Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/32290
Full metadata record
DC FieldValueLanguage
dc.contributor.authorCao, Y-
dc.contributor.authorWang, F-
dc.contributor.authorZheng, Q-
dc.coverage.spatialLondon, UK-
dc.date.accessioned2025-11-05T12:58:27Z-
dc.date.available2025-11-05T12:58:27Z-
dc.date.issued2025-10-01-
dc.identifierORCiD: Fang Wang https://orcid.org/0000-0003-1987-9150-
dc.identifierArticle number: 06003-
dc.identifier.citationCao, Y., Wang, F. and Zheng, Q. (2025) 'Visual transformer with depthwise separable convolution projections for video-based human action recognition', MATEC Web of Conferences, 413, 06003, pp. 1 - 5. doi: 10.1051/matecconf/202541306003.en_US
dc.identifier.issn2274-7214-
dc.identifier.urihttps://bura.brunel.ac.uk/handle/2438/32290-
dc.description.abstractHuman action recognition is a task that utilizes algorithms to recognize human actions from videos. Transformer-based algorithms have attracted growing attention in recent years. However, transformer networks often suffer from slow convergence and require large amounts of training data, due to their inability to prioritize information from neighboring pixels. To address these issues, we propose a novel network architecture that combines a depthwise separable convolution layer with transformer modules. The proposed network has been evaluated on the medium-sized benchmark dataset UCF101 and the results have demonstrated that the proposed model converges quickly during training and achieves competitive performance compared with SOTA pure transformer network, while reducing approximately 7.4 million parameters.en_US
dc.description.sponsorshipThis work is supported by the Zhongyuan University of Technology-Brunel University London (ZUT-BUL) Joint Doctoral Training Programme. This work is funded by the ZUT/BRUNEL scholarship.en_US
dc.format.extent1 - 5-
dc.format.mediumPrint-Electronic-
dc.languageEnglish-
dc.language.isoen_USen_US
dc.publisherEDP Sciencesen_US
dc.rightsCreative Commons Attribution 4.0 International-
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/-
dc.sourceInternational Conference on Measurement, AI, Quality and Sustainability (MAIQS 2025)-
dc.sourceInternational Conference on Measurement, AI, Quality and Sustainability (MAIQS 2025)-
dc.titleVisual transformer with depthwise separable convolution projections for video-based human action recognitionen_US
dc.typeConference Paperen_US
dc.date.dateAccepted2025-06-08-
dc.identifier.doihttps://doi.org/10.1051/matecconf/202541306003-
dc.relation.isPartOfMATEC Web of Conferences-
pubs.finish-date2025-08-28-
pubs.finish-date2025-08-28-
pubs.publication-statusPublished-
pubs.start-date2025-08-26-
pubs.start-date2025-08-26-
pubs.volume413-
dc.identifier.eissn2261-236X-
dc.rights.licensehttps://creativecommons.org/licenses/by/4.0/legalcode.en-
dcterms.dateAccepted2025-06-08-
dc.rights.holderThe Authors-
Appears in Collections:Dept of Mechanical and Aerospace Engineering Research Papers

Files in This Item:
File Description SizeFormat 
FullText.pdfCopyright © The Authors, published by EDP Sciences, 2025. Licence: Creative Commons. This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0 (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.323.91 kBAdobe PDFView/Open


This item is licensed under a Creative Commons License Creative Commons