Visual transformer with depthwise separable convolution projections for video-based human action recognition

Cao, Y; Wang, F; Zheng, Q

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/32290

Full metadata record

DC Field	Value	Language
dc.contributor.author	Cao, Y	-
dc.contributor.author	Wang, F	-
dc.contributor.author	Zheng, Q	-
dc.coverage.spatial	London, UK	-
dc.date.accessioned	2025-11-05T12:58:27Z	-
dc.date.available	2025-11-05T12:58:27Z	-
dc.date.issued	2025-10-01	-
dc.identifier	ORCiD: Fang Wang https://orcid.org/0000-0003-1987-9150	-
dc.identifier	Article number: 06003	-
dc.identifier.citation	Cao, Y., Wang, F. and Zheng, Q. (2025) 'Visual transformer with depthwise separable convolution projections for video-based human action recognition', MATEC Web of Conferences, 413, 06003, pp. 1 - 5. doi: 10.1051/matecconf/202541306003.	en_US
dc.identifier.issn	2274-7214	-
dc.identifier.uri	https://bura.brunel.ac.uk/handle/2438/32290	-
dc.description.abstract	Human action recognition is a task that utilizes algorithms to recognize human actions from videos. Transformer-based algorithms have attracted growing attention in recent years. However, transformer networks often suffer from slow convergence and require large amounts of training data, due to their inability to prioritize information from neighboring pixels. To address these issues, we propose a novel network architecture that combines a depthwise separable convolution layer with transformer modules. The proposed network has been evaluated on the medium-sized benchmark dataset UCF101 and the results have demonstrated that the proposed model converges quickly during training and achieves competitive performance compared with SOTA pure transformer network, while reducing approximately 7.4 million parameters.	en_US
dc.description.sponsorship	This work is supported by the Zhongyuan University of Technology-Brunel University London (ZUT-BUL) Joint Doctoral Training Programme. This work is funded by the ZUT/BRUNEL scholarship.	en_US
dc.format.extent	1 - 5	-
dc.format.medium	Print-Electronic	-
dc.language	English	-
dc.language.iso	en_US	en_US
dc.publisher	EDP Sciences	en_US
dc.rights	Creative Commons Attribution 4.0 International	-
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	-
dc.source	International Conference on Measurement, AI, Quality and Sustainability (MAIQS 2025)	-
dc.source	International Conference on Measurement, AI, Quality and Sustainability (MAIQS 2025)	-
dc.title	Visual transformer with depthwise separable convolution projections for video-based human action recognition	en_US
dc.type	Conference Paper	en_US
dc.date.dateAccepted	2025-06-08	-
dc.identifier.doi	https://doi.org/10.1051/matecconf/202541306003	-
dc.relation.isPartOf	MATEC Web of Conferences	-
pubs.finish-date	2025-08-28	-
pubs.finish-date	2025-08-28	-
pubs.publication-status	Published	-
pubs.start-date	2025-08-26	-
pubs.start-date	2025-08-26	-
pubs.volume	413	-
dc.identifier.eissn	2261-236X	-
dc.rights.license	https://creativecommons.org/licenses/by/4.0/legalcode.en	-
dcterms.dateAccepted	2025-06-08	-
dc.rights.holder	The Authors	-
Appears in Collections:	Dept of Mechanical and Aerospace Engineering Research Papers

Files in This Item:

File	Description	Size	Format
FullText.pdf	Copyright © The Authors, published by EDP Sciences, 2025. Licence: Creative Commons. This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0 (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.	323.91 kB	Adobe PDF	View/Open

Show simple item record

This item is licensed under a Creative Commons License