Self-supervised Representation Learning for Videos by Segmenting via Sampling Rate Order Prediction

Meng, H; Huang, J; Huang, Y; Wang, Q; Yang, W

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/23405

Full metadata record

DC Field	Value	Language
dc.contributor.author	Meng, H	-
dc.contributor.author	Huang, J	-
dc.contributor.author	Huang, Y	-
dc.contributor.author	Wang, Q	-
dc.contributor.author	Yang, W	-
dc.date.accessioned	2021-10-28T09:27:41Z	-
dc.date.available	2021-10-28T09:27:41Z	-
dc.date.issued	2021-09-20	-
dc.identifier	ORCiD: Jing Huang https://orcid.org/0000-0003-3445-6164	-
dc.identifier	ORCiD: Yan Huang https://orcid.org/0000-0001-7868-093X	-
dc.identifier	ORCiD: Qicong Wang https://orcid.org/0000-0001-7324-0433	-
dc.identifier	ORCiD: Hongying Meng https://orcid.org/0000-0002-8836-1382	-
dc.identifier.citation	Huang, J. et al. 'Self-supervised Representation Learning for Videos by Segmenting via Sampling Rate Order Prediction,' IEEE Transactions on Circuits and Systems for Video Technology, 32 (6), pp. 3475 - 3489. doi: 10.1109/TCSVT.2021.3114209.	en_US
dc.identifier.issn	1051-8215	-
dc.identifier.uri	https://bura.brunel.ac.uk/handle/2438/23405	-
dc.description.abstract	Self-supervised representation learning for videos has been very attractive recently because these methods exploit the information inherently obtained from the video itself instead of annotated labels that is quite time-consuming. However, existing methods ignore the importance of global observation while performing spatio-temporal transformation perception, which highly limits the expression capabilities of the video representation. This paper proposes a novel pretext task that combines the temporal information perception of the video with the motion amplitude perception of moving objects to learn the spatio-temporal representation of the video. Specifically, given a video clip containing several video segments, each video segment is sampled by different sampling rates and the order of video segments is disrupted. Then, the network is used to regress the sampling rate of each video segment and classify the order of input video segments. In the pre-training stage, the network can learn rich spatio-temporal semantic information where content-related contrastive learning is introduced to make the learned video representation more discriminative. To alleviate the appearance dependency caused by contrastive learning, we design a novel and robust vector similarity measurement approach, which can take feature alignment into consideration. Moreover, a view synthesis framework is proposed to further improve the performance of contrastive learning by automatically generating reasonable transformed views. We conduct benchmark experiments with several 3D backbone networks on two datasets. The results show that our proposed method outperforms the existing state-of-the-art methods across the three backbones on two downstream tasks of human action recognition and video retrieval.	-
dc.description.sponsorship	Shenzhen Science and Technology Projects (Grant Number: JCYJ20200109143035495 and JCYJ20180306173210774).	en_US
dc.format.extent	3475 - 3489	-
dc.format.medium	Print-Electronic	-
dc.language.iso	en_US	en_US
dc.publisher	IEEE	en_US
dc.rights	Copyright © 2021 Institute of Electrical and Electronics Engineers (IEEE). Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. See: https://journals.ieeeauthorcenter.ieee.org/become-an-ieee-journal-author/publishing-ethics/guidelinesand-policies/post-publication-policies/	-
dc.rights.uri	https://journals.ieeeauthorcenter.ieee.org/become-an-ieee-journal-author/publishing-ethics/guidelinesand-policies/post-publication-policies/	-
dc.subject	task analysis	en_US
dc.subject	semantics	en_US
dc.subject	streaming media	en_US
dc.subject	feature extraction	en_US
dc.subject	data mining	en_US
dc.subject	motion segmentation	en_US
dc.subject	image segmentation	en_US
dc.title	Self-supervised Representation Learning for Videos by Segmenting via Sampling Rate Order Prediction	en_US
dc.type	Article	en_US
dc.identifier.doi	https://doi.org/10.1109/TCSVT.2021.3114209	-
dc.relation.isPartOf	IEEE Transactions on Circuits and Systems for Video Technology	-
pubs.issue	6	-
pubs.publication-status	Published online	-
pubs.volume	32	-
dc.identifier.eissn	1558-2205	-
dcterms.dateAccepted	2021-09-15	-
dc.rights.holder	Institute of Electrical and Electronics Engineers (IEEE)	-
Appears in Collections:	Dept of Electronic and Electrical Engineering Research Papers

Files in This Item:

File	Description	Size	Format
FullText.pdf	Copyright © 2021 Institute of Electrical and Electronics Engineers (IEEE). Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. See: https://journals.ieeeauthorcenter.ieee.org/become-an-ieee-journal-author/publishing-ethics/guidelinesand-policies/post-publication-policies/	2.97 MB	Adobe PDF	View/Open

Show simple item record