Design and Analysis of Efficient Attention in Transformers for Social Group Activity Recognition

Tamura, Masato

doi:10.1007/s11263-024-02082-y

Design and Analysis of Efficient Attention in Transformers for Social Group Activity Recognition

Published: 08 May 2024

(2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Masato Tamura ORCID: orcid.org/0000-0003-1029-5271¹

114 Accesses
Explore all metrics

Abstract

Social group activity recognition is a challenging task extended from group activity recognition, where social groups must be recognized with their activities and group members. Existing methods tackle this task by leveraging region features of individuals following existing group activity recognition methods. However, the effectiveness of region features is susceptible to person localization and variable semantics of individual actions. To overcome these issues, we propose leveraging attention modules in transformers to generate social group features. In this method, multiple embeddings are used to aggregate features for a social group, each of which is assigned to a group member without duplication. Due to this non-duplicated assignment, the number of embeddings must be significant to avoid missing group members and thus renders attention in transformers ineffective. To find optimal attention designs with a large number of embeddings, we explore several design choices of queries for feature aggregation and self-attention modules in transformer decoders. Extensive experimental results show that the proposed method achieves state-of-the-art performance and verify that the proposed attention designs are highly effective on social group activity recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hunting Group Clues with Transformers for Social Group Activity Recognition

Self-supervised Social Relation Representation for Human Group Detection

Modeling multi-scale sub-group context for group activity recognition

Article 25 April 2022

Data Availability

All the datasets used in this paper are publicly available.

References

Amer, M. R., Lei, P., Todorovic, S. (2014) HiRF: Hierarchical random field for collective activity recognition in videos. In: ECCV
Amer, M. R., & Todorovic, S. (2016). Sum product networks for activity recognition. IEEE TPAMI, 38(4), 800–813.
Article Google Scholar
Amer, M. R., Todorovic, S., Fern, A., Zhu, S. C. (2013) Monte carlo tree search for scheduling activity recognition. In: ICCV
Azar, S. M., Atigh, M. G., Nickabadi, A., Alahi, A. (2019) Convolutional relational machine for group activity recognition. In: CVPR
Bagautdinov, T. M., Alahi, A., Fleuret, F., Fua, P. V., Savarese, S. (2017) Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In: CVPR
Bertasius, G., Wang, H., Torresani, L. (2021) Is space-time attention all you need for video understanding? In: ICML
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S. (2020) End-to-end object detection with transformers. In: ECCV
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR.
Choi, W., Chao, Y. W., Pantofaru, C., Savarese, S. (2014) Discovering groups of people in images. In: ECCV
Choi, W., Shahid, K., Savarese, S. (2009) What are they doing? : Collective activity classification using spatio-temporal relationship among people. In: ICCVW
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y. (2017) Deformable convolutional networks. In: ICCV
Deng, Z., Vahdat, A., Hu, H., Mori, G. (2016) Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In: CVPR
Ehsanpour, M., Abedin, A., Saleh, F., Shi, J., Reid, I., Rezatofighi, H. (2020) Joint learning of social groups, individuals action and sub-group activities in videos. In: ECCV
Gavrilyuk, K., Sanford, R., Javan, M., Snoek, C. G. M. (2020) Actor-transformers for group activity recognition. In: CVPR
Ge, W., Collins, R. T., & Ruback, R. B. (2012). Vision-based analysis of small groups in pedestrian crowds. IEEE TPAMI, 34(5), 1003–1016.
Article Google Scholar
Hu, G., Cui, B., He, Y., Yu, S. (2020) Progressive relation learning for group activity recognition. In: CVPR
Ibrahim, M. S., Mori, G. (2018) Hierarchical relational networks for group activity recognition and retrieval. In: ECCV
Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G. (2016) A hierarchical deep temporal model for group activity recognition. In: CVPR
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A. (2017) The kinetics human action video dataset. ArXiv:1705.06950
Kipf, T. N., Welling, M. (2017) Semi-supervised classification with graph convolutional networks. In: ICLR
Kong, L., Qin, J., Huang, D., Wang, Y., Gool, L. V. (2018) Hierarchical attention and context modeling for group activity recognition. In: ICASSP
Kuhn, H. W., Yaw, B. (1955) The hungarian method for the assignment problem. Naval Res. Logist. Quart pp. 83–97
Lan, T., Sigal, L., Mori, G. (2012) Social roles in hierarchical models for human activity recognition. In: CVPR
Lan, T., Wang, Y., Yang, W., Robinovitch, S. N., & Mori, G. (2012). Discriminative latent models for recognizing contextual group activities. IEEE TPAMI, 34(8), 1549–1562.
Article Google Scholar
Li, S., Cao, Q., Liu, L., Yang, K., Liu, S., Hou, J., Yi, S. (2021) GroupFormer: Group activity recognition with clustered spatial-temporal transformer. In: ICCV (2021)
Li, X., Chuah, M. C. (2017) SBGAR: Semantics based group activity recognition. In: ICCV
Lin, T. Y., Goyal, P., Girshick, R., He, K., Dollár, P. (2017) Focal loss for dense object detection. In: ICCV
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014) Microsoft COCO: Common objects in context. In: ECCV
Loshchilov, I., Hutter, F. (2019) Decoupled weight decay regularization. In: ICLR
Park H, Shi J. (2015) Social saliency prediction. In: CVPR
Pramono, R. R. A., Chen, Y. T., Fang, W. H. (2020) Empowering relational network by self-attention augmented conditional random fields for group activity recognition. In: ECCV
Qi, M., Qin, J., Li, A., Wang, Y., Luo, J., Gool, L. V. (2018) StagNet: An attentive semantic rnn for group activity recognition. In: ECCV
Sendo, K., Ukita, N. (2019) Heatmapping of people involved in group activities. In: MVA
Shu, T., Todorovic, S., Zhu, S. C. (2017) CERN: Confidence-energy recurrent network for group activity recognition. In: CVPR
Tamura, M., Vishwakarma, R., Vennelakanti, R. (2022) Hunting group clues with transformers for social group activity recognition. In: ECCV
Tang, J., Shu, X., Yan, R., & Zhang, L. (2022). Coherence constrained graph lstm for group activity recognition. IEEE TPAMI, 44(2), 636–647.
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I. (2017) Attention is all you need. In: NIPS
Veličkovič, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y. (2018) Graph attention networks. In: ICLR
Wang, M., Ni, B., Yang, X. (2017) Recurrent modeling of interaction context for collective activity recognition. In: CVPR
Wang, Z., Shi, Q., Shen, C., van den Hengel, A. (2013) Bilinear programming for human activity recognition with unknown mrf graphs. In: CVPR
Wu, J., Wang, L., Wang, L., Guo, J., Wu, G. (2019) Learning actor relation graphs for group activity recognition. In: CVPR
Yan, R., Shu, X., Yuan, C., Tian, Q., & Tang, J. (2022). Position-aware participation-contributed temporal dynamic model for group activity recognition. IEEE TNNLS, 33(12), 7574–7588.
Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q. (2020) HiGCIN: Hierarchical graph-based cross inference network for group activity recognition. IEEE TPAMI
Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q. (2020) Social adaptive module for weakly-supervised group activity recognition. In: ECCV
Yuan, H., Ni, D., Wang, M. (2021) Spatio-temporal dynamic inference network for group activity recognition. In: ICCV
Zhou, H., Kadav, A., Shamsian, A., Geng, S., Lai, F., Zhao, L., Liu, T., Kapadia, M., Graf, H. P. (2021) COMPOSER: Compositional learning of group activity in videos. arXiv preprint arXiv:2112.05892
Zhou, X., Wang, D., Krähenbühl, P. (2019) Objects as points. ArXiv:1904.07850
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J. (2021) Deformable DETR: Deformable transformers for end-to-end object detection. In: ICLR

Download references

Acknowledgements

Computational resource of AI Bridging Cloud Infrastructure (ABCI) provided by National Institute of Advanced Industrial Science and Technology (AIST) was used.

Funding

Not applicable.

Author information

Authors and Affiliations

Big Data Analytics Solution Lab, R &D, Hitachi America, Ltd., 2535 Augustine Dr, 3rd Floor, Santa Clara, CA, 95054, USA
Masato Tamura

Authors

Masato Tamura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Masato Tamura.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest.

Additional information

Communicated by Yasushi Yagi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tamura, M. Design and Analysis of Efficient Attention in Transformers for Social Group Activity Recognition. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02082-y

Download citation

Received: 14 February 2023
Accepted: 13 April 2024
Published: 08 May 2024
DOI: https://doi.org/10.1007/s11263-024-02082-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Design and Analysis of Efficient Attention in Transformers for Social Group Activity Recognition

Abstract

Access this article

Similar content being viewed by others

Hunting Group Clues with Transformers for Social Group Activity Recognition

Self-supervised Social Relation Representation for Human Group Detection

Modeling multi-scale sub-group context for group activity recognition

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Design and Analysis of Efficient Attention in Transformers for Social Group Activity Recognition

Abstract

Access this article

Similar content being viewed by others

Hunting Group Clues with Transformers for Social Group Activity Recognition

Self-supervised Social Relation Representation for Human Group Detection

Modeling multi-scale sub-group context for group activity recognition

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation