View-Invariant; Human Action Recognition; Human Pose Estimation
Abstract :
[en] This work presents a new view-invariant action recognition system that is able to classify human actions by using a single RGB camera, including challenging camera viewpoints. Understanding actions from different viewpoints remains an extremely challenging problem, due to depth ambiguities, occlusion, and a large variety of appearances and scenes. Moreover, using only the information from the 2D perspective gives different interpretations for the same action seen from different viewpoints. Our system operates in two subsequent stages. The first stage estimates the 2D human pose using a convolution neural network. In the next stage, the 2D human poses are lifted to 3D human poses, using a temporal convolution neural network that enforces the temporal coherence over the estimated 3D poses. The estimated 3D poses from different viewpoints are then aligned to the same camera reference frame. Finally, we propose to use a temporal convolution network-based classifier for cross-view action recognition.
Our results show that we can achieve state of art view-invariant action recognition accuracy even for the challenging viewpoints by only using RGB videos, without pre-training on synthetic or motion capture data.
Research center :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SIGCOM
Disciplines :
Computer science
Author, co-author :
Adel Musallam, Mohamed
BAPTISTA, Renato ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
AL ISMAEIL, Kassem ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
AOUADA, Djamila ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
External co-authors :
no
Language :
English
Title :
Temporal 3D Human Pose Estimation for Action Recognition from Arbitrary Viewpoints
Publication date :
December 2019
Event name :
6th Annual Conf. on Computational Science & Computational Intelligence
Event organizer :
https://americancse.org/events/csci2019
Event date :
5-7 December 2019
Audience :
International
Main work title :
6th Annual Conf. on Computational Science & Computational Intelligence, Las Vegas 5-7 December 2019
Publisher :
Conference Publishing Services
Peer reviewed :
Peer reviewed
Focus Area :
Computational Sciences
European Projects :
H2020 - 689947 - STARR - Decision SupporT and self-mAnagement system for stRoke survivoRs
FnR Project :
FNR10415355 - 3d Action Recognition Using Refinement And Invariance Strategies For Reliable Surveillance, 2015 (01/06/2016-31/05/2019) - Bjorn Ottersten
K. Papadopoulos, M. Antunes, D. Aouada, and B. Ottersten, "Enhanced trajectory-based action recognition using human pose, " in 2017 IEEE International Conference on Image Processing (ICIP), pp. 1807-1811, IEEE, 2017.
K. Papadopoulos, M. Antunes, D. Aouada, and B. Ottersten, "A revisit of action detection using improved trajectories, " in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2067-2071, IEEE, 2018.
A. E. R. Shabayek, R. Baptista, K. Papadopoulos, G. Demisse, O. Oyedotun, M. Antunes, D. Aouada, B. Ottersten, M. Anastassova, M. Boukallel, S. Panëels, G. Randall, M. Andre, A. Douchet, S. Bouilland, and L. O. Fernandez, "Starr-decision support and selfmanagement system for stroke survivors vision based rehabilitation system, " in European Project Space on Networks, Systems and Technologies-Volume 1: EPS Porto 2017, , pp. 69-80, INSTICC, SciTePress, 2017.
H. Wang, A. Kläser, C. Schmid, and L. Cheng-Lin, "Action recognition by dense trajectories, " in CVPR 2011-IEEE Conference on Computer Vision & Pattern Recognition, pp. 3169-3176, IEEE, 2011.
L. Xia, C.-C. Chen, and J. K. Aggarwal, "View invariant human action recognition using histograms of 3d joints, " in 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 20-27, IEEE, 2012.
R. Baptista, E. Ghorbel, K. Papadopoulos, G. Demisse, D. Aouada, and B. Ottersten, "View-invariant action recognition from rgb data via 3d pose estimation, " in IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12-17 May 2019, 2019.
E. Ghorbel, K. Papadopoulos, R. Baptista, H. Pathak, G. Demisse, D. Aouada, and B. Ottersten, "A view-invariant framework for fast skeleton-based action recognition using a single rgb camera, " in 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Prague, 25-27 February 2018, 2019.
A. Gupta, J. Martinez, J. J. Little, and R. J. Woodham, "3d pose from motion for cross-view action recognition via non-linear circulant temporal encoding, " in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2601-2608, 2014.
H. Rahmani and A. Mian, "Learning a non-linear knowledge transfer model for cross-view action recognition, " in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2458-2466, 2015.
K. Papadopoulos, E. Ghorbel, R. Baptista, D. Aouada, and B. Ottersten, "Two-stage rgb-based action detection using augmented 3d poses, " in Computer Analysis of Images and Patterns (M. Vento and G. Percannella, eds.), (Cham), pp. 26-35, Springer International Publishing, 2019.
D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt, "Vnect: Real-Time 3d human pose estimation with a single rgb camera, " ACM Transactions on Graphics, vol. 36, no. 4, 2017.
D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, "3d human pose estimation in video with temporal convolutions and semi-supervised training, " arXiv preprint arXiv:1811.11742, 2018.
H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, "Rmpe: Regional multi-person pose estimation, " in ICCV, 2017.
A. Newell, K. Yang, and J. Deng, "Stacked hourglass networks for human pose estimation, " Lecture Notes in Computer Science, p. 483-499, 2016.
J. Martinez, R. Hossain, J. Romero, and J. J. Little, "A simple yet effective baseline for 3d human pose estimation, " in ICCV, 2017.
D. Weinland, R. Ronfard, and E. Boyer, "Free viewpoint action recognition using motion history volumes, " CVIU, vol. 104, no. 2-3, pp. 249-257, 2006.
J. Redmon and A. Farhadi, "Yolov3: An incremental improvement, " arXiv, 2018.
N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, "Meta-learning with temporal convolutions, " CoRR, vol. abs/1707.03141, 2017.
G. Rogez, P. Weinzaepfel, and C. Schmid, "LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images, " IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei, "Towards 3d human pose estimation in the wild: A weakly-supervised approach, " in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
E. Ghorbel, J. Boonaert, R. Boutteau, S. Lecoeuche, and X. Savatier, "An extension of kernel learning methods using a modified log-euclidean distance for fast and accurate skeleton-based human action recognition, " Computer Vision and Image Understanding, 09 2018.
K. Lee, I. Lee, and S. Lee, "Propagating lstm: 3d pose estimation based on joint interdependency, " in Proceedings of the European Conference on Computer Vision (ECCV), pp. 119-135, 2018.
S. Bai, J. Z. Kolter, and V. Koltun, "An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, " arXiv:1803.01271, 2018.
N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, "A simple neural attentive meta-learner, " 2017.
M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian, "A real-Time algorithm for signal analysis with the help of the wavelet transform, " in Wavelets, pp. 286-297, Springer, 1990.
A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "Wavenet: A generative model for raw audio, " arXiv preprint arXiv:1609.03499, 2016.
F. Yu and V. Koltun, "Multi-scale context aggregation by dilated convolutions, " arXiv preprint arXiv:1511.07122, 2015.
N. Kalchbrenner, L. Espeholt, K. Simonyan, A. v. d. Oord, A. Graves, and K. Kavukcuoglu, "Neural machine translation in linear time, " arXiv preprint arXiv:1610.10099, 2016.
S. Hochreiter and J. Schmidhuber, "Long short-Term memory, " Neural Comput., vol. 9, pp. 1735-1780, Nov. 1997.
K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, "On the properties of neural machine translation: Encoder-decoder approaches, " CoRR, vol. abs/1409.1259, 2014.
Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, "Cascaded pyramid network for multi-person pose estimation, " 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018.
B. Li, O. I. Camps, and M. Sznaier, "Cross-view activity recognition using hankelets, " in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1362-1369, IEEE, 2012.
R. Li and T. Zickler, "Discriminative virtual views for cross-view action recognition, " in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2855-2862, IEEE, 2012.
Z. Zhang, C. Wang, B. Xiao, W. Zhou, S. Liu, and C. Shi, "Crossview action recognition via a continuous virtual path, " in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2690-2697, 2013.