[en] Some of the main challenges in skeleton-based action recognition systems are redundant and noisy pose transformations. Earlier works in skeleton-based action recognition explored different approaches for filtering linear noise transformations, but neglect to address potential nonlinear
transformations. In this paper, we present an unsupervised learning approach for estimating nonlinear noise transformations in pose estimates. Our approach starts by decoupling linear and nonlinear noise transformations. While the linear transformations are modelled explicitly the nonlinear transformations are learned from data. Subsequently, we use an autoencoder with L2-norm reconstruction error and show that it indeed does capture nonlinear noise transformations,
and recover a denoised pose estimate which in turn improves performance significantly. We validate our approach on a publicly available dataset, NW-UCLA.
Research center :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SIGCOM
Disciplines :
Computer science
Author, co-author :
Demisse, Girum ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
Papadopoulos, Konstantinos ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
Aouada, Djamila ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
Ottersten, Björn ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
External co-authors :
no
Language :
English
Title :
Pose Encoding for Robust Skeleton-Based Action Recognition
Publication date :
18 June 2018
Event name :
CVPRW: Visual Understanding of Humans in Crowd Scene
Event date :
from 18-06-2018 to 22-06-2018
Main work title :
CVPRW: Visual Understanding of Humans in Crowd Scene, Salt Lake City, Utah, June 18-22, 2018
Peer reviewed :
Peer reviewed
Focus Area :
Security, Reliability and Trust
FnR Project :
FNR10415355 - 3d Action Recognition Using Refinement And Invariance Strategies For Reliable Surveillance, 2015 (01/06/2016-31/05/2019) - Bjorn Ottersten
M. Antunes, D. Aouada, and B. Ottersten. A revisit to human action recognition from depth sequences: Guided svmsampling for joint selection. In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, pages 1-8. IEEE, 2016.
M. Antunes, R. Baptista, G. G. Demisse, D. Aouada, and B. Ottersten. Visual and human-interpretable feedback for assisting physical activity. In European Conference on Computer Vision Workshop (ECCVW), 2016.
D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409. 0473, 2014.
R. Baptista, M. Antunes, D. Aouada, and B. Ottersten. Video-based feedback for assisting physical activity. In 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISAPP), 2017.
R. Baptista, M. Antunes, D. Aouada, and B. Ottersten. Anticipating suspicious actions using a small dataset of action templates. In 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISAPP), 2018.
R. Baptista, M. Antunes, A. E. R. Shabayek, D. Aouada, and B. Ottersten. Flexible feedback system for posture monitoring and correction. In IEEE International Conference on Image Information Processing (ICIIP), 2017.
A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129-1159, 1995.
E. J. Candes and T. Tao. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203-4215, 2005.
K. Cho and X. Chen. Classifying and visualizing motion capture sequences using deep neural networks. In Computer Vision Theory and Applications (VISAPP), 2014 International Conference on, volume 2, pages 122-130. IEEE, 2014.
D. L. Donoho. For most large underdetermined systems of linear equations the minimal 1-norm solution is also the sparsest solution. Communications on pure and applied mathematics, 59(6):797-829, 2006.
Y. Du, W. Wang, and L. Wang. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1110-1118, 2015.
D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(Feb):625-660, 2010.
X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 315-323, 2011.
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997.
N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700-1709, 2013.
I. Lee, D. Kim, S. Kang, and S. Lee. Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 1012-1020. IEEE, 2017.
J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-temporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision, pages 816-833. Springer, 2016.
M. Liu, H. Liu, and C. Chen. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 68:346-362, 2017.
K. Papadopoulos, M. Antunes, D. Aouada, and B. Ottersten. Enhanced trajectory-based action recognition using human pose. In IEEE International Conference on Image Processing (ICIP), 2017.
S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 833-840. Omnipress, 2011.
T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901-909, 2016.
N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843-852, 2015.
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104-3112, 2014.
Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton. Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research, 4(Dec):1235-1260, 2003.
R. Vemulapalli, F. Arrate, and R. Chellappa. Human action recognition by representing 3d skeletons as points in a lie group. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 588-595, 2014.
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(Dec):3371-3408, 2010.
J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet ensemble for action recognition with depth cameras. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1290-1297. IEEE, 2012.
J. Wang, X. Nie, Y. Xia, Y. Wu, and S.-C. Zhu. Cross-view action modeling, learning and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2649-2656, 2014.
D. Wu and L. Shao. Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 724-731, 2014.
L. Xia, C.-C. Chen, and J. Aggarwal. View invariant human action recognition using histograms of 3d joints. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pages 20-27. IEEE, 2012.
X. Yang and Y. Tian. Super normal vector for activity recognition using depth sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 804-811, 2014.
W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, X. Xie, et al. Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In AAAI, volume 2, page 8, 2016.