Abstract :
[en] Learning-based approaches have recently become popular for various computer vision tasks such as facial expression recognition, action recognition, banknote identification, image captioning, medical image segmentation, etc. The learning-based approach allows the constructed model to learn features, which result in high performance. Recently, the backbone of most learning-based approaches are deep neural networks (DNNs). Importantly, it is believed that increasing the depth of DNNs invariably leads to improved generalization performance. Thus, many state-of-the-art DNNs have over 30 layers of feature representations. In fact, it is not uncommon to find DNNs with over 100 layers in the literature. However, training very DNNs that have over 15 layers is not trivial. On one hand, such very DNNs generally suffer optimization problems. On the other hand, very DNNs are often overparameterized such that they overfit the training data, and hence incur generalization loss. Moreover, overparameterized DNNs are impractical for applications that require low latency, small Graphic Processing Unit (GPU) memory for operation and small memory for storage. Interestingly, skip connections of various forms have been shown to alleviate the difficulty of optimizing very DNNs. In this thesis, we propose to improve the optimization and generalization of very DNNs with and without skip connections by reformulating their training schemes. Specifically, the different modifications proposed allow the DNNs to achieve state-of-the-art results on several benchmarking datasets.
The second part of the thesis presents the theoretical analyses of DNNs without and with skip connections based on several concepts from linear algebra and random matrix theory. The theoretical results obtained provide new insights into why DNNs with skip connections are easy to optimize, and generalize better than DNNs without skip connections. Ultimately, the theoretical results are shown to agree with practical DNNs via extensive experiments. The third part of the thesis addresses the problem of compressing large DNNs into smaller models. Following the identified drawbacks of the conventional group LASSO for compressing large DNNs, the debiased elastic group least absolute shrinkage and selection operator (DEGL) is employed. Furthermore, the layer-wise subspace learning (SL) of latent representations in large DNNs is proposed. The objective of SL is learning a compressed latent space for large DNNs. In addition, it is observed that SL improves the performance of LASSO, which is popularly known not to work well for compressing large DNNs. Extensive experiments are reported to validate the effectiveness of the different model compression approaches proposed in this thesis.
Finally, the thesis addresses the problem of multimodal learning using DNNs, where data from different modalities are combined into useful representations for improved learning results. Different interesting multimodal learning frameworks are applied to the problems of facial expression and object recognition. We show that under the right scenarios, the complementary information from multimodal data leads to better model performance.