MixCode: Enhancing Code Classification by Mixup-Based Data Augmentation

Dong, Zeming; HU, Qiang; GUO, Yuejun; CORDY, Maxime; PAPADAKIS, Mike; Zhang, Zhenya; LE TRAON, Yves; Zhao, Jianjun

doi:10.1109/SANER56733.2023.00043

Download

Paper published in a journal (Scientific congresses, symposiums and conference proceedings)

MixCode: Enhancing Code Classification by Mixup-Based Data Augmentation

Dong, Zeming; HU, Qiang; GUO, Yuejun et al.

2023 • In IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), p. 379–390

Peer reviewed

Permalink
https://hdl.handle.net/10993/59232

DOI
10.1109/SANER56733.2023.00043

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

SANER__MixCode_Enhancing_Code_Classification_by_Mixup_Based_Data_Augmentation (1).pdf

Author postprint (417.33 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Disciplines :

Computer science

Author, co-author :

Dong, Zeming

HU, Qiang ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

GUO, Yuejun ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > SerVal > Team Yves LE TRAON

CORDY, Maxime ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

PAPADAKIS, Mike ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

Zhang, Zhenya

LE TRAON, Yves ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

Zhao, Jianjun

External co-authors :

yes

Language :

English

Title :

MixCode: Enhancing Code Classification by Mixup-Based Data Augmentation

Publication date :

2023

Event name :

IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

Event date :

2023

Journal title :

IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

Pages :

379–390

Peer reviewed :

Peer reviewed

Additional URL :

https://doi.org/10.1109/SANER56733.2023.00043

Available on ORBilu :

since 28 December 2023

Statistics

Number of views

24 (0 by Unilu)

Number of downloads

21 (0 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

Bibliography

M. Wang and W. Deng, "Deep face recognition: A survey," Neurocomputing, vol. 429, pp. 215-244, 2021.
D. Dong, H. Wu, W. He, D. Yu, and H. Wang, "Multi-task learning for multiple language translation," in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, pp. 1723-1732.
O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev et al., "Grandmaster level in starcraft ii using multi-agent reinforcement learning," Nature, vol. 575, no. 7782, pp. 350-354, 2019.
X. Gu, H. Zhang, and S. Kim, "Deep code search," in 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 2018, pp. 933-944.
R. Puri, D. S. Kung, G. Janssen, W. Zhang, G. Domeniconi, V. Zolotov, J. Dolby, J. Chen, M. Choudhury, L. Decker et al., "Codenet: A largescale ai for code dataset for learning a diversity of coding tasks," arXiv preprint arXiv:2105.12655, 2021.
Y. Shi, T. Mao, T. Barnes, M. Chi, and T. W. Price, "More with less: Exploring how to use deep learning effectively through semisupervised learning for automatic bug detection in student code." in In Proceedings of the 14th International Conference on Educational Data Mining (EDM) 2021, 2021.
Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., "Codebert: A pre-trained model for programming and natural languages," arXiv preprint arXiv:2002.08155, 2020.
W. Ma, M. Zhao, E. Soremekun, Q. Hu, J. Zhang, M. Papadakis, M. Cordy, X. Xie, and Y. L. Traon, "Graphcode2vec: Generic code embedding via lexical and program dependence analyses," arXiv preprint arXiv:2112.01218, 2021.
Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, "Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks," Advances in neural information processing systems, vol. 32, 2019.
C. Shorten and T. M. Khoshgoftaar, "A survey on image data augmentation for deep learning," Journal of big data, vol. 6, no. 1, pp. 1-48, 2019.
S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, and E. Hovy, "A survey of data augmentation approaches for nlp," arXiv preprint arXiv:2105.03075, 2021.
T. Zhao, Y. Liu, L. Neves, O. Woodford, M. Jiang, and N. Shah, "Data augmentation for graph neural networks," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 12, 2021, pp. 11 015-11 023.
M. Allamanis, H. Jackson-Flux, and M. Brockschmidt, "Self-supervised bug detection and repair," Advances in Neural Information Processing Systems, vol. 34, pp. 27 865-27 876, 2021.
M. V. Pour, Z. Li, L. Ma, and H. Hemmati, "A search-based testing framework for deep neural networks of source code embedding," in 2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2021, pp. 36-46.
N. Yefet, U. Alon, and E. Yahav, "Adversarial examples for models of code," Proceedings of the ACM on Programming Languages, vol. 4, no. OOPSLA, pp. 1-30, 2020.
N. D. Bui, Y. Yu, and L. Jiang, "Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations," in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 511-521.
D. Wang, Z. Jia, S. Li, Y. Yu, Y. Xiong, W. Dong, and X. Liao, "Bridging pre-trained models and downstream tasks for source code understanding," in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 287-298.
S. Yu, T. Wang, and J. Wang, "Data augmentation by program transformation," Journal of Systems and Software, vol. 190, p. 111304, 2022.
P. Bielik and M. Vechev, "Adversarial robustness for code," in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13-18 Jul 2020, pp. 896-907. [Online]. Available: https://proceedings.mlr.press/v119/bielik20a.html.
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, "mixup: Beyond empirical risk minimization," arXiv preprint arXiv:1710.09412, 2017.
H. Xu and S. Mannor, "Robustness and generalization," Machine learning, vol. 86, no. 3, pp. 391-423, 2012.
K. Kawaguchi, L. P. Kaelbling, and Y. Bengio, "Generalization in deep learning," arXiv preprint arXiv:1710.05468, 2017.
Y. Li, S. Wang, and T. N. Nguyen, "Dlfix: Context-based code transformation learning for automated program repair," in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 2020, pp. 602-614.
R. Gupta, S. Pal, A. Kanade, and S. Shevade, "Deepfix: Fixing common c language errors by deep learning," in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
S. Bhatia, P. Kohli, and R. Singh, "Neuro-symbolic program corrector for introductory programming assignments," in 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 2018, pp. 60-70.
Y. Pu, K. Narasimhan, A. Solar-Lezama, and R. Barzilay, "sk p: a neural program corrector for moocs," in Companion Proceedings of the 2016 ACM SIGPLAN International Conference on Systems, Programming, Languages and Applications: Software for Humanity, 2016, pp. 39-40.
Z. Chen, S. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshyvanyk, and M. Monperrus, "Sequencer: Sequence-to-sequence learning for endto-end program repair," IEEE Transactions on Software Engineering, vol. 47, no. 9, pp. 1943-1959, 2019.
M. Yasunaga and P. Liang, "Graph-based, self-supervised program repair from diagnostic feedback," in International Conference on Machine Learning. PMLR, 2020, pp. 10 799-10 808.
E. Dinella, H. Dai, Z. Li, M. Naik, L. Song, and K. Wang, "Hoppity: Learning graph transformations to detect and fix bugs in programs," in International Conference on Learning Representations (ICLR), 2020.
Z. Chen, V. Hellendoorn, P. Lamblin, P. Maniatis, P.-A. Manzagol, D. Tarlow, and S. Moitra, "Plur: A unifying, graph-based view of program learning, understanding, and repair," Advances in Neural Information Processing Systems, vol. 34, 2021.
D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song et al., "Measuring coding challenge competence with apps," arXiv preprint arXiv:2105.09938, 2021.
L. Zhang, G. Rosenblatt, E. Fetaya, R. Liao, W. Byrd, M. Might, R. Urtasun, and R. Zemel, "Neural guided constraint logic programming for program synthesis," Advances in Neural Information Processing Systems, vol. 31, 2018.
X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin, "Deep code comment generation," in 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC). IEEE, 2018, pp. 200-20 010.
U. Alon, M. Zilberstein, O. Levy, and E. Yahav, "code2vec: Learning distributed representations of code," Proceedings of the ACM on Programming Languages, vol. 3, no. POPL, pp. 1-29, 2019.
D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu et al., "Graphcodebert: Pre-training code representations with data flow," arXiv preprint arXiv:2009.08366, 2020.
M. Allamanis, M. Brockschmidt, and M. Khademi, "Learning to represent programs with graphs," arXiv preprint arXiv:1711.00740, 2017.
P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gao et al., "Wilds: A benchmark of in-the-wild distribution shifts," in International Conference on Machine Learning. PMLR, 2021, pp. 5637-5664.
Y. Zhu, T. Ko, and B. Mak, "Mixup learning strategies for textindependent speaker verification." in Interspeech, 2019, pp. 4345-4349.
A. Kaur and M. Kaur, "Analysis of code refactoring impact on software quality," in MATEC Web of Conferences, vol. 57. EDP Sciences, 2016, p. 02012.
G. Lacerda, F. Petrillo, M. Pimenta, and Y. G. Guéhéneuc, "Code smells and refactoring: A tertiary systematic review of challenges and observations," Journal of Systems and Software, vol. 167, p. 110610, 2020.
H. Guo, Y. Mao, and R. Zhang, "Augmenting data with mixup for sentence classification: An empirical study," arXiv preprint arXiv:1905.08941, 2019.
J. B. McDonald and Y. J. Xu, "A generalization of the beta distribution with applications," Journal of Econometrics, vol. 66, no. 1-2, pp. 133-152, 1995.
M. Wei, Y. Huang, J. Yang, J. Wang, and S. Wang, "Cocofuzzing: Testing neural code models with coverage-guided fuzzing," arXiv preprint arXiv:2106.09242, 2021.
T. H. M. Le, H. Chen, and M. A. Babar, "Deep learning for source code modeling and generation: models, applications, and challenges," ACM Comput. Surv., vol. 53, no. 3, jun 2020. [Online]. Available: https://doi-org.proxy.bnl.lu/10.1145/3383458.
D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan, "Augmix: A simple data processing method to improve robustness and uncertainty," arXiv preprint arXiv:1912.02781, 2019.
D. Hendrycks, A. Zou, M. Mazeika, L. Tang, B. Li, D. Song, and J. Steinhardt, "Pixmix: Dreamlike pictures comprehensively improve safety measures," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 783-16 792.
L. Zhang, Z. Deng, K. Kawaguchi, A. Ghorbani, and J. Zou, "How does mixup help with robustness and generalization?" arXiv preprint arXiv:2010.04819, 2020.
Y. Hu, U. Z. Ahmed, S. Mechtaev, B. Leong, and A. Roychoudhury, "Refactoring based program repair applied to programming assignments," in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2019, pp. 388-398.
H. Zhong and Z. Su, "An empirical study on real bug fixes," in 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, vol. 1. IEEE, 2015, pp. 913-923.
Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, "Gated graph sequence neural networks," arXiv preprint arXiv:1511.05493, 2015.
T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks," arXiv preprint arXiv:1609.02907, 2016.
P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, "Graph attention networks," stat, vol. 1050, p. 20, 2017.
S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang et al., "Codexglue: A machine learning benchmark dataset for code understanding and generation," arXiv preprint arXiv:2102.04664, 2021.
X. Zhou, D. Han, and D. Lo, "Assessing generalizability of codebert," in 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2021, pp. 425-436.
A. Mastropaolo, S. Scalabrino, N. Cooper, D. N. Palacio, D. Poshyvanyk, R. Oliveto, and G. Bavota, "Studying the usage of text-to-text transfer transformer to support code-related tasks," in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 336-347.
Z. Sun, L. Li, Y. Liu, and X. Du, "On the importance of building high-quality training datasets for neural code search," arXiv preprint arXiv:2202.06649, 2022.
Z. Chen and M. Monperrus, "The codrep machine learning on source code competition," arXiv preprint arXiv:1807.03200, 2018.
N. D. Bui, Y. Yu, and L. Jiang, "Infercode: Self-supervised learning of code representations by predicting subtrees," in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 1186-1197.
A. Kanade, P. Maniatis, G. Balakrishnan, and K. Shi, "Learning and evaluating contextual embedding of source code," in International Conference on Machine Learning. PMLR, 2020, pp. 5110-5121.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
L. Buratti, S. Pujar, M. Bornea, S. McCarley, Y. Zheng, G. Rossiello, A. Morari, J. Laredo, V. Thost, Y. Zhuang et al., "Exploring software naturalness through neural language models," arXiv preprint arXiv:2006.12641, 2020.
I. J. Goodfellow, J. Shlens, and C. Szegedy, "Explaining and harnessing adversarial examples," arXiv preprint arXiv:1412.6572, 2014.
X. Zhang, Y. Zhou, T. Han, and T. Chen, "Training deep code comment generation models via data augmentation," in 12th Asia-Pacific Symposium on Internetware, 2020, pp. 185-188.
H. Zhang, Z. Li, G. Li, L. Ma, Y. Liu, and Z. Jin, "Generating adversarial examples for holding robustness of source code processing models," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 01, 2020, pp. 1169-1176.
Q. Mi, Y. Xiao, Z. Cai, and X. Jia, "The effectiveness of data augmentation in code readability classification," Information and Software Technology, vol. 129, p. 106378, 2021.
L. Sun, C. Xia, W. Yin, T. Liang, P. S. Yu, and L. He, "Mixuptransformer: dynamic data augmentation for nlp tasks," arXiv preprint arXiv:2010.02394, 2020.
J. Chen, Z. Yang, and D. Yang, "Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification," arXiv preprint arXiv:2004.12239, 2020.
R. Zhang, Y. Yu, and C. Zhang, "Seqmix: Augmenting active sequence labeling via sequence mixup," arXiv preprint arXiv:2010.02322, 2020.
D. Walawalkar, Z. Shen, Z. Liu, and M. Savvides, "Attentive cutmix: An enhanced data augmentation approach for deep learning based image classification," arXiv preprint arXiv:2003.13048, 2020.
A. Uddin, M. Monira, W. Shin, T. Chung, S.-H. Bae et al., "Saliencymix: A saliency guided data augmentation strategy for better regularization," arXiv preprint arXiv:2006.01791, 2020.
J. Qin, J. Fang, Q. Zhang, W. Liu, X. Wang, and X. Wang, "Resizemix: Mixing data with preserved object information and true labels," arXiv preprint arXiv:2012.11101, 2020.
J.-H. Kim, W. Choo, and H. O. Song, "Puzzle mix: Exploiting saliency and local statistics for optimal mixup," in International Conference on Machine Learning. PMLR, 2020, pp. 5275-5285.
J.-H. Kim, W. Choo, H. Jeong, and H. O. Song, "Co-mixup: Saliency guided joint mixup with supermodular diversity," arXiv preprint arXiv:2102.03065, 2021.
V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio, "Manifold mixup: Better representations by interpolating hidden states," in International Conference on Machine Learning. PMLR, 2019, pp. 6438-6447.
S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, "Cutmix: Regularization strategy to train strong classifiers with localizable features," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6023-6032.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfitting," The journal of machine learning research, vol. 15, no. 1, pp. 1929-1958, 2014.
Z. Liu, S. Li, D. Wu, Z. Chen, L. Wu, J. Guo, and S. Z. Li, "Unveiling the power of mixup for stronger classifiers," arXiv preprint arXiv:2103.13027, 2021.
S. Yoon, G. Kim, and K. Park, "Ssmix: Saliency-based span mixup for text classification," arXiv preprint arXiv:2106.08062, 2021.
Y. Wang, W. Wang, Y. Liang, Y. Cai, and B. Hooi, "Mixup for node and graph classification," in Proceedings of the Web Conference 2021, 2021, pp. 3663-3674.