On the Effectiveness of Graph Data Augmentation for Source Code Learning

Dong, Zeming; HU, Qiang; Zhang, Zhenya; Zhao, Jianjun

doi:10.1109/dsa59317.2023.00124

Request a copy

Paper published in a book (Scientific congresses, symposiums and conference proceedings)

On the Effectiveness of Graph Data Augmentation for Source Code Learning

Dong, Zeming; HU, Qiang; Zhang, Zhenya et al.

2023 • In 2023 10th International Conference on Dependable Systems and Their Applications (DSA)

Peer reviewed

Permalink
https://hdl.handle.net/10993/58078

DOI
10.1109/dsa59317.2023.00124

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

DSA 2023.pdf

Author postprint (278.81 kB)

Request a copy

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Disciplines :

Computer science

Author, co-author :

Dong, Zeming; Kyushu University,Japan

HU, Qiang ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

Zhang, Zhenya; Kyushu University,Japan

Zhao, Jianjun; Kyushu University,Japan

External co-authors :

yes

Language :

English

Title :

On the Effectiveness of Graph Data Augmentation for Source Code Learning

Publication date :

10 August 2023

Event name :

2023 10th International Conference on Dependable Systems and Their Applications (DSA)

Event date :

10-11 August 2023

By request :

Yes

Main work title :

2023 10th International Conference on Dependable Systems and Their Applications (DSA)

Publisher :

IEEE, Tokyo, Japan

Peer reviewed :

Peer reviewed

Additional URL :

http://xplorestaging.ieee.org/ielx7/10314134/10314123/10314385.pdf?arnumber=10314385

Available on ORBilu :

since 28 November 2023

Statistics

Number of views

131 (13 by Unilu)

Number of downloads

1 (1 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenAlex citations

Bibliography

M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, "A survey of machine learning for big code and naturalness, " ACM Computing Surveys (CSUR), vol. 51, no. 4, pp. 1-37, 2018.
Q. U. Ain, W. H. Butt, M. W. Anwar, F. Azam, and B. Maqbool, "A systematic review on code clone detection, " IEEE access, vol. 7, pp. 86 121-86 144, 2019.
Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong, "Vuldeepecker: A deep learningbased system for vulnerability detection, " arXiv preprint arXiv: 1801. 01681, 2018.
R. Puri, D. S. Kung, G. Janssen, W. Zhang, G. Domeniconi, V. Zolotov, J. Dolby, J. Chen, M. Choudhury, L. Decker et al., "Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks, " 2021.
J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, "Neural message passing for quantum chemistry, " in ICML, vol. 70, 2017, pp. 1263-1272.
Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, "Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks, " in NeurIPS, 2019.
D. Hendrycks and T. Dietterich, "Benchmarking neural network robustness to common corruptions and perturbations, " ICLR, 2019.
S.-A. Rebuffi, S. Gowal, D. A. Calian, F. Stimberg, O. Wiles, and T. A. Mann, "Data augmentation can improve robustness, " NeurIPS, vol. 34, 2021.
L. Perez and J. Wang, "The effectiveness of data augmentation in image classification using deep learning, " arXiv preprint arXiv: 1712. 04621, 2017.
M. Allamanis, H. R. Jackson-Flux, and M. Brockschmidt, "Self-supervised bug detection and repair, " in NeurIPS, 2021.
S. Yu, T. Wang, and J. Wang, "Data augmentation by program transformation, " Journal of Systems and Software, vol. 190, p. 111304, 2022.
R. Gupta, S. Pal, A. Kanade, and S. Shevade, "Deepfix: Fixing common c language errors by deep learning, " in AAAI, 2017, p. 1345-1351.
E. Dinella, H. Dai, Z. Li, M. Naik, L. Song, and K. Wang, "Hoppity: Learning graph transformations to detect and fix bugs in programs, " in ICLR, 2020.
D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt, "Measuring coding challenge competence with apps, " in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
L. Zhang, G. Rosenblatt, E. Fetaya, R. Liao, W. Byrd, M. Might, R. Urtasun, and R. Zemel, "Neural guided constraint logic programming for program synthesis, " NeurIPS, vol. 31, 2018.
X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin, "Deep code comment generation, " in ICPC, 2018, pp. 200-210.
W. Wang, G. Li, B. Ma, X. Xia, and Z. Jin, "Detecting code clones with graph neural network and flowaugmented abstract syntax tree, " in SANER, 2020, pp. 261-271.
D. A. Van Dyk and X.-L. Meng, "The art of data augmentation, " Journal of Computational and Graphical Statistics, vol. 10, no. 1, pp. 1-50, 2001.
B. Li, Y. Hou, and W. Che, "Data augmentation approaches in natural language processing: A survey, " AI Open, vol. 3, pp. 71-90, 2022.
S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, and E. Hovy, "A survey of data augmentation approaches for NLP, " in ACL-IJCNLP, 2021, pp. 968-988.
M. V. Pour, Z. Li, L. Ma, and H. Hemmati, "A searchbased testing framework for deep neural networks of source code embedding, " in ICST, 2021, pp. 36-46.
N. Yefet, U. Alon, and E. Yahav, "Adversarial examples for models of code, " OOPSLA, vol. 4, pp. 1-30, 2020.
N. D. Bui, Y. Yu, and L. Jiang, "Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations, " in SIGIR, 2021, pp. 511-521.
D. Wang, Z. Jia, S. Li, Y. Yu, Y. Xiong, W. Dong, and X. Liao, "Bridging pre-trained models and downstream tasks for source code understanding, " in ICSE, 2022, pp. 287-298.
X. Gao, R. K. Saha, M. R. Prasad, and A. Roychoudhury, "Fuzz testing based data augmentation to improve robustness of deep neural networks, " in ICSE, 2020, p. 1147-1158.
A. Kaur and M. Kaur, "Analysis of code refactoring impact on software quality, " in MATEC Web of Conferences, vol. 57, 2016, p. 02012.
G. Lacerda, F. Petrillo, M. Pimenta, and Y. G. Guéhéneuc, "Code smells and refactoring: A tertiary systematic review of challenges and observations, " Journal of Systems and Software, vol. 167, p. 110610, 2020.
Z. Dong, Q. Hu, Y. Guo, M. Cordy, M. Papadakis, Z. Zhang, Y. L. Traon, and J. Zhao, "Mixcode: Enhancing code classification by mixup-based data augmentation, " in SANER, 2023, pp. 379-390.
P. Bielik and M. Vechev, "Adversarial robustness for code, " in ICML, vol. 119, 2020, pp. 896-907.
T. Zhao, G. Liu, S. Günnemann, and M. Jiang, "Graph data augmentation for graph machine learning: A survey, " arXiv preprint arXiv: 2202. 08871, 2022.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A simple way to prevent neural networks from overfitting, " JMLR, vol. 15, no. 1, pp. 1929-1958, 2014.
W. Feng, J. Zhang, Y. Dong, Y. Han, H. Luan, Q. Xu, Q. Yang, E. Kharlamov, and J. Tang, "Graph random neural networks for semi-supervised learning on graphs, " in NeurIPS, vol. 33, 2020, pp. 22 092-22 103.
Y. Rong, W. Huang, T. Xu, and J. Huang, "Dropedge: Towards deep graph convolutional networks on node classification, " in ICLR, 2020.
Y. Wang, W. Wang, Y. Liang, Y. Cai, and B. Hooi, "Graphcrop: Subgraph cropping for graph classification, " 2020.
L. Zhang, Z. Deng, K. Kawaguchi, A. Ghorbani, and J. Zou, "How does mixup help with robustness and generalization?" in ICLR, 2021.
T. Zhao, Y. Liu, L. Neves, O. Woodford, M. Jiang, and N. Shah, "Data augmentation for graph neural networks, " AAAI, vol. 35, no. 12, pp. 11 015-11 023, 2021.
V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio, "Manifold mixup: Better representations by interpolating hidden states, " in ICML, 2019, pp. 6438-6447.
Y. Wang, W. Wang, Y. Liang, Y. Cai, and B. Hooi, "Mixup for node and graph classification, " in Proceedings of the Web Conference, 2021, pp. 3663-3674.
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, "mixup: Beyond empirical risk minimization, " in International Conference on Learning Representations (ICLR), 2018.
Y. Hu, U. Z. Ahmed, S. Mechtaev, B. Leong, and A. Roychoudhury, "Re-factoring based program repair applied to programming assignments, " in ASE, 2019, pp. 388-398.
H. Zhong and Z. Su, "An empirical study on real bug fixes, " in ICSE, vol. 1, 2015, pp. 913-923.
Z. Yang, J. Shi, J. He, and D. Lo, "Natural attack for pretrained models of code, " in ICSE, 2022, p. 1482-1493.
J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia, "Towards a big data curated benchmark of inter-project code clones, " in ICSME, 2014, pp. 476-480.
T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks, " in ICLR, 2017.
K. Xu, W. Hu, J. Leskovec, and S. Jegelka, "How powerful are graph neural networks?" in ICLR, 2019.
P. Velickovíc, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, "Graph attention networks, " stat, vol. 1050, p. 20, 2017.
Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel, "Gated graph sequence neural networks, " 2016.
W. L. Hamilton, R. Ying, and J. Leskovec, "Inductive representation learning on large graphs, " in NeurIPS, 2017, p. 1025-1035.
M. Fey and J. E. Lenssen, "Fast graph representation learning with pytorch geometric, " in ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization, " arXiv preprint arXiv: 1412. 6980, 2014.
T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks, " in ICLR, 2017.
Z. Chen and M. Monperrus, "The codrep machine learning on source code competition, " 2018.
C. Shorten and T. M. Khoshgoftaar, "A survey on image data augmentation for deep learning, " Journal of big data, vol. 6, no. 1, pp. 1-48, 2019.
H. Li, C. Miao, C. Leung, Y. Huang, Y. Huang, H. Zhang, and Y. Wang, "Exploring representation-level augmentation for code search, " in EMNLP, 2022, pp. 4924-4936.
Z. Dong, Q. Hu, Y. Guo, M. Cordy, M. Papadakis, Y. L. Traon, and J. Zhao, "Enhancing mixup-based graph learning for language processing via hybrid pooling, " arXiv preprint arXiv: 2210. 03123, 2022.
J. K. Siow, S. Liu, X. Xie, G. Meng, and Y. Liu, "Learning program semantics with code representations: An empirical study, " in SANER, 2022, pp. 554-565.
C. Niu, C. Li, V. Ng, D. Chen, J. Ge, and B. Luo, "An empirical comparison of pre-trained models of source code, " 2023.
A. Mastropaolo, L. Pascarella, E. Guglielmi, M. Ciniselli, S. Scalabrino, R. Oliveto, and G. Bavota, "On the robustness of code generation techniques: An empirical study on github copilot, " 2023.
S. Yan, H. Yu, Y. Chen, B. Shen, and L. Jiang, "Are the code snippets what we are searching for, " A benchmark and an empirical study on code search with naturallanguage queries, 2020.
Q. Hu, Y. Guo, X. Xie, M. Cordy, L. Ma, M. Papadakis, and Y. L. Traon, "Codes: Towards code model generalization under distribution shift, " in International Conference on Software Engineering (ICSE): New Ideas and Emerging Results (NIER), 2023.