Code TypeSetting; Spatial-Aware Neural Network; Abstract Syntax Trees; Applications of AI; Code representation; Code typesetting; Neural-networks; Performance; Rich structure; Spatial-aware neural network; State of the art; Structural information; Artificial Intelligence; Software
Résumé :
[en] Code representation is a key step in the application of AI in software engineering. Generic NLP representations are effective but do not exploit all the rich structure inherent to code. Recent work has focused on extracting abstract syntax trees (AST) and integrating their structural information into code representations.These AST-enhanced representations advanced the state of the art and accelerated new applications of AI to software engineering. ASTs, however, neglect important aspects of code structure, notably control and data flow, leaving some potentially relevant code signal unexploited. For example, purely image-based representations perform nearly as well as AST-based representations, despite the fact that they must learn to even recognize tokens, let alone their semantics. This result, from prior work, is strong evidence that these new code representations can still be improved; it also raises the question of just what signal image-based approaches are exploiting. We answer this question. We show that code is spatial and exploit this fact to propose , a new representation that embeds tokens into a grid that preserves code layout. Unlike some of the existing state of the art, is agnostic to the downstream task: whether that task is generation or classification, can complement the learning algorithm with spatial signal. For example, we show that CNNs, which are inherently spatially-aware models, can exploit outputs to effectively tackle fundamental software engineering tasks, such as code classification, code clone detection and vulnerability detection. PixelCNN leverages 's grid representations to achieve code completion. Through extensive experiments, we validate our spatial code hypothesis, quantifying model performance as we vary the degree to which the representation preserves the grid. To demonstrate its generality, we show that augments models, improving their performance on a range of tasks, On clone detection, improves ASTNN's performance by 3.3% F1 score.
Disciplines :
Sciences informatiques
Auteur, co-auteur :
KABORE, Abdoul Kader ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
Barr, Earl T.; University College London, United Kingdom ; Google DeepMind
KLEIN, Jacques ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
Bissyandé, Tegawendé F.; University of Luxembourg, Luxembourg
Co-auteurs externes :
yes
Langue du document :
Anglais
Titre :
CodeGrid: A Grid Representation of Code
Date de publication/diffusion :
12 juillet 2023
Nom de la manifestation :
Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis
Lieu de la manifestation :
Seattle, Usa
Date de la manifestation :
17-07-2023 => 21-07-2023
Manifestation à portée :
International
Titre de l'ouvrage principal :
ISSTA 2023 - Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis
This work was partly supported (1) by the Luxembourg National Research Fund (FNR) - NERVE project, ref. 14591304, (2) by the Luxembourg Ministry of Foreign and European Affairs through their Digital4Development (D4D) portfolio under project LuxWAyS and (3) by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (Project NATURAL - grant agreement N° 949014).
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A transformer-based approach for source code summarization. arXiv preprint arXiv: 2005. 00653 (2020). https://doi. org/10. 48550/arXiv. 2005. 00653
Miltiadis Allamanis. 2021. Graph Neural Networks on Program Analysis. In Graph Neural Networks: Foundations, Frontiers, and Applications, LingfeiWu, Peng Cui, Jian Pei, and Liang Zhao (Eds.). Springer, Singapore, Chapter need number, need pages. https://doi. org/10. 1007/978-981-16-6054-2_22
Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2014. Learning Natural Coding Conventions. In Proceedings of the 22nd ACM SIGSOFT Inter-national Symposium on Foundations of Software Engineering (Hong Kong, China) (FSE 2014). Association for Computing Machinery, New York, NY, USA, 281-293. https://doi. org/10. 1145/2635868. 2635883
Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. 2020. Structural language models of code. In International Conference on Machine Learning. PMLR, 245-256.
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1-29.
Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural code comprehension: A learnable representation of code semantics. arXiv preprint arXiv: 1806. 07336 (2018).
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5, 2 (1994), 157-166. https://doi. org/10. 1109/72. 279181
Sergey Bezryadin, Pavel Bourov, and Dmitry Ilinih. 2007. Brightness calculation in digital image processing. In International symposium on technologies for digital photo fulfillment, Vol. 2007. Society for Imaging Science and Technology, 10-15. https://doi. org/10. 2352/ISSN. 2169-4672. 2007. 1. 0. 10
Cathal Boogerd and Leon Moonen. 2008. Assessing the value of coding standards: An empirical study. In 2008 IEEE International Conference on Software Maintenance. 277-286. https://doi. org/10. 1109/ICSM. 2008. 4658076
Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from Examples to Improve Code Completion Systems. 213-222. https://doi. org/10. 1145/1595696. 1595728
Harold E. Burtt. 1949. Typography and Readability. Elementary English 26, 4 (April 1949), 212-221. https://www. jstor. org/stable/41383630
Casey Casalnuovo, Earl T. Barr, Santanu Kumar Dash, Prem Devanbu, and Emily Morgan. 2020. A Theory of Dual Channel Constraints. In 2020 IEEE/ACM 42nd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). 25-28. https://doi. org/10. 1145/3377816. 3381720
J. Chen, K. Hu, Y. Yu, Z. Chen, Q. Xuan, Y. Liu, and V. Filkov. 2020. Software Visualization and Deep Transfer Learning for Effiective Software Defect Prediction. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). 578-589. https://doi. org/10. 1145/3377811. 3380389
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, PeterWelinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. https://doi. org/10. 48550/arXiv. 2107. 03374 arXiv: 2107. 03374 [cs. LG]
Zimin Chen and Martin Monperrus. 2019. A Literature Study of Embeddings on Source Code. CoRR abs/1904. 03061 (2019). arXiv: 1904. 03061 http://arxiv. org/abs/1904. 03061
Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, and Matei Zaharia. 2017. Dawnbench: An end-to-end deep learning benchmark and competition. Training 100, 101 (2017), 102.
Nadia Daoudi, Jordan Samhi, Abdoul Kader Kaboré, Kevin Allix, Tegawendé F. Bissyandé, and Jacques Klein. 2021. DexRay: A Simple, yet Effiective Deep Learning Approach to Android Malware Detection based on Image Representation of Bytecode. In The 2nd International Workshop on Deployable Machine Learning for Security Defense (MLHat@KDD) (Singapore, Singapore). https://doi. org/10. 1109/BigData. 2018. 8622324
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810. 04805 (2018). https://doi. org/10. 48550/arXiv. 1810. 04805
Iain S Duff, Albert Maurice Erisman, and John Ker Reid. 2017. Direct methods for sparse matrices. Oxford University Press.
Rick Eden and Ruth Mitchell. [n. d.]. Paragraphing for the reader. College Composition and Communication 37, 4 ([n. d.]), 416-441. https://doi. org/10. 2307/357912
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv: 2002. 08155 (2020). https://doi. org/10. 48550/arXiv. 2002. 08155
Michael Fu and Chakkrit Tantithamthavorn. 2022. LineVul: A Transformer-Based Line-Level Vulnerability Prediction. In Proceedings of the 19th International Conference on Mining Software Repositories (Pittsburgh, Pennsylvania) (MSR '22). Association for Computing Machinery, New York, NY, USA, 608-620. https://doi. org/10. 1145/3524842. 3528452
Rohan Ghosh and Anupam K Gupta. 2019. Investigating convolutional neural networks using spatial orderness. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0-0.
Gustavo Grieco, Guillermo Luis Grinblat, Lucas Uzal, Sanjay Rawat, Josselin Feist, and Laurent Mounier. 2016. Toward large-scale vulnerability discovery using machine learning. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy. 85-96.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770-778.
Abram Hindle, MichaelW. Godfrey, and Richard C. Holt. 2009. Reading beside the lines: Using indentation to rank revisions by complexity. Science of Computer Pro-gramming 74(7) (2009), 414-429. http://softwareprocess. ca/pubs/hindle2009SCPReading-beside-the-lines. pdf
Aram Hovsepyan, Riccardo Scandariato, Wouter Joosen, and JamesWalden. 2012. Software vulnerability prediction using text analysis techniques. In Proceedings of the 4th international workshop on Security measurements and metrics. 7-10.
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2020. Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering 25, 3 (2020), 2179-2217.
T. H. Huang and H. Kao. 2018. R2-D2: ColoR-inspired Convolutional NeuRal Network (CNN)-based AndroiD Malware Detections. In 2018 IEEE International Conference on Big Data (Big Data). 2633-2642. https://doi. org/10. 1109/BigData. 2018. 8622324
Kevin Jesse, Premkumar T. Devanbu, and Toufique Ahmed. 2021. Learning type annotation: is big data enough? In ESEC/FSE '21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021, Diomidis Spinellis, Georgios Gousios, Marsha Chechik, and Massimiliano Di Penta (Eds.). ACM, 1483-1486. https://doi. org/10. 1145/3468264. 3473135
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In 29th International Conference on Software Engineering (ICSE'07). IEEE, 96-105.
Joel Jones. 2003. Abstract Syntax Tree Implementation Idioms. In Proceedings of the 10th Conference on Pattern Languages of Programs (PLoP2003). http://www. hillside. net/plop/plop2003/Papers/Jones-ImplementingASTs. pdf Proceedings of the 10th Conference on Pattern Languages of Programs (PLoP2003)http://hillside. net/plop/plop2003/papers. html.
Daniel Kahneman. 2011. Thinking, fast and slow. Macmillan.
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654-670.
WANG Ke, Jian-Hong JIANG, andMARui-Yun. 2018. A code classification method based on TF-IDF. DEStech Transactions on Economics, Business and Management eced (2018).
Patrick Keller, Laura Plein, Tegawendé F. Bissyandé, Jacques Klein, and Yves Le Traon. 2021. What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning. ACM Transactions on Software Engineering and Methodology-To appear (2021).
Nikhil Ketkar. 2017. Introduction to pytorch. In Deep learning with python. Springer, 195-208.
Fazeel Ahmed Khan and Adamu Abubakar. 2020. Machine Translation in Natural Language Processing by Implementing Artificial Neural Network Modelling Techniques: An Analysis. International Journal on Perceptive and Cognitive Computing 6, 1 (2020), 9-18.
Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. 2017. Vuddy: A scalable approach for vulnerable code clone discovery. In 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 595-614.
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning. PMLR, 1188-1196.
Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural computation 1, 4 (1989), 541-551.
Jian Li, Yue Wang, Michael R Lyu, and Irwin King. 2017. Code completion with neural attention and pointer networks. arXiv preprint arXiv: 1711. 09573 (2017).
Xin Li, Lu Wang, Yang Xin, Yixian Yang, and Yuling Chen. 2020. Automated Vulnerability Detection in Source Code Using Minimum Intermediate Representation Learning. Applied Sciences 10, 5 (2020), 1692.
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Hanchao Qi, and Jie Hu. 2016. Vulpecker: An automated vulnerability detection system based on code similarity analysis. In Proceedings of the 32nd Annual Conference on Computer Security Applications. 201-213.
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2018. Sysevr: A framework for using deep learning to detect software vulnerabilities. arXiv preprint arXiv: 1807. 06756 (2018).
Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv preprint arXiv: 1801. 01681 (2018).
Fang Liu, Ge Li, BolinWei, Xin Xia, Zhiyi Fu, and Zhi Jin. 2020. A self-attentional neural architecture for code completion with multi-task learning. In Proceedings of the 28th International Conference on Program Comprehension. 37-47.
Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. Multi-task Learning based Pre-trained Language Model for Code Completion. CoRR abs/2012. 14631 (2020). arXiv: 2012. 14631 https://arxiv. org/abs/2012. 14631
Tomas Mikolov, Kai Chen, Greg Corrado, and Jefirey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv: 1301. 3781 (2013).
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26 (2013).
Lili Mou, Ge Li, Zhi Jin, Lu Zhang, and Tao Wang. 2014. TBCNN: A Tree-Based Convolutional Neural Network for Programming Language Processing. CoRR abs/1409. 5718 (2014). arXiv: 1409. 5718 http://arxiv. org/abs/1409. 5718
Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Thirtieth AAAI Conference on Artificial Intelligence.
National Institute of Standards and Technology. 2018. National Vulnerability Database. http://nvd. nist. gov/.
National Institute of Standards and Technology. 2018. Software Assurance Reference Dataset. https://samate. nist. gov/SRD/index. php.
Stephan Neuhaus, Thomas Zimmermann, Christian Holler, and Andreas Zeller. 2007. Predicting vulnerable software components. In Proceedings of the 14th ACM conference on Computer and communications security. 529-540.
Lawrence C Ngugi, Moataz Abelwahab, and Mohammed Abo-Zahhad. 2021. Recent advances in image processing techniques for automated leaf pest and disease recognition-A review. Information processing in agriculture 8, 1 (2021), 27-51.
Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N Nguyen. 2013. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. 532-542.
Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. 2016. Conditional image generation with pixelcnn decoders. arXiv preprint arXiv: 1606. 05328 (2016).
OpenAI. 2021. GitHub Copilot-Your AI Pair Programmer. https://copilot. github. com
Alice J O'Toole and Carlos D Castillo. 2021. Face Recognition by Humans and Machines: Three Fundamental Advances from Deep Learning. Annual Review of Vision Science 7 (2021).
Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. Citeseer, 29-48.
Dhavleesh Rattan, Rajesh Bhatia, and Maninder Singh. 2013. Software clone detection: A systematic review. Information and Software Technology 55, 7 (2013), 1165-1199.
Radim Rehrek, Petr Sojka, et al. 2011. Gensim-statistical semantics in python. Retrieved from genism. org (2011).
Chanchal Kumar Roy and James R Cordy. 2007. A survey on software clone detection research. Queen's School of Computing TR 541, 115 (2007), 64-68.
Hitesh Sajnani, Vaibhav Saini, Jefirey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. Sourcerercc: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering. 1157-1168.
Hajah T Sueno, Bobby D Gerardo, and Ruji P Medina. 2020. Converting Text to Numerical Representation using Modified Bayesian Vectorization Technique for Multi-Class Classification. International Journal 9, 4 (2020).
Jefirey Svajlenko, Judith F Islam, Iman Keivanloo, Chanchal K Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In 2014 IEEE International Conference on Software Maintenance and Evolution. IEEE, 476-480.
Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. 2016. Conditional Image Generation with PixelCNN Decoders. CoRR abs/1606. 05328 (2016). arXiv: 1606. 05328 http://arxiv. org/abs/1606. 05328
Richard J Waldinger and Richard CT Lee. 1969. PROW: A step toward automatic program writing. In Proceedings of the 1st international joint conference on Artificial intelligence. 241-252.
HuihuiWei and Ming Li. 2017. Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code. In IJCAI. 3034-3040.
Hyeon-Joong Yoo. 2015. Deep convolution neural networks in computer vision: A review. IEIE Transactions on Smart Processing and Computing 4, 1 (2015), 35-43.
Hao Yu, Wing Lam, Long Chen, Ge Li, Tao Xie, and QianxiangWang. 2019. Neural detection of semantic code clones via tree-based convolution. In 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). IEEE, 70-80.
Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818-833.
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 783-794.