From 848905645a5a895f2ba11bc5484d60779e679700 Mon Sep 17 00:00:00 2001 From: Moshi Wei Date: Sun, 12 Jul 2020 22:49:06 -0400 Subject: [PATCH 001/297] Update leclair2019neural.markdown --- _publications/leclair2019neural.markdown | 3 +++ 1 file changed, 3 insertions(+) diff --git a/_publications/leclair2019neural.markdown b/_publications/leclair2019neural.markdown index 2f12444d..972969d7 100644 --- a/_publications/leclair2019neural.markdown +++ b/_publications/leclair2019neural.markdown @@ -5,6 +5,9 @@ authors: A. LeClair, S. Jiang, C. McMillan conference: ICSE year: 2019 bibkey: leclair2019neural +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/pdf/1902.01954.pdf"} + - {name: "Code and Data", url: "https://s3.us-east-2.amazonaws.com/icse2018/index.html"} tags: ["summarization", "documentation"] --- Source code summarization -- creating natural language descriptions of source code behavior -- is a rapidly-growing research topic with applications to automatic documentation generation, program comprehension, and software maintenance. Traditional techniques relied on heuristics and templates built manually by human experts. Recently, data-driven approaches based on neural machine translation have largely overtaken template-based systems. But nearly all of these techniques rely almost entirely on programs having good internal documentation; without clear identifier names, the models fail to create good summaries. In this paper, we present a neural model that combines words from code with code structure from an AST. Unlike previous approaches, our model processes each data source as a separate input, which allows the model to learn code structure independent of the text in code. This process helps our approach provide coherent summaries in many cases even when zero internal documentation is provided. We evaluate our technique with a dataset we created from 2.1m Java methods. We find improvement over two baseline techniques from SE literature and one from NLP literature. From bc8c4ba2dc526f5c62db5873a5fde577a84f3e21 Mon Sep 17 00:00:00 2001 From: Moshi Wei Date: Sun, 12 Jul 2020 22:20:17 -0400 Subject: [PATCH 002/297] added code link Just want to add another missing code link --- _publications/karampatsis2019deep.markdown | 1 + 1 file changed, 1 insertion(+) diff --git a/_publications/karampatsis2019deep.markdown b/_publications/karampatsis2019deep.markdown index a6a1e878..554608a7 100644 --- a/_publications/karampatsis2019deep.markdown +++ b/_publications/karampatsis2019deep.markdown @@ -7,6 +7,7 @@ year: 2019 bibkey: karampatsis2019deep additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/1903.05734"} + - {name: "Code", url: "https://github.com/mast-group/OpenVocabCodeNLM"} tags: ["language model"] --- Statistical language modeling techniques have successfully been applied to source code, yielding a variety of new software development tools, such as tools for code suggestion and improving readability. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. But traditional language models limit the vocabulary to a fixed set of common words. For code, this strong assumption has been shown to have a significant negative effect on predictive performance. But the open vocabulary version of the neural network language models for code have not been introduced in the literature. We present a new open-vocabulary neural language model for code that is not limited to a fixed vocabulary of identifier names. We employ a segmentation into subword units, subsequences of tokens chosen based on a compression criterion, following previous work in machine translation. Our network achieves best in class performance, outperforming even the state-of-the-art methods of Hellendoorn and Devanbu that are designed specifically to model code. Furthermore, we present a simple method for dynamically adapting the model to a new test project, resulting in increased performance. We showcase our methodology on code corpora in three different languages of over a billion tokens each, hundreds of times larger than in previous work. To our knowledge, this is the largest neural language model for code that has been reported. From be1e6e1717e95afaabe1b2c1ad24484ce687cd63 Mon Sep 17 00:00:00 2001 From: Moshi Wei Date: Sun, 12 Jul 2020 22:14:19 -0400 Subject: [PATCH 003/297] added code links Hello! I am a DL researcher and I really like the work you guys have done. I want to contribute as well! --- _publications/ahmad2020transformer.markdown | 1 + 1 file changed, 1 insertion(+) diff --git a/_publications/ahmad2020transformer.markdown b/_publications/ahmad2020transformer.markdown index d74c7dae..48903d5f 100644 --- a/_publications/ahmad2020transformer.markdown +++ b/_publications/ahmad2020transformer.markdown @@ -7,6 +7,7 @@ year: 2020 bibkey: ahmad2020transformer additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2005.00653"} + - {name: "Code", url: "https://github.com/wasiahmad/NeuralCodeSum"} tags: ["summarization"] --- Generating a readable summary that describes the functionality of a program is known as source code summarization. In this task, learning code representation by modeling the pairwise relationship between code tokens to capture their long-range dependencies is crucial. To learn code representation for summarization, we explore the Transformer model that uses a self-attention mechanism and has shown to be effective in capturing long-range dependencies. In this work, we show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin. We perform extensive analysis and ablation studies that reveal several important findings, e.g., the absolute encoding of source code tokens' position hinders, while relative encoding significantly improves the summarization performance. We have made our code publicly available to facilitate future research. From 1d7021cbbc665ff2dbe7a2350b2d546fb50171c9 Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Mon, 13 Jul 2020 09:29:21 +0100 Subject: [PATCH 004/297] Add Graph4Code --- _publications/abdelaziz2020graph4code.markdown | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 _publications/abdelaziz2020graph4code.markdown diff --git a/_publications/abdelaziz2020graph4code.markdown b/_publications/abdelaziz2020graph4code.markdown new file mode 100644 index 00000000..7bbffa5c --- /dev/null +++ b/_publications/abdelaziz2020graph4code.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Graph4Code: A Machine Interpretable Knowledge Graph for Code" +authors: Ibrahim Abdelaziz, Julian Dolby, James P. McCusker, Kavitha Srinivas +conference: +year: 2020 +bibkey: abdelaziz2020graph4code +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2002.09440"} +tags: ["dataset"] +--- +Knowledge graphs have proven extremely useful in powering diverse applications in semantic search and natural language understanding. Graph4Code is a knowledge graph about program code that can similarly power diverse applications such as program search, code understanding, refactoring, bug detection, and code automation. The graph uses generic techniques to capture the semantics of Python code: the key nodes in the graph are classes, functions and methods in popular Python modules. Edges indicate function usage (e.g., how data flows through function calls, as derived from program analysis of real code), and documentation about functions (e.g., code documentation, usage documentation, or forum discussions such as StackOverflow). We make extensive use of named graphs in RDF to make the knowledge graph extensible by the community. We describe a set of generic extraction techniques that we applied to over 1.3M Python files drawn from GitHub, over 2,300 Python modules, as well as 47M forum posts to generate a graph with over 2 billion triples. We also provide a number of initial use cases of the knowledge graph in code assistance, enforcing best practices, debugging and type inference. The graph and all its artifacts are available to the community for use. From b4b07a2213ddd9b93c612e6850df65a8e08a9e17 Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Mon, 13 Jul 2020 09:31:48 +0100 Subject: [PATCH 005/297] Minor additions. --- _publications/abdelaziz2020graph4code.markdown | 1 + resources.md | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/_publications/abdelaziz2020graph4code.markdown b/_publications/abdelaziz2020graph4code.markdown index 7bbffa5c..81450aac 100644 --- a/_publications/abdelaziz2020graph4code.markdown +++ b/_publications/abdelaziz2020graph4code.markdown @@ -7,6 +7,7 @@ year: 2020 bibkey: abdelaziz2020graph4code additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2002.09440"} + - {name: "Website", url: "https://wala.github.io/graph4code/"} tags: ["dataset"] --- Knowledge graphs have proven extremely useful in powering diverse applications in semantic search and natural language understanding. Graph4Code is a knowledge graph about program code that can similarly power diverse applications such as program search, code understanding, refactoring, bug detection, and code automation. The graph uses generic techniques to capture the semantics of Python code: the key nodes in the graph are classes, functions and methods in popular Python modules. Edges indicate function usage (e.g., how data flows through function calls, as derived from program analysis of real code), and documentation about functions (e.g., code documentation, usage documentation, or forum discussions such as StackOverflow). We make extensive use of named graphs in RDF to make the knowledge graph extensible by the community. We describe a set of generic extraction techniques that we applied to over 1.3M Python files drawn from GitHub, over 2,300 Python modules, as well as 47M forum posts to generate a graph with over 2 billion triples. We also provide a number of initial use cases of the knowledge graph in code assistance, enforcing best practices, debugging and type inference. The graph and all its artifacts are available to the community for use. diff --git a/resources.md b/resources.md index 0a3a7f6a..8e12e12d 100644 --- a/resources.md +++ b/resources.md @@ -40,6 +40,7 @@ The last few years a few workshops have been organized in this area. Please, fee * [Sofware Analysis](http://rightingcode.org/) in Univ. of Pennsylvania. It is a great introduction to Program Analysis [[videos](https://www.youtube.com/playlist?list=PLF3-CvSRq2SaApl3Lnu6Tu_ecsBr94543)] ### Competitions +* [nlc2cmd](http://nlc2cmd.us-east.mybluemix.net/#/) in NeurIPS 2020 by Project CLAI. Starts July 2020. * [CodeSearchNet Challenge: Evaluating the State of Semantic Code Search](https://github.com/github/CodeSearchNet) by Github. Starts Sep 2019. * [CodRep 2019: Machine Learning on Source Code Competition](https://github.com/KTH/codrep-2019) by KTH. Starts on April 25th 2019. * [CodRep 2018: Machine Learning on Source Code Competition](https://github.com/KTH/CodRep-competition) by KTH. Starts on April 14th 2018. @@ -49,4 +50,4 @@ The last few years a few workshops have been organized in this area. Please, fee papers in the area. You can access the list [here](https://github.com/src-d/awesome-machine-learning-on-source-code). * [Autormated Program Repair](https://www.monperrus.net/martin/automatic-software-repair) has a curated list of pointers for helping newcomers to understan the field, -maintained by [Martin Monperrus](www.monperrus.net). \ No newline at end of file +maintained by [Martin Monperrus](www.monperrus.net). From 0525be02d4e20cfd9ee5a9c4af2c0f59ba920683 Mon Sep 17 00:00:00 2001 From: Reshinth Adithyan <36307201+reshinthadithyan@users.noreply.github.com> Date: Wed, 15 Jul 2020 05:44:18 +0530 Subject: [PATCH 006/297] Create jain2020contrastive.markdown Added Contrastive Code Representation Learning. --- _publications/jain2020contrastive.markdown | 26 ++++++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100644 _publications/jain2020contrastive.markdown diff --git a/_publications/jain2020contrastive.markdown b/_publications/jain2020contrastive.markdown new file mode 100644 index 00000000..a9c0c6f7 --- /dev/null +++ b/_publications/jain2020contrastive.markdown @@ -0,0 +1,26 @@ +--- +layout: publication +title: "Contrastive Code Representation Learning" +authors: Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel, Joseph E. Gonzalez, Ion Stoica +conference: ICML +year: 2020 +bibkey: jain2020contrastive +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2007.04973"} + - {name: "Website", url: "https://parasj.github.io/contracode/"} + - {name: "GitHub", url : "https://github.com/parasj/contracode"} +tags: ["Contrastive Learning"] +--- +Machine-aided programming tools such as type predictors and code summarizers +are increasingly learning-based. However, most code representation learning approaches rely on supervised learning with task-specific annotated datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised +algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, relying only on +the raw text of programs. In particular, we design an unsupervised pretext task by +generating textually divergent copies of source functions via automated source-tosource compiler transforms that preserve semantics. We train a neural model to +identify variants of an anchor program within a large batch of negatives. To solve +this task, the network must extract program features representing the functionality, +not form, of the program. This is the first application of instance discrimination +to code representation learning to our knowledge. We pre-train models over 1.8m +unannotated JavaScript methods mined from GitHub. ContraCode pre-training +improves code summarization accuracy by 7.9% over supervised approaches and +4.8% over RoBERTa pre-training. Moreover, our approach is agnostic to model architecture; for a type inference task, contrastive pre-training consistently improves +the accuracy of existing baselines. From 5d4d68a43a5062d236314a0cb25846bb596ee783 Mon Sep 17 00:00:00 2001 From: Reshinth Adithyan <36307201+reshinthadithyan@users.noreply.github.com> Date: Wed, 15 Jul 2020 15:42:17 +0530 Subject: [PATCH 007/297] Update _publications/jain2020contrastive.markdown Co-authored-by: Miltos --- _publications/jain2020contrastive.markdown | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_publications/jain2020contrastive.markdown b/_publications/jain2020contrastive.markdown index a9c0c6f7..ce5d9184 100644 --- a/_publications/jain2020contrastive.markdown +++ b/_publications/jain2020contrastive.markdown @@ -9,7 +9,7 @@ additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2007.04973"} - {name: "Website", url: "https://parasj.github.io/contracode/"} - {name: "GitHub", url : "https://github.com/parasj/contracode"} -tags: ["Contrastive Learning"] +tags: ["representation", "pretraining"] --- Machine-aided programming tools such as type predictors and code summarizers are increasingly learning-based. However, most code representation learning approaches rely on supervised learning with task-specific annotated datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised From 4f39a13baa6daa56156792ce01efae21cf303c13 Mon Sep 17 00:00:00 2001 From: Reshinth Adithyan <36307201+reshinthadithyan@users.noreply.github.com> Date: Wed, 15 Jul 2020 15:43:45 +0530 Subject: [PATCH 008/297] Update _publications/jain2020contrastive.markdown Co-authored-by: Miltos --- _publications/jain2020contrastive.markdown | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_publications/jain2020contrastive.markdown b/_publications/jain2020contrastive.markdown index ce5d9184..5759f90a 100644 --- a/_publications/jain2020contrastive.markdown +++ b/_publications/jain2020contrastive.markdown @@ -2,7 +2,7 @@ layout: publication title: "Contrastive Code Representation Learning" authors: Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel, Joseph E. Gonzalez, Ion Stoica -conference: ICML +conference: year: 2020 bibkey: jain2020contrastive additional_links: From 014e1846574513ea43a9372188261a9bc036e12f Mon Sep 17 00:00:00 2001 From: Reshinth Adithyan <36307201+reshinthadithyan@users.noreply.github.com> Date: Thu, 30 Jul 2020 12:25:26 +0530 Subject: [PATCH 009/297] Update lachaux2020unsupervised.markdown --- _publications/lachaux2020unsupervised.markdown | 1 + 1 file changed, 1 insertion(+) diff --git a/_publications/lachaux2020unsupervised.markdown b/_publications/lachaux2020unsupervised.markdown index 85785810..110b8c2d 100644 --- a/_publications/lachaux2020unsupervised.markdown +++ b/_publications/lachaux2020unsupervised.markdown @@ -7,6 +7,7 @@ year: 2020 bibkey: lachaux2020unsupervised additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2006.03511"} + - {name: "GitHub", url: "https://github.com/facebookresearch/TransCoder"} tags: ["migration"] --- A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is timeconsuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin. From 2c602f9632e1690626ce2a84d7b07731a4dba1d4 Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Sat, 1 Aug 2020 20:10:31 +0100 Subject: [PATCH 010/297] Add papers. --- _publications/brody2020neural.markdown | 12 ++++++++++++ _publications/nair2020funcgnn.markdown | 12 ++++++++++++ 2 files changed, 24 insertions(+) create mode 100644 _publications/brody2020neural.markdown create mode 100644 _publications/nair2020funcgnn.markdown diff --git a/_publications/brody2020neural.markdown b/_publications/brody2020neural.markdown new file mode 100644 index 00000000..5fb0a1f2 --- /dev/null +++ b/_publications/brody2020neural.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Neural Edit Completion" +authors: Shaked Brody, Uri Alon, Eran Yahav +conference: +year: 2020 +bibkey: brody2020neural +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2005.13209"} +tags: ["edit", "AST", "autocomplete"] +--- +We address the problem of predicting edit completions based on a learned model that was trained on past edits. Given a code snippet that is partially edited, our goal is to predict a completion of the edit for the rest of the snippet. We refer to this task as the EditCompletion task and present a novel approach for tackling it. The main idea is to directly represent structural edits. This allows us to model the likelihood of the edit itself, rather than learning the likelihood of the edited code. We represent an edit operation as a path in the program's Abstract Syntax Tree (AST), originating from the source of the edit to the target of the edit. Using this representation, we present a powerful and lightweight neural model for the EditCompletion task. We conduct a thorough evaluation, comparing our approach to a variety of representation and modeling approaches that are driven by multiple strong models such as LSTMs, Transformers, and neural CRFs. Our experiments show that our model achieves 28% relative gain over state-of-the-art sequential models and 2× higher accuracy than syntactic models that learn to generate the edited code instead of modeling the edits directly. We make our code, dataset, and trained models publicly available. diff --git a/_publications/nair2020funcgnn.markdown b/_publications/nair2020funcgnn.markdown new file mode 100644 index 00000000..7c8c9ec3 --- /dev/null +++ b/_publications/nair2020funcgnn.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "funcGNN: A Graph Neural Network Approach to Program Similarity" +authors: Aravind Nair, Avijit Roy, Karl Meinke +conference: ESEM +year: 2020 +bibkey: nair2020funcgnn +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2007.13239"} +tags: ["GNN", "clone"] +--- +Program similarity is a fundamental concept, central to the solution of software engineering tasks such as software plagiarism, clone identification, code refactoring and code search. Accurate similarity estimation between programs requires an in-depth understanding of their structure, semantics and flow. A control flow graph (CFG), is a graphical representation of a program which captures its logical control flow and hence its semantics. A common approach is to estimate program similarity by analysing CFGs using graph similarity measures, e.g. graph edit distance (GED). However, graph edit distance is an NP-hard problem and computationally expensive, making the application of graph similarity techniques to complex software programs impractical. This study intends to examine the effectiveness of graph neural networks to estimate program similarity, by analysing the associated control flow graphs. We introduce funcGNN, which is a graph neural network trained on labeled CFG pairs to predict the GED between unseen program pairs by utilizing an effective embedding vector. To our knowledge, this is the first time graph neural networks have been applied on labeled CFGs for estimating the similarity between high-level language programs. Results: We demonstrate the effectiveness of funcGNN to estimate the GED between programs and our experimental analysis demonstrates how it achieves a lower error rate (0.00194), with faster (23 times faster than the quickest traditional GED approximation method) and better scalability compared with the state of the art methods. funcGNN posses the inductive learning ability to infer program structure and generalise to unseen programs. The graph embedding of a program proposed by our methodology could be applied to several related software engineering problems (such as code plagiarism and clone identification) thus opening multiple research directions. From 8f3e996e785d9f64b4b9056eba7c6b705e430565 Mon Sep 17 00:00:00 2001 From: Miltos Date: Sun, 2 Aug 2020 11:25:10 +0100 Subject: [PATCH 011/297] Update _config.yml --- _config.yml | 3 +++ 1 file changed, 3 insertions(+) diff --git a/_config.yml b/_config.yml index b99a8666..2b7f88d2 100644 --- a/_config.yml +++ b/_config.yml @@ -13,3 +13,6 @@ collections: plugins_dir: - jekyll-sitemap - jekyll-seo-tag + +sass: + style: compressed From bcbb0bb5d2d9d80075c3b7bf4ca7e085f39f19ba Mon Sep 17 00:00:00 2001 From: Anton Date: Sun, 2 Aug 2020 18:14:15 +0300 Subject: [PATCH 012/297] new paper added paper MISIM: An End-to-End Neural Code Similarity System added --- _publications/ye2020misim.markdown | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 _publications/ye2020misim.markdown diff --git a/_publications/ye2020misim.markdown b/_publications/ye2020misim.markdown new file mode 100644 index 00000000..2131ee74 --- /dev/null +++ b/_publications/ye2020misim.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "MISIM: An End-to-End Neural Code Similarity System" +authors: Fangke Ye and Shengtian Zhou and Anand Venkat and Ryan Marcus and Nesime Tatbul and Jesmin Jahan Tithi and Paul Petersen and Timothy Mattson and Tim Kraska and Pradeep Dubey and Vivek Sarkar and Justin Gottschlich +conference: +year: 2020 +bibkey: ye2020misim +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2006.05265"} +tags: ["code similarity"] +--- +Code similarity systems are integral to a range of applications from code recommendation to automated construction of software tests and defect mitigation. In this paper, we present Machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware similarity structure, which is designed to aid in lifting semantic meaning from code syntax. Second, MISIM provides a neural-based code similarity scoring system, which can be implemented with various neural network algorithms and topologies with learned parameters. We compare MISIM to three other state-of-the-art code similarity systems: (i) code2vec, (ii) Neural Code Comprehension, and (iii) Aroma. In our experimental evaluation across 45,780 programs, MISIM consistently outperformed all three systems, often by a large factor (upwards of 40.6x). From 77459e8e49ec1852d26d676177bb84af2edb4602 Mon Sep 17 00:00:00 2001 From: Miltos Date: Sun, 2 Aug 2020 18:53:23 +0100 Subject: [PATCH 013/297] Adjust author list with commas --- _publications/ye2020misim.markdown | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_publications/ye2020misim.markdown b/_publications/ye2020misim.markdown index 2131ee74..ccf641f4 100644 --- a/_publications/ye2020misim.markdown +++ b/_publications/ye2020misim.markdown @@ -1,7 +1,7 @@ --- layout: publication title: "MISIM: An End-to-End Neural Code Similarity System" -authors: Fangke Ye and Shengtian Zhou and Anand Venkat and Ryan Marcus and Nesime Tatbul and Jesmin Jahan Tithi and Paul Petersen and Timothy Mattson and Tim Kraska and Pradeep Dubey and Vivek Sarkar and Justin Gottschlich +authors: Fangke Ye, Shengtian Zhou, Anand Venkat, Ryan Marcus, Nesime Tatbul, Jesmin Jahan Tithi, Paul Petersen, Timothy Mattson, Tim Kraska, Pradeep Dubey, Vivek Sarkar, Justin Gottschlich conference: year: 2020 bibkey: ye2020misim From 3265fc3e0225698c9c154d75b5f2d7e980c172f3 Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Sat, 8 Aug 2020 13:12:33 +0300 Subject: [PATCH 014/297] Add Rabit et al. --- _publications/rabin2020generalizability.markdown | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 _publications/rabin2020generalizability.markdown diff --git a/_publications/rabin2020generalizability.markdown b/_publications/rabin2020generalizability.markdown new file mode 100644 index 00000000..10f65219 --- /dev/null +++ b/_publications/rabin2020generalizability.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "On the Generalizability of Neural Program Analyzers with respect to Semantic-Preserving Program Transformations" +authors: Md. Rafiqul Islam Rabin, Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang, Mohammad Amin Alipour +conference: +year: 2020 +bibkey: rabin2020generalizability +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2008.01566"} +tags: ["adversarial", "GNN", "AST"] +--- +With the prevalence of publicly available source code repositories to train deep neural network models, neural program analyzers can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analyzers. Although such analyzers have been tested on various existing datasets, the extent in which they generalize to unforeseen source code is largely unknown. Since it is impossible to test neural program analyzers on all unforeseen programs, in this paper, we propose to evaluate the generalizability of neural program analyzers with respect to semantic-preserving transformations: a generalizable neural program analyzer should perform equally well on programs that are of the same semantics but of different lexical appearances and syntactical structures. More specifically, we compare the results of various neural program analyzers for the method name prediction task on programs before and after automated semantic-preserving transformations. We use three Java datasets of different sizes and three state-of-the-art neural network models for code, namely code2vec, code2seq, and Gated Graph Neural Networks (GGNN), to build nine such neural program analyzers for evaluation. Our results show that even with small semantically preserving changes to the programs, these neural program analyzers often fail to generalize their performance. Our results also suggest that neural program analyzers based on data and control dependencies in programs generalize better than neural program analyzers based only on abstract syntax trees. On the positive side, we observe that as the size of training dataset grows and diversifies the generalizability of correct predictions produced by the analyzers can be improved too. From f70ea26df925ea44605dc5f7afe85f8ebbbd2bed Mon Sep 17 00:00:00 2001 From: Marcelo Martins Date: Mon, 7 Sep 2020 09:45:32 -0300 Subject: [PATCH 015/297] add a paper and a code link (CoNCRA: A Convolutional Neural Network Code Retrieval Approach) --- _publications/derezendemartins2020concra.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) create mode 100644 _publications/derezendemartins2020concra.md diff --git a/_publications/derezendemartins2020concra.md b/_publications/derezendemartins2020concra.md new file mode 100644 index 00000000..883c2851 --- /dev/null +++ b/_publications/derezendemartins2020concra.md @@ -0,0 +1,14 @@ +--- +layout: publication +title: "CoNCRA: A Convolutional Neural Network Code Retrieval Approach" +authors: Marcelo de Rezende Martins and Marco Aurélio Gerosa +conference: SBES '20 +year: 2020 +bibkey: derezendemartins2020concra +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2009.01959"} + - {name: "code", url: "https://github.com/mrezende/concra"} +tags: ["search"] +--- +Software developers routinely search for code using general-purpose search engines. However, these search engines cannot find code semantically unless it has an accompanying description. We propose a technique for semantic code search: A Convolutional Neural Network approach to code retrieval (CoNCRA). Our technique aims to find the code snippet that most closely matches the developer's intent, expressed in natural language. We evaluated our approach's efficacy on a dataset composed of questions and code snippets collected from Stack Overflow. Our preliminary results showed that our technique, which prioritizes local interactions (words nearby), improved the state-of-the-art (SOTA) by 5% on average, retrieving the most relevant code snippets in the top 3 (three) positions by almost 80% of the time. Therefore, our technique is promising and can improve the efficacy of semantic code retrieval. + From afc77d9a772913ab4417f7451dde75e9326f9b71 Mon Sep 17 00:00:00 2001 From: Miltos Date: Mon, 7 Sep 2020 20:06:12 +0100 Subject: [PATCH 016/297] Update keyword to pre-existing one --- _publications/derezendemartins2020concra.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_publications/derezendemartins2020concra.md b/_publications/derezendemartins2020concra.md index 883c2851..3dcc4f87 100644 --- a/_publications/derezendemartins2020concra.md +++ b/_publications/derezendemartins2020concra.md @@ -8,7 +8,7 @@ bibkey: derezendemartins2020concra additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2009.01959"} - {name: "code", url: "https://github.com/mrezende/concra"} -tags: ["search"] +tags: ["retrieval"] --- Software developers routinely search for code using general-purpose search engines. However, these search engines cannot find code semantically unless it has an accompanying description. We propose a technique for semantic code search: A Convolutional Neural Network approach to code retrieval (CoNCRA). Our technique aims to find the code snippet that most closely matches the developer's intent, expressed in natural language. We evaluated our approach's efficacy on a dataset composed of questions and code snippets collected from Stack Overflow. Our preliminary results showed that our technique, which prioritizes local interactions (words nearby), improved the state-of-the-art (SOTA) by 5% on average, retrieving the most relevant code snippets in the top 3 (three) positions by almost 80% of the time. Therefore, our technique is promising and can improve the efficacy of semantic code retrieval. From 3b5517597f752c7f4e0aaa1a98abb0b4d774b686 Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Fri, 11 Sep 2020 09:38:39 +0100 Subject: [PATCH 017/297] Add Johnson et al. --- _publications/johnson2020learning.markdown | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 _publications/johnson2020learning.markdown diff --git a/_publications/johnson2020learning.markdown b/_publications/johnson2020learning.markdown new file mode 100644 index 00000000..34014b80 --- /dev/null +++ b/_publications/johnson2020learning.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Learning Graph Structure With A Finite-State Automaton Layer" +authors: Daniel D. Johnson, Hugo Larochelle, Daniel Tarlow +conference: +year: 2020 +bibkey: johnson2020learning +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2007.04929"} +tags: ["GNN", "program analysis"] +--- +Graph-based neural network models are producing strong results in a number of domains, in part because graphs provide flexibility to encode domain knowledge in the form of relational structure (edges) between nodes in the graph. In practice, edges are used both to represent intrinsic structure (e.g., abstract syntax trees of programs) and more abstract relations that aid reasoning for a downstream task (e.g., results of relevant program analyses). In this work, we study the problem of learning to derive abstract relations from the intrinsic graph structure. Motivated by their power in program analyses, we consider relations defined by paths on the base graph accepted by a finite-state automaton. We show how to learn these relations end-to-end by relaxing the problem into learning finite-state automata policies on a graph-based POMDP and then training these policies using implicit differentiation. The result is a differentiable Graph Finite-State Automaton (GFSA) layer that adds a new edge type (expressed as a weighted adjacency matrix) to a base graph. We demonstrate that this layer can find shortcuts in grid-world graphs and reproduce simple static analyses on Python programs. Additionally, we combine the GFSA layer with a larger graph-based model trained end-to-end on the variable misuse program understanding task, and find that using the GFSA layer leads to better performance than using hand-engineered semantic edges or other baseline methods for adding learned edge types. From 5844b5b11f7b48e4e0e865e4f788933cbcbc11a6 Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Fri, 11 Sep 2020 12:50:13 +0100 Subject: [PATCH 018/297] Add two more papers. --- _publications/heyman2020neural.markdown | 12 ++++++++++++ _publications/mammadli2020static.markdown | 12 ++++++++++++ 2 files changed, 24 insertions(+) create mode 100644 _publications/heyman2020neural.markdown create mode 100644 _publications/mammadli2020static.markdown diff --git a/_publications/heyman2020neural.markdown b/_publications/heyman2020neural.markdown new file mode 100644 index 00000000..c502e194 --- /dev/null +++ b/_publications/heyman2020neural.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent" +authors: Geert Heyman, Tom Van Cutsem +conference: +year: 2020 +bibkey: heyman2020neural +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2008.12193"} +tags: ["search"] +--- +In this work, we propose and study annotated code search: the retrieval of code snippets paired with brief descriptions of their intent using natural language queries. On three benchmark datasets, we investigate how code retrieval systems can be improved by leveraging descriptions to better capture the intents of code snippets. Building on recent progress in transfer learning and natural language processing, we create a domain-specific retrieval model for code annotated with a natural language description. We find that our model yields significantly more relevant search results (with absolute gains up to 20.6% in mean reciprocal rank) compared to state-of-the-art code retrieval methods that do not use descriptions but attempt to compute the intent of snippets solely from unannotated code. diff --git a/_publications/mammadli2020static.markdown b/_publications/mammadli2020static.markdown new file mode 100644 index 00000000..7632d77f --- /dev/null +++ b/_publications/mammadli2020static.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Static Neural Compiler Optimization via Deep Reinforcement Learning" +authors: Rahim Mammadli, Ali Jannesari, Felix Wolf +conference: +year: 2020 +bibkey: mammadli2020static +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2008.08951"} +tags: ["compilation"] +--- +The phase-ordering problem of modern compilers has received a lot of attention from the research community over the years, yet remains largely unsolved. Various optimization sequences exposed to the user are manually designed by compiler developers. In designing such a sequence developers have to choose the set of optimization passes, their parameters and ordering within a sequence. Resulting sequences usually fall short of achieving optimal runtime for a given source code and may sometimes even degrade the performance when compared to unoptimized version. In this paper, we employ a deep reinforcement learning approach to the phase-ordering problem. Provided with sub-sequences constituting LLVM's O3 sequence, our agent learns to outperform the O3 sequence on the set of source codes used for training and achieves competitive performance on the validation set, gaining up to 1.32x speedup on previously-unseen programs. Notably, our approach differs from autotuning methods by not depending on one or more test runs of the program for making successful optimization decisions. It has no dependence on any dynamic feature, but only on the statically-attainable intermediate representation of the source code. We believe that the models trained using our approach can be integrated into modern compilers as neural optimization agents, at first to complement, and eventually replace the hand-crafted optimization sequences. From 8470ffd672f54132ca84ca83023572aca9664ef9 Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Tue, 22 Sep 2020 08:44:30 +0100 Subject: [PATCH 019/297] Add workshop "survey-like" paper. --- _publications/devanbu2020deep.markdown | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 _publications/devanbu2020deep.markdown diff --git a/_publications/devanbu2020deep.markdown b/_publications/devanbu2020deep.markdown new file mode 100644 index 00000000..437d4546 --- /dev/null +++ b/_publications/devanbu2020deep.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Deep Learning & Software Engineering: State of Research and Future Directions" +authors: Prem Devanbu, Matthew Dwyer, Sebastian Elbaum, Michael Lowry, Kevin Moran, Denys Poshyvanyk, Baishakhi Ray, Rishabh Singh, Xiangyu Zhang +conference: +year: 2020 +bibkey: devanbu2020deep +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2009.08525"} +tags: ["survey"] +--- +Given the current transformative potential of research that sits at the intersection of Deep Learning (DL) and Software Engineering (SE), an NSF-sponsored community workshop was conducted in co-location with the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE'19) in San Diego, California. The goal of this workshop was to outline high priority areas for cross-cutting research. While a multitude of exciting directions for future work were identified, this report provides a general summary of the research areas representing the areas of highest priority which were discussed at the workshop. The intent of this report is to serve as a potential roadmap to guide future work that sits at the intersection of SE & DL. From 93408789929a0b3bbf47769fb90eb41262214aea Mon Sep 17 00:00:00 2001 From: Shubhadeep Roychowdhury Date: Mon, 21 Sep 2020 15:54:58 +0200 Subject: [PATCH 020/297] adding GraphCodeBERT --- _publications/guo2020graphcodebert.markdown | 11 +++++++++++ 1 file changed, 11 insertions(+) create mode 100644 _publications/guo2020graphcodebert.markdown diff --git a/_publications/guo2020graphcodebert.markdown b/_publications/guo2020graphcodebert.markdown new file mode 100644 index 00000000..1e17918a --- /dev/null +++ b/_publications/guo2020graphcodebert.markdown @@ -0,0 +1,11 @@ +--- +layout: publication +title: GraphCodeBERT: Pre-training Code Representations with Data Flow +authors: Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Jian Yin, Daxin Jiang, Ming Zhou +year: 2020 +bibkey: guo2020graphcodebert +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2009.08366"} +tags: ["pretraining"] +--- +Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search. From 7a9b90d28532638d937df4cc9fc9285c25a44a67 Mon Sep 17 00:00:00 2001 From: Shubhadeep Roychowdhury Date: Tue, 22 Sep 2020 10:54:29 +0200 Subject: [PATCH 021/297] Update guo2020graphcodebert.markdown --- _publications/guo2020graphcodebert.markdown | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_publications/guo2020graphcodebert.markdown b/_publications/guo2020graphcodebert.markdown index 1e17918a..2d5d07dd 100644 --- a/_publications/guo2020graphcodebert.markdown +++ b/_publications/guo2020graphcodebert.markdown @@ -1,6 +1,6 @@ --- layout: publication -title: GraphCodeBERT: Pre-training Code Representations with Data Flow +title: "GraphCodeBERT: Pre-training Code Representations with Data Flow" authors: Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Jian Yin, Daxin Jiang, Ming Zhou year: 2020 bibkey: guo2020graphcodebert From 0d6280082345e6986b7def98c5eaf0f63b356993 Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Fri, 25 Sep 2020 19:27:59 +0100 Subject: [PATCH 022/297] Add Wang et al --- _publications/wang2020blended.markdown | 10 ++++++++++ 1 file changed, 10 insertions(+) create mode 100644 _publications/wang2020blended.markdown diff --git a/_publications/wang2020blended.markdown b/_publications/wang2020blended.markdown new file mode 100644 index 00000000..2494cbae --- /dev/null +++ b/_publications/wang2020blended.markdown @@ -0,0 +1,10 @@ +--- +layout: publication +title: "Blended, precise semantic program embeddings" +authors: Ke Wang, Zhendong Su +conference: PLDI +year: 2020 +bibkey: wang2020blended +tags: ["dynamic"] +--- +Learning neural program embeddings is key to utilizing deep neural networks in program languages research --- precise and efficient program representations enable the application of deep models to a wide range of program analysis tasks. Existing approaches predominately learn to embed programs from their source code, and, as a result, they do not capture deep, precise program semantics. On the other hand, models learned from runtime information critically depend on the quality of program executions, thus leading to trained models with highly variant quality. This paper tackles these inherent weaknesses of prior approaches by introducing a new deep neural network, Liger, which learns program representations from a mixture of symbolic and concrete execution traces. We have evaluated Liger on two tasks: method name prediction and semantics classification. Results show that Liger is significantly more accurate than the state-of-the-art static model code2seq in predicting method names, and requires on average around 10x fewer executions covering nearly 4x fewer paths than the state-of-the-art dynamic model DYPRO in both tasks. Liger offers a new, interesting design point in the space of neural program embeddings and opens up this new direction for exploration. From 399bff5cde3d8e1de35d6b4964f6d7244eb6b28e Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Thu, 8 Oct 2020 11:04:32 +0100 Subject: [PATCH 023/297] Add Devign --- _publications/zhou2019devign.markdown | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 _publications/zhou2019devign.markdown diff --git a/_publications/zhou2019devign.markdown b/_publications/zhou2019devign.markdown new file mode 100644 index 00000000..f2499ab5 --- /dev/null +++ b/_publications/zhou2019devign.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks" +authors: Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, Yang Liu +conference: NeurIPS +year: 2020 +bibkey: zhou2019devign +additional_links: + - {name: "Paper", url: "http://papers.nips.cc/paper/9209-devign-effective-vulnerability-identification-by-learning-comprehensive-program-semantics-via-graph-neural-networks"} +tags: ["GNN", "static analysis"] +--- +Vulnerability identification is crucial to protect the software systems from attacks for cyber security. It is especially important to localize the vulnerable functions among the source code to facilitate the fix. However, it is a challenging and tedious process, and also requires specialized security expertise. Inspired by the work on manually-defined patterns of vulnerabilities from various code representation graphs and the recent advance on graph neural networks, we propose Devign, a general graph neural network based model for graph-level classification through learning on a rich set of code semantic representations. It includes a novel Conv module to efficiently extract useful features in the learned rich node representations for graph-level classification. The model is trained over manually labeled datasets built on 4 diversified large-scale open-source C projects that incorporate high complexity and variety of real source code instead of synthesis code used in previous works. The results of the extensive evaluation on the datasets demonstrate that Devign outperforms the state of the arts significantly with an average of 10.51% higher accuracy and 8.68% F1 score, increases averagely 4.66% accuracy and 6.37% F1 by the Conv module. From 0ad926be190b90cd87dcd490090faf41724cc97d Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Sun, 11 Oct 2020 19:21:34 +0100 Subject: [PATCH 024/297] Add papers. --- _publications/bui2020efficient.markdown | 12 ++++++++++++ _publications/clement2020pymt5.markdown | 12 ++++++++++++ 2 files changed, 24 insertions(+) create mode 100644 _publications/bui2020efficient.markdown create mode 100644 _publications/clement2020pymt5.markdown diff --git a/_publications/bui2020efficient.markdown b/_publications/bui2020efficient.markdown new file mode 100644 index 00000000..416899ca --- /dev/null +++ b/_publications/bui2020efficient.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Efficient Framework for Learning Code Representations through Semantic-Preserving Program Transformations" +authors: Nghi D. Q. Bui +conference: +year: 2020 +bibkey: bui2020efficient +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2009.02731"} +tags: ["pre-training"] +--- +Recent learning techniques for the representation of code depend mostly on human-annotated (labeled) data. In this work, we are proposing Corder, a self-supervised learning system that can learn to represent code without having to label data. The key innovation is that we train the source code model by asking it to recognize similar and dissimilar code snippets through a contrastive learning paradigm. We use a set of semantic-preserving transformation operators to generate snippets that are syntactically diverse but semantically equivalent. The contrastive learning objective, at the same time, maximizes agreement between different views of the same snippets and minimizes agreement between transformed views of different snippets. We train different instances of Corder on 3 neural network encoders, which are Tree-based CNN, ASTNN, and Code2vec over 2.5 million unannotated Java methods mined from GitHub. Our result shows that the Corder pre-training improves code classification and method name prediction with large margins. Furthermore, the code vectors generated by Corder are adapted to code clustering which has been shown to significantly beat the other baselines. diff --git a/_publications/clement2020pymt5.markdown b/_publications/clement2020pymt5.markdown new file mode 100644 index 00000000..40b0b0a7 --- /dev/null +++ b/_publications/clement2020pymt5.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "PyMT5: multi-mode translation of natural language and Python code with transformers" +authors: Colin B. Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, Neel Sundaresan +conference: EMNLP +year: 2020 +bibkey: clement2020pymt5 +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2010.03150"} +tags: ["bimodal"] +--- +Simultaneously modeling source code and natural language has many exciting applications in automated software development and understanding. Pursuant to achieving such technology, we introduce PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style. We present an analysis and modeling effort of a large-scale parallel corpus of 26 million Python methods and 7.7 million method-docstring pairs, demonstrating that for docstring and method generation, PyMT5 outperforms similarly-sized auto-regressive language models (GPT2) which were English pre-trained or randomly initialized. On the CodeSearchNet test set, our best model predicts 92.1% syntactically correct method bodies, achieved a BLEU score of 8.59 for method generation and 16.3 for docstring generation (summarization), and achieved a ROUGE-L F-score of 24.8 for method generation and 36.7 for docstring generation. From 0c71ff68b746e5ab3a7faa110f72b684ba1a8a82 Mon Sep 17 00:00:00 2001 From: Sean Date: Thu, 8 Oct 2020 11:27:05 +0100 Subject: [PATCH 025/297] Create luan2019aroma.markdown --- _publications/luan2019aroma.markdown | 9 +++++++++ 1 file changed, 9 insertions(+) create mode 100644 _publications/luan2019aroma.markdown diff --git a/_publications/luan2019aroma.markdown b/_publications/luan2019aroma.markdown new file mode 100644 index 00000000..618177cb --- /dev/null +++ b/_publications/luan2019aroma.markdown @@ -0,0 +1,9 @@ +--- +layout: publication +title: "Aroma: code recommendation via structural code search" +authors: Sifei Luan, Di Yang, Celeste Barnaby, Koushik Sen, Satish Chandra +conference: PACMPL +year: 2015 +bibkey: luan2019aroma +--- +Programmers often write code that has similarity to existing code written somewhere. A tool that could help programmers to search such similar code would be immensely useful. Such a tool could help programmers to extend partially written code snippets to completely implement necessary functionality, help to discover extensions to the partial code which are commonly included by other programmers, help to cross-check against similar code written by other programmers, or help to add extra code which would fix common mistakes and errors. We propose Aroma, a tool and technique for code recommendation via structural code search. Aroma indexes a huge code corpus including thousands of open-source projects, takes a partial code snippet as input, searches the corpus for method bodies containing the partial code snippet, and clusters and intersects the results of the search to recommend a small set of succinct code snippets which both contain the query snippet and appear as part of several methods in the corpus. We evaluated Aroma on 2000 randomly selected queries created from the corpus, as well as 64 queries derived from code snippets obtained from Stack Overflow, a popular website for discussing code. We implemented Aroma for 4 different languages, and developed an IDE plugin for Aroma. Furthermore, we conducted a study where we asked 12 programmers to complete programming tasks using Aroma, and collected their feedback. Our results indicate that Aroma is capable of retrieving and recommending relevant code snippets efficiently. From 050fe2d3c1c1451252501a7160c8185786f74105 Mon Sep 17 00:00:00 2001 From: Sean Date: Mon, 12 Oct 2020 12:33:18 +0100 Subject: [PATCH 026/297] Update _publications/luan2019aroma.markdown Co-authored-by: Miltos --- _publications/luan2019aroma.markdown | 1 + 1 file changed, 1 insertion(+) diff --git a/_publications/luan2019aroma.markdown b/_publications/luan2019aroma.markdown index 618177cb..b09c47cd 100644 --- a/_publications/luan2019aroma.markdown +++ b/_publications/luan2019aroma.markdown @@ -5,5 +5,6 @@ authors: Sifei Luan, Di Yang, Celeste Barnaby, Koushik Sen, Satish Chandra conference: PACMPL year: 2015 bibkey: luan2019aroma +tags: ["retrieval"] --- Programmers often write code that has similarity to existing code written somewhere. A tool that could help programmers to search such similar code would be immensely useful. Such a tool could help programmers to extend partially written code snippets to completely implement necessary functionality, help to discover extensions to the partial code which are commonly included by other programmers, help to cross-check against similar code written by other programmers, or help to add extra code which would fix common mistakes and errors. We propose Aroma, a tool and technique for code recommendation via structural code search. Aroma indexes a huge code corpus including thousands of open-source projects, takes a partial code snippet as input, searches the corpus for method bodies containing the partial code snippet, and clusters and intersects the results of the search to recommend a small set of succinct code snippets which both contain the query snippet and appear as part of several methods in the corpus. We evaluated Aroma on 2000 randomly selected queries created from the corpus, as well as 64 queries derived from code snippets obtained from Stack Overflow, a popular website for discussing code. We implemented Aroma for 4 different languages, and developed an IDE plugin for Aroma. Furthermore, we conducted a study where we asked 12 programmers to complete programming tasks using Aroma, and collected their feedback. Our results indicate that Aroma is capable of retrieving and recommending relevant code snippets efficiently. From 74b36da56d9a2918a952b19fa9ad97732a323526 Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Tue, 13 Oct 2020 09:10:23 +0100 Subject: [PATCH 027/297] Add some recent papers. --- _publications/gros2020code.markdown | 12 ++++++++++++ _publications/panthaplackel2020deep.markdown | 12 ++++++++++++ _publications/wang2020modular.markdown | 12 ++++++++++++ 3 files changed, 36 insertions(+) create mode 100644 _publications/gros2020code.markdown create mode 100644 _publications/panthaplackel2020deep.markdown create mode 100644 _publications/wang2020modular.markdown diff --git a/_publications/gros2020code.markdown b/_publications/gros2020code.markdown new file mode 100644 index 00000000..e28a223a --- /dev/null +++ b/_publications/gros2020code.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Code to Comment \"Translation\": Data, Metrics, Baselining & Evaluation" +authors: David Gros, Hariharan Sezhiyan, Prem Devanbu, Zhou Yu +conference: +year: 2020 +bibkey: gros2020code +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2010.01410"} +tags: ["bimodal", "documentation"] +--- +The relationship of comments to code, and in particular, the task of generating useful comments given the code, has long been of interest. The earliest approaches have been based on strong syntactic theories of comment-structures, and relied on textual templates. More recently, researchers have applied deep learning methods to this task, and specifically, trainable generative translation models which are known to work very well for Natural Language translation (e.g., from German to English). We carefully examine the underlying assumption here: that the task of generating comments sufficiently resembles the task of translating between natural languages, and so similar models and evaluation metrics could be used. We analyze several recent code-comment datasets for this task: CodeNN, DeepCom, FunCom, and DocString. We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators. We found some interesting differences between the code-comment data and the WMT19 natural language data. Next, we describe and conduct some studies to calibrate BLEU (which is commonly used as a measure of comment quality). using "affinity pairs" of methods, from different projects, in the same project, in the same class, etc; Our study suggests that the current performance on some datasets might need to be improved substantially. We also argue that fairly naive information retrieval (IR) methods do well enough at this task to be considered a reasonable baseline. Finally, we make some suggestions on how our findings might be used in future research in this area. diff --git a/_publications/panthaplackel2020deep.markdown b/_publications/panthaplackel2020deep.markdown new file mode 100644 index 00000000..e6984f84 --- /dev/null +++ b/_publications/panthaplackel2020deep.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Deep Just-In-Time Inconsistency Detection Between Comments and Source Code" +authors: Sheena Panthaplackel, Junyi Jessy Li, Milos Gligoric, Raymond J. Mooney +conference: +year: 2020 +bibkey: panthaplackel2020deep +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2010.01625"} +tags: ["edit", "bimodal", "documentation"] +--- +Natural language comments convey key aspects of source code such as implementation, usage, and pre- and post-conditions. Failure to update comments accordingly when the corresponding code is modified introduces inconsistencies, which is known to lead to confusion and software bugs. In this paper, we aim to detect whether a comment becomes inconsistent as a result of changes to the corresponding body of code, in order to catch potential inconsistencies just-in-time, i.e., before they are committed to a version control system. To achieve this, we develop a deep-learning approach that learns to correlate a comment with code changes. By evaluating on a large corpus of comment/code pairs spanning various comment types, we show that our model outperforms multiple baselines by significant margins. For extrinsic evaluation, we show the usefulness of our approach by combining it with a comment update model to build a more comprehensive automatic comment maintenance system which can both detect and resolve inconsistent comments based on code changes. diff --git a/_publications/wang2020modular.markdown b/_publications/wang2020modular.markdown new file mode 100644 index 00000000..debae771 --- /dev/null +++ b/_publications/wang2020modular.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Modular Tree Network for Source Code Representation Learning" +authors: Wenhan Wang, Ge Li, Sijie Shen, Xin Xia, Zhi Jin +conference: TOSEM +year: 2020 +bibkey: wang2020modular +additional_links: + - {name: "ACM", url: "https://dl.acm.org/doi/10.1145/3409331"} +tags: ["AST", "representation"] +--- +Learning representation for source code is a foundation of many program analysis tasks. In recent years, neural networks have already shown success in this area, but most existing models did not make full use of the unique structural information of programs. Although abstract syntax tree (AST)-based neural models can handle the tree structure in the source code, they cannot capture the richness of different types of substructure in programs. In this article, we propose a modular tree network that dynamically composes different neural network units into tree structures based on the input AST. Different from previous tree-structural neural network models, a modular tree network can capture the semantic differences between types of AST substructures. We evaluate our model on two tasks: program classification and code clone detection. Our model achieves the best performance compared with state-of-the-art approaches in both tasks, showing the advantage of leveraging more elaborate structure information of the source code. From 775ceb341a48293da6903816fc4b0e35aa2db1e0 Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Fri, 16 Oct 2020 14:47:52 +0100 Subject: [PATCH 028/297] Add Tabssum et al. --- _publications/tabassum2020code.markdown | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/_publications/tabassum2020code.markdown b/_publications/tabassum2020code.markdown index 4326e230..6298dded 100644 --- a/_publications/tabassum2020code.markdown +++ b/_publications/tabassum2020code.markdown @@ -7,6 +7,7 @@ year: 2020 bibkey: tabassum2020code additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2005.01634"} + - {name: "Code", url: "https://github.com/jeniyat/StackOverflowNER/"} tags: ["dataset", "information extraction"] --- -There is an increasing interest in studying natural language and computer code together, as large corpora of programming texts become readily available on the Internet. For example, StackOverflow currently has over 15 million programming related questions written by 8.5 million users. Meanwhile, there is still a lack of fundamental NLP techniques for identifying code tokens or software-related named entities that appear within natural language sentences. In this paper, we introduce a new named entity recognition (NER) corpus for the computer programming domain, consisting of 15,372 sentences annotated with 20 fine-grained entity types. We also present the SoftNER model that combines contextual information with domain specific knowledge using an attention network. The code token recognizer combined with an entity segmentation model we proposed, consistently improves the performance of the named entity tagger. Our proposed SoftNER tagger outperforms the BiLSTM-CRF model with an absolute increase of +9.73 F-1 score on StackOverflow data. +There is an increasing interest in studying natural language and computer code together, as large corpora of programming texts become readily available on the Internet. For example, StackOverflow currently has over 15 million programming related questions written by 8.5 million users. Meanwhile, there is still a lack of fundamental NLP techniques for identifying code tokens or software-related named entities that appear within natural language sentences. In this paper, we introduce a new named entity recognition (NER) corpus for the computer programming domain, consisting of 15,372 sentences annotated with 20 fine-grained entity types. We trained in-domain BERT representations (BERTOverflow) on 152 million sentences from StackOverflow, which lead to an absolute increase of +10 F-1 score over off-the-shelf BERT. We also present the SoftNER model which achieves an overall 79.10 F1 score for code and named entity recognition on StackOverflow data. Our SoftNER model incorporates a context-independent code token classifier with corpus-level features to improve the BERT-based tagging model. From 2e530862a1aa8daffb7ff9ccb2bee0176056b8ca Mon Sep 17 00:00:00 2001 From: avinashbhat Date: Sat, 17 Oct 2020 21:54:14 +0530 Subject: [PATCH 029/297] Added Wang et al. --- _publications/wang2020detecting.markdown | 13 +++++++++++++ 1 file changed, 13 insertions(+) create mode 100644 _publications/wang2020detecting.markdown diff --git a/_publications/wang2020detecting.markdown b/_publications/wang2020detecting.markdown new file mode 100644 index 00000000..e4f624af --- /dev/null +++ b/_publications/wang2020detecting.markdown @@ -0,0 +1,13 @@ +--- +layout: publication +title: Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree +authors: Wenhan Wang, Ge Li, Bo Ma, Xin Xia, Zhi Jin +conference: IEEE International Conference on Software Analysis, Evolution, and Reengineering +year: 2020 +bibkey: wang2020detecting +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2002.08653"} +tags: ["clone detection", "GNN"] +--- + +Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code clones have been proposed previously, but most of them focus on detecting syntactic clones and do not work well on semantic clones with different syntactic features. To detect semantic clones, researchers have tried to adopt deep learning for code clone detection to automatically learn latent semantic features from data. Especially, to leverage grammar information, several approaches used abstract syntax trees (AST) as input and achieved significant progress on code clone benchmarks in various programming languages. However, these AST-based approaches still can not fully leverage the structural information of code fragments, especially semantic information such as control flow and data flow. To leverage control and data flow information, in this paper, we build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs with explicit control and data flow edges. Then we apply two different types of graph neural networks (GNN) on FA-AST to measure the similarity of code pairs. As far as we have concerned, we are the first to apply graph neural networks on the domain of code clone detection. We apply our FA-AST and graph neural networks on two Java datasets: Google Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks. \ No newline at end of file From 13966e885aa494ecfa20e1955618867beaeb9adf Mon Sep 17 00:00:00 2001 From: avinashbhat Date: Mon, 19 Oct 2020 13:48:55 +0530 Subject: [PATCH 030/297] Added Shido et al. --- _publications/shido2019automatic.markdown | 14 ++++++++++++++ 1 file changed, 14 insertions(+) create mode 100644 _publications/shido2019automatic.markdown diff --git a/_publications/shido2019automatic.markdown b/_publications/shido2019automatic.markdown new file mode 100644 index 00000000..fef2ffb1 --- /dev/null +++ b/_publications/shido2019automatic.markdown @@ -0,0 +1,14 @@ +--- +layout: publication +title: "Automatic Source Code Summarization with Extended Tree-LSTM" +authors: Yusuke Shido, Yasuaki Kobayashi, Akihiro Yamamoto, Atsushi Miyamoto, Tadayuki Matsumura +conference: International Joint Conference on Neural Networks +year: 2019 +bibkey: shido2019automatic +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/1906.08094"} + - {name: "Dataset", url: "https://github.com/xing-hu/DeepCom"} + - {name: "code", url: "https://github.com/sh1doy/summarization_tf"} +tags: ["Code Summarization", "LSTM"] +--- +Neural machine translation models are used to automatically generate a document from given source code since this can be regarded as a machine translation task. Source code summarization is one of the components for automatic document generation, which generates a summary in natural language from given source code. This suggests that techniques used in neural machine translation, such as Long Short-Term Memory (LSTM), can be used for source code summarization. However, there is a considerable difference between source code and natural language: Source code is essentially structured, having loops and conditional branching, etc. Therefore, there is some obstacle to apply known machine translation models to source code.Abstract syntax trees (ASTs) capture these structural properties and play an important role in recent machine learning studies on source code. Tree-LSTM is proposed as a generalization of LSTMs for tree-structured data. However, there is a critical issue when applying it to ASTs: It cannot handle a tree that contains nodes having an arbitrary number of children and their order simultaneously, which ASTs generally have such nodes. To address this issue, we propose an extension of Tree-LSTM, which we call Multi-way Tree-LSTM and apply it for source code summarization. As a result of computational experiments, our proposal achieved better results when compared with several state-of-the-art techniques. From 3ca0013a6faae90c1d83b3099273ca3c9fe18278 Mon Sep 17 00:00:00 2001 From: Miltos Date: Tue, 20 Oct 2020 09:34:32 +0100 Subject: [PATCH 031/297] Update _publications/shido2019automatic.markdown --- _publications/shido2019automatic.markdown | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_publications/shido2019automatic.markdown b/_publications/shido2019automatic.markdown index fef2ffb1..14f793b7 100644 --- a/_publications/shido2019automatic.markdown +++ b/_publications/shido2019automatic.markdown @@ -9,6 +9,6 @@ additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/1906.08094"} - {name: "Dataset", url: "https://github.com/xing-hu/DeepCom"} - {name: "code", url: "https://github.com/sh1doy/summarization_tf"} -tags: ["Code Summarization", "LSTM"] +tags: ["summarization", "AST"] --- Neural machine translation models are used to automatically generate a document from given source code since this can be regarded as a machine translation task. Source code summarization is one of the components for automatic document generation, which generates a summary in natural language from given source code. This suggests that techniques used in neural machine translation, such as Long Short-Term Memory (LSTM), can be used for source code summarization. However, there is a considerable difference between source code and natural language: Source code is essentially structured, having loops and conditional branching, etc. Therefore, there is some obstacle to apply known machine translation models to source code.Abstract syntax trees (ASTs) capture these structural properties and play an important role in recent machine learning studies on source code. Tree-LSTM is proposed as a generalization of LSTMs for tree-structured data. However, there is a critical issue when applying it to ASTs: It cannot handle a tree that contains nodes having an arbitrary number of children and their order simultaneously, which ASTs generally have such nodes. To address this issue, we propose an extension of Tree-LSTM, which we call Multi-way Tree-LSTM and apply it for source code summarization. As a result of computational experiments, our proposal achieved better results when compared with several state-of-the-art techniques. From 6149975a0c81b856f58a9026dd961b8b7b326c79 Mon Sep 17 00:00:00 2001 From: Reza Gharibi Date: Thu, 22 Oct 2020 00:26:39 +0330 Subject: [PATCH 032/297] Fixed a broken link --- resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/resources.md b/resources.md index 8e12e12d..b892b57f 100644 --- a/resources.md +++ b/resources.md @@ -50,4 +50,4 @@ The last few years a few workshops have been organized in this area. Please, fee papers in the area. You can access the list [here](https://github.com/src-d/awesome-machine-learning-on-source-code). * [Autormated Program Repair](https://www.monperrus.net/martin/automatic-software-repair) has a curated list of pointers for helping newcomers to understan the field, -maintained by [Martin Monperrus](www.monperrus.net). +maintained by [Martin Monperrus](https://www.monperrus.net/martin/). From e46ecf2b924727f51ffcd69592d106549cf58f14 Mon Sep 17 00:00:00 2001 From: Colin Clement Date: Thu, 29 Oct 2020 19:56:44 +0000 Subject: [PATCH 033/297] added/updated msft c+ai papers --- _publications/clement2020pymt5.markdown | 4 ++-- _publications/svyatkovskiy2020intellicode.markdown | 6 ++---- _publications/tufano2020generating.markdown | 12 ++++++++++++ _publications/tufano2020unit.markdown | 12 ++++++++++++ 4 files changed, 28 insertions(+), 6 deletions(-) create mode 100644 _publications/tufano2020generating.markdown create mode 100644 _publications/tufano2020unit.markdown diff --git a/_publications/clement2020pymt5.markdown b/_publications/clement2020pymt5.markdown index 40b0b0a7..a3f054fa 100644 --- a/_publications/clement2020pymt5.markdown +++ b/_publications/clement2020pymt5.markdown @@ -7,6 +7,6 @@ year: 2020 bibkey: clement2020pymt5 additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2010.03150"} -tags: ["bimodal"] +tags: ["bimodal", "generative", "summarization", "documentation", "language model", "pretraining", "pre-training"] --- -Simultaneously modeling source code and natural language has many exciting applications in automated software development and understanding. Pursuant to achieving such technology, we introduce PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style. We present an analysis and modeling effort of a large-scale parallel corpus of 26 million Python methods and 7.7 million method-docstring pairs, demonstrating that for docstring and method generation, PyMT5 outperforms similarly-sized auto-regressive language models (GPT2) which were English pre-trained or randomly initialized. On the CodeSearchNet test set, our best model predicts 92.1% syntactically correct method bodies, achieved a BLEU score of 8.59 for method generation and 16.3 for docstring generation (summarization), and achieved a ROUGE-L F-score of 24.8 for method generation and 36.7 for docstring generation. +Simultaneously modeling source code and natural language has many exciting applications in automated software development and understanding. Pursuant to achieving such technology, we introduce PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style. We present an analysis and modeling effort of a large-scale parallel corpus of 26 million Python methods and 7.7 million method-docstring pairs, demonstrating that for docstring and method generation, PyMT5 outperforms similarly-sized auto-regressive language models (GPT2) which were English pre-trained or randomly initialized. On the CodeSearchNet test set, our best model predicts 92.1% syntactically correct method bodies, achieved a BLEU score of 8.59 for method generation and 16.3 for docstring generation (summarization), and achieved a ROUGE-L F-score of 24.8 for method generation and 36.7 for docstring generation. diff --git a/_publications/svyatkovskiy2020intellicode.markdown b/_publications/svyatkovskiy2020intellicode.markdown index 21130ed9..c515f635 100644 --- a/_publications/svyatkovskiy2020intellicode.markdown +++ b/_publications/svyatkovskiy2020intellicode.markdown @@ -2,14 +2,12 @@ layout: publication title: "IntelliCode Compose: Code Generation Using Transformer" authors: Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, Neel Sundaresan -conference: year: 2020 bibkey: svyatkovskiy2020intellicode additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2005.08025"} -tags: ["autocompletion"] +tags: ["autocompletion", "generative", "synthesis", "language model", "pretraining", "pre-training"] --- In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments. - In this paper, we introduce IntelliCode Compose − a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, C#, JavaScript and TypeScript programming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook. -Our best model yields an average edit similarity of 86.7% and a perplexity of 1.82 for Python programming language. +Our best model yields an average edit similarity of 86.7% and a perplexity of 1.82 for Python programming language. diff --git a/_publications/tufano2020generating.markdown b/_publications/tufano2020generating.markdown new file mode 100644 index 00000000..1d645b57 --- /dev/null +++ b/_publications/tufano2020generating.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Generating Accurate Assert Statements for Unit Test Cases using Pretrained Transformers" +authors: Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, Neel Sundaresan +conference: ICSE +year: 2020 +bibkey: tufano2020unit +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2009.05634"} +tags: ["generative", "synthesis"] +--- +Unit testing represents the foundational basis of the software testing pyramid, beneath integration and end-to-end testing. Automated software testing researchers have proposed a variety of techniques to assist developers in this time-consuming task. In this paper we present an approach to support developers in writing unit test cases by generating accurate and useful assert statements. Our approach is based on a state-of-the-art transformer model initially pretrained on an English textual corpus. This semantically rich model is then trained in a semi-supervised fashion on a large corpus of source code. Finally, we finetune this model on the task of generating assert statements for unit tests. The resulting model is able to generate accurate assert statements for a given method under test. In our empirical evaluation, the model was able to predict the exact assert statements written by developers in 62% of the cases in the first attempt. The results show 80% relative improvement for top-1 accuracy over the previous RNN-based approach in the literature. We also show the substantial impact of the pretraining process on the performances of our model, as well as comparing it with assert auto-completion task. Finally, we demonstrate how our approach can be used to augment EvoSuite test cases, with additional asserts leading to improved test coverage. diff --git a/_publications/tufano2020unit.markdown b/_publications/tufano2020unit.markdown new file mode 100644 index 00000000..362b423f --- /dev/null +++ b/_publications/tufano2020unit.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Unit Test Case Generation with Transformers" +authors: Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, Neel Sundaresan +conference: ICSE +year: 2020 +bibkey: tufano2020unit +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2009.05617"} +tags: ["generative", "synthesis"] +--- +Automated Unit Test Case generation has been the focus of extensive literature within the research community. Existing approaches are usually guided by the test coverage criteria, generating synthetic test cases that are often difficult to read or understand for developers. In this paper we propose AthenaTest, an approach that aims at generating unit test cases by learning from real-world, developer-written test cases. Our approach relies on a state-of-the-art sequence-to-sequence transformer model which is able to write useful test cases for a given method under test (i.e., focal method). We also introduce methods2test - the largest publicly available supervised parallel corpus of unit test case methods and corresponding focal methods in Java, which comprises 630k test cases mined from 70k open-source repositories hosted on GitHub. We use this dataset to train a transformer model to translate focal methods into the corresponding test cases. We evaluate the ability of our model in generating test cases using natural language processing as well as code-specific criteria. First, we assess the quality of the translation compared to the target test case, then we analyze properties of the test case such as syntactic correctness and number and variety of testing APIs (e.g., asserts). We execute the test cases, collect test coverage information, and compare them with test cases generated by EvoSuite and GPT-3. Finally, we survey professional developers on their preference in terms of readability, understandability, and testing effectiveness of the generated test cases. From 6f77549ac0bb07fd1d050598cf90b33c259ce055 Mon Sep 17 00:00:00 2001 From: Colin Clement Date: Thu, 29 Oct 2020 20:05:02 +0000 Subject: [PATCH 034/297] added tag for test generation --- _publications/tufano2020generating.markdown | 2 +- _publications/tufano2020unit.markdown | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/_publications/tufano2020generating.markdown b/_publications/tufano2020generating.markdown index 1d645b57..b0fdbb88 100644 --- a/_publications/tufano2020generating.markdown +++ b/_publications/tufano2020generating.markdown @@ -7,6 +7,6 @@ year: 2020 bibkey: tufano2020unit additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2009.05634"} -tags: ["generative", "synthesis"] +tags: ["generative", "synthesis", "test generation"] --- Unit testing represents the foundational basis of the software testing pyramid, beneath integration and end-to-end testing. Automated software testing researchers have proposed a variety of techniques to assist developers in this time-consuming task. In this paper we present an approach to support developers in writing unit test cases by generating accurate and useful assert statements. Our approach is based on a state-of-the-art transformer model initially pretrained on an English textual corpus. This semantically rich model is then trained in a semi-supervised fashion on a large corpus of source code. Finally, we finetune this model on the task of generating assert statements for unit tests. The resulting model is able to generate accurate assert statements for a given method under test. In our empirical evaluation, the model was able to predict the exact assert statements written by developers in 62% of the cases in the first attempt. The results show 80% relative improvement for top-1 accuracy over the previous RNN-based approach in the literature. We also show the substantial impact of the pretraining process on the performances of our model, as well as comparing it with assert auto-completion task. Finally, we demonstrate how our approach can be used to augment EvoSuite test cases, with additional asserts leading to improved test coverage. diff --git a/_publications/tufano2020unit.markdown b/_publications/tufano2020unit.markdown index 362b423f..0f68a605 100644 --- a/_publications/tufano2020unit.markdown +++ b/_publications/tufano2020unit.markdown @@ -7,6 +7,6 @@ year: 2020 bibkey: tufano2020unit additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2009.05617"} -tags: ["generative", "synthesis"] +tags: ["generative", "synthesis", "test generation"] --- Automated Unit Test Case generation has been the focus of extensive literature within the research community. Existing approaches are usually guided by the test coverage criteria, generating synthetic test cases that are often difficult to read or understand for developers. In this paper we propose AthenaTest, an approach that aims at generating unit test cases by learning from real-world, developer-written test cases. Our approach relies on a state-of-the-art sequence-to-sequence transformer model which is able to write useful test cases for a given method under test (i.e., focal method). We also introduce methods2test - the largest publicly available supervised parallel corpus of unit test case methods and corresponding focal methods in Java, which comprises 630k test cases mined from 70k open-source repositories hosted on GitHub. We use this dataset to train a transformer model to translate focal methods into the corresponding test cases. We evaluate the ability of our model in generating test cases using natural language processing as well as code-specific criteria. First, we assess the quality of the translation compared to the target test case, then we analyze properties of the test case such as syntactic correctness and number and variety of testing APIs (e.g., asserts). We execute the test cases, collect test coverage information, and compare them with test cases generated by EvoSuite and GPT-3. Finally, we survey professional developers on their preference in terms of readability, understandability, and testing effectiveness of the generated test cases. From d57c2ed738edc504fef25a5596001dd85f961d25 Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Fri, 30 Oct 2020 14:09:57 +0000 Subject: [PATCH 035/297] Unify tags "pre-training" -> "pretraining" --- _publications/bui2020efficient.markdown | 2 +- _publications/clement2020pymt5.markdown | 2 +- _publications/svyatkovskiy2020intellicode.markdown | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/_publications/bui2020efficient.markdown b/_publications/bui2020efficient.markdown index 416899ca..1e86470d 100644 --- a/_publications/bui2020efficient.markdown +++ b/_publications/bui2020efficient.markdown @@ -7,6 +7,6 @@ year: 2020 bibkey: bui2020efficient additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2009.02731"} -tags: ["pre-training"] +tags: ["pretraining"] --- Recent learning techniques for the representation of code depend mostly on human-annotated (labeled) data. In this work, we are proposing Corder, a self-supervised learning system that can learn to represent code without having to label data. The key innovation is that we train the source code model by asking it to recognize similar and dissimilar code snippets through a contrastive learning paradigm. We use a set of semantic-preserving transformation operators to generate snippets that are syntactically diverse but semantically equivalent. The contrastive learning objective, at the same time, maximizes agreement between different views of the same snippets and minimizes agreement between transformed views of different snippets. We train different instances of Corder on 3 neural network encoders, which are Tree-based CNN, ASTNN, and Code2vec over 2.5 million unannotated Java methods mined from GitHub. Our result shows that the Corder pre-training improves code classification and method name prediction with large margins. Furthermore, the code vectors generated by Corder are adapted to code clustering which has been shown to significantly beat the other baselines. diff --git a/_publications/clement2020pymt5.markdown b/_publications/clement2020pymt5.markdown index a3f054fa..ffef4177 100644 --- a/_publications/clement2020pymt5.markdown +++ b/_publications/clement2020pymt5.markdown @@ -7,6 +7,6 @@ year: 2020 bibkey: clement2020pymt5 additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2010.03150"} -tags: ["bimodal", "generative", "summarization", "documentation", "language model", "pretraining", "pre-training"] +tags: ["bimodal", "generative", "summarization", "documentation", "language model", "pretraining"] --- Simultaneously modeling source code and natural language has many exciting applications in automated software development and understanding. Pursuant to achieving such technology, we introduce PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style. We present an analysis and modeling effort of a large-scale parallel corpus of 26 million Python methods and 7.7 million method-docstring pairs, demonstrating that for docstring and method generation, PyMT5 outperforms similarly-sized auto-regressive language models (GPT2) which were English pre-trained or randomly initialized. On the CodeSearchNet test set, our best model predicts 92.1% syntactically correct method bodies, achieved a BLEU score of 8.59 for method generation and 16.3 for docstring generation (summarization), and achieved a ROUGE-L F-score of 24.8 for method generation and 36.7 for docstring generation. diff --git a/_publications/svyatkovskiy2020intellicode.markdown b/_publications/svyatkovskiy2020intellicode.markdown index c515f635..276bfe9e 100644 --- a/_publications/svyatkovskiy2020intellicode.markdown +++ b/_publications/svyatkovskiy2020intellicode.markdown @@ -6,7 +6,7 @@ year: 2020 bibkey: svyatkovskiy2020intellicode additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2005.08025"} -tags: ["autocompletion", "generative", "synthesis", "language model", "pretraining", "pre-training"] +tags: ["autocompletion", "generative", "synthesis", "language model", "pretraining"] --- In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments. In this paper, we introduce IntelliCode Compose − a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, C#, JavaScript and TypeScript programming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook. From 63857e54a358678e5bd5790ae5d4784d566ddc77 Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Fri, 6 Nov 2020 15:40:32 +0000 Subject: [PATCH 036/297] Add some relevant papers. --- _publications/bieber2020learning.markdown | 12 ++++++++++++ _publications/husain2019codesearchnet.markdown | 2 +- _publications/sun2020pscs.markdown | 12 ++++++++++++ _publications/tian2020evaluating.markdown | 12 ++++++++++++ 4 files changed, 37 insertions(+), 1 deletion(-) create mode 100644 _publications/bieber2020learning.markdown create mode 100644 _publications/sun2020pscs.markdown create mode 100644 _publications/tian2020evaluating.markdown diff --git a/_publications/bieber2020learning.markdown b/_publications/bieber2020learning.markdown new file mode 100644 index 00000000..1a7d8f28 --- /dev/null +++ b/_publications/bieber2020learning.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks" +authors: David Bieber, Charles Sutton, Hugo Larochelle, Daniel Tarlow +conference: NeurIPS +year: 2020 +bibkey: bieber2020learning +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2010.12621"} +tags: ["representation", "dynamic"] +--- +Graph neural networks (GNNs) have emerged as a powerful tool for learning software engineering tasks including code completion, bug finding, and program repair. They benefit from leveraging program structure like control flow graphs, but they are not well-suited to tasks like program execution that require far more sequential reasoning steps than number of GNN propagation steps. Recurrent neural networks (RNNs), on the other hand, are well-suited to long sequential chains of reasoning, but they do not naturally incorporate program structure and generally perform worse on the above tasks. Our aim is to achieve the best of both worlds, and we do so by introducing a novel GNN architecture, the Instruction Pointer Attention Graph Neural Networks (IPA-GNN), which achieves improved systematic generalization on the task of learning to execute programs using control flow graphs. The model arises by considering RNNs operating on program traces with branch decisions as latent variables. The IPA-GNN can be seen either as a continuous relaxation of the RNN model or as a GNN variant more tailored to execution. To test the models, we propose evaluating systematic generalization on learning to execute using control flow graphs, which tests sequential reasoning and use of program structure. More practically, we evaluate these models on the task of learning to execute partial programs, as might arise if using the model as a heuristic function in program synthesis. Results show that the IPA-GNN outperforms a variety of RNN and GNN baselines on both tasks. diff --git a/_publications/husain2019codesearchnet.markdown b/_publications/husain2019codesearchnet.markdown index 52b7e378..e01a0573 100644 --- a/_publications/husain2019codesearchnet.markdown +++ b/_publications/husain2019codesearchnet.markdown @@ -9,7 +9,7 @@ additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/1909.09436"} - {name: "Code and other info", url: "https://github.com/github/CodeSearchNet"} - {name: "Leaderboard", url: "https://app.wandb.ai/github/codesearchnet/benchmark"} -tags: ["dataset", "retrieval"] +tags: ["dataset", "retrieval", "search"] --- Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas. diff --git a/_publications/sun2020pscs.markdown b/_publications/sun2020pscs.markdown new file mode 100644 index 00000000..62746b96 --- /dev/null +++ b/_publications/sun2020pscs.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "PSCS: A Path-based Neural Model for Semantic Code Search" +authors: Zhensu Sun, Yan Liu, Chen Yang, Yu Qian +conference: +year: 2020 +bibkey: sun2020pscs +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2008.03042"} +tags: ["AST", "retrieval", "search"] +--- +To obtain code snippets for reuse, programmers prefer to search for related documents, e.g., blogs or Q&A, instead of code itself. The major reason is due to the semantic diversity and mismatch between queries and code snippets. Deep learning models have been proposed to address this challenge. Compared with approaches using information retrieval techniques, deep learning models do not suffer from the information loss caused by refining user intention into keywords. However, the performance of previous works is not satisfactory because they ignore the importance of code structure. When the semantics of code (e.g., identifier names, APIs) are ambiguous, code structure may be the only feature for the model to utilize. In that case, previous works relearn the structural information from lexical tokens of code, which is extremely difficult for a model without any domain knowledge. In this work, we propose PSCS, a path-based neural model for semantic code search. Our model encodes both the semantics and structures of code represented by AST paths. We train and evaluate our model over 330k-19k query-function pairs, respectively. The evaluation results demonstrate that PSCS achieves a SuccessRate of 47.6% and a Mean Reciprocal Rank (MRR) of 30.4% when considering the top-10 results with a match. The proposed approach significantly outperforms both DeepCS, the first approach that applies deep learning to code search task, and CARLCS, a state-of-the-art approach that introduces a co-attentive representation learning model on the basis of DeepCS. The importance of code structure is demonstrated with an ablation study on code features, which enlightens model design for further studies. diff --git a/_publications/tian2020evaluating.markdown b/_publications/tian2020evaluating.markdown new file mode 100644 index 00000000..ecc8d51b --- /dev/null +++ b/_publications/tian2020evaluating.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair" +authors: Haoye Tian, Kui Liu, Abdoul Kader Kaboreé, Anil Koyuncu, Li Li, Jacques Klein, Tegawendé F. Bissyandé +conference: +year: 2020 +bibkey: tian2020evaluating +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2008.02944"} +tags: ["edit", "defect"] +--- +A large body of the literature of automated program repair develops approaches where patches are generated to be validated against an oracle (e.g., a test suite). Because such an oracle can be imperfect, the generated patches, although validated by the oracle, may actually be incorrect. While the state of the art explore research directions that require dynamic information or rely on manually-crafted heuristics, we study the benefit of learning code representations to learn deep features that may encode the properties of patch correctness. Our work mainly investigates different representation learning approaches for code changes to derive embeddings that are amenable to similarity computations. We report on findings based on embeddings produced by pre-trained and re-trained neural networks. Experimental results demonstrate the potential of embeddings to empower learning algorithms in reasoning about patch correctness: a machine learning predictor with BERT transformer-based embeddings associated with logistic regression yielded an AUC value of about 0.8 in predicting patch correctness on a deduplicated dataset of 1000 labeled patches. Our study shows that learned representations can lead to reasonable performance when comparing against the state-of-the-art, PATCH-SIM, which relies on dynamic information. These representations may further be complementary to features that were carefully (manually) engineered in the literature. From 3c853cc3137776d3af224fa058aa66d8c7eae33f Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Fri, 6 Nov 2020 16:01:02 +0000 Subject: [PATCH 037/297] One more paper. --- _publications/chirkova2020empirical.markdown | 12 ++++++++++++ _publications/hellendoorn2020global.markdown | 2 +- 2 files changed, 13 insertions(+), 1 deletion(-) create mode 100644 _publications/chirkova2020empirical.markdown diff --git a/_publications/chirkova2020empirical.markdown b/_publications/chirkova2020empirical.markdown new file mode 100644 index 00000000..ab9fe59c --- /dev/null +++ b/_publications/chirkova2020empirical.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Empirical Study of Transformers for Source Code" +authors: Nadezhda Chirkova, Sergey Troshin +conference: +year: 2020 +bibkey: chirkova2020empirical +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2010.07987"} +tags: ["transformers"] +--- +Initially developed for natural language processing (NLP), Transformers are now widely used for source code processing, due to the format similarity between source code and text. In contrast to natural language, source code is strictly structured, i. e. follows the syntax of the programming language. Several recent works develop Transformer modifications for capturing syntactic information in source code. The drawback of these works is that they do not compare to each other and all consider different tasks. In this work, we conduct a thorough empirical study of the capabilities of Transformers to utilize syntactic information in different tasks. We consider three tasks (code completion, function naming and bug fixing) and re-implement different syntax-capturing modifications in a unified framework. We show that Transformers are able to make meaningful predictions based purely on syntactic information and underline the best practices of taking the syntactic information into account for improving the performance of the model. diff --git a/_publications/hellendoorn2020global.markdown b/_publications/hellendoorn2020global.markdown index fad2804b..1446ac53 100644 --- a/_publications/hellendoorn2020global.markdown +++ b/_publications/hellendoorn2020global.markdown @@ -7,6 +7,6 @@ year: 2020 bibkey: hellendoorn2020global additional_links: - {name: "OpenReview", url: "https://openreview.net/forum?id=B1lnbRNtwr¬eId=B1lnbRNtwr"} -tags: ["variable misuse", "defect", "GNN"] +tags: ["variable misuse", "defect", "GNN", "transformers"] --- Models of code can learn distributed representations of a program's syntax and semantics to predict many non-trivial properties of a program. Recent state-of-the-art models leverage highly structured representations of programs, such as trees, graphs and paths therein (e.g. data-flow relations), which are precise and abundantly available for code. This provides a strong inductive bias towards semantically meaningful relations, yielding more generalizable representations than classical sequence-based models. Unfortunately, these models primarily rely on graph-based message passing to represent relations in code, which makes them de facto local due to the high cost of message-passing steps, quite in contrast to modern, global sequence-based models, such as the Transformer. In this work, we bridge this divide between global and structured models by introducing two new hybrid model families that are both global and incorporate structural bias: Graph Sandwiches, which wrap traditional (gated) graph message-passing layers in sequential message-passing layers; and Graph Relational Embedding Attention Transformers (GREAT for short), which bias traditional Transformers with relational information from graph edge types. By studying a popular, non-trivial program repair task, variable-misuse identification, we explore the relative merits of traditional and hybrid model families for code representation. Starting with a graph-based model that already improves upon the prior state-of-the-art for this task by 20%, we show that our proposed hybrid models improve an additional 10-15%, while training both faster and using fewer parameters. From cdd2facd3b8259d362db7ebecc7bf4a3fae52e2e Mon Sep 17 00:00:00 2001 From: Orestis Floros Date: Tue, 17 Nov 2020 10:15:42 +0100 Subject: [PATCH 038/297] david2019neural: Add GNN tag See 5.4 and https://github.com/tech-srl/Nero --- _publications/david2019neural.markdown | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_publications/david2019neural.markdown b/_publications/david2019neural.markdown index 784f467a..893468f3 100644 --- a/_publications/david2019neural.markdown +++ b/_publications/david2019neural.markdown @@ -7,7 +7,7 @@ year: 2019 bibkey: david2019neural additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/1902.09122"} -tags: ["naming", "deobfuscation"] +tags: ["naming", "deobfuscation", "GNN"] --- We address the problem of predicting procedure names in stripped executables which contain no debug information. Predicting procedure names can dramatically ease the task of reverse engineering, saving precious time and human effort. From 4e4c86940eb5e15770f939372c8437dcf264be0c Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Thu, 19 Nov 2020 08:08:29 +0000 Subject: [PATCH 039/297] Add paper. --- _publications/aye2020learning.markdown | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 _publications/aye2020learning.markdown diff --git a/_publications/aye2020learning.markdown b/_publications/aye2020learning.markdown new file mode 100644 index 00000000..ad45fc06 --- /dev/null +++ b/_publications/aye2020learning.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Learning Autocompletion from Real-World Datasets" +authors: Gareth Ari Aye, Seohyun Kim, Hongyu Li +conference: +year: 2020 +bibkey: aye2020learning +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2011.04542"} +tags: ["autocompletion"] +--- +Code completion is a popular software development tool integrated into all major IDEs. Many neural language models have achieved promising results in completion suggestion prediction on synthetic benchmarks. However, a recent study When Code Completion Fails: a Case Study on Real-World Completions demonstrates that these results may not translate to improvements in real-world performance. To combat this effect, we train models on real-world code completion examples and find that these models outperform models trained on committed source code and working version snapshots by 12.8% and 13.8% accuracy respectively. We observe this improvement across modeling technologies and show through A/B testing that it corresponds to a 6.2% increase in programmers' actual autocompletion usage. Furthermore, our study characterizes a large corpus of logged autocompletion usages to investigate why training on real-world examples leads to stronger models. From 87cbfb9f7f37b706c46cd431f685d53b4d4567bf Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Wed, 25 Nov 2020 12:15:06 +0000 Subject: [PATCH 040/297] Add sinkfinder. --- _publications/bian2020sinkfinder.markdown | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 _publications/bian2020sinkfinder.markdown diff --git a/_publications/bian2020sinkfinder.markdown b/_publications/bian2020sinkfinder.markdown new file mode 100644 index 00000000..5400b116 --- /dev/null +++ b/_publications/bian2020sinkfinder.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "SinkFinder: harvesting hundreds of unknown interesting function pairs with just one seed" +authors: Pan Bian, Bin Liang, Jianjun Huang, Wenchang Shi, Xidong Wang, Jian Zhang +conference: FSE +year: 2020 +bibkey: bian2020sinkfinder +tags: ["program analysis"] +--- +Mastering the knowledge about security-sensitive functions that can potentially result in bugs is valuable to detect them. However, identifying this kind of functions is not a trivial task. Introducing machine learning-based techniques to do the task is a natural choice. Unfortunately, the approach also requires considerable prior knowledge, e.g., sufficient labelled training samples. In practice, the requirement is often hard to meet. + +In this paper, to solve the problem, we propose a novel and practical method called SinkFinder to automatically discover function pairs that we are interested in, which only requires very limited prior knowledge. SinkFinder first takes just one pair of well-known interesting functions as the initial seed to infer enough positive and negative training samples by means of sub-word word embedding. By using these samples, a support vector machine classifier is trained to identify more interesting function pairs. Finally, checkers equipped with the obtained knowledge can be easily developed to detect bugs in target systems. The experiments demonstrate that SinkFinder can successfully discover hundreds of interesting functions and detect dozens of previously unknown bugs from large-scale systems, such as Linux, OpenSSL and PostgreSQL. From ac21fabeb39954a0bdd871af9ab32cc1eb39c3fa Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Wed, 9 Dec 2020 13:28:45 +0000 Subject: [PATCH 041/297] Merge tags --- _publications/aye2020learning.markdown | 2 +- _publications/kim2020code.markdown | 2 +- _publications/svyatkovskiy2020fast.markdown | 2 +- _publications/svyatkovskiy2020intellicode.markdown | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/_publications/aye2020learning.markdown b/_publications/aye2020learning.markdown index ad45fc06..37a7806c 100644 --- a/_publications/aye2020learning.markdown +++ b/_publications/aye2020learning.markdown @@ -7,6 +7,6 @@ year: 2020 bibkey: aye2020learning additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2011.04542"} -tags: ["autocompletion"] +tags: ["autocomplete"] --- Code completion is a popular software development tool integrated into all major IDEs. Many neural language models have achieved promising results in completion suggestion prediction on synthetic benchmarks. However, a recent study When Code Completion Fails: a Case Study on Real-World Completions demonstrates that these results may not translate to improvements in real-world performance. To combat this effect, we train models on real-world code completion examples and find that these models outperform models trained on committed source code and working version snapshots by 12.8% and 13.8% accuracy respectively. We observe this improvement across modeling technologies and show through A/B testing that it corresponds to a 6.2% increase in programmers' actual autocompletion usage. Furthermore, our study characterizes a large corpus of logged autocompletion usages to investigate why training on real-world examples leads to stronger models. diff --git a/_publications/kim2020code.markdown b/_publications/kim2020code.markdown index de1e2ce1..52f6f5a8 100644 --- a/_publications/kim2020code.markdown +++ b/_publications/kim2020code.markdown @@ -8,7 +8,7 @@ bibkey: kim2020code additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2003.13848"} - {name: "Code", url: "https://github.com/facebookresearch/code-prediction-transformer"} -tags: ["autocompletion"] +tags: ["autocomplete"] --- In this paper, we describe how to leverage Transformer, a recent neural architecture for learning from sequential data (such as text), for code completion. As in the realm of natural language processing, Transformers surpass the prediction accuracy achievable by RNNs; we provide an experimental confirmation of this over a Python dataset. diff --git a/_publications/svyatkovskiy2020fast.markdown b/_publications/svyatkovskiy2020fast.markdown index eabfd40b..65701f46 100644 --- a/_publications/svyatkovskiy2020fast.markdown +++ b/_publications/svyatkovskiy2020fast.markdown @@ -7,7 +7,7 @@ year: 2020 bibkey: svyatkovskiy2020fast additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2004.13651"} -tags: ["autocompletion"] +tags: ["autocomplete"] --- Code completion is one of the most widely used features of modern integrated development environments (IDEs). Deep learning has recently made significant progress in the statistical prediction of source code. However, state-of-the-art neural network models consume prohibitively large amounts of memory, causing computational burden to the development environment, especially when deployed in lightweight client devices. diff --git a/_publications/svyatkovskiy2020intellicode.markdown b/_publications/svyatkovskiy2020intellicode.markdown index 276bfe9e..4eb75028 100644 --- a/_publications/svyatkovskiy2020intellicode.markdown +++ b/_publications/svyatkovskiy2020intellicode.markdown @@ -6,7 +6,7 @@ year: 2020 bibkey: svyatkovskiy2020intellicode additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2005.08025"} -tags: ["autocompletion", "generative", "synthesis", "language model", "pretraining"] +tags: ["autocomplete", "generative", "synthesis", "language model", "pretraining"] --- In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments. In this paper, we introduce IntelliCode Compose − a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, C#, JavaScript and TypeScript programming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook. From 3d1c7dfac5fbcca95b5b24d69428ca4595c5a04c Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Fri, 18 Dec 2020 12:11:15 +0000 Subject: [PATCH 042/297] More papers, the first to be published in 2021! --- _publications/bui2021infercode.markdown | 12 ++++++++++++ _publications/cummins2020programl.markdown | 18 ++++++++++++++++++ _publications/pradel2020neural.markdown | 12 ++++++++++++ _publications/schuster2021you.markdown | 16 ++++++++++++++++ _publications/wang2020learning2.markdown | 14 ++++++++++++++ 5 files changed, 72 insertions(+) create mode 100644 _publications/bui2021infercode.markdown create mode 100644 _publications/cummins2020programl.markdown create mode 100644 _publications/pradel2020neural.markdown create mode 100644 _publications/schuster2021you.markdown create mode 100644 _publications/wang2020learning2.markdown diff --git a/_publications/bui2021infercode.markdown b/_publications/bui2021infercode.markdown new file mode 100644 index 00000000..6040a7a1 --- /dev/null +++ b/_publications/bui2021infercode.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees" +authors: Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang +conference: ICSE +year: 2021 +bibkey: bui2021infercode +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2012.07023"} +tags: ["representation"] +--- +Building deep learning models on source code has found many successful software engineering applications, such as code search, code comment generation, bug detection, code migration, and so on. Current learning techniques, however, have a major drawback that these models are mostly trained on datasets labeled for particular downstream tasks, and code representations may not be suitable for other tasks. While some techniques produce representations from unlabeled code, they are far from satisfactory when applied to downstream tasks. Although certain techniques generate representations from unlabeled code when applied to downstream tasks they are far from satisfactory. This paper proposes InferCode to overcome the limitation by adapting the self-supervised learning mechanism to build source code model. The key novelty lies in training code representations by predicting automatically identified subtrees from the context of the ASTs. Subtrees in ASTs are treated with InferCode as the labels for training code representations without any human labeling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We trained an InferCode model instance using the Tree-based CNN as the encoder of a large set of Java code and applied it to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search or reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Compared to previous code learning techniques applied to the same downstream tasks, such as Code2Vec, Code2Seq, ASTNN, higher performance results are achieved using our pre-trained InferCode model with a significant margin for most tasks including those involving different programming languages. diff --git a/_publications/cummins2020programl.markdown b/_publications/cummins2020programl.markdown new file mode 100644 index 00000000..e991eaf2 --- /dev/null +++ b/_publications/cummins2020programl.markdown @@ -0,0 +1,18 @@ +--- +layout: publication +title: "ProGraML: Graph-based Deep Learning for Program Optimization and Analysis" +authors: Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler, Hugh Leather +conference: +year: 2020 +bibkey: cummins2020programl +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2003.10536"} + - {name: "Dataset", url: "https://zenodo.org/record/4122437"} + - {name: "Code", url: "https://github.com/ChrisCummins/ProGraML"} +tags: ["dataset", "GNN"] +--- +The increasing complexity of computing systems places a tremendous burden on optimizing compilers, requiring ever more accurate and aggressive optimizations. Machine learning offers significant benefits for constructing optimization heuristics but there remains a gap between what state-of-the-art methods achieve and the performance of an optimal heuristic. Closing this gap requires improvements in two key areas: a representation that accurately captures the semantics of programs, and a model architecture with sufficient expressiveness to reason about this representation. + +We introduce ProGraML - Program Graphs for Machine Learning - a novel graph-based program representation using a low level, language agnostic, and portable format; and machine learning models capable of performing complex downstream tasks over these graphs. The ProGraML representation is a directed attributed multigraph that captures control, data, and call relations, and summarizes instruction and operand types and ordering. Message Passing Neural Networks propagate information through this structured representation, enabling whole-program or per-vertex classification tasks. + +ProGraML provides a general-purpose program representation that equips learnable models to perform the types of program analysis that are fundamental to optimization. To this end, we evaluate the performance of our approach first on a suite of traditional compiler analysis tasks: control flow reachability, dominator trees, data dependencies, variable liveness, and common subexpression detection. On a benchmark dataset of 250k LLVM-IR files covering six source programming languages, ProGraML achieves an average 94.0 F1 score, significantly outperforming the state-of-the-art approaches. We then apply our approach to two high-level tasks - heterogeneous device mapping and program classification - setting new state-of-the-art performance in both. diff --git a/_publications/pradel2020neural.markdown b/_publications/pradel2020neural.markdown new file mode 100644 index 00000000..d51876a8 --- /dev/null +++ b/_publications/pradel2020neural.markdown @@ -0,0 +1,12 @@ +--- +layout: publication +title: "Neural Software Analysis" +authors: Michael Pradel, Satish Chandra +conference: +year: 2020 +bibkey: pradel2020neural +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2011.07986"} +tags: ["program analysis", "survey"] +--- +Many software development problems can be addressed by program analysis tools, which traditionally are based on precise, logical reasoning and heuristics to ensure that the tools are practical. Recent work has shown tremendous success through an alternative way of creating developer tools, which we call neural software analysis. The key idea is to train a neural machine learning model on numerous code examples, which, once trained, makes predictions about previously unseen code. In contrast to traditional program analysis, neural software analysis naturally handles fuzzy information, such as coding conventions and natural language embedded in code, without relying on manually encoded heuristics. This article gives an overview of neural software analysis, discusses when to (not) use it, and presents three example analyses. The analyses address challenging software development problems: bug detection, type prediction, and code completion. The resulting tools complement and outperform traditional program analyses, and are used in industrial practice. diff --git a/_publications/schuster2021you.markdown b/_publications/schuster2021you.markdown new file mode 100644 index 00000000..1d4460a7 --- /dev/null +++ b/_publications/schuster2021you.markdown @@ -0,0 +1,16 @@ +--- +layout: publication +title: "You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion" +authors: Roei Schuster, Congzheng Song, Eran Tromer, Vitaly Shmatikov +conference: USENIX Security +year: 2021 +bibkey: schuster2021you +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2007.02220"} +tags: ["autocomplete", "adversarial"] +--- +Code autocompletion is an integral feature of modern code editors and IDEs. The latest generation of autocompleters uses neural language models, trained on public open-source code repositories, to suggest likely (not just statically feasible) completions given the current context. + +We demonstrate that neural code autocompleters are vulnerable to poisoning attacks. By adding a few specially-crafted files to the autocompleter's training corpus (data poisoning), or else by directly fine-tuning the autocompleter on these files (model poisoning), the attacker can influence its suggestions for attacker-chosen contexts. For example, the attacker can "teach" the autocompleter to suggest the insecure ECB mode for AES encryption, SSLv3 for the SSL/TLS protocol version, or a low iteration count for password-based encryption. Moreover, we show that these attacks can be targeted: an autocompleter poisoned by a targeted attack is much more likely to suggest the insecure completion for files from a specific repo or specific developer. + +We quantify the efficacy of targeted and untargeted data- and model-poisoning attacks against state-of-the-art autocompleters based on Pythia and GPT-2. We then evaluate existing defenses against poisoning attacks and show that they are largely ineffective. diff --git a/_publications/wang2020learning2.markdown b/_publications/wang2020learning2.markdown new file mode 100644 index 00000000..3b9c0d73 --- /dev/null +++ b/_publications/wang2020learning2.markdown @@ -0,0 +1,14 @@ +--- +layout: publication +title: "Learning to Represent Programs with Heterogeneous Graphs" +authors: Wenhan Wang, Kechi Zhang, Ge Li, Zhi Jin +conference: +year: 2020 +bibkey: Learning to Represent Programs with Heterogeneous Graphs +additional_links: + - {name: "ArXiV", url: "https://arxiv.org/abs/2012.04188"} +tags: ["GNN", "summarization"] +--- +Program source code contains complex structure information, which can be represented in structured data forms like trees or graphs. To acquire the structural information in source code, most existing researches use abstract syntax trees (AST). A group of works add additional edges to ASTs to convert source code into graphs and use graph neural networks to learn representations for program graphs. Although these works provide additional control or data flow information to ASTs for downstream tasks, they neglect an important aspect of structure information in AST itself: the different types of nodes and edges. In ASTs, different nodes contain different kinds of information like variables or control flow, and the relation between a node and all its children can also be different. + +To address the information of node and edge types, we bring the idea of heterogeneous graphs to learning on source code and present a new formula of building heterogeneous program graphs from ASTs with additional type information for nodes and edges. We use the ASDL grammar of programming language to define the node and edge types of program graphs. Then we use heterogeneous graph neural networks to learn on these graphs. We evaluate our approach on two tasks: code comment generation and method naming. Both tasks require reasoning on the semantics of complete code snippets. Experiment results show that our approach outperforms baseline models, including homogeneous graph-based models, showing that leveraging the type information of nodes and edges in program graphs can help in learning program semantics. From f1c7472903770ed440ea1db644f0de202eb7aad4 Mon Sep 17 00:00:00 2001 From: Alexander Date: Sat, 19 Dec 2020 10:53:28 +0100 Subject: [PATCH 043/297] Add NeurIPS 2020 CAP workshop --- resources.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/resources.md b/resources.md index b892b57f..f6755244 100644 --- a/resources.md +++ b/resources.md @@ -26,6 +26,7 @@ Please, feel free to submit a pull request to adding more links in this page. ### Workshops and Other Academic Events The last few years a few workshops have been organized in this area. Please, feel free to add any missing or future workshops here. +* [Workshop on Computer-Assisted Programming](https://capworkshop.github.io/) 12 December 2020, NeurIPS 2020, virtual * [ML on Code devroom at FOSDEM19](https://fosdem.org/2019/schedule/track/ml_on_code/) 2-3 February 2019, Brussels, EU [[videos](https://video.fosdem.org/2019/H.2213/)] * [Machine Learning for Programming](http://ml4p.org/) 18–19 July 2018, Oxford, UK [[videos](https://www.youtube.com/watch?v=dQaAp9wdFtQ&list=PLMPy362FkW9pd96bwh0BuCGMo6fdMQ2aw)] * [International Workshop on Machine Learning techniques for Programming Languages](https://conf.researchr.org/track/ecoop-issta-2018/ML4PL-2018-papers) 16 - 21 July 2018 Amsterdam, Netherlands @@ -37,7 +38,8 @@ The last few years a few workshops have been organized in this area. Please, fee ### Courses on Important Relevant Background -* [Sofware Analysis](http://rightingcode.org/) in Univ. of Pennsylvania. It is a great introduction to Program Analysis [[videos](https://www.youtube.com/playlist?list=PLF3-CvSRq2SaApl3Lnu6Tu_ecsBr94543)] +* [Sofware Analysis](http://rightingcode.org/) at Univ. of Pennsylvania. It is a great introduction to Program Analysis [[videos](https://www.youtube.com/playlist?list=PLF3-CvSRq2SaApl3Lnu6Tu_ecsBr94543)] +* [Applications of Data Science for Software Engineering 2020](https://www.youtube.com/watch?v=34hcH7Js41I&list=PLmAXH4O57P5_0IflYjLIg8l0IupZPbdlY) at TU Eindhoven. ### Competitions * [nlc2cmd](http://nlc2cmd.us-east.mybluemix.net/#/) in NeurIPS 2020 by Project CLAI. Starts July 2020. From 2828c8d5d63e54593067e523cb86e4fffb987305 Mon Sep 17 00:00:00 2001 From: Alexander Serebrenik Date: Sat, 19 Dec 2020 20:33:00 +0100 Subject: [PATCH 044/297] Updating the official name of the university Thank you very much for including the link! Just a minor fix of the official English name of the university. --- resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/resources.md b/resources.md index f6755244..4f1c83fd 100644 --- a/resources.md +++ b/resources.md @@ -39,7 +39,7 @@ The last few years a few workshops have been organized in this area. Please, fee ### Courses on Important Relevant Background * [Sofware Analysis](http://rightingcode.org/) at Univ. of Pennsylvania. It is a great introduction to Program Analysis [[videos](https://www.youtube.com/playlist?list=PLF3-CvSRq2SaApl3Lnu6Tu_ecsBr94543)] -* [Applications of Data Science for Software Engineering 2020](https://www.youtube.com/watch?v=34hcH7Js41I&list=PLmAXH4O57P5_0IflYjLIg8l0IupZPbdlY) at TU Eindhoven. +* [Applications of Data Science for Software Engineering 2020](https://www.youtube.com/watch?v=34hcH7Js41I&list=PLmAXH4O57P5_0IflYjLIg8l0IupZPbdlY) at Eindhoven University of Technology. ### Competitions * [nlc2cmd](http://nlc2cmd.us-east.mybluemix.net/#/) in NeurIPS 2020 by Project CLAI. Starts July 2020. From 44cf6925c17530258d9c91d7697a182c00a678ed Mon Sep 17 00:00:00 2001 From: shakedbr Date: Thu, 24 Dec 2020 10:18:47 +0200 Subject: [PATCH 045/297] Updated "Neural Edit Completion" publication - conference, updated name, code --- ...y2020neural.markdown => brody2020structural.markdown} | 9 +++++---- contributors.md | 1 + 2 files changed, 6 insertions(+), 4 deletions(-) rename _publications/{brody2020neural.markdown => brody2020structural.markdown} (81%) diff --git a/_publications/brody2020neural.markdown b/_publications/brody2020structural.markdown similarity index 81% rename from _publications/brody2020neural.markdown rename to _publications/brody2020structural.markdown index 5fb0a1f2..3722a81b 100644 --- a/_publications/brody2020neural.markdown +++ b/_publications/brody2020structural.markdown @@ -1,12 +1,13 @@ --- layout: publication -title: "Neural Edit Completion" +title: "A Structural Model for Contextual Code Changes" authors: Shaked Brody, Uri Alon, Eran Yahav -conference: +conference: OOPSLA year: 2020 -bibkey: brody2020neural +bibkey: brody2020structural additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2005.13209"} + - {name: "Code", url: "https://github.com/tech-srl/c3po"} tags: ["edit", "AST", "autocomplete"] --- -We address the problem of predicting edit completions based on a learned model that was trained on past edits. Given a code snippet that is partially edited, our goal is to predict a completion of the edit for the rest of the snippet. We refer to this task as the EditCompletion task and present a novel approach for tackling it. The main idea is to directly represent structural edits. This allows us to model the likelihood of the edit itself, rather than learning the likelihood of the edited code. We represent an edit operation as a path in the program's Abstract Syntax Tree (AST), originating from the source of the edit to the target of the edit. Using this representation, we present a powerful and lightweight neural model for the EditCompletion task. We conduct a thorough evaluation, comparing our approach to a variety of representation and modeling approaches that are driven by multiple strong models such as LSTMs, Transformers, and neural CRFs. Our experiments show that our model achieves 28% relative gain over state-of-the-art sequential models and 2× higher accuracy than syntactic models that learn to generate the edited code instead of modeling the edits directly. We make our code, dataset, and trained models publicly available. +We address the problem of predicting edit completions based on a learned model that was trained on past edits. Given a code snippet that is partially edited, our goal is to predict a completion of the edit for the rest of the snippet. We refer to this task as the EditCompletion task and present a novel approach for tackling it. The main idea is to directly represent structural edits. This allows us to model the likelihood of the edit itself, rather than learning the likelihood of the edited code. We represent an edit operation as a path in the program's Abstract Syntax Tree (AST), originating from the source of the edit to the target of the edit. Using this representation, we present a powerful and lightweight neural model for the EditCompletion task. We conduct a thorough evaluation, comparing our approach to a variety of representation and modeling approaches that are driven by multiple strong models such as LSTMs, Transformers, and neural CRFs. Our experiments show that our model achieves 28% relative gain over state-of-the-art sequential models and 2× higher accuracy than syntactic models that learn to generate the edited code instead of modeling the edits directly. Our code, dataset, and trained models are publicly available at https://github.com/tech-srl/c3po/ . diff --git a/contributors.md b/contributors.md index 65cefdff..57efa5b6 100644 --- a/contributors.md +++ b/contributors.md @@ -15,4 +15,5 @@ Please, feel free to add your name below, once you contribute to this website. A comprehensive list can be found [here](https://github.com/ml4code/ml4code.github.io/graphs/contributors). * [Uri Alon](http://www.cs.technion.ac.il/~urialon/) Technion, Israel +* [Shaked Brody](https://shakedbr.cswp.cs.technion.ac.il/) Technion, Israel * [Nghi D. Q. Bui](https://bdqnghi.github.io/) Singapore Management University, Singapore \ No newline at end of file From 483b46190617d91457da7f170e77d6f8c32b15cf Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Tue, 29 Dec 2020 19:27:20 +0000 Subject: [PATCH 046/297] Add two papers. --- _publications/liu2020automating.markdown | 10 ++++++++++ _publications/tian2020evaluating.markdown | 2 +- 2 files changed, 11 insertions(+), 1 deletion(-) create mode 100644 _publications/liu2020automating.markdown diff --git a/_publications/liu2020automating.markdown b/_publications/liu2020automating.markdown new file mode 100644 index 00000000..369f3a9a --- /dev/null +++ b/_publications/liu2020automating.markdown @@ -0,0 +1,10 @@ +--- +layout: publication +title: "Automating Just-In-Time Comment Updating" +authors: Zhongxin Liu, Xin Xia, Meng Yan, Shanping Li +conference: ASE +year: 2020 +bibkey: liu2020automating +tags: ["documentation"] +--- +Code comments are valuable for program comprehension and software maintenance, and also require maintenance with code evolution. However, when changing code, developers sometimes neglect updating the related comments, bringing in inconsistent or obsolete comments (aka., bad comments). Such comments are detrimental since they may mislead developers and lead to future bugs. Therefore, it is necessary to fix and avoid bad comments. In this work, we argue that bad comments can be reduced and even avoided by automatically performing comment updates with code changes. We refer to this task as “Just-In-Time (JIT) Comment Updating” and propose an approach named CUP (Comment UPdater) to automate this task. CUP can be used to assist developers in updating comments during code changes and can consequently help avoid the introduction of bad comments. Specifically, CUP leverages a novel neural sequence-to-sequence model to learn comment update patterns from extant code-comment co-changes and can automatically generate a new comment based on its corresponding old comment and code change. Several customized enhancements, such as a special tokenizer and a novel co-attention mechanism, are introduced in CUP by us to handle the characteristics of this task. We build a dataset with over 108K comment-code co-change samples and evaluate CUP on it. The evaluation results show that CUP outperforms an information-retrieval-based and a rule-based baselines by substantial margins, and can reduce developers' edits required for JIT comment updating. In addition, the comments generated by our approach are identical to those updated by developers in 1612 (16.7%) test samples, 7 times more than the best-performing baseline. diff --git a/_publications/tian2020evaluating.markdown b/_publications/tian2020evaluating.markdown index ecc8d51b..5fd6c68f 100644 --- a/_publications/tian2020evaluating.markdown +++ b/_publications/tian2020evaluating.markdown @@ -7,6 +7,6 @@ year: 2020 bibkey: tian2020evaluating additional_links: - {name: "ArXiV", url: "https://arxiv.org/abs/2008.02944"} -tags: ["edit", "defect"] +tags: ["repair", "transformers"] --- A large body of the literature of automated program repair develops approaches where patches are generated to be validated against an oracle (e.g., a test suite). Because such an oracle can be imperfect, the generated patches, although validated by the oracle, may actually be incorrect. While the state of the art explore research directions that require dynamic information or rely on manually-crafted heuristics, we study the benefit of learning code representations to learn deep features that may encode the properties of patch correctness. Our work mainly investigates different representation learning approaches for code changes to derive embeddings that are amenable to similarity computations. We report on findings based on embeddings produced by pre-trained and re-trained neural networks. Experimental results demonstrate the potential of embeddings to empower learning algorithms in reasoning about patch correctness: a machine learning predictor with BERT transformer-based embeddings associated with logistic regression yielded an AUC value of about 0.8 in predicting patch correctness on a deduplicated dataset of 1000 labeled patches. Our study shows that learned representations can lead to reasonable performance when comparing against the state-of-the-art, PATCH-SIM, which relies on dynamic information. These representations may further be complementary to features that were carefully (manually) engineered in the literature. From aaf819964f833df9a078fcc68956c2091828ea17 Mon Sep 17 00:00:00 2001 From: Miltos Allamanis Date: Thu, 14 Jan 2021 13:33:05 +0000 Subject: [PATCH 047/297] Add ribbon to (hopefully) nudge more participation. --- _includes/sidebar.html | 1 + public/css/hyde.css | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 34 insertions(+) diff --git a/_includes/sidebar.html b/_includes/sidebar.html index 1535bd8c..4029daef 100644 --- a/_includes/sidebar.html +++ b/_includes/sidebar.html @@ -1,3 +1,4 @@ +Contribute to ML4Code