From a3d816d97f22be40bc40028aaecd2154d877a861 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Wed, 29 Jun 2022 14:58:56 +0300
Subject: [PATCH 001/114] Add paper.

---
 _publications/dinella2022toga.markdown | 28 ++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)
 create mode 100644 _publications/dinella2022toga.markdown

diff --git a/_publications/dinella2022toga.markdown b/_publications/dinella2022toga.markdown
new file mode 100644
index 00000000..fbc8ff55
--- /dev/null
+++ b/_publications/dinella2022toga.markdown
@@ -0,0 +1,28 @@
+---
+layout: publication
+title: "TOGA: A Neural Method for Test Oracle Generation"
+authors: Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, Shuvendu K. Lahiri
+conference: ICSE
+year: 2022
+additional_links:
+   - {name: "Preprint", url: "https://www.seas.upenn.edu/~edinella/icse-camera-ready.pdf"}
+tags: ["code generation", "Transformer", "test generation"]
+---
+Testing is widely recognized as an important stage of the software
+development lifecycle. Effective software testing can provide benefits such as bug finding, preventing regressions, and documentation.
+In terms of documentation, unit tests express a unit’s intended
+functionality, as conceived by the developer. A test oracle, typically expressed as an condition, documents the intended behavior
+of a unit under a given test prefix. Synthesizing a functional test
+oracle is a challenging problem, as it must capture the intended
+functionality rather than the implemented functionality.
+In this paper, we propose TOGA (a neural method for Test Oracle
+GenerAtion), a unified transformer-based neural approach to infer
+both exceptional and assertion test oracles based on the context of
+the focal method. Our approach can handle units with ambiguous
+or missing documentation, and even units with a missing implementation. We evaluate our approach on both oracle inference accuracy
+and functional bug-finding. Our technique improves accuracy by
+33% over existing oracle inference approaches, achieving 96% overall accuracy on a held out test dataset. Furthermore, we show that
+when integrated with a automated test generation tool (EvoSuite),
+our approach finds 57 real world bugs in large-scale Java programs,
+including 30 bugs that are not found by any other automated testing
+method in our evaluation

From 303d6bda39f4f909f514a87719560ee95f8b6277 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Wed, 29 Jun 2022 14:59:54 +0300
Subject: [PATCH 002/114] fix tag

---
 _publications/nye2021show.markdown | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/_publications/nye2021show.markdown b/_publications/nye2021show.markdown
index 05437381..3bb58a6f 100644
--- a/_publications/nye2021show.markdown
+++ b/_publications/nye2021show.markdown
@@ -6,6 +6,6 @@ conference:
 year: 2021
 additional_links:
    - {name: "ArXiV", url: "https://arxiv.org/abs/2112.00114"}
-tags: ["Transformer", "excecution"]
+tags: ["Transformer", "execution"]
 ---
 Large pre-trained language models perform remarkably well on tasks that can be done "in one pass", such as generating realistic text or synthesizing computer programs. However, they struggle with tasks that require unbounded multi-step computation, such as adding integers or executing programs. Surprisingly, we find that these same models are able to perform complex multi-step computations -- even in the few-shot regime -- when asked to perform the operation "step by step", showing the results of intermediate computations. In particular, we train transformers to perform multi-step computations by asking them to emit intermediate computation steps into a "scratchpad". On a series of increasingly complex tasks ranging from long addition to the execution of arbitrary programs, we show that scratchpads dramatically improve the ability of language models to perform multi-step computations.

From 7d463cd53dbbafb1f7b6ac11113fb862b3215832 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Wed, 29 Jun 2022 22:09:31 +0300
Subject: [PATCH 003/114] Add paper

---
 _publications/shrivastava2020repository.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/shrivastava2020repository.markdown

diff --git a/_publications/shrivastava2020repository.markdown b/_publications/shrivastava2020repository.markdown
new file mode 100644
index 00000000..5af6a384
--- /dev/null
+++ b/_publications/shrivastava2020repository.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Repository-Level Prompt Generation for Large Language Models of Code"
+authors: Disha Shrivastava, Hugo Larochelle, Daniel Tarlow
+conference:
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2206.12839"}
+tags: ["Transformer", "code completion"]
+---
+With the success of large language models (LLMs) of code and their use as code assistants (e.g. Codex used in GitHub Copilot), techniques for introducing domain-specific knowledge in the prompt design process become important. In this work, we propose a framework called Repo-Level Prompt Generator that learns to generate example-specific prompts using a set of rules. These rules take context from the entire repository, thereby incorporating both the structure of the repository and the context from other relevant files (e.g. imports, parent class files). Our technique doesn't require any access to the weights of the LLM, making it applicable in cases where we only have black-box access to the LLM. We conduct experiments on the task of single-line code-autocompletion using code repositories taken from Google Code archives. We demonstrate that an oracle constructed from our proposed rules gives up to 36% relative improvement over Codex, showing the quality of the rules. Further, we show that when we train a model to select the best rule, we can achieve significant performance gains over Codex. The code for our work can be found at: https://github.com/shrivastavadisha/repo_level_prompt_generation .

From 7474bbd959777267c9376df3fc8bb5539f93712e Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Sat, 2 Jul 2022 09:55:30 +0300
Subject: [PATCH 004/114] Add papers.

---
 _publications/bareiss2022code.markdown   | 11 +++++++
 _publications/zeng2022extensive.markdown | 38 ++++++++++++++++++++++++
 2 files changed, 49 insertions(+)
 create mode 100644 _publications/bareiss2022code.markdown
 create mode 100644 _publications/zeng2022extensive.markdown

diff --git a/_publications/bareiss2022code.markdown b/_publications/bareiss2022code.markdown
new file mode 100644
index 00000000..9d2578fc
--- /dev/null
+++ b/_publications/bareiss2022code.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Code Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Models on Code"
+authors: Patrick Bareiß, Beatriz Souza, Marcelo d'Amorim, Michael Pradel
+conference:
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2206.01335"}
+tags: ["Transformer"]
+---
+Few-shot learning with large-scale, pre-trained language models is a powerful way to answer questions about code, e.g., how to complete a given code example, or even generate code snippets from scratch. The success of these models raises the question whether they could serve as a basis for building a wide range code generation tools. Traditionally, such tools are built manually and separately for each task. Instead, few-shot learning may allow to obtain different tools from a single pre-trained language model by simply providing a few examples or a natural language description of the expected tool behavior. This paper studies to what extent a state-of-the-art, pre-trained language model of code, Codex, may serve this purpose. We consider three code manipulation and code generation tasks targeted by a range of traditional tools: (i) code mutation; (ii) test oracle generation from natural language documentation; and (iii) test case generation. For each task, we compare few-shot learning to a manually built tool. Our results show that the model-based tools complement (code mutation), are on par (test oracle generation), or even outperform their respective traditionally built tool (test case generation), while imposing far less effort to develop them. By comparing the effectiveness of different variants of the model-based tools, we provide insights on how to design an appropriate input ("prompt") to the model and what influence the size of the model has. For example, we find that providing a small natural language description of the code generation task is an easy way to improve predictions. Overall, we conclude that few-shot language models are surprisingly effective, yet there is still more work to be done, such as exploring more diverse ways of prompting and tackling even more involved tasks.
diff --git a/_publications/zeng2022extensive.markdown b/_publications/zeng2022extensive.markdown
new file mode 100644
index 00000000..f9418aa2
--- /dev/null
+++ b/_publications/zeng2022extensive.markdown
@@ -0,0 +1,38 @@
+---
+layout: publication
+title: "An Extensive Study on Pre-trained Models for Program Understanding and Generation"
+authors: Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, Lingming Zhang
+conference: ISSTA
+year: 2022
+additional_links:
+   - {name: "Author Version", url: "http://lingming.cs.illinois.edu/publications/issta2022.pdf"}
+tags: ["Transformer", "evaluation"]
+---
+Automatic program understanding and generation techniques could
+significantly advance the productivity of programmers and have
+been widely studied by academia and industry. Recently, the advent of pre-trained paradigm enlightens researchers to develop
+general-purpose pre-trained models which can be applied for a
+broad range of program understanding and generation tasks. Such
+pre-trained models, derived by self-supervised objectives on large
+unlabelled corpora, can be fine-tuned in downstream tasks (such
+as code search and code generation) with minimal adaptations. Although these pre-trained models claim superiority over the prior
+techniques, they seldom follow equivalent evaluation protocols, e.g.,
+they are hardly evaluated on the identical benchmarks, tasks, or settings. Consequently, there is a pressing need for a comprehensive
+study of the pre-trained models on their effectiveness, versatility
+as well as the limitations to provide implications and guidance for
+the future development in this area. To this end, we first perform
+an extensive study of eight open-access pre-trained models over
+a large benchmark on seven representative code tasks to assess
+their reproducibility. We further compare the pre-trained models
+and domain-specific state-of-the-art techniques for validating pre-trained effectiveness. At last, we investigate the robustness of the
+pre-trained models by inspecting their performance variations under adversarial attacks. Through the study, we find that while we
+can in general replicate the original performance of the pre-train
+models on their evaluated tasks and adopted benchmarks, subtle
+performance fluctuations can refute the findings in their original
+papers. Moreover, none of the existing pre-trained models can dominate over all other models. We also find that the pre-trained models
+can significantly outperform non-pre-trained state-of-the-art techniques in program understanding tasks. Furthermore, we perform
+the first study for natural language-programming language pre-trained model robustness via adversarial attacks and find that a
+simple random attack approach can easily fool the state-of-the-art
+pre-trained models and thus incur security issues. At last, we also
+provide multiple practical guidelines for advancing future research
+on pre-trained models for program understanding and generation.

From 00e6f1caa4e426f03c813f9420c05c76455e01f9 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Sat, 2 Jul 2022 10:36:20 +0300
Subject: [PATCH 005/114] Add DeepPerf

---
 _publications/garg2022deepperf.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/garg2022deepperf.markdown

diff --git a/_publications/garg2022deepperf.markdown b/_publications/garg2022deepperf.markdown
new file mode 100644
index 00000000..4b2e6b28
--- /dev/null
+++ b/_publications/garg2022deepperf.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "DeepPERF: A Deep Learning-Based Approach For Improving Software Performance"
+authors: Spandan Garg, Roshanak Zilouchian Moghaddam, Colin B. Clement, Neel Sundaresan, Chen Wu
+conference:
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2206.13619"}
+tags: ["Transformer", "optimization"]
+---
+Improving software performance is an important yet challenging part of the software development cycle. Today, the majority of performance inefficiencies are identified and patched by performance experts. Recent advancements in deep learning approaches and the wide-spread availability of open source data creates a great opportunity to automate the identification and patching of performance problems. In this paper, we present DeepPERF, a transformer-based approach to suggest performance improvements for C# applications. We pretrain DeepPERF on English and Source code corpora and followed by finetuning for the task of generating performance improvement patches for C# applications. Our evaluation shows that our model can generate the same performance improvement suggestion as the developer fix in ~53% of the cases, getting ~34% of them verbatim in our expert-verified dataset of performance changes made by C# developers. Additionally, we evaluate DeepPERF on 50 open source C# repositories on GitHub using both benchmark and unit tests and find that our model is able to suggest valid performance improvements that can improve both CPU usage and Memory allocations. So far we've submitted 19 pull-requests with 28 different performance optimizations and 11 of these PRs have been approved by the project owners.

From b68ab767f9f5d6043d3d7852060eb0706ee407be Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Sat, 2 Jul 2022 11:02:33 +0300
Subject: [PATCH 006/114] Add DiverseTyper

---
 _publications/jesse2022learning.markdown | 9 +++++++++
 1 file changed, 9 insertions(+)
 create mode 100644 _publications/jesse2022learning.markdown

diff --git a/_publications/jesse2022learning.markdown b/_publications/jesse2022learning.markdown
new file mode 100644
index 00000000..994a909e
--- /dev/null
+++ b/_publications/jesse2022learning.markdown
@@ -0,0 +1,9 @@
+---
+layout: publication
+title: "Learning To Predict User-Defined Types"
+authors: Kevin Jesse, Premkumar T. Devanbu, Anand Sawant
+conference: TSE
+year: 2022
+tags: ["Transformer", "types"]
+---
+TypeScript is a widely adopted gradual typed language where developers can optionally type variables, functions, parameters and more. Probabilistic type inference approaches with ML (machine learning) work well especially for commonly occurring types such as boolean, number, and string. TypeScript permits a wide range of types including developer defined class names and type interfaces. These developer defined types, termed user-defined types, can be written within the realm of language naming conventions. The set of user-defined types is boundless and existing bounded type guessing approaches are an imperfect solution. Existing works either under perform in user-defined types or ignore user-defined types altogether. This work leverages a BERT-style pre-trained model, with multi-task learning objectives, to learn how to type user-defined classes and interfaces. Thus we present DIVERSETYPER, a solution that explores the diverse set of user-defined types by uniquely aligning classes and interfaces declarations to the places in which they are used. DIVERSETYPER surpasses all existing works including those that model user-defined types.

From 60f5c286544fcaca8bea2f2c5a02fdf7672f4957 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Mon, 4 Jul 2022 17:09:54 +0300
Subject: [PATCH 007/114] Add two papers.

---
 _publications/ahmed2022learning.markdown       | 11 +++++++++++
 _publications/ziegler2022productivity.markdown | 12 ++++++++++++
 2 files changed, 23 insertions(+)
 create mode 100644 _publications/ahmed2022learning.markdown
 create mode 100644 _publications/ziegler2022productivity.markdown

diff --git a/_publications/ahmed2022learning.markdown b/_publications/ahmed2022learning.markdown
new file mode 100644
index 00000000..eba1aebc
--- /dev/null
+++ b/_publications/ahmed2022learning.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Learning code summarization from a small and local dataset"
+authors: Toufique Ahmed, Premkumar Devanbu
+conference:
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2206.00804"}
+tags: ["Transformer", "summarization"]
+---
+Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of labeled examples, typically drawn from many projects. However, software phenomena can be very project-specific. Vocabulary, and other phenomena vary substantially with each project. Thus, training on project-specific data, and testing on the same project, is a promising idea. This hypothesis has to be evaluated carefully, e.g., in a time-series setting, to prevent training-test leakage. We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient (and thus prima facie well-suited for learning in a limited-sample same-project setting) and a maximalist hybrid approach, fine-tuning first on many projects in many languages and then training on the same-project. We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art, on many different projects in both Java and Python.
diff --git a/_publications/ziegler2022productivity.markdown b/_publications/ziegler2022productivity.markdown
new file mode 100644
index 00000000..5cb1d1bb
--- /dev/null
+++ b/_publications/ziegler2022productivity.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Productivity Assessment of Neural Code Completion"
+authors: Albert Ziegler, Eirini Kalliamvakou, Shawn Simister, Ganesh Sittampalam, Alice Li, Andrew Rice, Devon Rifkin, Edward Aftandilian
+conference: MAPS
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2205.06537"}
+   - {name: "Data", url: "https://github.com/wunderalbert/prod-neural-materials"}
+tags: ["evaluation", "human evaluation"]
+---
+Neural code synthesis has reached a point where snippet generation is accurate enough to be considered for integration into human software development workflows. Commercial products aim to increase programmers' productivity, without being able to measure it directly. In this case study, we asked users of GitHub Copilot about its impact on their productivity, and sought to find a reflection of their perception in directly measurable user data. We find that the rate with which shown suggestions are accepted, rather than more specific metrics regarding the persistence of completions in the code over time, drives developers' perception of productivity.

From 7f3a63f1217d650cfd1122f930b8f9f40f98b70c Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Tue, 5 Jul 2022 13:50:57 +0300
Subject: [PATCH 008/114] Add human eval of Codex

---
 _publications/barke2022grounded.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/barke2022grounded.markdown

diff --git a/_publications/barke2022grounded.markdown b/_publications/barke2022grounded.markdown
new file mode 100644
index 00000000..2af4be2f
--- /dev/null
+++ b/_publications/barke2022grounded.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Grounded Copilot: How Programmers Interact with Code-Generating Models"
+authors: Shraddha Barke, Michael B. James, Nadia Polikarpova
+conference:
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2206.15000"}
+tags: ["human evaluation", "synthesis"]
+---
+Powered by recent advances in code-generating models, AI assistants like Github Copilot promise to change the face of programming forever. But what is this new face of programming? We present the first grounded theory analysis of how programmers interact with Copilot, based on observing 20 participants--with a range of prior experience using the assistant--as they solve diverse programming tasks across four languages. Our main finding is that interactions with programming assistants are bimodal: in acceleration mode, the programmer knows what to do next and uses Copilot to get there faster; in exploration mode, the programmer is unsure how to proceed and uses Copilot to explore their options. Based on our theory, we provide recommendations for improving the usability of future AI programming assistants.

From 6e9b8dca82155aa7f97df160389bf68f460eea0a Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Mon, 11 Jul 2022 10:28:37 +0300
Subject: [PATCH 009/114] Add papers

---
 _publications/reid2022learning.markdown | 11 +++++++++++
 _publications/richter2022can.markdown   | 14 ++++++++++++++
 2 files changed, 25 insertions(+)
 create mode 100644 _publications/reid2022learning.markdown
 create mode 100644 _publications/richter2022can.markdown

diff --git a/_publications/reid2022learning.markdown b/_publications/reid2022learning.markdown
new file mode 100644
index 00000000..a33f8eff
--- /dev/null
+++ b/_publications/reid2022learning.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Learning to Model Editing Processes"
+authors: Machel Reid, Graham Neubig
+conference: 
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2205.12374"}
+tags: ["Transformer", "edit"]
+---
+Most existing sequence generation models produce outputs in one pass, usually left-to-right. However, this is in contrast with a more natural approach that humans use in generating content; iterative refinement and editing. Recent work has introduced edit-based models for various tasks (such as neural machine translation and text style transfer), but these generally model a single edit step. In this work, we propose modeling editing processes, modeling the whole process of iteratively generating sequences. We form a conceptual framework to describe the likelihood of multi-step edits, and describe neural models that can learn a generative model of sequences based on these multistep edits. We introduce baseline results and metrics on this task, finding that modeling editing processes improves performance on a variety of axes on both our proposed task and related downstream tasks compared to previous single-step models of edits.
diff --git a/_publications/richter2022can.markdown b/_publications/richter2022can.markdown
new file mode 100644
index 00000000..d462f424
--- /dev/null
+++ b/_publications/richter2022can.markdown
@@ -0,0 +1,14 @@
+---
+layout: publication
+title: "Can we learn from developer mistakes? Learning to localize and repair real bugs from real bug fixes"
+authors: Cedric Richter, Heike Wehrheim
+conference: 
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2207.00301"}
+   - {name: "Code", url: "https://github.com/cedricrupb/nbfbaselines"}
+tags: ["Transformer", "repair", "defect"]
+---
+Real bug fixes found in open source repositories seem to be the perfect source for learning to localize and repair real bugs. However, the absence of large scale bug fix collections has made it difficult to effectively exploit real bug fixes in the training of larger neural models in the past. In contrast, artificial bugs -- produced by mutating existing source code -- can be easily obtained at a sufficient scale and are therefore often preferred in the training of existing approaches. Still, localization and repair models that are trained on artificial bugs usually underperform when faced with real bugs. This raises the question whether bug localization and repair models trained on real bug fixes are more effective in localizing and repairing real bugs.
+
+We address this question by introducing RealiT, a pre-train-and-fine-tune approach for effectively learning to localize and repair real bugs from real bug fixes. RealiT is first pre-trained on a large number of artificial bugs produced by traditional mutation operators and then fine-tuned on a smaller set of real bug fixes. Fine-tuning does not require any modifications of the learning algorithm and hence can be easily adopted in various training scenarios for bug localization or repair (even when real training data is scarce). In addition, we found that training on real bug fixes with RealiT is empirically powerful by nearly doubling the localization performance of an existing model on real bugs while maintaining or even improving the repair performance.

From 8199df61982cc966d154e6a5ade95abae4dacf41 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Mon, 18 Jul 2022 10:42:03 +0100
Subject: [PATCH 010/114] Add paper

---
 _publications/zhou2022docoder.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/zhou2022docoder.markdown

diff --git a/_publications/zhou2022docoder.markdown b/_publications/zhou2022docoder.markdown
new file mode 100644
index 00000000..8e23e65b
--- /dev/null
+++ b/_publications/zhou2022docoder.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "DocCoder: Generating Code by Retrieving and Reading Docs"
+authors: Shuyan Zhou, Uri Alon, Frank F. Xu, Zhengbao JIang, Graham Neubig
+conference:
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2207.05987"}
+   - {name: "Code and Data", url: "https://github.com/shuyanzhou/doccoder"}
+tags: ["Transformer", "search", "code generation"]
+---
+Natural-language-to-code models learn to generate a code snippet given a natural language (NL) intent. However, the rapid growth of both publicly available and proprietary libraries and functions makes it impossible to cover all APIs using training examples, as new libraries and functions are introduced daily. Thus, existing models inherently cannot generalize to using unseen functions and libraries merely through incorporating them into the training data. In contrast, when human programmers write programs, they frequently refer to textual resources such as code manuals, documentation, and tutorials, to explore and understand available library functionality. Inspired by this observation, we introduce DocCoder: an approach that explicitly leverages code manuals and documentation by (1) retrieving the relevant documentation given the NL intent, and (2) generating the code based on the NL intent and the retrieved documentation. Our approach is general, can be applied to any programming language, and is agnostic to the underlying neural model. We demonstrate that DocCoder consistently improves NL-to-code models: DocCoder achieves 11x higher exact match accuracy than strong baselines on a new Bash dataset tldr; on the popular Python CoNaLa benchmark, DocCoder improves over strong baselines by 1.65 BLEU.

From 93a3f7e1476c6ab5cd0ad8abf0efca3c45654487 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Tue, 19 Jul 2022 10:11:06 +0100
Subject: [PATCH 011/114] Add paper

---
 _publications/szafraniec2022code.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/szafraniec2022code.markdown

diff --git a/_publications/szafraniec2022code.markdown b/_publications/szafraniec2022code.markdown
new file mode 100644
index 00000000..2f5c4072
--- /dev/null
+++ b/_publications/szafraniec2022code.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Code Translation with Compiler Representations"
+authors: Marc Szafraniec, Baptiste Roziere, Hugh Leather, Francois Charton, Patrick Labatut, Gabriel Synnaeve
+conference: 
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2207.03578"}
+tags: ["Transformer", "migration", "decompilation"]
+---
+In this paper, we leverage low-level compiler intermediate representations (IR) to improve code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation. However, they treat the code as sequences of text tokens, and still do not differentiate well enough between similar pieces of code which have different semantics in different languages. The consequence is low quality translation, reducing the practicality of NMT, and stressing the need for an approach significantly increasing its accuracy. Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages. Our method improves upon the state of the art for unsupervised code translation, increasing the number of correct translations by 11% on average, and up to 79% for the Java - Rust pair. We extend previous test sets for code translation, by adding hundreds of Go and Rust functions. Additionally, we train models with high performance on the problem of IR decompilation, generating programming source code from IR, and study using IRs as intermediary pivot for translation.

From 7013dd123f511c1b853f322af80b0f8b3a073a22 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Wed, 10 Aug 2022 09:53:46 +0100
Subject: [PATCH 012/114] Create fried2022incoder.markdown

---
 _publications/fried2022incoder.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/fried2022incoder.markdown

diff --git a/_publications/fried2022incoder.markdown b/_publications/fried2022incoder.markdown
new file mode 100644
index 00000000..9364be5f
--- /dev/null
+++ b/_publications/fried2022incoder.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "InCoder: A Generative Model for Code Infilling and Synthesis"
+authors: Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, Mike Lewis
+conference:
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2204.05999"}
+tags: ["Transformer", "code generation", "naming", "summarization"]
+---
+Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via infilling). InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context. Our model is the first generative model that is able to directly perform zero-shot code infilling, which we evaluate on challenging tasks such as type inference, comment generation, and variable re-naming. We find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale. The InCoder models and code are publicly released at https://sites.google.com/view/incoder-code-models

From ce5440d77c24949c7fb07b24c12752ea81c70bcf Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Sun, 14 Aug 2022 13:52:25 +0100
Subject: [PATCH 013/114] Add paper

---
 _publications/bavarian2022efficient.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/bavarian2022efficient.markdown

diff --git a/_publications/bavarian2022efficient.markdown b/_publications/bavarian2022efficient.markdown
new file mode 100644
index 00000000..ab873f1e
--- /dev/null
+++ b/_publications/bavarian2022efficient.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Efficient Training of Language Models to Fill in the Middle"
+authors: Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, Mark Chen
+conference:
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2207.14255"}
+tags: ["Transformer", "language model"]
+---
+We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive language models be trained with FIM by default. To this end, we run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span. We use these ablations to prescribe strong default settings and best practices to train FIM models. We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.

From fcca29b2a51fc109d0e0e841698ce7cca78fe469 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Thu, 18 Aug 2022 11:08:46 +0100
Subject: [PATCH 014/114] Add paper.

---
 _publications/zhang2022coditt5.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/zhang2022coditt5.markdown

diff --git a/_publications/zhang2022coditt5.markdown b/_publications/zhang2022coditt5.markdown
new file mode 100644
index 00000000..99e60ac7
--- /dev/null
+++ b/_publications/zhang2022coditt5.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "CoditT5: Pretraining for Source Code and Natural Language Editing"
+authors: Jiyang Zhang, Sheena Panthaplackel, Pengyu Nie, Junyi Jessy Li, Milos Gligoric
+conference:
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2208.05446"}
+tags: ["Transformer", "edit"]
+---
+Pretrained language models have been shown to be effective in many software-related generation tasks; however, they are not well-suited for editing tasks as they are not designed to reason about edits. To address this, we propose a novel pretraining objective which explicitly models edits and use it to build CoditT5, a large language model for software-related editing tasks that is pretrained on large amounts of source code and natural language comments. We fine-tune it on various downstream editing tasks, including comment updating, bug fixing, and automated code review. By outperforming pure generation-based models, we demonstrate the generalizability of our approach and its suitability for editing tasks. We also show how a pure generation model and our edit-based model can complement one another through simple reranking strategies, with which we achieve state-of-the-art performance for the three downstream editing tasks.

From 2f0c2f6f95978c9a23b09e1b5dfa80548a998755 Mon Sep 17 00:00:00 2001
From: Goutham Ramakrishnan <goutham7r@gmail.com>
Date: Sun, 21 Aug 2022 14:22:36 -0700
Subject: [PATCH 015/114] Update ramakrishnan2020semantic.markdown

---
 _publications/ramakrishnan2020semantic.markdown | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/_publications/ramakrishnan2020semantic.markdown b/_publications/ramakrishnan2020semantic.markdown
index e1c14529..9e136d2d 100644
--- a/_publications/ramakrishnan2020semantic.markdown
+++ b/_publications/ramakrishnan2020semantic.markdown
@@ -1,13 +1,13 @@
 ---
 layout: publication
 title: "Semantic Robustness of Models of Source Code"
-authors: Goutham Ramakrishnan, Jordan Henkel, Zi Wang, Aws Albarghouthi, Somesh Jha, Thomas Reps
-conference:
-year: 2020
+authors: Jordan Henkel, Goutham Ramakrishnan, Zi Wang, Aws Albarghouthi, Somesh Jha, Thomas Reps
+conference: SANER
+year: 2022
 additional_links:
+   - {name: "IEEE", url: "https://ieeexplore.ieee.org/document/9825895"}
    - {name: "ArXiV", url: "https://arxiv.org/abs/2002.03043"}
+   - {name: "Code", url: "https://github.com/jjhenkel/averloc"}
 tags: ["adversarial", "naming"]
 ---
-Deep neural networks are vulnerable to adversarial examples - small input perturbations that result in incorrect predictions. We study this problem in the context of models of source code, where we want the network to be robust to source-code modifications that preserve code functionality. We define a natural notion of robustness, k-transformation robustness, in which an adversary performs up to k semantics-preserving transformations to an input program. We show how to train robust models using an adversarial training objective inspired by that of Madry et al. (2018) for continuous domains.
-
-We implement an extensible framework for adversarial training over source code, and conduct a thorough evaluation on a number of datasets and two different architectures. Our results show (1) the increase in robustness following adversarial training, (2) the ability of training on weak adversaries to provide robustness to attacks by stronger adversaries, and (3) the shift in attribution focus of adversarially trained models towards semantic vs. syntactic features. 
+Deep neural networks are vulnerable to adversarial examples-small input perturbations that result in incorrect predictions. We study this problem for models of source code, where we want the neural network to be robust to source-code modifications that preserve code functionality. To facilitate training robust models, we define a powerful and generic adversary that can employ sequences of parametric, semantics-preserving program transformations. We then explore how, with such an adversary, one can train models that are robust to adversarial program transformations. We conduct a thorough evaluation of our approach and find several surprising facts: we find robust training to beat dataset augmentation in every evaluation we performed; we find that a state-of-the-art architecture (code2seq) for models of code is harder to make robust than a simpler baseline; additionally, we find code2seq to have surprising weaknesses not present in our simpler baseline model; finally, we find that robust models perform better against unseen data from different sources (as one might hope)-however, we also find that robust models are not clearly better in the cross-language transfer task. To the best of our knowledge, we are the first to study the interplay between robustness of models of code and the domain-adaptation and cross-language- transfer tasks.

From 8541f8676b2f329fd02c57e4c03c3fa6c984942c Mon Sep 17 00:00:00 2001
From: Goutham Ramakrishnan <goutham7r@gmail.com>
Date: Sun, 21 Aug 2022 14:23:29 -0700
Subject: [PATCH 016/114] Update ramakrishnan2020semantic.markdown

---
 _publications/ramakrishnan2020semantic.markdown | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/_publications/ramakrishnan2020semantic.markdown b/_publications/ramakrishnan2020semantic.markdown
index 9e136d2d..2eca6367 100644
--- a/_publications/ramakrishnan2020semantic.markdown
+++ b/_publications/ramakrishnan2020semantic.markdown
@@ -10,4 +10,4 @@ additional_links:
    - {name: "Code", url: "https://github.com/jjhenkel/averloc"}
 tags: ["adversarial", "naming"]
 ---
-Deep neural networks are vulnerable to adversarial examples-small input perturbations that result in incorrect predictions. We study this problem for models of source code, where we want the neural network to be robust to source-code modifications that preserve code functionality. To facilitate training robust models, we define a powerful and generic adversary that can employ sequences of parametric, semantics-preserving program transformations. We then explore how, with such an adversary, one can train models that are robust to adversarial program transformations. We conduct a thorough evaluation of our approach and find several surprising facts: we find robust training to beat dataset augmentation in every evaluation we performed; we find that a state-of-the-art architecture (code2seq) for models of code is harder to make robust than a simpler baseline; additionally, we find code2seq to have surprising weaknesses not present in our simpler baseline model; finally, we find that robust models perform better against unseen data from different sources (as one might hope)-however, we also find that robust models are not clearly better in the cross-language transfer task. To the best of our knowledge, we are the first to study the interplay between robustness of models of code and the domain-adaptation and cross-language- transfer tasks.
+Deep neural networks are vulnerable to adversarial examples - small input perturbations that result in incorrect predictions. We study this problem for models of source code, where we want the neural network to be robust to source-code modifications that preserve code functionality. To facilitate training robust models, we define a powerful and generic adversary that can employ sequences of parametric, semantics-preserving program transformations. We then explore how, with such an adversary, one can train models that are robust to adversarial program transformations. We conduct a thorough evaluation of our approach and find several surprising facts: we find robust training to beat dataset augmentation in every evaluation we performed; we find that a state-of-the-art architecture (code2seq) for models of code is harder to make robust than a simpler baseline; additionally, we find code2seq to have surprising weaknesses not present in our simpler baseline model; finally, we find that robust models perform better against unseen data from different sources (as one might hope) - however, we also find that robust models are not clearly better in the cross-language transfer task. To the best of our knowledge, we are the first to study the interplay between robustness of models of code and the domain-adaptation and cross-language transfer tasks.

From 4e1f2e7c4ce490ace2f99b94d54788190377ebf1 Mon Sep 17 00:00:00 2001
From: Goutham Ramakrishnan <goutham7r@gmail.com>
Date: Sun, 21 Aug 2022 14:25:30 -0700
Subject: [PATCH 017/114] Update ramakrishnan2020semantic.markdown

---
 _publications/ramakrishnan2020semantic.markdown | 1 +
 1 file changed, 1 insertion(+)

diff --git a/_publications/ramakrishnan2020semantic.markdown b/_publications/ramakrishnan2020semantic.markdown
index 2eca6367..f6978565 100644
--- a/_publications/ramakrishnan2020semantic.markdown
+++ b/_publications/ramakrishnan2020semantic.markdown
@@ -5,6 +5,7 @@ authors: Jordan Henkel, Goutham Ramakrishnan, Zi Wang, Aws Albarghouthi, Somesh
 conference: SANER
 year: 2022
 additional_links:
+   - {name: "PDF", url: "https://pages.cs.wisc.edu/~jjhenkel/papers/saner22-semantic-robustness.pdf"}
    - {name: "IEEE", url: "https://ieeexplore.ieee.org/document/9825895"}
    - {name: "ArXiV", url: "https://arxiv.org/abs/2002.03043"}
    - {name: "Code", url: "https://github.com/jjhenkel/averloc"}

From f038f103880d67f9564822f6ad2e37ae3ef7e200 Mon Sep 17 00:00:00 2001
From: Goutham Ramakrishnan <goutham7r@gmail.com>
Date: Sun, 21 Aug 2022 14:33:47 -0700
Subject: [PATCH 018/114] Create ramakrishnan2020backdoors.markdown

---
 _publications/ramakrishnan2020backdoors.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/ramakrishnan2020backdoors.markdown

diff --git a/_publications/ramakrishnan2020backdoors.markdown b/_publications/ramakrishnan2020backdoors.markdown
new file mode 100644
index 00000000..f19bef94
--- /dev/null
+++ b/_publications/ramakrishnan2020backdoors.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Backdoors in Neural Models of Source Code"
+authors: Goutham Ramakrishnan, Aws Albarghouthi
+conference: 
+year: 2020
+additional_links:
+    - {name: "ArXiV", url: "https://arxiv.org/abs/2006.06841"}
+    - {name: "Code", url: "https://github.com/goutham7r/backdoors-for-code"}
+tags: ["adversarial"]
+---
+Deep neural networks are vulnerable to a range of adversaries. A particularly pernicious class of vulnerabilities are backdoors, where model predictions diverge in the presence of subtle triggers in inputs. An attacker can implant a backdoor by poisoning the training data to yield a desired target prediction on triggered inputs. We study backdoors in the context of deep-learning for source code. (1) We define a range of backdoor classes for source-code tasks and show how to poison a dataset to install such backdoors. (2) We adapt and improve recent algorithms from robust statistics for our setting, showing that backdoors leave a spectral signature in the learned representation of source code, thus enabling detection of poisoned data. (3) We conduct a thorough evaluation on different architectures and languages, showing the ease of injecting backdoors and our ability to eliminate them.

From f50282e991ad6874fc69d6d9ec5c02b96843ca90 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Mon, 22 Aug 2022 08:25:04 +0100
Subject: [PATCH 019/114] Rename ramakrishnan2020semantic.markdown to
 henkel2020semantic.markdown

---
 ...akrishnan2020semantic.markdown => henkel2020semantic.markdown} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename _publications/{ramakrishnan2020semantic.markdown => henkel2020semantic.markdown} (100%)

diff --git a/_publications/ramakrishnan2020semantic.markdown b/_publications/henkel2020semantic.markdown
similarity index 100%
rename from _publications/ramakrishnan2020semantic.markdown
rename to _publications/henkel2020semantic.markdown

From 03e7446318fc47ec7f7c243be4d2a96a84ed7f3c Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Fri, 26 Aug 2022 15:24:01 +0300
Subject: [PATCH 020/114] Add paper

---
 _publications/sarkar2022what.markdown | 15 +++++++++++++++
 1 file changed, 15 insertions(+)
 create mode 100644 _publications/sarkar2022what.markdown

diff --git a/_publications/sarkar2022what.markdown b/_publications/sarkar2022what.markdown
new file mode 100644
index 00000000..e8507132
--- /dev/null
+++ b/_publications/sarkar2022what.markdown
@@ -0,0 +1,15 @@
+---
+layout: publication
+title: "What is it like to program with artificial intelligence?"
+authors: Advait Sarkar, Andrew D. Gordon, Carina Negreanu, Christian Poelitz, Sruti Srinivasa Ragavan, Ben Zorn
+conference: 
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2208.06213"}
+tags: ["human evaluation", "review"]
+---
+Large language models, such as OpenAI's codex and Deepmind's AlphaCode, can generate code to solve a variety of problems expressed in natural language. This technology has already been commercialised in at least one widely-used programming editor extension: GitHub Copilot.
+
+In this paper, we explore how programming with large language models (LLM-assisted programming) is similar to, and differs from, prior conceptualisations of programmer assistance. We draw upon publicly available experience reports of LLM-assisted programming, as well as prior usability and design studies. We find that while LLM-assisted programming shares some properties of compilation, pair programming, and programming via search and reuse, there are fundamental differences both in the technical possibilities as well as the practical experience. Thus, LLM-assisted programming ought to be viewed as a new way of programming with its own distinct properties and challenges.
+
+Finally, we draw upon observations from a user study in which non-expert end user programmers use LLM-assisted tools for solving data tasks in spreadsheets. We discuss the issues that might arise, and open research challenges, in applying large language models to end-user programming, particularly with users who have little or no programming expertise.

From 5e84dd225c0011f739f29708d4155186a42eeff0 Mon Sep 17 00:00:00 2001
From: Pengyu Nie <prodigy.sov@gmail.com>
Date: Fri, 26 Aug 2022 14:34:22 -0500
Subject: [PATCH 021/114] Update panthaplackel2020learning.markdown

add a missing author
---
 _publications/panthaplackel2020learning.markdown | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/_publications/panthaplackel2020learning.markdown b/_publications/panthaplackel2020learning.markdown
index dbc043d7..5fb3b7a2 100644
--- a/_publications/panthaplackel2020learning.markdown
+++ b/_publications/panthaplackel2020learning.markdown
@@ -1,7 +1,7 @@
 ---
 layout: publication
 title: "Learning to Update Natural Language Comments Based on Code Changes"
-authors: Sheena Panthaplackel, Milos Gligoric, Raymond J. Mooney, Junyi Jessy Li
+authors: Sheena Panthaplackel, Pengyu Nie, Milos Gligoric, Raymond J. Mooney, Junyi Jessy Li
 conference: ACL
 year: 2020
 additional_links:

From a8e7192e8112fd2c6c4f6197bb042f2bde1c59b8 Mon Sep 17 00:00:00 2001
From: Sean <sean.j.moran@gmail.com>
Date: Tue, 23 Aug 2022 10:27:05 +0100
Subject: [PATCH 022/114] Create lherondelle2022topical.markdown

---
 _publications/lherondelle2022topical.markdown | 20 +++++++++++++++++++
 1 file changed, 20 insertions(+)
 create mode 100644 _publications/lherondelle2022topical.markdown

diff --git a/_publications/lherondelle2022topical.markdown b/_publications/lherondelle2022topical.markdown
new file mode 100644
index 00000000..52eb73ef
--- /dev/null
+++ b/_publications/lherondelle2022topical.markdown
@@ -0,0 +1,20 @@
+---
+layout: publication
+title: "Topical: Learning Repository Embeddings from Source Code using Attention"
+authors: Agathe Lherondelle, Yash Satsangi, Fran Silavong, Shaltiel Eloul, Sean Moran
+conference: Arxiv
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/pdf/2208.09495.pdf"}
+tags: ["representation", "topic modelling"]
+---
+Machine learning on source code (MLOnCode) promises to transform how software is delivered. By mining the context and relationship between software artefacts, MLOnCode
+augments the software developer’s capabilities with code autogeneration, code recommendation, code auto-tagging and other data-driven enhancements. For many of these tasks a script level
+representation of code is sufficient, however, in many cases a repository level representation that takes into account various dependencies and repository structure is imperative, for example,
+auto-tagging repositories with topics or auto-documentation of repository code etc. Existing methods for computing repository level representations suffer from (a) reliance on natural language
+documentation of code (for example, README files) (b) naive aggregation of method/script-level representation, for example, by concatenation or averaging. This paper introduces Topical a
+deep neural network to generate repository level embeddings of publicly available GitHub code repositories directly from source code. Topical incorporates an attention mechanism that projects the source code, the full dependency graph and the
+script level textual information into a dense repository-level representation. To compute the repository-level representations, Topical is trained to predict the topics associated with a repository, on a dataset of publicly available GitHub repositories that
+were crawled along with their ground truth topic tags. Our experiments show that the embeddings computed by Topical are able to outperform multiple baselines, including baselines
+that naively combine the method-level representations through averaging or concatenation at the task of repository auto-tagging. Furthermore, we show that Topical’s attention mechanism outperforms naive aggregation methods when computing repositorylevel representations from script-level representation generated
+by existing methods. Topical is a lightweight framework for computing repository-level representation of code repositories that scales efficiently with the number of topics and dataset size.

From 2c98afc2622cd33abde86cee14335525e3b24124 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Mon, 10 Oct 2022 10:26:11 +0100
Subject: [PATCH 023/114] Add two papers

---
 _publications/key2022speak.markdown      | 11 +++++++++++
 _publications/nadeem2022codedsi.markdown | 11 +++++++++++
 2 files changed, 22 insertions(+)
 create mode 100644 _publications/key2022speak.markdown
 create mode 100644 _publications/nadeem2022codedsi.markdown

diff --git a/_publications/key2022speak.markdown b/_publications/key2022speak.markdown
new file mode 100644
index 00000000..efc5056e
--- /dev/null
+++ b/_publications/key2022speak.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "I Speak, You Verify: Toward Trustworthy Neural Program Synthesis"
+authors: Darren Key, Wen-Ding Li, Kevin Ellis
+conference:
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2210.00848"}
+tags: ["synthesis"]
+---
+We develop an approach for improving the trustworthiness and overall accuracy of program synthesizers based on large language models for source code. Given a natural language description of a programming problem, our method samples both candidate programs as well as candidate predicates specifying how the program should behave. We learn to analyze the agreement between programs and predicates to judge both which program is most likely to be correct, and also judge whether the language model is able to solve the programming problem in the first place. This latter capacity allows favoring high precision over broad recall: fostering trust by only proposing a program when the system is certain that it is correct.
diff --git a/_publications/nadeem2022codedsi.markdown b/_publications/nadeem2022codedsi.markdown
new file mode 100644
index 00000000..224c2e8b
--- /dev/null
+++ b/_publications/nadeem2022codedsi.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "CodeDSI: Differentiable Code Search"
+authors: Usama Nadeem, Noah Ziems, Shaoen Wu
+conference:
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2210.00328"}
+tags: ["search"]
+---
+Reimplementing solutions to previously solved software engineering problems is not only inefficient but also introduces inadequate and error-prone code. Many existing methods achieve impressive performance on this issue by using autoregressive text-generation models trained on code. However, these methods are not without their flaws. The generated code from these models can be buggy, lack documentation, and introduce vulnerabilities that may go unnoticed by developers. An alternative to code generation -- neural code search -- is a field of machine learning where a model takes natural language queries as input and, in turn, relevant code samples from a database are returned. Due to the nature of this pre-existing database, code samples can be documented, tested, licensed, and checked for vulnerabilities before being used by developers in production. In this work, we present CodeDSI, an end-to-end unified approach to code search. CodeDSI is trained to directly map natural language queries to their respective code samples, which can be retrieved later. In an effort to improve the performance of code search, we have investigated docid representation strategies, impact of tokenization on docid structure, and dataset sizes on overall code search performance. Our results demonstrate CodeDSI strong performance, exceeding conventional robust baselines by 2-6% across varying dataset sizes.
\ No newline at end of file

From 827dfa9f91f1caf3d64bba3851277bc2787c3ee3 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Mon, 10 Oct 2022 10:40:25 +0100
Subject: [PATCH 024/114] Add CodeT

---
 _publications/chen2022codet.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/chen2022codet.markdown

diff --git a/_publications/chen2022codet.markdown b/_publications/chen2022codet.markdown
new file mode 100644
index 00000000..e7f4a5d2
--- /dev/null
+++ b/_publications/chen2022codet.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "CodeT: Code Generation with Generated Tests"
+authors: Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, Weizhu Chen
+conference:
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2207.10397"}
+tags: ["synthesis"]
+---
+Given a programming problem, pre-trained language models such as Codex have demonstrated the ability to generate multiple different code solutions via sampling. However, selecting a correct or best solution from those samples still remains a challenge. While an easy way to verify the correctness of a code solution is through executing test cases, producing high-quality test cases is prohibitively expensive. In this paper, we explore the use of pre-trained language models to automatically generate test cases, calling our method CodeT: Code generation with generated Tests. CodeT executes the code solutions using the generated test cases, and then chooses the best solution based on a dual execution agreement with both the generated test cases and other generated solutions. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks. Extensive experimental results demonstrate CodeT can achieve significant, consistent, and surprising improvements over previous methods. For example, CodeT improves the pass@1 on HumanEval to 65.8%, an increase of absolute 18.8% on the code-davinci-002 model, and an absolute 20+% improvement over previous state-of-the-art results.

From bbb9ab237b86a8329afa866fa73d70841c85f566 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Wed, 12 Oct 2022 11:36:56 +0100
Subject: [PATCH 025/114] Add paper

---
 _publications/sahu2022learning.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/sahu2022learning.markdown

diff --git a/_publications/sahu2022learning.markdown b/_publications/sahu2022learning.markdown
new file mode 100644
index 00000000..c80232b7
--- /dev/null
+++ b/_publications/sahu2022learning.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Learning to Answer Semantic Queries over Code"
+authors: Surya Prakash Sahu, Madhurima Mandal, Shikhar Bharadwaj, Aditya Kanade, Petros Maniatis, Shirish Shevade
+conference:
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2209.08372"}
+tags: ["static analysis", "Transformer"]
+---
+During software development, developers need answers to queries about semantic aspects of code. Even though extractive question-answering using neural approaches has been studied widely in natural languages, the problem of answering semantic queries over code using neural networks has not yet been explored. This is mainly because there is no existing dataset with extractive question and answer pairs over code involving complex concepts and long chains of reasoning. We bridge this gap by building a new, curated dataset called CodeQueries, and proposing a neural question-answering methodology over code.
+We build upon state-of-the-art pre-trained models of code to predict answer and supporting-fact spans. Given a query and code, only some of the code may be relevant to answer the query. We first experiment under an ideal setting where only the relevant code is given to the model and show that our models do well. We then experiment under three pragmatic considerations: (1) scaling to large-size code, (2) learning from a limited number of examples and (3) robustness to minor syntax errors in code. Our results show that while a neural model can be resilient to minor syntax errors in code, increasing size of code, presence of code that is not relevant to the query, and reduced number of training examples limit the model performance. We are releasing our data and models to facilitate future work on the proposed problem of answering semantic queries over code.

From c1d2092b868bbd3e5e4f5ad9f9954c64fd40d0e2 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Thu, 13 Oct 2022 21:29:29 +0100
Subject: [PATCH 026/114] Add paper

---
 _publications/chen2021plur.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/chen2021plur.markdown

diff --git a/_publications/chen2021plur.markdown b/_publications/chen2021plur.markdown
new file mode 100644
index 00000000..645015dc
--- /dev/null
+++ b/_publications/chen2021plur.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "PLUR: A Unifying, Graph-Based View of Program Learning, Understanding, and Repair"
+authors: Zimin Chen, Vincent J Hellendoorn, Pascal Lamblin, Petros Maniatis, Pierre-Antoine Manzagol, Daniel Tarlow, Subhodeep Moitra
+conference: NeurIPS
+year: 2021
+additional_links:
+   - {name: "NeurIPS Proceedings", url: "https://proceedings.neurips.cc/paper/2021/hash/c2937f3a1b3a177d2408574da0245a19-Abstract.html"}
+tags: ["repair"]
+---
+Machine learning for understanding and editing source code has recently attracted significant interest, with many developments in new models, new code representations, and new tasks.This proliferation can appear disparate and disconnected, making each approach seemingly unique and incompatible, thus obscuring the core machine learning challenges and contributions.In this work, we demonstrate that the landscape can be significantly simplified by taking a general approach of mapping a graph to a sequence of tokens and pointers.Our main result is to show that 16 recently published tasks of different shapes can be cast in this form, based on which a single model architecture achieves near or above state-of-the-art results on nearly all tasks, outperforming custom models like code2seq and alternative generic models like Transformers.This unification further enables multi-task learning and a series of cross-cutting experiments about the importance of different modeling choices for code understanding and repair tasks.The full framework, called PLUR, is easily extensible to more tasks, and will be open-sourced (https://github.com/google-research/plur).

From f80a9ad28fe90e467d3312efb6acec355535742e Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Fri, 21 Oct 2022 11:34:48 +0100
Subject: [PATCH 027/114] Add CrystalBLEU

---
 _publications/eghbali2022crystalbleu.markdown | 34 +++++++++++++++++++
 1 file changed, 34 insertions(+)
 create mode 100644 _publications/eghbali2022crystalbleu.markdown

diff --git a/_publications/eghbali2022crystalbleu.markdown b/_publications/eghbali2022crystalbleu.markdown
new file mode 100644
index 00000000..cdc84e98
--- /dev/null
+++ b/_publications/eghbali2022crystalbleu.markdown
@@ -0,0 +1,34 @@
+---
+layout: publication
+title: "CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code"
+authors: Aryaz Eghbali, Michael Pradel
+conference: ASE
+year: 2022
+additional_links:
+   - {name: "Preprint", url: "https://arxiv.org/abs/xxxx.xxxxxx"}
+tags: ["evaluation"]
+---
+Recent years have brought a surge of work on predicting pieces
+of source code, e.g., for code completion, code migration, program
+repair, or translating natural language into code. All this work faces
+the challenge of evaluating the quality of a prediction w.r.t. some
+oracle, typically in the form of a reference solution. A common
+evaluation metric is the BLEU score, an n-gram-based metric originally proposed for evaluating natural language translation, but
+adopted in software engineering because it can be easily computed
+on any programming language and enables automated evaluation at
+scale. However, a key difference between natural and programming
+languages is that in the latter, completely unrelated pieces of code
+may have many common n-grams simply because of the syntactic
+verbosity and coding conventions of programming languages. We
+observe that these trivially shared n-grams hamper the ability of
+the metric to distinguish between truly similar code examples and
+code examples that are merely written in the same language. This
+paper presents CrystalBLEU, an evaluation metric based on BLEU,
+that allows for precisely and efficiently measuring the similarity of
+code. Our metric preserves the desirable properties of BLEU, such
+as being language-agnostic, able to handle incomplete or partially
+incorrect code, and efficient, while reducing the noise caused by
+trivially shared n-grams. We evaluate CrystalBLEU on two datasets
+from prior work and on a new, labeled dataset of semantically equivalent programs. Our results show that CrystalBLEU can distinguish
+similar from dissimilar code examples 1.9–4.5 times more effectively, when compared to the original BLEU score and a previously
+proposed variant of BLEU for code.

From ed0d3a799e0d5249fcf4add29be62f9c85799453 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Fri, 21 Oct 2022 11:35:05 +0100
Subject: [PATCH 028/114] Fix link

---
 _publications/eghbali2022crystalbleu.markdown | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/_publications/eghbali2022crystalbleu.markdown b/_publications/eghbali2022crystalbleu.markdown
index cdc84e98..488a5781 100644
--- a/_publications/eghbali2022crystalbleu.markdown
+++ b/_publications/eghbali2022crystalbleu.markdown
@@ -5,7 +5,7 @@ authors: Aryaz Eghbali, Michael Pradel
 conference: ASE
 year: 2022
 additional_links:
-   - {name: "Preprint", url: "https://arxiv.org/abs/xxxx.xxxxxx"}
+   - {name: "Preprint", url: "https://www.software-lab.org/publications/ase2022_CrystalBLEU.pdf"}
 tags: ["evaluation"]
 ---
 Recent years have brought a surge of work on predicting pieces

From a191fcb8c1dacd75bd97cacf01ac69391a396f03 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Sun, 30 Oct 2022 07:54:30 +0000
Subject: [PATCH 029/114] Add the stack

---
 _publications/kocetkov2022stack.markdown | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)
 create mode 100644 _publications/kocetkov2022stack.markdown

diff --git a/_publications/kocetkov2022stack.markdown b/_publications/kocetkov2022stack.markdown
new file mode 100644
index 00000000..6bb0e716
--- /dev/null
+++ b/_publications/kocetkov2022stack.markdown
@@ -0,0 +1,23 @@
+---
+layout: publication
+title: "The Stack: 3TB of permissively licensed source code"
+authors: Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries
+conference:
+year: 2022
+additional_links:
+   - {name: "Preprint", url: "https://drive.google.com/file/d/17J-0KXTDzY9Esp-JqXYHIcy--i_7G5Bb/view"}
+tags: ["dataset"]
+---
+Large Language Models (LLMs) play an ever-increasing role in the field of
+Artificial Intelligence (AI)–not only for natural language processing but also
+for code understanding and generation. To stimulate open and responsible
+research on LLMs for code, we introduce The Stack, a 3.1 TB dataset
+consisting of permissively licensed source code in 30 programming languages.
+We describe how we collect the full dataset, construct a permissively licensed
+subset, and present promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that
+(1) near-deduplicating the data significantly boosts performance across all
+experiments, and (2) it is possible to match previously reported HumanEval
+and MBPP performance using only permissively licensed data. We make the
+dataset available at https://hf.co/BigCode and give developers the possi-
+bility to have their code removed from the dataset by following the instruc-
+tions at https://www.bigcode-project.org/docs/about/the-stack/.

From 6908daf973a12b9d9b8b83f8144c759678f1c9b7 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Fri, 4 Nov 2022 10:43:39 +0000
Subject: [PATCH 030/114] Add prompting paper

---
 _publications/doderlein2022piloting.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/doderlein2022piloting.markdown

diff --git a/_publications/doderlein2022piloting.markdown b/_publications/doderlein2022piloting.markdown
new file mode 100644
index 00000000..cbe23003
--- /dev/null
+++ b/_publications/doderlein2022piloting.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Piloting Copilot and Codex: Hot Temperature, Cold Prompts, or Black Magic?"
+authors: Jean-Baptiste Döderlein, Mathieu Acher, Djamel Eddine Khelladi, Benoit Combemale
+conference:
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2210.14699"}
+tags: ["Transformer"]
+---
+Language models are promising solutions for tackling increasing complex problems. In software engineering, they recently attracted attention in code assistants, with programs automatically written in a given programming language from a programming task description in natural language. They have the potential to save time and effort when writing code. However, these systems are currently poorly understood, preventing them from being used optimally. In this paper, we investigate the various input parameters of two language models, and conduct a study to understand if variations of these input parameters (e.g. programming task description and the surrounding context, creativity of the language model, number of generated solutions) can have a significant impact on the quality of the generated programs. We design specific operators for varying input parameters and apply them over two code assistants (Copilot and Codex) and two benchmarks representing algorithmic problems (HumanEval and LeetCode). Our results showed that varying the input parameters can significantly improve the performance of language models. However, there is a tight dependency when varying the temperature, the prompt and the number of generated solutions, making potentially hard for developers to properly control the parameters to obtain an optimal result. This work opens opportunities to propose (automated) strategies for improving performance.

From ca9a1312311696d173c2f85758210c110ef3080b Mon Sep 17 00:00:00 2001
From: "Naiming Liu (Lucy)" <32887016+lucy66666@users.noreply.github.com>
Date: Wed, 9 Nov 2022 14:11:41 -0600
Subject: [PATCH 031/114] Update liu2022open.markdown

---
 _publications/liu2022open.markdown | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/_publications/liu2022open.markdown b/_publications/liu2022open.markdown
index 54d41a3c..1ff11cdb 100644
--- a/_publications/liu2022open.markdown
+++ b/_publications/liu2022open.markdown
@@ -6,12 +6,7 @@ conference:
 year: 2022
 additional_links:
    - {name: "ArXiV", url: "https://arxiv.org/abs/2203.03716"}
+   - {name: "code", url: "https://github.com/lucy66666/OKT"}
 tags: ["education", "code generation"]
 ---
-Knowledge tracing refers to the problem of estimating each student’s knowledge component/skill mastery level from their past responses to questions in educational applications. 
-One direct benefit knowledge tracing methods provide is the ability to predict each student’s performance on the future questions. 
-However, one key limitation of most existing knowledge tracing methods is that they treat student responses to questions as binary-valued, i.e., whether the responses are correct or incorrect. 
-Response correctness analysis/prediction is easy to navigate but loses important information, especially for open-ended questions: the exact student responses can potentially provide much more information about their knowledge states than only response correctness. 
-In this paper, we present our first exploration into open-ended knowledge tracing, i.e., the analysis and prediction of students’ openended responses to questions in the knowledge tracing setup. 
-We first lay out a generic framework for open-ended knowledge tracing before detailing its application to the domain of computer science education with programming questions. 
-We define a series of evaluation metrics in this domain and conduct a series of quantitative and qualitative experiments to test the boundaries of open-ended knowledge tracing methods on a real-world student code dataset.
+In education applications, knowledge tracing refers to the problem of estimating students' time-varying concept/skill mastery level from their past responses to questions and predicting their future performance. One key limitation of most existing knowledge tracing methods is that they treat student responses to questions as binary-valued, i.e., whether they are correct or incorrect. Response correctness analysis/prediction ignores important information on student knowledge contained in the exact content of the responses, especially for open-ended questions. In this paper, we conduct the first exploration into open-ended knowledge tracing (OKT) by studying the new task of predicting students' exact open-ended responses to questions. Our work is grounded in the domain of computer science education with programming questions. We develop an initial solution to the OKT problem, a student knowledge-guided code generation approach, that combines program synthesis methods using language models with student knowledge tracing methods. We also conduct a series of quantitative and qualitative experiments on a real-world student code dataset to validate OKT and demonstrate its promise in educational applications.

From 8df12d6b114e5d7189d7318b9c4b04da54b38822 Mon Sep 17 00:00:00 2001
From: pkun <181250068@smail.nju.edu.cn>
Date: Fri, 18 Nov 2022 13:32:19 +0800
Subject: [PATCH 032/114] add SPT-Code

---
 _publications/niu2022spt-code.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/niu2022spt-code.markdown

diff --git a/_publications/niu2022spt-code.markdown b/_publications/niu2022spt-code.markdown
new file mode 100644
index 00000000..26ea593f
--- /dev/null
+++ b/_publications/niu2022spt-code.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations"
+authors: Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, Bin Luo
+conference: ICSE
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2201.01549"}
+   - {name: "code", url: "https://github.com/NougatCA/SPT-Code"}
+tags: ["Transformer", "representation"]
+---
+Recent years have seen the successful application of large pre-trained modelsto code representation learning, resulting in substantial improvements on manycode-related downstream tasks. But there are issues surrounding theirapplication to SE tasks. First, the majority of the pre-trained models focus onpre-training only the encoder of the Transformer. For generation tasks that areaddressed using models with the encoder-decoder architecture, however, there isno reason why the decoder should be left out during pre-training. Second, manyexisting pre-trained models, including state-of-the-art models such asT5-learning, simply reuse the pre-training tasks designed for naturallanguages. Moreover, to learn the natural language description of source codeneeded eventually for code-related tasks such as code summarization, existingpre-training tasks require a bilingual corpus composed of source code and theassociated natural language description, which severely limits the amount ofdata for pre-training. To this end, we propose SPT-Code, a sequence-to-sequencepre-trained model for source code. In order to pre-train SPT-Code in asequence-to-sequence manner and address the aforementioned weaknessesassociated with existing pre-training tasks, we introduce three pre-trainingtasks that are specifically designed to enable SPT-Code to learn knowledge ofsource code, the corresponding code structure, as well as a natural languagedescription of the code without relying on any bilingual corpus, and eventuallyexploit these three sources of information when it is applied to downstreamtasks. Experimental results demonstrate that SPT-Code achieves state-of-the-artperformance on five code-related downstream tasks after fine-tuning.

From 950eeb80d2076fed168e765c306f8e747d367215 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Fri, 18 Nov 2022 14:40:02 +0000
Subject: [PATCH 033/114] Apply suggestions from code review

fix spacing
---
 _publications/niu2022spt-code.markdown | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/_publications/niu2022spt-code.markdown b/_publications/niu2022spt-code.markdown
index 26ea593f..8a42fa41 100644
--- a/_publications/niu2022spt-code.markdown
+++ b/_publications/niu2022spt-code.markdown
@@ -9,4 +9,4 @@ additional_links:
    - {name: "code", url: "https://github.com/NougatCA/SPT-Code"}
 tags: ["Transformer", "representation"]
 ---
-Recent years have seen the successful application of large pre-trained modelsto code representation learning, resulting in substantial improvements on manycode-related downstream tasks. But there are issues surrounding theirapplication to SE tasks. First, the majority of the pre-trained models focus onpre-training only the encoder of the Transformer. For generation tasks that areaddressed using models with the encoder-decoder architecture, however, there isno reason why the decoder should be left out during pre-training. Second, manyexisting pre-trained models, including state-of-the-art models such asT5-learning, simply reuse the pre-training tasks designed for naturallanguages. Moreover, to learn the natural language description of source codeneeded eventually for code-related tasks such as code summarization, existingpre-training tasks require a bilingual corpus composed of source code and theassociated natural language description, which severely limits the amount ofdata for pre-training. To this end, we propose SPT-Code, a sequence-to-sequencepre-trained model for source code. In order to pre-train SPT-Code in asequence-to-sequence manner and address the aforementioned weaknessesassociated with existing pre-training tasks, we introduce three pre-trainingtasks that are specifically designed to enable SPT-Code to learn knowledge ofsource code, the corresponding code structure, as well as a natural languagedescription of the code without relying on any bilingual corpus, and eventuallyexploit these three sources of information when it is applied to downstreamtasks. Experimental results demonstrate that SPT-Code achieves state-of-the-artperformance on five code-related downstream tasks after fine-tuning.
+Recent years have seen the successful application of large pre-trained modelsto code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding theirapplication to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pre-training tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existingpre-training tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstreamt asks. Experimental results demonstrate that SPT-Code achieves state-of-the-artperformance on five code-related downstream tasks after fine-tuning.

From b50e9ac2503eae7e4e461f5a8aae4dd3d8fed723 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Fri, 18 Nov 2022 20:27:47 +0000
Subject: [PATCH 034/114] Try to fix pipeline

---
 .github/workflows/deploy.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml
index 251abc72..b6e8d907 100644
--- a/.github/workflows/deploy.yml
+++ b/.github/workflows/deploy.yml
@@ -23,7 +23,7 @@ jobs:
           architecture: x64
     - name: Compute tSNE Embeddings
       run: |
-          python -m pip install transformers sklearn numpy
+          python -m pip install transformers scikit-learn numpy
           python -m pip install torch==1.10.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
           python ${{ github.workspace }}/etc/compute_embeddings.py ${{ github.workspace }}/_site/paper-abstracts.json ${{ github.workspace }}/_site/tsne.json
     - name: Compute topics

From b5ff87a900042c2afdf42bf2fffa6598a9b1c3d1 Mon Sep 17 00:00:00 2001
From: "M.R.I. Rabin" <mdrafiqulrabin@gmail.com>
Date: Sat, 19 Nov 2022 03:53:01 -0600
Subject: [PATCH 035/114] Update and rename rabin2020generalizability.markdown
 to rabin2021generalizability.markdown

---
 _publications/rabin2020generalizability.markdown | 11 -----------
 _publications/rabin2021generalizability.markdown | 12 ++++++++++++
 2 files changed, 12 insertions(+), 11 deletions(-)
 delete mode 100644 _publications/rabin2020generalizability.markdown
 create mode 100644 _publications/rabin2021generalizability.markdown

diff --git a/_publications/rabin2020generalizability.markdown b/_publications/rabin2020generalizability.markdown
deleted file mode 100644
index 2ec9ad4c..00000000
--- a/_publications/rabin2020generalizability.markdown
+++ /dev/null
@@ -1,11 +0,0 @@
----
-layout: publication
-title: "On the Generalizability of Neural Program Analyzers with respect to Semantic-Preserving Program Transformations"
-authors: Md. Rafiqul Islam Rabin, Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang, Mohammad Amin Alipour
-conference:
-year: 2020
-additional_links:
-   - {name: "ArXiV", url: "https://arxiv.org/abs/2008.01566"}
-tags: ["adversarial", "GNN", "grammar"]
----
-With the prevalence of publicly available source code repositories to train deep neural network models, neural program analyzers can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analyzers. Although such analyzers have been tested on various existing datasets, the extent in which they generalize to unforeseen source code is largely unknown. Since it is impossible to test neural program analyzers on all unforeseen programs, in this paper, we propose to evaluate the generalizability of neural program analyzers with respect to semantic-preserving transformations: a generalizable neural program analyzer should perform equally well on programs that are of the same semantics but of different lexical appearances and syntactical structures. More specifically, we compare the results of various neural program analyzers for the method name prediction task on programs before and after automated semantic-preserving transformations. We use three Java datasets of different sizes and three state-of-the-art neural network models for code, namely code2vec, code2seq, and Gated Graph Neural Networks (GGNN), to build nine such neural program analyzers for evaluation. Our results show that even with small semantically preserving changes to the programs, these neural program analyzers often fail to generalize their performance. Our results also suggest that neural program analyzers based on data and control dependencies in programs generalize better than neural program analyzers based only on abstract syntax trees. On the positive side, we observe that as the size of training dataset grows and diversifies the generalizability of correct predictions produced by the analyzers can be improved too. 
diff --git a/_publications/rabin2021generalizability.markdown b/_publications/rabin2021generalizability.markdown
new file mode 100644
index 00000000..533d7a62
--- /dev/null
+++ b/_publications/rabin2021generalizability.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program Transformations"
+authors: Md Rafiqul Islam Rabin, Nghi D. Q. Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, Mohammad Amin Alipour
+conference: IST
+year: 2021
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2008.01566"}
+   - {name: "code", url: "https://github.com/mdrafiqulrabin/tnpa-generalizability"}
+tags: ["adversarial", "generalizability", "robustness", "transformation", "AST", "GNN"]
+---
+With the prevalence of publicly available source code repositories to train deep neural network models, neural program models can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analysis techniques. Although such neural program models have been tested on various existing datasets, the extent to which they generalize to unforeseen source code is largely unknown. Since it is very challenging to test neural program models on all unforeseen programs, in this paper, we propose to evaluate the generalizability of neural program models with respect to semantic-preserving transformations: a generalizable neural program model should perform equally well on programs that are of the same semantics but of different lexical appearances and syntactical structures. We compare the results of various neural program models for the method name prediction task on programs before and after automated semantic-preserving transformations. We use three Java datasets of different sizes and three state-of-the-art neural network models for code, namely code2vec, code2seq, and GGNN, to build nine such neural program models for evaluation. Our results show that even with small semantically preserving changes to the programs, these neural program models often fail to generalize their performance. Our results also suggest that neural program models based on data and control dependencies in programs generalize better than neural program models based only on abstract syntax trees. On the positive side, we observe that as the size of the training dataset grows and diversifies the generalizability of correct predictions produced by the neural program models can be improved too. Our results on the generalizability of neural program models provide insights to measure their limitations and provide a stepping stone for their improvement.

From b86214947257d7929fc8389a7e70066ad92abdcb Mon Sep 17 00:00:00 2001
From: "M.R.I. Rabin" <mdrafiqulrabin@gmail.com>
Date: Sat, 19 Nov 2022 04:11:04 -0600
Subject: [PATCH 036/114] Update and rename rabin2021memorization.markdown to
 rabin2022memorization.markdown

---
 _publications/rabin2021memorization.markdown | 11 -----------
 _publications/rabin2022memorization.markdown | 12 ++++++++++++
 2 files changed, 12 insertions(+), 11 deletions(-)
 delete mode 100644 _publications/rabin2021memorization.markdown
 create mode 100644 _publications/rabin2022memorization.markdown

diff --git a/_publications/rabin2021memorization.markdown b/_publications/rabin2021memorization.markdown
deleted file mode 100644
index 5dd7177a..00000000
--- a/_publications/rabin2021memorization.markdown
+++ /dev/null
@@ -1,11 +0,0 @@
----
-layout: publication
-title: "Memorization and Generalization in Neural Code Intelligence Models"
-authors: Md Rafiqul Islam Rabin, Aftab Hussain, Vincent J. Hellendoorn, Mohammad Amin Alipour
-conference: FSE
-year: 2021
-additional_links:
-   - {name: "ArXiV", url: "https://arxiv.org/abs/2106.08704"}
-tags: ["summarization"]
----
-Deep Neural Networks (DNN) are increasingly commonly used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, training DNNs means walking a knife's edges, because their large capacity also renders them prone to memorizing data points. While traditionally thought of as an aspect of over-training, recent work suggests that the memorization risk manifests especially strongly when the training datasets are noisy and memorization is the only recourse. Unfortunately, most code intelligence tasks rely on rather noise-prone and repetitive data sources, such as GitHub, which, due to their sheer size, cannot be manually inspected and evaluated. We evaluate the memorization and generalization tendencies in neural code intelligence models through a case study across several benchmarks and model families by leveraging established approaches from other fields that use DNNs, such as introducing targeted noise into the training dataset. In addition to reinforcing prior general findings about the extent of memorization in DNNs, our results shed light on the impact of noisy dataset in training. 
diff --git a/_publications/rabin2022memorization.markdown b/_publications/rabin2022memorization.markdown
new file mode 100644
index 00000000..93c254f6
--- /dev/null
+++ b/_publications/rabin2022memorization.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Memorization and Generalization in Neural Code Intelligence Models"
+authors: Md Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour, Vincent J. Hellendoorn
+conference: IST
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2106.08704"}
+   - {name: "code", url: "https://github.com/mdrafiqulrabin/CI-Memorization"}
+tags: ["memorization", "generalization", "noise", "capacity", "language model"]
+---
+Deep Neural Networks (DNNs) are increasingly being used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, their large capacity can render them prone to memorizing data points. Recent work suggests that the memorization risk manifests especially strongly when the training dataset is noisy, involving many ambiguous or questionable samples, and memorization is the only recourse. The goal of this paper is to evaluate and compare the extent of memorization and generalization in neural code intelligence models. It aims to provide insights on how memorization may impact the learning behavior of neural models in code intelligence systems. To observe the extent of memorization in models, we add random noise to the original training dataset and use various metrics to quantify the impact of noise on various aspects of training and testing. We evaluate several state-of-the-art neural code intelligence models and benchmarks based on Java, Python, and Ruby codebases. Our results highlight important risks: millions of trainable parameters allow the neural networks to memorize anything, including noisy data, and provide a false sense of generalization. We observed all models manifest some forms of memorization. This can be potentially troublesome in most code intelligence tasks where they rely on rather noise-prone and repetitive data sources, such as code from GitHub. To the best of our knowledge, we provide the first study to quantify memorization effects in the domain of software engineering and code intelligence systems. This work raises awareness and provides new insights into important issues of training neural models in code intelligence systems that are usually overlooked by software engineering researchers.

From 95b25fd94fd81d87ca54e23cb46578c8e08f743e Mon Sep 17 00:00:00 2001
From: "M.R.I. Rabin" <mdrafiqulrabin@gmail.com>
Date: Sat, 19 Nov 2022 16:35:14 -0600
Subject: [PATCH 037/114] Created rabin2019testing.markdown

---
 _publications/rabin2019testing.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/rabin2019testing.markdown

diff --git a/_publications/rabin2019testing.markdown b/_publications/rabin2019testing.markdown
new file mode 100644
index 00000000..84b7c1b0
--- /dev/null
+++ b/_publications/rabin2019testing.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Testing Neural Program Analyzers"
+authors: Md Rafiqul Islam Rabin, Ke Wang, Mohammad Amin Alipour
+conference: ASE (LBR-Track)
+year: 2019
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/1908.10711"}
+   - {name: "code", url: "https://github.com/mdrafiqulrabin/tnpa-framework"}
+tags: ["testing", "robustness", "transformation"]
+---
+Deep neural networks have been increasingly used in software engineering and program analysis tasks. They usually take a program and make some predictions about it, e.g., bug prediction. We call these models neural program analyzers. The reliability of neural programs can impact the reliability of the encompassing analyses. In this paper, we describe our ongoing efforts to develop effective techniques for testing neural programs. We discuss the challenges involved in developing such tools and our future plans. In our preliminary experiment on a neural model recently proposed in the literature, we found that the model is very brittle, and simple perturbations in the input can cause the model to make mistakes in its prediction.

From 7b537b7167c66359782dec499c31cdfaba66d0e3 Mon Sep 17 00:00:00 2001
From: "M.R.I. Rabin" <mdrafiqulrabin@gmail.com>
Date: Sat, 19 Nov 2022 16:48:48 -0600
Subject: [PATCH 038/114] Created rabin2020demystifying.markdown

---
 _publications/rabin2020demystifying.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/rabin2020demystifying.markdown

diff --git a/_publications/rabin2020demystifying.markdown b/_publications/rabin2020demystifying.markdown
new file mode 100644
index 00000000..e11374fc
--- /dev/null
+++ b/_publications/rabin2020demystifying.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Towards Demystifying Dimensions of Source Code Embeddings"
+authors: Md Rafiqul Islam Rabin, Arjun Mukherjee, Omprakash Gnawali, Mohammad Amin Alipour
+conference: RL+SE&PL (Co-located with ESEC/FSE)
+year: 2020
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2008.13064"}
+   - {name: "code", url: "https://github.com/mdrafiqulrabin/handcrafted-embeddings"}
+tags: ["understanding", "embeddings", "features"]
+---
+Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with high-dimensional vectors computed by training deep neural networks on a large volume of programs. Although successful, there is little known about the contents of these vectors and their characteristics. In this paper, we present our preliminary results towards better understanding the contents of code2vec neural source code embeddings. In particular, in a small case study, we use the code2vec embeddings to create binary SVM classifiers and compare their performance with the handcrafted features. Our results suggest that the handcrafted features can perform very close to the highly-dimensional code2vec embeddings, and the information gains are more evenly distributed in the code2vec embeddings compared to the handcrafted features. We also find that the code2vec embeddings are more resilient to the removal of dimensions with low information gains than the handcrafted features. We hope our results serve a stepping stone toward principled analysis and evaluation of these code representations.

From 8fe037421cf179e3c97d85b3a7dee1fd5c855a45 Mon Sep 17 00:00:00 2001
From: "M.R.I. Rabin" <mdrafiqulrabin@gmail.com>
Date: Sat, 19 Nov 2022 17:02:28 -0600
Subject: [PATCH 039/114] Create rabin2021understanding.markdown

---
 _publications/rabin2021understanding.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/rabin2021understanding.markdown

diff --git a/_publications/rabin2021understanding.markdown b/_publications/rabin2021understanding.markdown
new file mode 100644
index 00000000..83cd5730
--- /dev/null
+++ b/_publications/rabin2021understanding.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Understanding Neural Code Intelligence Through Program Simplification"
+authors: Md Rafiqul Islam Rabin, Vincent J. Hellendoorn, Mohammad Amin Alipour
+conference: ESEC/FSE
+year: 2021
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2106.03353"}
+   - {name: "code", url: "https://github.com/mdrafiqulrabin/SIVAND"}
+tags: ["understanding", "explainability", "interpretability", "transparency", "simplification", "reduction", "delta debugging", "attention", "features"]
+---
+A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior of the underlying models and the factors that affect them. This is especially challenging for tools backed by deep neural networks. Various methods have tried to reduce this opacity in the vein of "transparent/interpretable-AI". However, these approaches are often specific to a particular set of network architectures, even requiring access to the network's parameters. This makes them difficult to use for the average programmer, which hinders the reliable adoption of neural CI systems. In this paper, we propose a simple, model-agnostic approach to identify critical input features for models in CI systems, by drawing on software debugging research, specifically delta debugging. Our approach, SIVAND, uses simplification techniques that reduce the size of input programs of a CI model while preserving the predictions of the model. We show that this approach yields remarkably small outputs and is broadly applicable across many model architectures and problem domains. We find that the models in our experiments often rely heavily on just a few syntactic features in input programs. We believe that SIVAND's extracted features may help understand neural CI systems' predictions and learned behavior.

From 95052cce57b70f1b3cc3b4d46ebdc8949864cb20 Mon Sep 17 00:00:00 2001
From: "M.R.I. Rabin" <mdrafiqulrabin@gmail.com>
Date: Sat, 19 Nov 2022 17:10:07 -0600
Subject: [PATCH 040/114] Create rabin2022understanding.markdown

---
 _publications/rabin2022understanding.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/rabin2022understanding.markdown

diff --git a/_publications/rabin2022understanding.markdown b/_publications/rabin2022understanding.markdown
new file mode 100644
index 00000000..0edfa88b
--- /dev/null
+++ b/_publications/rabin2022understanding.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models"
+authors: Md Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour
+conference: MAPS
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2205.14374"}
+   - {name: "code", url: "https://github.com/mdrafiqulrabin/ci-dd-perses"}
+tags: ["understanding", "explainability", "interpretability", "transparency", "simplification", "reduction", "delta debugging", "perses", "features", "adversarial"]
+---
+Neural code intelligence (CI) models are opaque black-boxes and offer little insight on the features they use in making predictions. This opacity may lead to distrust in their prediction and hamper their wider adoption in safety-critical applications. Recently, input program reduction techniques have been proposed to identify key features in the input programs to improve the transparency of CI models. However, this approach is syntax-unaware and does not consider the grammar of the programming language. In this paper, we apply a syntax-guided program reduction technique that considers the grammar of the input programs during reduction. Our experiments on multiple models across different types of input programs show that the syntax-guided program reduction technique is faster and provides smaller sets of key tokens in reduced programs. We also show that the key tokens could be used in generating adversarial examples for up to 65% of the input programs.

From 3e5f875f313d72b20d9fff8d2f17fa1fa121c130 Mon Sep 17 00:00:00 2001
From: mdrafiqulrabin <mdrafiqulrabin@gmail.com>
Date: Sun, 20 Nov 2022 15:19:43 -0600
Subject: [PATCH 041/114] fix tags

---
 _publications/rabin2019testing.markdown          | 2 +-
 _publications/rabin2020demystifying.markdown     | 4 ++--
 _publications/rabin2021generalizability.markdown | 2 +-
 _publications/rabin2021understanding.markdown    | 2 +-
 _publications/rabin2022memorization.markdown     | 2 +-
 _publications/rabin2022understanding.markdown    | 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/_publications/rabin2019testing.markdown b/_publications/rabin2019testing.markdown
index 84b7c1b0..60a0bfb5 100644
--- a/_publications/rabin2019testing.markdown
+++ b/_publications/rabin2019testing.markdown
@@ -7,6 +7,6 @@ year: 2019
 additional_links:
    - {name: "ArXiV", url: "https://arxiv.org/abs/1908.10711"}
    - {name: "code", url: "https://github.com/mdrafiqulrabin/tnpa-framework"}
-tags: ["testing", "robustness", "transformation"]
+tags: ["evaluation", "refactoring"]
 ---
 Deep neural networks have been increasingly used in software engineering and program analysis tasks. They usually take a program and make some predictions about it, e.g., bug prediction. We call these models neural program analyzers. The reliability of neural programs can impact the reliability of the encompassing analyses. In this paper, we describe our ongoing efforts to develop effective techniques for testing neural programs. We discuss the challenges involved in developing such tools and our future plans. In our preliminary experiment on a neural model recently proposed in the literature, we found that the model is very brittle, and simple perturbations in the input can cause the model to make mistakes in its prediction.
diff --git a/_publications/rabin2020demystifying.markdown b/_publications/rabin2020demystifying.markdown
index e11374fc..89ff6934 100644
--- a/_publications/rabin2020demystifying.markdown
+++ b/_publications/rabin2020demystifying.markdown
@@ -2,11 +2,11 @@
 layout: publication
 title: "Towards Demystifying Dimensions of Source Code Embeddings"
 authors: Md Rafiqul Islam Rabin, Arjun Mukherjee, Omprakash Gnawali, Mohammad Amin Alipour
-conference: RL+SE&PL (Co-located with ESEC/FSE)
+conference: "RL+SE&PL (Co-located with ESEC/FSE)"
 year: 2020
 additional_links:
    - {name: "ArXiV", url: "https://arxiv.org/abs/2008.13064"}
    - {name: "code", url: "https://github.com/mdrafiqulrabin/handcrafted-embeddings"}
-tags: ["understanding", "embeddings", "features"]
+tags: ["evaluation", "representation", "naming", "interpretability"]
 ---
 Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with high-dimensional vectors computed by training deep neural networks on a large volume of programs. Although successful, there is little known about the contents of these vectors and their characteristics. In this paper, we present our preliminary results towards better understanding the contents of code2vec neural source code embeddings. In particular, in a small case study, we use the code2vec embeddings to create binary SVM classifiers and compare their performance with the handcrafted features. Our results suggest that the handcrafted features can perform very close to the highly-dimensional code2vec embeddings, and the information gains are more evenly distributed in the code2vec embeddings compared to the handcrafted features. We also find that the code2vec embeddings are more resilient to the removal of dimensions with low information gains than the handcrafted features. We hope our results serve a stepping stone toward principled analysis and evaluation of these code representations.
diff --git a/_publications/rabin2021generalizability.markdown b/_publications/rabin2021generalizability.markdown
index 533d7a62..df8f78e0 100644
--- a/_publications/rabin2021generalizability.markdown
+++ b/_publications/rabin2021generalizability.markdown
@@ -7,6 +7,6 @@ year: 2021
 additional_links:
    - {name: "ArXiV", url: "https://arxiv.org/abs/2008.01566"}
    - {name: "code", url: "https://github.com/mdrafiqulrabin/tnpa-generalizability"}
-tags: ["adversarial", "generalizability", "robustness", "transformation", "AST", "GNN"]
+tags: ["evaluation", "adversarial", "generalizability", "refactoring", "summarization"]
 ---
 With the prevalence of publicly available source code repositories to train deep neural network models, neural program models can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analysis techniques. Although such neural program models have been tested on various existing datasets, the extent to which they generalize to unforeseen source code is largely unknown. Since it is very challenging to test neural program models on all unforeseen programs, in this paper, we propose to evaluate the generalizability of neural program models with respect to semantic-preserving transformations: a generalizable neural program model should perform equally well on programs that are of the same semantics but of different lexical appearances and syntactical structures. We compare the results of various neural program models for the method name prediction task on programs before and after automated semantic-preserving transformations. We use three Java datasets of different sizes and three state-of-the-art neural network models for code, namely code2vec, code2seq, and GGNN, to build nine such neural program models for evaluation. Our results show that even with small semantically preserving changes to the programs, these neural program models often fail to generalize their performance. Our results also suggest that neural program models based on data and control dependencies in programs generalize better than neural program models based only on abstract syntax trees. On the positive side, we observe that as the size of the training dataset grows and diversifies the generalizability of correct predictions produced by the neural program models can be improved too. Our results on the generalizability of neural program models provide insights to measure their limitations and provide a stepping stone for their improvement.
diff --git a/_publications/rabin2021understanding.markdown b/_publications/rabin2021understanding.markdown
index 83cd5730..05455697 100644
--- a/_publications/rabin2021understanding.markdown
+++ b/_publications/rabin2021understanding.markdown
@@ -7,6 +7,6 @@ year: 2021
 additional_links:
    - {name: "ArXiV", url: "https://arxiv.org/abs/2106.03353"}
    - {name: "code", url: "https://github.com/mdrafiqulrabin/SIVAND"}
-tags: ["understanding", "explainability", "interpretability", "transparency", "simplification", "reduction", "delta debugging", "attention", "features"]
+tags: ["interpretability", "refactoring", "information extraction"]
 ---
 A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior of the underlying models and the factors that affect them. This is especially challenging for tools backed by deep neural networks. Various methods have tried to reduce this opacity in the vein of "transparent/interpretable-AI". However, these approaches are often specific to a particular set of network architectures, even requiring access to the network's parameters. This makes them difficult to use for the average programmer, which hinders the reliable adoption of neural CI systems. In this paper, we propose a simple, model-agnostic approach to identify critical input features for models in CI systems, by drawing on software debugging research, specifically delta debugging. Our approach, SIVAND, uses simplification techniques that reduce the size of input programs of a CI model while preserving the predictions of the model. We show that this approach yields remarkably small outputs and is broadly applicable across many model architectures and problem domains. We find that the models in our experiments often rely heavily on just a few syntactic features in input programs. We believe that SIVAND's extracted features may help understand neural CI systems' predictions and learned behavior.
diff --git a/_publications/rabin2022memorization.markdown b/_publications/rabin2022memorization.markdown
index 93c254f6..b75d7827 100644
--- a/_publications/rabin2022memorization.markdown
+++ b/_publications/rabin2022memorization.markdown
@@ -7,6 +7,6 @@ year: 2022
 additional_links:
    - {name: "ArXiV", url: "https://arxiv.org/abs/2106.08704"}
    - {name: "code", url: "https://github.com/mdrafiqulrabin/CI-Memorization"}
-tags: ["memorization", "generalization", "noise", "capacity", "language model"]
+tags: ["evaluation", "memorization", "generalizability", "refactoring", "language model"]
 ---
 Deep Neural Networks (DNNs) are increasingly being used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, their large capacity can render them prone to memorizing data points. Recent work suggests that the memorization risk manifests especially strongly when the training dataset is noisy, involving many ambiguous or questionable samples, and memorization is the only recourse. The goal of this paper is to evaluate and compare the extent of memorization and generalization in neural code intelligence models. It aims to provide insights on how memorization may impact the learning behavior of neural models in code intelligence systems. To observe the extent of memorization in models, we add random noise to the original training dataset and use various metrics to quantify the impact of noise on various aspects of training and testing. We evaluate several state-of-the-art neural code intelligence models and benchmarks based on Java, Python, and Ruby codebases. Our results highlight important risks: millions of trainable parameters allow the neural networks to memorize anything, including noisy data, and provide a false sense of generalization. We observed all models manifest some forms of memorization. This can be potentially troublesome in most code intelligence tasks where they rely on rather noise-prone and repetitive data sources, such as code from GitHub. To the best of our knowledge, we provide the first study to quantify memorization effects in the domain of software engineering and code intelligence systems. This work raises awareness and provides new insights into important issues of training neural models in code intelligence systems that are usually overlooked by software engineering researchers.
diff --git a/_publications/rabin2022understanding.markdown b/_publications/rabin2022understanding.markdown
index 0edfa88b..d4879a84 100644
--- a/_publications/rabin2022understanding.markdown
+++ b/_publications/rabin2022understanding.markdown
@@ -7,6 +7,6 @@ year: 2022
 additional_links:
    - {name: "ArXiV", url: "https://arxiv.org/abs/2205.14374"}
    - {name: "code", url: "https://github.com/mdrafiqulrabin/ci-dd-perses"}
-tags: ["understanding", "explainability", "interpretability", "transparency", "simplification", "reduction", "delta debugging", "perses", "features", "adversarial"]
+tags: ["interpretability", "refactoring", "adversarial"]
 ---
 Neural code intelligence (CI) models are opaque black-boxes and offer little insight on the features they use in making predictions. This opacity may lead to distrust in their prediction and hamper their wider adoption in safety-critical applications. Recently, input program reduction techniques have been proposed to identify key features in the input programs to improve the transparency of CI models. However, this approach is syntax-unaware and does not consider the grammar of the programming language. In this paper, we apply a syntax-guided program reduction technique that considers the grammar of the input programs during reduction. Our experiments on multiple models across different types of input programs show that the syntax-guided program reduction technique is faster and provides smaller sets of key tokens in reduced programs. We also show that the key tokens could be used in generating adversarial examples for up to 65% of the input programs.

From 4160bb79147fd58b777256409c98a4d90eb783e7 Mon Sep 17 00:00:00 2001
From: HaochenLi <alex.haochenli@gmail.com>
Date: Thu, 8 Dec 2022 10:20:08 +0800
Subject: [PATCH 042/114] Create li2022exploring.markdown

---
 _publications/li2022exploring.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/li2022exploring.markdown

diff --git a/_publications/li2022exploring.markdown b/_publications/li2022exploring.markdown
new file mode 100644
index 00000000..f185b730
--- /dev/null
+++ b/_publications/li2022exploring.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: Exploring Representation-Level Augmentation for Code Search
+authors: Haochen Li, Chunyan Miao, Cyril Leung, Yanxian Huang, Yuan Huang, Hongyu Zhang, Yanlin Wang
+conference: EMNLP
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2210.12285"}
+   - {name: "code", url: "https://github.com/Alex-HaochenLi/RACS"}
+tags: ["search", "Transformer"]
+---
+Code search, which aims at retrieving the most relevant code fragment for a given natural language query, is a common activity in software development practice. Recently, contrastive learning is widely used in code search research, where many data augmentation approaches for source code (e.g., semantic-preserving program transformation) are proposed to learn better representations.  However, these augmentations are at the raw-data level, which requires additional code analysis in the preprocessing stage and additional training costs in the training stage. In this paper, we explore augmentation methods that augment data (both code and query) at representation level which does not require additional data processing and training, and based on this we propose a general format of representation-level augmentation that unifies existing methods. Then, we propose three new augmentation methods (linear extrapolation, binary interpolation, and Gaussian scaling) based on the general format. Furthermore, we theoretically analyze the advantages of the proposed augmentation methods over traditional contrastive learning methods on code search. We experimentally evaluate the proposed representation-level augmentation methods with state-of-the-art code search models on a large-scale public dataset consisting of six programming languages. The experimental results show that our approach can consistently boost the performance of the studied code search models.
\ No newline at end of file

From 0f9546b9ea1880beebc978505c3960273a3fbb63 Mon Sep 17 00:00:00 2001
From: Goutham Ramakrishnan <goutham7r@gmail.com>
Date: Sun, 18 Dec 2022 21:49:13 -0800
Subject: [PATCH 043/114] Updated conference and year

---
 _publications/ramakrishnan2020backdoors.markdown | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/_publications/ramakrishnan2020backdoors.markdown b/_publications/ramakrishnan2020backdoors.markdown
index f19bef94..35d4d059 100644
--- a/_publications/ramakrishnan2020backdoors.markdown
+++ b/_publications/ramakrishnan2020backdoors.markdown
@@ -2,9 +2,10 @@
 layout: publication
 title: "Backdoors in Neural Models of Source Code"
 authors: Goutham Ramakrishnan, Aws Albarghouthi
-conference: 
-year: 2020
+conference: ICPR
+year: 2022
 additional_links:
+    - {name: "IEEE", url: "https://ieeexplore.ieee.org/document/9956690"}
     - {name: "ArXiV", url: "https://arxiv.org/abs/2006.06841"}
     - {name: "Code", url: "https://github.com/goutham7r/backdoors-for-code"}
 tags: ["adversarial"]

From a0e37a83aa81f74895160329586279300a622eef Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Tue, 27 Dec 2022 07:47:56 +0200
Subject: [PATCH 044/114] Add missing pubs.

---
 _publications/chen2022codet.markdown            |  2 +-
 _publications/zlotchevski2022exploring.markdown | 11 +++++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)
 create mode 100644 _publications/zlotchevski2022exploring.markdown

diff --git a/_publications/chen2022codet.markdown b/_publications/chen2022codet.markdown
index e7f4a5d2..446c6796 100644
--- a/_publications/chen2022codet.markdown
+++ b/_publications/chen2022codet.markdown
@@ -6,6 +6,6 @@ conference:
 year: 2022
 additional_links:
    - {name: "ArXiV", url: "https://arxiv.org/abs/2207.10397"}
-tags: ["synthesis"]
+tags: ["synthesis", "Transformer", "execution"]
 ---
 Given a programming problem, pre-trained language models such as Codex have demonstrated the ability to generate multiple different code solutions via sampling. However, selecting a correct or best solution from those samples still remains a challenge. While an easy way to verify the correctness of a code solution is through executing test cases, producing high-quality test cases is prohibitively expensive. In this paper, we explore the use of pre-trained language models to automatically generate test cases, calling our method CodeT: Code generation with generated Tests. CodeT executes the code solutions using the generated test cases, and then chooses the best solution based on a dual execution agreement with both the generated test cases and other generated solutions. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks. Extensive experimental results demonstrate CodeT can achieve significant, consistent, and surprising improvements over previous methods. For example, CodeT improves the pass@1 on HumanEval to 65.8%, an increase of absolute 18.8% on the code-davinci-002 model, and an absolute 20+% improvement over previous state-of-the-art results.
diff --git a/_publications/zlotchevski2022exploring.markdown b/_publications/zlotchevski2022exploring.markdown
new file mode 100644
index 00000000..5bd5d5fc
--- /dev/null
+++ b/_publications/zlotchevski2022exploring.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Exploring and Evaluating Personalized Models for Code Generation"
+authors: Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin Clement, Neel Sundaresan, Michele Tufano
+conference: FSE
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2208.13928"}
+tags: ["Transformer"]
+---
+Large Transformer models achieved the state-of-the-art status for Natural Language Understanding tasks and are increasingly becoming the baseline model architecture for modeling source code. Transformers are usually pre-trained on large unsupervised corpora, learning token representations and transformations relevant to modeling generally available text, and are then fine-tuned on a particular downstream task of interest. While fine-tuning is a tried-and-true method for adapting a model to a new domain -- for example, question-answering on a given topic -- generalization remains an on-going challenge. In this paper, we explore and evaluate transformer model fine-tuning for personalization. In the context of generating unit tests for Java methods, we evaluate learning to personalize to a specific software project using several personalization techniques. We consider three key approaches: (i) custom fine-tuning, which allows all the model parameters to be tuned; (ii) lightweight fine-tuning, which freezes most of the model's parameters, allowing tuning of the token embeddings and softmax layer only or the final layer alone; (iii) prefix tuning, which keeps model parameters frozen, but optimizes a small project-specific prefix vector. Each of these techniques offers a trade-off in total compute cost and predictive performance, which we evaluate by code and task-specific metrics, training time, and total computational operations. We compare these fine-tuning strategies for code generation and discuss the potential generalization and cost benefits of each in various deployment scenarios.

From cceb6a3409dd26d357b7e0418933863df0de7963 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Tue, 27 Dec 2022 07:56:57 +0200
Subject: [PATCH 045/114] Add SantaCoder

---
 _publications/allal2022santacoder.markdown | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)
 create mode 100644 _publications/allal2022santacoder.markdown

diff --git a/_publications/allal2022santacoder.markdown b/_publications/allal2022santacoder.markdown
new file mode 100644
index 00000000..f2ba994c
--- /dev/null
+++ b/_publications/allal2022santacoder.markdown
@@ -0,0 +1,19 @@
+---
+layout: publication
+title: "SantaCoder: don’t reach for the stars!"
+authors: Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muenninghoff, Mayank Mishra, Alex Gu, Manan Den, Longesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Terry Yue Zhuo, Francesco De Toni, Bernanrdo Garcia del Rio, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Michael Lappert, Ian Yu, Paulo Villegas, Jia Li, David Lansy, Huu Nguyen, Danish Contractor, Luis Villa, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Arjun Guha, Harm de Vries, Leonadro von Werra
+conference:
+year: 2022
+tags: ["Transformer"]
+---
+The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code.1 This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII)
+redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java,
+JavaScript, and Python subsets of The Stack (Kocetkov et al., 2022) and
+evaluate the models on MultiPL-E (Cassano et al., 2022), a text2code
+benchmark available in 18 programming languages. We find that more
+aggressive filtering of near-duplicates can further boost performance and,
+surprisingly, that selecting files from repositories with 5+ GitHub stars
+deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and
+CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the
+Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL
+license at https://hf.co/bigcode

From a6f9148aec3c24bb7ee080798a4721e15e89c81a Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Tue, 27 Dec 2022 07:59:02 +0200
Subject: [PATCH 046/114] Add JEMMA

---
 _publications/karmakar2022jemma.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/karmakar2022jemma.markdown

diff --git a/_publications/karmakar2022jemma.markdown b/_publications/karmakar2022jemma.markdown
new file mode 100644
index 00000000..4c270ff7
--- /dev/null
+++ b/_publications/karmakar2022jemma.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "JEMMA: An Extensible Java Dataset for ML4Code Applications"
+authors: Anjan Karmakar, Miltiadis Allamanis, Romain Robbes
+conference: EMSE
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2212.09132"}
+tags: ["dataset"]
+---
+Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.

From 4b70e76b408df24927f730f1b1d4c8a6c39468df Mon Sep 17 00:00:00 2001
From: Rajaswa Patil <t-rpatil@microsoft.com>
Date: Wed, 18 Jan 2023 14:43:34 +0530
Subject: [PATCH 047/114] Added naik2022probing.markdown and updated
 contributor affiliation for Rajaswa Patil

---
 _publications/naik2022probing.markdown | 13 +++++++++++++
 index.md                               |  2 +-
 2 files changed, 14 insertions(+), 1 deletion(-)
 create mode 100644 _publications/naik2022probing.markdown

diff --git a/_publications/naik2022probing.markdown b/_publications/naik2022probing.markdown
new file mode 100644
index 00000000..7945b28b
--- /dev/null
+++ b/_publications/naik2022probing.markdown
@@ -0,0 +1,13 @@
+---
+layout: publication
+title: "Probing Semantic Grounding in Language Models of Code with Representational Similarity Analysis"
+authors: Shounak Naik, Rajaswa Patil, Swati Agarwal, Veeky Baths
+conference: International Conference on Advanced Data Mining and Applications (ADMA 2022)
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2207.07706"}
+   - {name: "PDF", url: "https://link.springer.com/chapter/10.1007/978-3-031-22137-8_29"}
+   - {name: "Code", url: "https://github.com/shounaknaik/Probing-Semantic-Grounding-in-Language-Models-of-Code-with-Representational-Similarity-Analysis"}
+tags: ["interpretability", "language model", "evaluation", "Transformer"]
+---
+Representational Similarity Analysis is a method from cognitive neuroscience, which helps in comparing representations from two different sources of data. In this paper, we propose using Representational Similarity Analysis to probe the semantic grounding in language models of code. We probe representations from the CodeBERT model for semantic grounding by using the data from the IBM CodeNet dataset. Through our experiments, we show that current pre-training methods do not induce semantic grounding in language models of code, and instead focus on optimizing form-based patterns. We also show that even a little amount of fine-tuning on semantically relevant tasks increases the semantic grounding in CodeBERT significantly. Our ablations with the input modality to the CodeBERT model show that using bimodal inputs (code and natural language) over unimodal inputs (only code) gives better semantic grounding and sample efficiency during semantic fine-tuning. Finally, our experiments with semantic perturbations in code reveal that CodeBERT is able to robustly distinguish between semantically correct and incorrect code.
diff --git a/index.md b/index.md
index 58fbe2eb..44467cff 100644
--- a/index.md
+++ b/index.md
@@ -74,4 +74,4 @@ website. A comprehensive list can be found [here](https://github.com/ml4code/ml4
 * [Uri Alon](http://www.cs.technion.ac.il/~urialon/) Technion, Israel
 * [Shaked Brody](https://shakedbr.cswp.cs.technion.ac.il/) Technion, Israel
 * [Nghi D. Q. Bui](https://bdqnghi.github.io/) Singapore Management University, Singapore
-* [Rajaswa Patil](https://rajaswa.github.io/) TCS Research, India
+* [Rajaswa Patil](https://rajaswa.github.io/) Microsoft PROSE

From dd875d48bf9bd7fdb6af2ad5d413eb9a6a270438 Mon Sep 17 00:00:00 2001
From: Rajaswa Patil <t-rpatil@microsoft.com>
Date: Tue, 24 Jan 2023 16:26:37 +0530
Subject: [PATCH 048/114] Added patil2022exploring.markdown

---
 _publications/patil2022exploring.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/patil2022exploring.markdown

diff --git a/_publications/patil2022exploring.markdown b/_publications/patil2022exploring.markdown
new file mode 100644
index 00000000..be5a7c12
--- /dev/null
+++ b/_publications/patil2022exploring.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Exploring Dimensions of Generalizability and Few-shot Transfer for Text-to-SQL Semantic Parsing"
+authors: Rajaswa Patil, Manasi Patwardhan, Shirish Karande, Lovekesh Vig, Gautam Shroff
+conference: The 1st Transfer Learning for Natural Language Processing Workshop (TL4NLP 2022)
+year: 2022
+additional_links:
+   - {name: "PDF", url: "https://proceedings.mlr.press/v203/patil23a.html"}
+   - {name: "Data", url: "https://github.com/ManasiPat/Spider-Gen"}
+tags: ["dataset", "evaluation", "Transformer", "benchmark", "generalizability"]
+---
+Existing work on generalization in Text-to-SQL semantic parsing has been restricted to a zero-shot cross-domain setting. In this paper, we introduce Spider-Gen: a Text-to-SQL benchmark to develop a paradigm of transfer learning across distinct dimensions of generalization in Text-to-SQL semantic parsing. The Spider-Gen benchmark focuses on few-shot adaption for Cross-domain, Lexical, and Structural generalization of Text-to-SQL models. Through our experiments with the Spider-Gen dataset, we show that Seq2Seq language models struggle to generalize against change in data distribution, lexical changes in database schema, and changes in SQL query complexity. Our experiments also reveal that performing few-shot fine-tuning helps Text-to-SQL models to generalize across these changes. However, such few-shot adaptation comes with a negative effect on the knowledge learnt during training. Hence, we also explore Parameter-efficient Fine-tuning methods to overcome the limitations of Seq2Seq Text-to-SQL models. We release the Spider-Gen dataset publicly to facilitate further research in generalization and transfer learning across various dimensions in Text-to-SQL semantic parsing.

From 2fdcb27c13db239ea0405673ecbe16dad9ab366e Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Sat, 28 Jan 2023 11:46:20 +0000
Subject: [PATCH 049/114] Add Chow et al.

---
 _publications/chow2023beware.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/chow2023beware.markdown

diff --git a/_publications/chow2023beware.markdown b/_publications/chow2023beware.markdown
new file mode 100644
index 00000000..11440e89
--- /dev/null
+++ b/_publications/chow2023beware.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Beware of the Unexpected: Bimodal Taint Analysis"
+authors: Yiu Wai Chow, Max Schäfer, Michael Pradel
+conference:
+year: 2023
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2301.10545"}
+tags: ["static analysis"]
+---
+Static analysis is a powerful tool for detecting security vulnerabilities and other programming problems. Global taint tracking, in particular, can spot vulnerabilities arising from complicated data flow across multiple functions. However, precisely identifying which flows are problematic is challenging, and sometimes depends on factors beyond the reach of pure program analysis, such as conventions and informal knowledge. For example, learning that a parameter `name` of an API function `locale` ends up in a file path is surprising and potentially problematic. In contrast, it would be completely unsurprising to find that a parameter `command` passed to an API function `execaCommand` is eventually interpreted as part of an operating-system command. This paper presents Fluffy, a bimodal taint analysis that combines static analysis, which reasons about data flow, with machine learning, which probabilistically determines which flows are potentially problematic. The key idea is to let machine learning models predict from natural language information involved in a taint flow, such as API names, whether the flow is expected or unexpected, and to inform developers only about the latter. We present a general framework and instantiate it with four learned models, which offer different trade-offs between the need to annotate training data and the accuracy of predictions. We implement Fluffy on top of the CodeQL analysis framework and apply it to 250K JavaScript projects. Evaluating on five common vulnerability types, we find that Fluffy achieves an F1 score of 0.85 or more on four of them across a variety of datasets. 

From d08f98392b2f1fc5bfdabb02e9dfa958edfe2c4d Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Sat, 28 Jan 2023 14:49:48 +0000
Subject: [PATCH 050/114] fix

---
 _publications/chow2023beware.markdown | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/_publications/chow2023beware.markdown b/_publications/chow2023beware.markdown
index 11440e89..dd246b6b 100644
--- a/_publications/chow2023beware.markdown
+++ b/_publications/chow2023beware.markdown
@@ -2,7 +2,7 @@
 layout: publication
 title: "Beware of the Unexpected: Bimodal Taint Analysis"
 authors: Yiu Wai Chow, Max Schäfer, Michael Pradel
-conference:
+conference: ISSTA
 year: 2023
 additional_links:
    - {name: "ArXiV", url: "https://arxiv.org/abs/2301.10545"}

From 94994cf571a9a913c5574fc49576322152781de3 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Mon, 30 Jan 2023 10:17:52 +0000
Subject: [PATCH 051/114] Add CodeScore

---
 _publications/dong2023codescore.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/dong2023codescore.markdown

diff --git a/_publications/dong2023codescore.markdown b/_publications/dong2023codescore.markdown
new file mode 100644
index 00000000..331c5f9d
--- /dev/null
+++ b/_publications/dong2023codescore.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "CodeScore: Evaluating Code Generation by Learning Code Execution"
+authors: Yihong Dong, Jiazheng Ding, Xue Jiang, Zhuo Li, Ge Li, Zhi Jin
+conference:
+year: 2023
+additional_links:
+#    - {name: "ArXiV", url: "https://arxiv.org/abs/2301.09043"}
+tags: ["Transformer", "evaluation]
+---
+A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing CEMs can be categorized into match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) and execution-based CEMs (e.g., AvgPassRatio and Pass@k), but both of them suffer from some issues. The former only measures differences in surface form regardless of the functional equivalence of codes, while the latter has huge execution overheads, including collecting expensive test cases, resolving tedious execution dependencies, and enormous execution time. To address these issues, in this paper, we propose CodeScore, an efficient and effective CEM for code generation, which estimates test case PassRatio of generated code without executing code. We also present a framework named UniCE for training unified code evaluation models by learning code execution, i.e., learning PassRatio and Executability of generated code. In order to learn code execution comprehensively, we construct more than 100 test cases for each task in several popular benchmark datasets, covering MBPP, APPS, and HumanEval. Experimental results show that CodeScore has obtained a state-of-the-art correlation with execution-based CEMs. CodeScore is strongly correlated with AvgPassPatio, and binary CodeScore is moderately correlated with Pass@1. In particular, CodeScore eliminates the need for test cases and execution dependencies in inference, and CodeScore reduces execution time by three orders of magnitude compared to AvgPassPatio and Pass@1.

From 9e1bbd2f322d1edefc7823d97eab63cf933da178 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Mon, 30 Jan 2023 10:18:03 +0000
Subject: [PATCH 052/114] fix

---
 _publications/dong2023codescore.markdown | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/_publications/dong2023codescore.markdown b/_publications/dong2023codescore.markdown
index 331c5f9d..d1de46e5 100644
--- a/_publications/dong2023codescore.markdown
+++ b/_publications/dong2023codescore.markdown
@@ -5,7 +5,7 @@ authors: Yihong Dong, Jiazheng Ding, Xue Jiang, Zhuo Li, Ge Li, Zhi Jin
 conference:
 year: 2023
 additional_links:
-#    - {name: "ArXiV", url: "https://arxiv.org/abs/2301.09043"}
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2301.09043"}
 tags: ["Transformer", "evaluation]
 ---
 A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing CEMs can be categorized into match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) and execution-based CEMs (e.g., AvgPassRatio and Pass@k), but both of them suffer from some issues. The former only measures differences in surface form regardless of the functional equivalence of codes, while the latter has huge execution overheads, including collecting expensive test cases, resolving tedious execution dependencies, and enormous execution time. To address these issues, in this paper, we propose CodeScore, an efficient and effective CEM for code generation, which estimates test case PassRatio of generated code without executing code. We also present a framework named UniCE for training unified code evaluation models by learning code execution, i.e., learning PassRatio and Executability of generated code. In order to learn code execution comprehensively, we construct more than 100 test cases for each task in several popular benchmark datasets, covering MBPP, APPS, and HumanEval. Experimental results show that CodeScore has obtained a state-of-the-art correlation with execution-based CEMs. CodeScore is strongly correlated with AvgPassPatio, and binary CodeScore is moderately correlated with Pass@1. In particular, CodeScore eliminates the need for test cases and execution dependencies in inference, and CodeScore reduces execution time by three orders of magnitude compared to AvgPassPatio and Pass@1.

From f01cabb7e5bfd808181c86cfd08a02584ee87df4 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Mon, 30 Jan 2023 10:43:46 +0000
Subject: [PATCH 053/114] Fix 2

---
 _publications/dong2023codescore.markdown | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/_publications/dong2023codescore.markdown b/_publications/dong2023codescore.markdown
index d1de46e5..f749e0fb 100644
--- a/_publications/dong2023codescore.markdown
+++ b/_publications/dong2023codescore.markdown
@@ -6,6 +6,6 @@ conference:
 year: 2023
 additional_links:
    - {name: "ArXiV", url: "https://arxiv.org/abs/2301.09043"}
-tags: ["Transformer", "evaluation]
+tags: ["Transformer", "evaluation"]
 ---
 A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing CEMs can be categorized into match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) and execution-based CEMs (e.g., AvgPassRatio and Pass@k), but both of them suffer from some issues. The former only measures differences in surface form regardless of the functional equivalence of codes, while the latter has huge execution overheads, including collecting expensive test cases, resolving tedious execution dependencies, and enormous execution time. To address these issues, in this paper, we propose CodeScore, an efficient and effective CEM for code generation, which estimates test case PassRatio of generated code without executing code. We also present a framework named UniCE for training unified code evaluation models by learning code execution, i.e., learning PassRatio and Executability of generated code. In order to learn code execution comprehensively, we construct more than 100 test cases for each task in several popular benchmark datasets, covering MBPP, APPS, and HumanEval. Experimental results show that CodeScore has obtained a state-of-the-art correlation with execution-based CEMs. CodeScore is strongly correlated with AvgPassPatio, and binary CodeScore is moderately correlated with Pass@1. In particular, CodeScore eliminates the need for test cases and execution dependencies in inference, and CodeScore reduces execution time by three orders of magnitude compared to AvgPassPatio and Pass@1.

From f557edf43dd31ce55fa13f26331f623c35f035e3 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Wed, 8 Feb 2023 13:28:46 +0000
Subject: [PATCH 054/114] Add Souza et al.

---
 _publications/souza2023lexecutor.markdown | 13 +++++++++++++
 1 file changed, 13 insertions(+)
 create mode 100644 _publications/souza2023lexecutor.markdown

diff --git a/_publications/souza2023lexecutor.markdown b/_publications/souza2023lexecutor.markdown
new file mode 100644
index 00000000..9c7b98fb
--- /dev/null
+++ b/_publications/souza2023lexecutor.markdown
@@ -0,0 +1,13 @@
+---
+layout: publication
+title: "LExecutor: Learning-Guided Execution"
+authors: Beatriz Souza, Michael Pradel
+conference:
+year: 2023
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2302.02343"}
+   - {name: "Dataset", url: "https://blah/blah"}
+tags: ["execution"]
+---
+Executing code is essential for various program analysis tasks, e.g., to detect bugs that manifest through exceptions or to obtain execution traces for further dynamic analysis. However, executing an arbitrary piece of code is often difficult in practice, e.g., because of missing variable definitions, missing user inputs, and missing third-party dependencies. This paper presents LExecutor, a learning-guided approach for executing arbitrary code snippets in an underconstrained way. The key idea is to let a neural model predict missing values that otherwise would cause the program to get stuck, and to inject these values into the execution. For example, LExecutor injects likely values for otherwise undefined variables and likely return values of calls to otherwise missing functions. We evaluate the approach on Python code from popular open-source projects and on code snippets extracted from Stack Overflow. The neural model predicts realistic values with an accuracy between 80.1% and 94.2%, allowing LExecutor to closely mimic real executions. As a result, the approach successfully executes significantly more code than any available technique, such as simply executing the code as-is. For example, executing the open-source code snippets as-is covers only 4.1% of all lines, because the code crashes early on, whereas LExecutor achieves a coverage of 50.1%.
+

From a7c55fd91322fadb982034f1f630206d1db454d9 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Wed, 8 Feb 2023 13:29:37 +0000
Subject: [PATCH 055/114] fix

---
 _publications/souza2023lexecutor.markdown | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/_publications/souza2023lexecutor.markdown b/_publications/souza2023lexecutor.markdown
index 9c7b98fb..1ad8eb1b 100644
--- a/_publications/souza2023lexecutor.markdown
+++ b/_publications/souza2023lexecutor.markdown
@@ -6,7 +6,7 @@ conference:
 year: 2023
 additional_links:
    - {name: "ArXiV", url: "https://arxiv.org/abs/2302.02343"}
-   - {name: "Dataset", url: "https://blah/blah"}
+   - {name: "Code", url: "https://github.com/michaelpradel/LExecutor"}
 tags: ["execution"]
 ---
 Executing code is essential for various program analysis tasks, e.g., to detect bugs that manifest through exceptions or to obtain execution traces for further dynamic analysis. However, executing an arbitrary piece of code is often difficult in practice, e.g., because of missing variable definitions, missing user inputs, and missing third-party dependencies. This paper presents LExecutor, a learning-guided approach for executing arbitrary code snippets in an underconstrained way. The key idea is to let a neural model predict missing values that otherwise would cause the program to get stuck, and to inject these values into the execution. For example, LExecutor injects likely values for otherwise undefined variables and likely return values of calls to otherwise missing functions. We evaluate the approach on Python code from popular open-source projects and on code snippets extracted from Stack Overflow. The neural model predicts realistic values with an accuracy between 80.1% and 94.2%, allowing LExecutor to closely mimic real executions. As a result, the approach successfully executes significantly more code than any available technique, such as simply executing the code as-is. For example, executing the open-source code snippets as-is covers only 4.1% of all lines, because the code crashes early on, whereas LExecutor achieves a coverage of 50.1%.

From 9a3db21db3e40385f4d53313db9b7636d0bf2de1 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Thu, 16 Feb 2023 20:29:52 +0000
Subject: [PATCH 056/114] Add CodeBERTScore

---
 _publications/zhou2022codebertscore.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/zhou2022codebertscore.markdown

diff --git a/_publications/zhou2022codebertscore.markdown b/_publications/zhou2022codebertscore.markdown
new file mode 100644
index 00000000..86ea2486
--- /dev/null
+++ b/_publications/zhou2022codebertscore.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code"
+authors: Shuyan Zhou, Uri Alon, Sumit Agarwal, Graham Neubig
+conference:
+year: 2023
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2302.05527"}
+   - {name: "Code", url: "https://github.com/neulab/code-bert-score"}
+tags: ["evaluation", "Transformer"]
+---
+Since the rise of neural models of code that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an automatic evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of measuring exact token matching as BLEU, CodeBERTScore computes a soft similarity score between each token in the generated code and in the reference code, using the contextual encodings of large pretrained models. Further, instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the programmatic context surrounding the generated code. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by CodeBERTScore is more likely to be preferred by humans, as well as to function correctly when executed. Finally, while CodeBERTScore can be used with a multilingual CodeBERT as its base model, we release five language-specific pretrained models to use with our publicly available code at https://github.com/neulab/code-bert-score . Our language-specific models have been downloaded more than 25,000 times from the Huggingface Hub.

From 82b340018dcdaab8bdab433d05cc7f42f3b6e8e4 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Tue, 7 Mar 2023 20:20:22 +0000
Subject: [PATCH 057/114] Add paper

---
 _publications/panthaplackel2022using.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/panthaplackel2022using.markdown

diff --git a/_publications/panthaplackel2022using.markdown b/_publications/panthaplackel2022using.markdown
new file mode 100644
index 00000000..1597adcc
--- /dev/null
+++ b/_publications/panthaplackel2022using.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Using Developer Discussions to Guide Fixing Bugs in Software"
+authors: Sheena Panthaplackel, Milos Gligoric, Junyi Jessy Li, Raymond J. Mooney
+conference: EMNLP
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2211.06335"}
+tags: ["Transformer", "repair"]
+---
+Automatically fixing software bugs is a challenging task. While recent work showed that natural language context is useful in guiding bug-fixing models, the approach required prompting developers to provide this context, which was simulated through commit messages written after the bug-fixing code changes were made. We instead propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for any additional information from developers. For this, we augment standard bug-fixing datasets with bug report discussions. Using these newly compiled datasets, we demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.

From bac9255eef2a23b5315827b6d9fec9be0b7b6188 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Thu, 23 Mar 2023 20:48:36 +0000
Subject: [PATCH 058/114] Add TypeT5

---
 _publications/wei2023typet5.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/wei2023typet5.markdown

diff --git a/_publications/wei2023typet5.markdown b/_publications/wei2023typet5.markdown
new file mode 100644
index 00000000..03b7262a
--- /dev/null
+++ b/_publications/wei2023typet5.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "TypeT5: Seq2seq Type Inference using Static Analysis"
+authors: Jiayi Wei, Greg Durrett, Isil Dillig
+conference: ICLR
+year: 2023
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2303.09564"}
+tags: ["types", "Transformer"]
+---
+There has been growing interest in automatically predicting missing type annotations in programs written in Python and JavaScript. While prior methods have achieved impressive accuracy when predicting the most common types, they often perform poorly on rare or complex types. In this paper, we present a new type inference method that treats type prediction as a code infilling task by leveraging CodeT5, a state-of-the-art seq2seq pre-trained language model for code. Our method uses static analysis to construct dynamic contexts for each code element whose type signature is to be predicted by the model. We also propose an iterative decoding scheme that incorporates previous type predictions in the model's input context, allowing information exchange between related code elements. Our evaluation shows that the proposed approach, TypeT5, not only achieves a higher overall accuracy (particularly on rare and complex types) but also produces more coherent results with fewer type errors -- while enabling easy user intervention. 

From bb2e110f7a215f9c7eaecd72fb0bcd55b99599e3 Mon Sep 17 00:00:00 2001
From: Rishab Sharma <rishabsharma.ofic@gmail.com>
Date: Sun, 26 Mar 2023 01:40:07 -0700
Subject: [PATCH 059/114] add sharma ICPC 2022 papers

---
 _publications/sharma2022an.markdown     | 13 +++++++++++++
 _publications/sharma2022lamner.markdown | 13 +++++++++++++
 2 files changed, 26 insertions(+)
 create mode 100644 _publications/sharma2022an.markdown
 create mode 100644 _publications/sharma2022lamner.markdown

diff --git a/_publications/sharma2022an.markdown b/_publications/sharma2022an.markdown
new file mode 100644
index 00000000..0954a171
--- /dev/null
+++ b/_publications/sharma2022an.markdown
@@ -0,0 +1,13 @@
+---
+layout: publication
+title: "An Exploratory Study on Code Attention in BERT"
+authors: Rishab Sharma, Fuxiang Chen, Fatemeh H. Fard, David Lo
+conference: ICPC
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2204.10200"}
+   - {name: "code", url: "https://github.com/fardfh-lab/Code-Attention-BERT"}
+tags: ["Transformer", "representation", "language model", "interpretability", "pretraining", "clone"]
+---
+Many recent models in software engineering introduced deep neural models based on the Transformer architecture or use transformer-based Pre-trained Language Models (PLM) trained on code. Although these models achieve the state of the arts results in many downstream tasks such as code summarization and bug detection, they are based on Transformer and PLM, which are mainly studied in the Natural Language Processing (NLP) field. The current studies rely on the reasoning and practices from NLP for these models in code, despite the differences between natural languages and programming languages. There is also limited literature on explaining how code is modeled. Here, we investigate the attention behavior of PLM on code and compare it with natural language. We pre-trained BERT, a Transformer based PLM, on code and explored what kind of information it learns, both semantic and syntactic. We run several experiments to analyze the attention values of code constructs on each other and what BERT learns in each layer. Our analyses show that BERT pays more attention to syntactic entities, specifically identifiers and separators, in contrast to the most attended token [CLS] in NLP. This observation motivated us to leverage identifiers to represent the code sequence instead of the [CLS] token when used for code clone detection. Our results show that employing embeddings from identifiers increases the performance of BERT by 605% and 4% F1-score in its lower layers and the upper layers, respectively. When identifiers' embeddings are used in CodeBERT, a code-based PLM, the performance is improved by 21--24% in the F1-score of clone detection. The findings can benefit the research community by using code-specific representations instead of applying the common embeddings used in NLP, and open new directions for developing smaller models with similar performance.
+
diff --git a/_publications/sharma2022lamner.markdown b/_publications/sharma2022lamner.markdown
new file mode 100644
index 00000000..bc839cea
--- /dev/null
+++ b/_publications/sharma2022lamner.markdown
@@ -0,0 +1,13 @@
+---
+layout: publication
+title: "LAMNER: Code Comment Generation Using Character Language Model and Named Entity Recognition"
+authors: Rishab Sharma, Fuxiang Chen, Fatemeh H. Fard
+conference: ICPC
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2204.09654"}
+   - {name: "code", url: "https://github.com/fardfh-lab/LAMNER"}
+tags: ["summarization", "documentation", "language model", "types", "representation"]
+---
+Code comment generation is the task of generating a high-level natural language description for a given code method/function. Although researchers have been studying multiple ways to generate code comments automatically, previous work mainly considers representing a code token in its entirety semantics form only (e.g., a language model is used to learn the semantics of a code token), and additional code properties such as the tree structure of a code are included as an auxiliary input to the model. There are two limitations: 1) Learning the code token in its entirety form may not be able to capture information succinctly in source code, and 2)The code token does not contain additional syntactic information, inherently important in programming languages. In this paper, we present LAnguage Model and Named Entity Recognition (LAMNER), a code comment generator capable of encoding code constructs effectively and capturing the structural property of a code token. A character-level language model is used to learn the semantic representation to encode a code token. For the structural property of a token, a Named Entity Recognition model is trained to learn the different types of code tokens. These representations are then fed into an encoder-decoder architecture to generate code comments. We evaluate the generated comments from LAMNER and other baselines on a popular Java dataset with four commonly used metrics. Our results show that LAMNER is effective and improves over the best baseline model in BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROUGE-L, METEOR, and CIDEr by 14.34%, 18.98%, 21.55%, 23.00%, 10.52%, 1.44%, and 25.86%, respectively. Additionally, we fused LAMNER’s code representation with the baseline models, and the fused models consistently showed improvement over the nonfused models. The human evaluation further shows that LAMNER produces high-quality code comments.
+

From 73d570a5f657ba3590f6890aea25e12c5d7f3470 Mon Sep 17 00:00:00 2001
From: rishab-32 <45006363+rishab-32@users.noreply.github.com>
Date: Sun, 26 Mar 2023 03:42:48 -0700
Subject: [PATCH 060/114] Rename sharma2022an.markdown to
 sharma2022exploratory.markdown

---
 .../{sharma2022an.markdown => sharma2022exploratory.markdown}     | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename _publications/{sharma2022an.markdown => sharma2022exploratory.markdown} (100%)

diff --git a/_publications/sharma2022an.markdown b/_publications/sharma2022exploratory.markdown
similarity index 100%
rename from _publications/sharma2022an.markdown
rename to _publications/sharma2022exploratory.markdown

From a65946aedaa63ec6a8120d9461e1dbd674ac7b4c Mon Sep 17 00:00:00 2001
From: "Sergey V. Kovalchuk" <sergey.v.kovalchuk@gmail.com>
Date: Tue, 4 Apr 2023 22:04:26 +0300
Subject: [PATCH 061/114] Create kovalchuk2022human.markdown

---
 _publications/kovalchuk2022human.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/kovalchuk2022human.markdown

diff --git a/_publications/kovalchuk2022human.markdown b/_publications/kovalchuk2022human.markdown
new file mode 100644
index 00000000..7bfd8a0f
--- /dev/null
+++ b/_publications/kovalchuk2022human.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Human perceiving behavior modeling in evaluation of code generation models"
+authors: S. Kovalchuk, V. Lomshakov, A. Aliev
+conference: GEM
+year: 2022
+additional_links:
+   - {name: "ACLAnthology", url: "https://aclanthology.org/2022.gem-1.24/"}
+tags: ["code generation", "evaluation", "human evaluation", ]
+---
+Within this study, we evaluated a series of code generation models based on CodeGen and GPTNeo to compare the metric-based performance and human evaluation. For a deeper analysis of human perceiving within the evaluation procedure we’ve implemented a 5-level Likert scale assessment of the model output using a perceiving model based on the Theory of Planned Behavior (TPB). Through such analysis, we showed an extension of model assessment as well as a deeper understanding of the quality and applicability of generated code for practical question answering. The approach was evaluated with several model settings in order to assess diversity in quality and style of answer. With the TPB-based model, we showed a different level of perceiving the model result, namely personal understanding, agreement level, and readiness to use the particular code. With such analysis, we investigate a series of issues in code generation as natural language generation (NLG) problems observed in a practical context of programming question-answering with code.
\ No newline at end of file

From 555a6c1e64b13b5e4e7f7f34516fe30f8bab6907 Mon Sep 17 00:00:00 2001
From: ist1373 <ist1373@gmail.com>
Date: Wed, 12 Apr 2023 16:32:35 -0700
Subject: [PATCH 062/114] Create saberi2023model.markdown

---
 _publications/saberi2023model.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/saberi2023model.markdown

diff --git a/_publications/saberi2023model.markdown b/_publications/saberi2023model.markdown
new file mode 100644
index 00000000..a4570414
--- /dev/null
+++ b/_publications/saberi2023model.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Model-Agnostic Syntactical Information for Pre-Trained Programming Language Models"
+authors: Iman Saberi, Fateme H. Fard
+conference: MSR
+year: 2023
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/pdf/2303.06233"}
+tags: ["Adapters", "Pre-trained Programming Language", "Code Refinement", "Code Summarization"]
+---
+Pre-trained Programming Language Models (PPLMs) achieved many recent states of the art results for many code-related software engineering tasks. Though some studies use data flow or propose tree-based models that utilize Abstract Syntax Tree (AST), most PPLMs do not fully utilize the rich syntactical information in source code. Still, the input is considered a sequence of tokens. There are two issues; the first is computational inefficiency due to the quadratic relationship between input length and attention complexity. Second, any syntactical information, when needed as an extra input to the current PPLMs, requires the model to be pre-trained from scratch, wasting all the computational resources already used for pre-training the current models. In this work, we propose Named Entity Recognition (NER) adapters, lightweight modules that can be inserted into Transformer blocks to learn type information extracted from the AST. These adapters can be used with current PPLMs such as CodeBERT, GraphCodeBERT, and CodeT5. We train the NER adapters using a novel Token Type Classification objective function (TTC). We insert our proposed work in CodeBERT, building CodeBERTER, and evaluate the performance on two tasks of code refinement and code summarization. CodeBERTER improves the accuracy of code refinement from 16.4 to 17.8 while using 20% of training parameter budget compared to the fully fine-tuning approach, and the BLEU score of code summarization from 14.75 to 15.90 while reducing 77% of training parameters compared to the fully fine-tuning approach.

From 0c85aaab2332f9f3039a3f1a073b2bb819049c67 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Wed, 19 Apr 2023 16:23:42 +0100
Subject: [PATCH 063/114] Add DiverseVul

---
 _publications/chen2023diversevul.markdown | 13 +++++++++++++
 1 file changed, 13 insertions(+)
 create mode 100644 _publications/chen2023diversevul.markdown

diff --git a/_publications/chen2023diversevul.markdown b/_publications/chen2023diversevul.markdown
new file mode 100644
index 00000000..274da617
--- /dev/null
+++ b/_publications/chen2023diversevul.markdown
@@ -0,0 +1,13 @@
+---
+layout: publication
+title: "DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection"
+authors: Yizheng Chen, Zhoujie Ding, Xinyun Chen, David Wagner
+conference:
+year: 2023
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2304.00409"}
+tags: ["dataset", "Transformer", "vulnerability"]
+---
+We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 150 CWEs, 26,635 vulnerable functions, and 352,606 non-vulnerable functions extracted from 7,861 commits. Our dataset covers 305 more projects than all previous datasets combined. We show that increasing the diversity and volume of training data improves the performance of deep learning models for vulnerability detection.
+Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models.
+However, we also identify hopeful future research directions. We demonstrate that large language models (LLMs) are the future for vulnerability detection, outperforming Graph Neural Networks (GNNs) with manual feature engineering. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.

From 70e0c842623de7fcb8a8a0251f7a3c987be76554 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Mon, 24 Apr 2023 13:54:02 +0100
Subject: [PATCH 064/114] Add Ahmed et al.

---
 _publications/ahmed2033improving.markdown | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)
 create mode 100644 _publications/ahmed2033improving.markdown

diff --git a/_publications/ahmed2033improving.markdown b/_publications/ahmed2033improving.markdown
new file mode 100644
index 00000000..1f55b183
--- /dev/null
+++ b/_publications/ahmed2033improving.markdown
@@ -0,0 +1,17 @@
+---
+layout: publication
+title: "Improving Few-Shot Prompts with Relevant Static Analysis Products"
+authors: Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, Earl T. Barr
+conference: 
+year: 2023
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2304.06815"}
+tags: ["summarization", "Transformer"]
+---
+Large Language Models (LLM) are a new class of computation engines, "programmed" via prompt engineering. We are still learning how to best "program" these LLMs to help developers. We start with the intuition that developers tend to consciously and unconsciously have a collection of semantics facts in mind when working on coding tasks. Mostly these are shallow, simple facts arising from a quick read. For a function, examples of facts might include parameter and local variable names, return expressions, simple pre- and post-conditions, and basic control and data flow, etc.
+
+One might assume that the powerful multi-layer architecture of transformer-style LLMs makes them inherently capable of doing this simple level of "code analysis" and extracting such information, implicitly, while processing code: but are they, really? If they aren't, could explicitly adding this information help? Our goal here is to investigate this question, using the code summarization task and evaluate whether automatically augmenting an LLM's prompt with semantic facts explicitly, actually helps.
+
+Prior work shows that LLM performance on code summarization benefits from few-shot samples drawn either from the same-project or from examples found via information retrieval methods (such as BM25). While summarization performance has steadily increased since the early days, there is still room for improvement: LLM performance on code summarization still lags its performance on natural-language tasks like translation and text summarization.
+
+We find that adding semantic facts actually does help! This approach improves performance in several different settings suggested by prior work, including for two different Large Language Models. In most cases, improvement nears or exceeds 2 BLEU; for the PHP language in the challenging CodeSearchNet dataset, this augmentation actually yields performance surpassing 30 BLEU. 

From d188b54181eee50d1b5232071f679459f594e229 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Mon, 24 Apr 2023 14:17:44 +0100
Subject: [PATCH 065/114] Add Jesse and tag maintainance

---
 _publications/jesse2023large.markdown  | 11 +++++++++++
 _publications/saberi2023model.markdown |  4 ++--
 2 files changed, 13 insertions(+), 2 deletions(-)
 create mode 100644 _publications/jesse2023large.markdown

diff --git a/_publications/jesse2023large.markdown b/_publications/jesse2023large.markdown
new file mode 100644
index 00000000..5c953d22
--- /dev/null
+++ b/_publications/jesse2023large.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Large Language Models and Simple, Stupid Bugs"
+authors: Kevin Jesse, Toufique Ahmed, Premkumar T. Devanbu, Emily Morgan
+conference: 
+year: 2023
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2303.11455"}
+tags: ["Transformer", "defect"]
+---
+With the advent of powerful neural language models, AI-based systems to assist developers in coding tasks are becoming widely available; Copilot is one such system. Copilot uses Codex, a large language model (LLM), to complete code conditioned on a preceding "prompt". Codex, however, is trained on public GitHub repositories, viz., on code that may include bugs and vulnerabilities. Previous studies [1], [2] show Codex reproduces vulnerabilities seen in training. In this study, we examine how prone Codex is to generate an interesting bug category, single statement bugs, commonly referred to as simple, stupid bugs or SStuBs in the MSR community. We find that Codex and similar LLMs do help avoid some SStuBs, but do produce known, verbatim SStuBs as much as 2x as likely than known, verbatim correct code. We explore the consequences of the Codex generated SStuBs and propose avoidance strategies that suggest the possibility of reducing the production of known, verbatim SStubs, and increase the possibility of producing known, verbatim fixes. 
diff --git a/_publications/saberi2023model.markdown b/_publications/saberi2023model.markdown
index a4570414..7dcdc632 100644
--- a/_publications/saberi2023model.markdown
+++ b/_publications/saberi2023model.markdown
@@ -5,7 +5,7 @@ authors: Iman Saberi, Fateme H. Fard
 conference: MSR
 year: 2023
 additional_links:
-   - {name: "ArXiV", url: "https://arxiv.org/pdf/2303.06233"}
-tags: ["Adapters", "Pre-trained Programming Language", "Code Refinement", "Code Summarization"]
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2303.06233"}
+tags: ["Transformer", "repair", "summarization"]
 ---
 Pre-trained Programming Language Models (PPLMs) achieved many recent states of the art results for many code-related software engineering tasks. Though some studies use data flow or propose tree-based models that utilize Abstract Syntax Tree (AST), most PPLMs do not fully utilize the rich syntactical information in source code. Still, the input is considered a sequence of tokens. There are two issues; the first is computational inefficiency due to the quadratic relationship between input length and attention complexity. Second, any syntactical information, when needed as an extra input to the current PPLMs, requires the model to be pre-trained from scratch, wasting all the computational resources already used for pre-training the current models. In this work, we propose Named Entity Recognition (NER) adapters, lightweight modules that can be inserted into Transformer blocks to learn type information extracted from the AST. These adapters can be used with current PPLMs such as CodeBERT, GraphCodeBERT, and CodeT5. We train the NER adapters using a novel Token Type Classification objective function (TTC). We insert our proposed work in CodeBERT, building CodeBERTER, and evaluate the performance on two tasks of code refinement and code summarization. CodeBERTER improves the accuracy of code refinement from 16.4 to 17.8 while using 20% of training parameter budget compared to the fully fine-tuning approach, and the BLEU score of code summarization from 14.75 to 15.90 while reducing 77% of training parameters compared to the fully fine-tuning approach.

From 2c433a9415477498cfa3bc2f0e643e9091801e2c Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Wed, 26 Apr 2023 13:44:24 +0100
Subject: [PATCH 066/114] Add RepoCoder

---
 _publications/zhang2023repocoder.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/zhang2023repocoder.markdown

diff --git a/_publications/zhang2023repocoder.markdown b/_publications/zhang2023repocoder.markdown
new file mode 100644
index 00000000..5de5ff42
--- /dev/null
+++ b/_publications/zhang2023repocoder.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation"
+authors: Fengji Zhang, Bei Chen, Yue Zhang, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, Weizhu Chen
+conference:
+year: 2023
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2303.12570"}
+   - {name: "Code", url: "https://github.com/microsoft/CodeT/tree/main/RepoCoder"}
+tags: ["completion", "Transformer", "retrieval"]
+---
+The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion process by incorporating a similarity-based retriever and a pre-trained code language model, which allows for the effective utilization of repository-level information for code completion and grants the ability to generate code at various levels of granularity. Furthermore, RepoCoder utilizes a novel iterative retrieval-generation paradigm that bridges the gap between retrieval context and the intended completion target. We also propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. We test the performance of RepoCoder by using various combinations of code retrievers and generators. Experimental results indicate that RepoCoder significantly improves the zero-shot code completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of RepoCoder through comprehensive analysis, providing valuable insights for future research. 

From 6af6fc2bde7dd0788da9137bef9f05bdb6b35f37 Mon Sep 17 00:00:00 2001
From: Alex Bezzubov <bzz@apache.org>
Date: Thu, 27 Apr 2023 17:07:46 +0200
Subject: [PATCH 067/114] Add JetBrains completion ranking

---
 _publications/bibaev2022all.markdown | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
 create mode 100644 _publications/bibaev2022all.markdown

diff --git a/_publications/bibaev2022all.markdown b/_publications/bibaev2022all.markdown
new file mode 100644
index 00000000..b1d3ed73
--- /dev/null
+++ b/_publications/bibaev2022all.markdown
@@ -0,0 +1,18 @@
+---
+layout: publication
+title: "All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs"
+authors: Vitaliy Bibaev, Alexey Kalina, Vadim Lomshakov, Yaroslav Golubev, Alexander Bezzubov, Nikita Povarov, Timofey Bryksin
+conference: ESEC/FSE
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2205.10692"}
+tags: ["autocomplete"]
+---
+We propose an approach for collecting completion usage logs from the users in an IDE and using them to train a machine learning based model for ranking completion candidates.
+We developed a set of features that describe completion candidates and their context, and deployed their anonymized collection in the Early Access Program of IntelliJ-based IDEs.
+We used the logs to collect a dataset of code completions from users, and employed it to train a ranking CatBoost model.
+Then, we evaluated it in two settings: on a held-out set of the collected completions and in a separate A/B test on two different groups of users in the IDE.
+Our evaluation shows that using a simple ranking model trained on the past user behavior logs significantly improved code completion experience.
+Compared to the default heuristics-based ranking, our model demonstrated a decrease in the number of typing actions necessary to perform the completion in the IDE from 2.073 to 1.832.
+The approach adheres to privacy requirements and legal constraints, since it does not require collecting personal information, performing all the necessary anonymization on the client's side.
+Importantly, it can be improved continuously: implementing new features, collecting new data, and evaluating new models - this way, we have been using it in production since the end of 2020.
\ No newline at end of file

From 662aa3cee1de42c914554543456da90cce297ec1 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Wed, 3 May 2023 10:09:16 +0100
Subject: [PATCH 068/114] Add tracefixer

---
 _publications/bouzenia2023tracefixer.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/bouzenia2023tracefixer.markdown

diff --git a/_publications/bouzenia2023tracefixer.markdown b/_publications/bouzenia2023tracefixer.markdown
new file mode 100644
index 00000000..26b08036
--- /dev/null
+++ b/_publications/bouzenia2023tracefixer.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "TraceFixer: Execution Trace-Driven Program Repair"
+authors: Islem Bouzenia, Yangruibo Ding, Kexin Pei, Baishakhi Ray, Michael Pradel
+conference:
+year: 2023
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2304.12743"}
+tags: ["Transformer", "repair", "dynamic"]
+---
+When debugging unintended program behavior, developers can often identify the point in the execution where the actual behavior diverges from the desired behavior. For example, a variable may get assigned a wrong value, which then negatively influences the remaining computation. Once a developer identifies such a divergence, how to fix the code so that it provides the desired behavior? This paper presents TraceFixer, a technique for predicting how to edit source code so that it does not diverge from the expected behavior anymore. The key idea is to train a neural program repair model that not only learns from source code edits but also exploits excerpts of runtime traces. The input to the model is a partial execution trace of the incorrect code, which can be obtained automatically through code instrumentation, and the correct state that the program should reach at the divergence point, which the user provides, e.g., in an interactive debugger. Our approach fundamentally differs from current program repair techniques, which share a similar goal but exploit neither execution traces nor information about the desired program state. We evaluate TraceFixer on single-line mistakes in Python code. After training the model on hundreds of thousands of code edits created by a neural model that mimics real-world bugs, we find that exploiting execution traces improves the bug-fixing ability by 13% to 20% (depending on the dataset, within the top-10 predictions) compared to a baseline that learns from source code edits only. Applying TraceFixer to 20 real-world Python bugs shows that the approach successfully fixes 10 of them.

From 0686b5bee87454a8fb2abf1348afdb728ac637ec Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Thu, 4 May 2023 16:13:18 +0100
Subject: [PATCH 069/114] Add CodeGen2

---
 _publications/nijkamp2023codegen2.markdown | 15 +++++++++++++++
 1 file changed, 15 insertions(+)
 create mode 100644 _publications/nijkamp2023codegen2.markdown

diff --git a/_publications/nijkamp2023codegen2.markdown b/_publications/nijkamp2023codegen2.markdown
new file mode 100644
index 00000000..ab8f7e4f
--- /dev/null
+++ b/_publications/nijkamp2023codegen2.markdown
@@ -0,0 +1,15 @@
+---
+layout: publication
+title: "CodeGen2: Lessons for Training LLMs on Programming and Natural Languages"
+authors: Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, Yingbo Zhou
+conference:
+year: 2023
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2305.02309"}
+tags: ["Transformer"]
+---
+Large language models (LLMs) have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. The quality of the learned representations appears to be dictated by the neural scaling laws as a function of the number of model parameters and observations, while imposing upper bounds on the model performance by the amount of available data and compute, which is costly.
+
+In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) infill sampling, and, (4) data distributions. Specifically, for the model architecture, we attempt to unify encoder and decoder-based models into a single prefix-LM. For learning methods, (i) causal language modeling, (ii) span corruption, (iii) infilling are unified into a simple learning algorithm. For infill sampling, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution of programming and natural languages on model performance is explored.
+
+We conduct a comprehensive series of empirical experiments on 1B LLMs, for which failures and successes of this exploration are distilled into four lessons. We will provide a final recipe for training and release CodeGen2 models in size 1B, 3.7B, 7B, and, 16B parameters, along with the training framework as open-source: https://github.com/salesforce/CodeGen2 

From 9f2a24ac5df8257f93d6fd1c2a81d3085f7d6daa Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Sun, 7 May 2023 20:51:32 +0100
Subject: [PATCH 070/114] Add small utility and paper.

---
 _publications/add_from_arxiv.py    | 60 ++++++++++++++++++++++++++++++
 _publications/liu2023code.markdown | 12 ++++++
 2 files changed, 72 insertions(+)
 create mode 100644 _publications/add_from_arxiv.py
 create mode 100644 _publications/liu2023code.markdown

diff --git a/_publications/add_from_arxiv.py b/_publications/add_from_arxiv.py
new file mode 100644
index 00000000..f0a01216
--- /dev/null
+++ b/_publications/add_from_arxiv.py
@@ -0,0 +1,60 @@
+#!/bin/python3
+
+import argparse
+import arxiv
+import re
+import os
+import textwrap
+
+
+def _first_non_stopword(title: str) -> str:
+    for word in re.split("\W", title.lower()):
+        if word in ("a", "an", "the", "is", "are", "what", "who", "your"):
+            continue
+        return word
+    raise ValueError(f'The title seems to have only stopwords! "{title}"')
+
+
+def _author_lastname(author_name: str) -> str:
+    return author_name.split(" ")[-1].lower()
+
+
+def get_info(paper_id: str, out_dir: str) -> None:
+    search = arxiv.Search(id_list=[paper_id])
+    paper = next(search.results())
+
+    summary = (
+        paper.summary.replace("\n\n", "@@--@@")
+        .replace("\n", " ")
+        .replace("@@--@@", "\n\n")
+    )
+
+    tmpl = textwrap.dedent(
+        f"""
+        ---
+        layout: publication
+        title: "{paper.title}"
+        authors: {", ".join(a.name for a in paper.authors)}
+        conference:
+        year: {paper.published.year}
+        additional_links:
+        - {{name: "ArXiV", url: "https://arxiv.org/abs/%7Bpaper_id%7D"}}
+        tags: ["TODO"]
+        ---
+        {summary}
+        """
+    )
+
+    filename = f"{_author_lastname(paper.authors[0].name)}{paper.published.year}{_first_non_stopword(paper.title)}.markdown"
+    with open(os.path.join(out_dir, filename), "w") as f:
+        f.write(tmpl)
+
+    print(f'Output at: {filename}')
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("paper_id", help="The id of the paper to retrieve.")
+    parser.add_argument("out_path", help="The path to output the file.")
+    args = parser.parse_args()
+
+    get_info(args.paper_id, args.out_path)
diff --git a/_publications/liu2023code.markdown b/_publications/liu2023code.markdown
new file mode 100644
index 00000000..946018ca
--- /dev/null
+++ b/_publications/liu2023code.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation"
+authors: Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, Lingming Zhang
+conference:
+year: 2023
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2305.01210"}
+tags: ["evaluation"]
+---
+Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code according to user intent written in natural language. Code evaluation datasets, containing curated synthesis problems with input/output test-cases, are used to measure the performance of various LLMs on code synthesis. However, test-cases in these datasets can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis benchmarking framework to rigorously evaluate the functional correctness of LLM-synthesized code. In short, EvalPlus takes in the base evaluation dataset and uses an automatic input generation step to produce and diversify large amounts of new test inputs using both LLM-based and mutation-based input generators to further validate the synthesized code. We extend the popular HUMANEVAL benchmark and build HUMANEVAL+ with 81x additionally generated tests. Our extensive evaluation across 14 popular LLMs demonstrates that HUMANEVAL+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by 15.1% on average! Moreover, we even found several incorrect ground-truth implementations in HUMANEVAL. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis but also opens up a new direction to improve programming benchmarks through automated test input generation.

From edb4eb569981400d345e30c5a402765a2cfbd2e7 Mon Sep 17 00:00:00 2001
From: Alex Bezzubov <bzz@apache.org>
Date: Sun, 30 Apr 2023 18:07:21 +0200
Subject: [PATCH 071/114] tsne vis: change the model & embeddings

Use smaller model that is fast and proived a better quality
'all-MiniLM-L6-v2' from https://www.sbert.net/docs/pretrained_models.html

Use title as well as abstract for paper embeddings.

Encode & avg. in batches.
---
 etc/compute_embeddings.py | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/etc/compute_embeddings.py b/etc/compute_embeddings.py
index 1e0c8da8..950a8311 100644
--- a/etc/compute_embeddings.py
+++ b/etc/compute_embeddings.py
@@ -3,6 +3,7 @@
 
 import numpy as np
 import torch
+import torch.nn.functional as F
 import sklearn.manifold
 import transformers
 
@@ -13,13 +14,20 @@ def parse_arguments():
     parser.add_argument("json", default=False, help="the path the json containing all papers.")
     parser.add_argument("outpath", default=False, help="the target path of the visualizations papers.")
     parser.add_argument("--seed", default=0, help="The seed for TSNE.", type=int)
+    parser.add_argument("--model", default='sentence-transformers/all-MiniLM-L6-v2', help="Name of the HF model")
+
     return parser.parse_args()
 
+def mean_pooling(token_embeddings, attention_mask):
+    """ Mean Pooling, takes attention mask into account for correct averaging"""
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+
 
 if __name__ == "__main__":
     args = parse_arguments()
-    tokenizer = transformers.AutoTokenizer.from_pretrained("deepset/sentence_bert")
-    model = transformers.AutoModel.from_pretrained("deepset/sentence_bert")
+    tokenizer = transformers.AutoTokenizer.from_pretrained(args.model)
+    model = transformers.AutoModel.from_pretrained(args.model)
     model.eval()
 
     with open(args.json) as f:
@@ -27,16 +35,19 @@ def parse_arguments():
 
     print(f"Num papers: {len(data)}")
 
-    all_embeddings = []
+    corpus = []
     for paper_info in data:
-        with torch.no_grad():
-            token_ids = torch.tensor([tokenizer.encode(paper_info["abstract"])][:512])
-            hidden_states, _ = model(token_ids)[-2:]
-            all_embeddings.append(hidden_states.mean(0).mean(0).numpy())
+        corpus.append(tokenizer.sep_token.join([paper_info['title'], paper_info['abstract']]))
+
+    encoded_corpus = tokenizer(corpus, padding=True, truncation=True, return_tensors='pt')
+    with torch.no_grad():
+        hidden_states = model(**encoded_corpus).last_hidden_state
+
+    corpus_embeddings = mean_pooling(hidden_states, encoded_corpus['attention_mask'])
+    corpus_embeddings = F.normalize(corpus_embeddings, p=2, dim=1)
 
     np.random.seed(args.seed)
-    all_embeddings = np.array(all_embeddings)
-    out = sklearn.manifold.TSNE(n_components=2, metric="cosine").fit_transform(all_embeddings)
+    out = sklearn.manifold.TSNE(n_components=2, metric="cosine").fit_transform(corpus_embeddings)
 
     for i, paper_info in enumerate(data):
         paper_info['tsne_embedding'] = out[i].tolist()

From 19c91fcde07349f755471f88f670c69c7cb21b58 Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Sun, 14 May 2023 22:20:04 +0100
Subject: [PATCH 072/114] Revert "tsne vis: change the model & embeddings"

This reverts commit edb4eb569981400d345e30c5a402765a2cfbd2e7.
---
 etc/compute_embeddings.py | 29 +++++++++--------------------
 1 file changed, 9 insertions(+), 20 deletions(-)

diff --git a/etc/compute_embeddings.py b/etc/compute_embeddings.py
index 950a8311..1e0c8da8 100644
--- a/etc/compute_embeddings.py
+++ b/etc/compute_embeddings.py
@@ -3,7 +3,6 @@
 
 import numpy as np
 import torch
-import torch.nn.functional as F
 import sklearn.manifold
 import transformers
 
@@ -14,20 +13,13 @@ def parse_arguments():
     parser.add_argument("json", default=False, help="the path the json containing all papers.")
     parser.add_argument("outpath", default=False, help="the target path of the visualizations papers.")
     parser.add_argument("--seed", default=0, help="The seed for TSNE.", type=int)
-    parser.add_argument("--model", default='sentence-transformers/all-MiniLM-L6-v2', help="Name of the HF model")
-
     return parser.parse_args()
 
-def mean_pooling(token_embeddings, attention_mask):
-    """ Mean Pooling, takes attention mask into account for correct averaging"""
-    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
-    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
-
 
 if __name__ == "__main__":
     args = parse_arguments()
-    tokenizer = transformers.AutoTokenizer.from_pretrained(args.model)
-    model = transformers.AutoModel.from_pretrained(args.model)
+    tokenizer = transformers.AutoTokenizer.from_pretrained("deepset/sentence_bert")
+    model = transformers.AutoModel.from_pretrained("deepset/sentence_bert")
     model.eval()
 
     with open(args.json) as f:
@@ -35,19 +27,16 @@ def mean_pooling(token_embeddings, attention_mask):
 
     print(f"Num papers: {len(data)}")
 
-    corpus = []
+    all_embeddings = []
     for paper_info in data:
-        corpus.append(tokenizer.sep_token.join([paper_info['title'], paper_info['abstract']]))
-
-    encoded_corpus = tokenizer(corpus, padding=True, truncation=True, return_tensors='pt')
-    with torch.no_grad():
-        hidden_states = model(**encoded_corpus).last_hidden_state
-
-    corpus_embeddings = mean_pooling(hidden_states, encoded_corpus['attention_mask'])
-    corpus_embeddings = F.normalize(corpus_embeddings, p=2, dim=1)
+        with torch.no_grad():
+            token_ids = torch.tensor([tokenizer.encode(paper_info["abstract"])][:512])
+            hidden_states, _ = model(token_ids)[-2:]
+            all_embeddings.append(hidden_states.mean(0).mean(0).numpy())
 
     np.random.seed(args.seed)
-    out = sklearn.manifold.TSNE(n_components=2, metric="cosine").fit_transform(corpus_embeddings)
+    all_embeddings = np.array(all_embeddings)
+    out = sklearn.manifold.TSNE(n_components=2, metric="cosine").fit_transform(all_embeddings)
 
     for i, paper_info in enumerate(data):
         paper_info['tsne_embedding'] = out[i].tolist()

From 7852ad60c988c28403c2cbea2c77255803f5af5c Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Tue, 16 May 2023 09:09:39 +0100
Subject: [PATCH 073/114] Add papers.

---
 _publications/add_from_arxiv.py        |  3 ++-
 _publications/li2023starcoder.markdown | 12 ++++++++++++
 _publications/wang2023codet5.markdown  | 12 ++++++++++++
 _publications/yin2022natural.markdown  | 12 ++++++++++++
 4 files changed, 38 insertions(+), 1 deletion(-)
 create mode 100644 _publications/li2023starcoder.markdown
 create mode 100644 _publications/wang2023codet5.markdown
 create mode 100644 _publications/yin2022natural.markdown

diff --git a/_publications/add_from_arxiv.py b/_publications/add_from_arxiv.py
index f0a01216..c9cfde73 100644
--- a/_publications/add_from_arxiv.py
+++ b/_publications/add_from_arxiv.py
@@ -49,7 +49,8 @@ def get_info(paper_id: str, out_dir: str) -> None:
     with open(os.path.join(out_dir, filename), "w") as f:
         f.write(tmpl)
 
-    print(f'Output at: {filename}')
+    print(f"Output at: {filename}")
+
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
diff --git a/_publications/li2023starcoder.markdown b/_publications/li2023starcoder.markdown
new file mode 100644
index 00000000..90474f19
--- /dev/null
+++ b/_publications/li2023starcoder.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "StarCoder: may the source be with you!"
+authors: Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries
+conference:
+year: 2023
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2305.06161"}
+tags: ["Transformer"]
+---
+The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI `code-cushman-001` model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.
diff --git a/_publications/wang2023codet5.markdown b/_publications/wang2023codet5.markdown
new file mode 100644
index 00000000..1c4abb27
--- /dev/null
+++ b/_publications/wang2023codet5.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "CodeT5+: Open Code Large Language Models for Code Understanding and Generation"
+authors: Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, Steven C. H. Hoi
+conference:
+year: 2023
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2305.07922"}
+tags: ["Transformer"]
+---
+Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.
diff --git a/_publications/yin2022natural.markdown b/_publications/yin2022natural.markdown
new file mode 100644
index 00000000..bd44ea68
--- /dev/null
+++ b/_publications/yin2022natural.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "Natural Language to Code Generation in Interactive Data Science Notebooks"
+authors: Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Alex Polozov, Charles Sutton
+conference:
+year: 2022
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2212.09248"}
+tags: ["notebook", "evaluation"]
+---
+Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.

From 41523e38f24323bcbc20ccd03f107f8142b69d47 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Tue, 16 May 2023 15:44:54 +0100
Subject: [PATCH 074/114] Add Liu et al. (2023)

---
 _publications/liu2023code.markdown | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/_publications/liu2023code.markdown b/_publications/liu2023code.markdown
index 946018ca..1600487d 100644
--- a/_publications/liu2023code.markdown
+++ b/_publications/liu2023code.markdown
@@ -1,12 +1,12 @@
 
 ---
 layout: publication
-title: "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation"
-authors: Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, Lingming Zhang
+title: "Code Execution with Pre-trained Language Models"
+authors: Chenxiao Liu, Shuai Lu, Weizhu Chen, Daxin Jiang, Alexey Svyatkovskiy, Shengyu Fu, Neel Sundaresan, Nan Duan
 conference:
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2305.01210"}
-tags: ["evaluation"]
+- {name: "ArXiV", url: "https://arxiv.org/abs/2305.05383"}
+tags: ["execution", "dynamic"]
 ---
-Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code according to user intent written in natural language. Code evaluation datasets, containing curated synthesis problems with input/output test-cases, are used to measure the performance of various LLMs on code synthesis. However, test-cases in these datasets can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis benchmarking framework to rigorously evaluate the functional correctness of LLM-synthesized code. In short, EvalPlus takes in the base evaluation dataset and uses an automatic input generation step to produce and diversify large amounts of new test inputs using both LLM-based and mutation-based input generators to further validate the synthesized code. We extend the popular HUMANEVAL benchmark and build HUMANEVAL+ with 81x additionally generated tests. Our extensive evaluation across 14 popular LLMs demonstrates that HUMANEVAL+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by 15.1% on average! Moreover, we even found several incorrect ground-truth implementations in HUMANEVAL. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis but also opens up a new direction to improve programming benchmarks through automated test input generation.
+Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. In this paper, we investigate how well pre-trained models can understand and perform code execution. We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution, which challenges existing models such as Codex. We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension. We evaluate CodeExecutor on code execution and show its promising performance and limitations. We also demonstrate its potential benefits for code intelligence tasks such as zero-shot code-to-code search and text-to-code generation. Our analysis provides insights into the learning and generalization abilities of pre-trained models for code execution.

From da84eeba727afaebb014b5a0bd4b0d3596e60d5c Mon Sep 17 00:00:00 2001
From: Alex Bezzubov <bzz@apache.org>
Date: Sun, 30 Apr 2023 18:07:21 +0200
Subject: [PATCH 075/114] tsne vis: change the model & embeddings

Use smaller model that is fast and proived a better quality
'all-MiniLM-L6-v2' from https://www.sbert.net/docs/pretrained_models.html

Use title as well as abstract for paper embeddings.

Encode & avg. in batches.
---
 etc/compute_embeddings.py | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/etc/compute_embeddings.py b/etc/compute_embeddings.py
index 1e0c8da8..950a8311 100644
--- a/etc/compute_embeddings.py
+++ b/etc/compute_embeddings.py
@@ -3,6 +3,7 @@
 
 import numpy as np
 import torch
+import torch.nn.functional as F
 import sklearn.manifold
 import transformers
 
@@ -13,13 +14,20 @@ def parse_arguments():
     parser.add_argument("json", default=False, help="the path the json containing all papers.")
     parser.add_argument("outpath", default=False, help="the target path of the visualizations papers.")
     parser.add_argument("--seed", default=0, help="The seed for TSNE.", type=int)
+    parser.add_argument("--model", default='sentence-transformers/all-MiniLM-L6-v2', help="Name of the HF model")
+
     return parser.parse_args()
 
+def mean_pooling(token_embeddings, attention_mask):
+    """ Mean Pooling, takes attention mask into account for correct averaging"""
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+
 
 if __name__ == "__main__":
     args = parse_arguments()
-    tokenizer = transformers.AutoTokenizer.from_pretrained("deepset/sentence_bert")
-    model = transformers.AutoModel.from_pretrained("deepset/sentence_bert")
+    tokenizer = transformers.AutoTokenizer.from_pretrained(args.model)
+    model = transformers.AutoModel.from_pretrained(args.model)
     model.eval()
 
     with open(args.json) as f:
@@ -27,16 +35,19 @@ def parse_arguments():
 
     print(f"Num papers: {len(data)}")
 
-    all_embeddings = []
+    corpus = []
     for paper_info in data:
-        with torch.no_grad():
-            token_ids = torch.tensor([tokenizer.encode(paper_info["abstract"])][:512])
-            hidden_states, _ = model(token_ids)[-2:]
-            all_embeddings.append(hidden_states.mean(0).mean(0).numpy())
+        corpus.append(tokenizer.sep_token.join([paper_info['title'], paper_info['abstract']]))
+
+    encoded_corpus = tokenizer(corpus, padding=True, truncation=True, return_tensors='pt')
+    with torch.no_grad():
+        hidden_states = model(**encoded_corpus).last_hidden_state
+
+    corpus_embeddings = mean_pooling(hidden_states, encoded_corpus['attention_mask'])
+    corpus_embeddings = F.normalize(corpus_embeddings, p=2, dim=1)
 
     np.random.seed(args.seed)
-    all_embeddings = np.array(all_embeddings)
-    out = sklearn.manifold.TSNE(n_components=2, metric="cosine").fit_transform(all_embeddings)
+    out = sklearn.manifold.TSNE(n_components=2, metric="cosine").fit_transform(corpus_embeddings)
 
     for i, paper_info in enumerate(data):
         paper_info['tsne_embedding'] = out[i].tolist()

From caf2b61729dd6195909c5aab6466958087e95a67 Mon Sep 17 00:00:00 2001
From: Alex Bezzubov <bzz@apache.org>
Date: Sun, 28 May 2023 15:14:52 +0200
Subject: [PATCH 076/114] tsne vis: batch_size=4 & cli arg for TF Projector
 format

---
 etc/compute_embeddings.py | 43 +++++++++++++++++++++++++++++----------
 1 file changed, 32 insertions(+), 11 deletions(-)

diff --git a/etc/compute_embeddings.py b/etc/compute_embeddings.py
index 950a8311..43f0ba7c 100644
--- a/etc/compute_embeddings.py
+++ b/etc/compute_embeddings.py
@@ -1,5 +1,7 @@
 import argparse
 import json
+from timeit import default_timer as timer
+from datetime import date
 
 import numpy as np
 import torch
@@ -14,7 +16,8 @@ def parse_arguments():
     parser.add_argument("json", default=False, help="the path the json containing all papers.")
     parser.add_argument("outpath", default=False, help="the target path of the visualizations papers.")
     parser.add_argument("--seed", default=0, help="The seed for TSNE.", type=int)
-    parser.add_argument("--model", default='sentence-transformers/all-MiniLM-L6-v2', help="Name of the HF model")
+    parser.add_argument("--model", default='sentence-transformers/all-MiniLM-L6-v2', help="The name of the HF model")
+    parser.add_argument("--save_emb", action='https://github.com/store_true', help="Save embeddings in CSV for Tensorboard Projector")
 
     return parser.parse_args()
 
@@ -23,9 +26,7 @@ def mean_pooling(token_embeddings, attention_mask):
     input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
     return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
 
-
-if __name__ == "__main__":
-    args = parse_arguments()
+def main(args):
     tokenizer = transformers.AutoTokenizer.from_pretrained(args.model)
     model = transformers.AutoModel.from_pretrained(args.model)
     model.eval()
@@ -39,18 +40,38 @@ def mean_pooling(token_embeddings, attention_mask):
     for paper_info in data:
         corpus.append(tokenizer.sep_token.join([paper_info['title'], paper_info['abstract']]))
 
-    encoded_corpus = tokenizer(corpus, padding=True, truncation=True, return_tensors='pt')
-    with torch.no_grad():
-        hidden_states = model(**encoded_corpus).last_hidden_state
-
-    corpus_embeddings = mean_pooling(hidden_states, encoded_corpus['attention_mask'])
-    corpus_embeddings = F.normalize(corpus_embeddings, p=2, dim=1)
+    batch_size = 4
+    all_embeddings=[]
+    start = timer()
+    for i in range(0, len(corpus), batch_size):
+        encoded_batch = tokenizer(corpus[i:min(i+batch_size, len(corpus))], padding=True, truncation=True, return_tensors='pt')
+        with torch.no_grad():
+            hidden_state = model(**encoded_batch).last_hidden_state
+            all_embeddings.append(mean_pooling(hidden_state, encoded_batch['attention_mask']))
+
+    all_embeddings = torch.cat(all_embeddings, dim=0)
+    all_embeddings = F.normalize(all_embeddings, p=2, dim=1)
+    print(f"elapsed {timer()-start:.1f}s")
+
+    if args.save_emb:
+        filename = f"{args.model.replace('/', '_')}-{date.today().strftime('%d.%m.%y')}"
+        np.savetxt(f"{filename}-emb.tsv", all_embeddings, delimiter="\t")
+        import csv
+        with open(f"{filename}-meta.tsv", 'w', newline='') as csvfile:
+            w = csv.writer(csvfile, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
+            w.writerow(["year", "key", "title"])
+            for paper in data:
+                w.writerow([paper["year"], paper["key"], paper["title"]])
 
     np.random.seed(args.seed)
-    out = sklearn.manifold.TSNE(n_components=2, metric="cosine").fit_transform(corpus_embeddings)
+    out = sklearn.manifold.TSNE(n_components=2, metric="cosine").fit_transform(all_embeddings)
 
     for i, paper_info in enumerate(data):
         paper_info['tsne_embedding'] = out[i].tolist()
 
     with open(args.outpath, 'w') as f:
         json.dump(data, f)
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    main(args)

From c6c7c030f3bfcb27e07bd7cc2facd9f6e33dfa3e Mon Sep 17 00:00:00 2001
From: "Sergey V. Kovalchuk" <sergey.v.kovalchuk@gmail.com>
Date: Fri, 30 Jun 2023 16:11:53 +0300
Subject: [PATCH 077/114] Adding ICCS 2023, ICCQ 2023

---
 _publications/kovalchuk2023test.markdown | 11 +++++++++++
 _publications/lomshakov2023fine.markdown | 12 ++++++++++++
 2 files changed, 23 insertions(+)
 create mode 100644 _publications/kovalchuk2023test.markdown
 create mode 100644 _publications/lomshakov2023fine.markdown

diff --git a/_publications/kovalchuk2023test.markdown b/_publications/kovalchuk2023test.markdown
new file mode 100644
index 00000000..476f609c
--- /dev/null
+++ b/_publications/kovalchuk2023test.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: Test-based and metric-based evaluation of code generation models for practical question answering
+authors: S. Kovalchuk, D. Fedrushkov, V. Lomshakov, A. Aliev
+conference: ICCQ
+year: 2023
+additional_links:
+   - {name: "IEEE", url: "https://ieeexplore.ieee.org/document/10114665"}
+tags: ["code generation", "test generation", "natural language generation", "evaluation", "metrics", "natural language processing"]
+---
+We performed a comparative analysis of code generation model performance with evaluation using common NLP metrics in comparison to a test-based evaluation. The investigation was performed in the context of question answering with code (test-to-code problem) and was aimed at applicability checking both ways for generated code evaluation in a fully automatic manner. We used CodeGen and GPTNeo pretrained models applied to a problem of question answering using Stack Overflow-based corpus (APIzation). For test-based evaluation, industrial test-generation solutions (Machinet, UTBot) were used for providing automatically generated tests. The analysis showed that the performance evaluation based solely on NLP metrics or on tests provides a rather limited assessment of generated code quality. We see the evidence that predictions with both high and low NLP metrics exist that pass and don't pass tests. With the early results of our empirical study being discussed in this paper, we believe that the combination of both approaches may increase possible ways for building, evaluating, and training code generation models.
\ No newline at end of file
diff --git a/_publications/lomshakov2023fine.markdown b/_publications/lomshakov2023fine.markdown
new file mode 100644
index 00000000..b38a2ff2
--- /dev/null
+++ b/_publications/lomshakov2023fine.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: Fine-Tuning Large Language Models for Answering Programming Questions with Code Snippets
+authors: V. Lomshakov, S. Kovalchuk, M. Omelchenko, S. Nikolenko, A. Aliev
+conference: ICCS
+year: 2023
+additional_links:
+   - {name: "LNCS", url: "https://link.springer.com/chapter/10.1007/978-3-031-36021-3_15"}
+   - {name: "Papers with Code ", url: "https://paperswithcode.com/paper/fine-tuning-large-language-models-for"}
+tags: ["program synthesis", "question answering", "large language models"]
+---
+We study the ability of pretrained large language models (LLM) to answer questions from online question answering fora such as Stack Overflow. We consider question-answer pairs where the main part of the answer consists of source code. On two benchmark datasets — CoNaLa and a newly collected dataset based on Stack Overflow — we investigate how a closed-book question answering system can be improved by fine-tuning the LLM for the downstream task, prompt engineering, and data preprocessing. We use publicly available autoregressive language models such as GPT-Neo, CodeGen, and PanGu-Coder, and after the proposed fine-tuning achieve a BLEU score of 0.4432 on the CoNaLa test set, significantly exceeding previous state of the art for this task.
\ No newline at end of file

From d58c7b70ec36dcd8cb9d799fc30bcc1750435446 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Wed, 12 Jul 2023 07:52:08 +0100
Subject: [PATCH 078/114] Add papers.

---
 _publications/olausson2023demystifying.markdown  | 12 ++++++++++++
 _publications/shrivastava2023repofusion.markdown | 12 ++++++++++++
 2 files changed, 24 insertions(+)
 create mode 100644 _publications/olausson2023demystifying.markdown
 create mode 100644 _publications/shrivastava2023repofusion.markdown

diff --git a/_publications/olausson2023demystifying.markdown b/_publications/olausson2023demystifying.markdown
new file mode 100644
index 00000000..08466786
--- /dev/null
+++ b/_publications/olausson2023demystifying.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "Demystifying GPT Self-Repair for Code Generation"
+authors: Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, Armando Solar-Lezama
+conference:
+year: 2023
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2306.09896"}
+tags: ["repair"]
+---
+Large Language Models (LLMs) have shown remarkable aptitude in code generation but still struggle on challenging programming tasks. Self-repair -- in which the model debugs and fixes mistakes in its own code -- has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of providing accurate feedback on why the code is wrong when that code was generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4's ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. To do so, we first establish a new evaluation strategy dubbed pass@t that measures the pass rate of the tasks against the total number of tokens sampled from the model, enabling a fair comparison to purely sampling-based approaches. With this evaluation strategy, we find that the effectiveness of self-repair is only seen in GPT-4. We also observe that self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback on the programs generated by GPT-3.5 and using expert human programmers to give feedback on the programs generated by GPT-4, we unlock significant performance gains.
diff --git a/_publications/shrivastava2023repofusion.markdown b/_publications/shrivastava2023repofusion.markdown
new file mode 100644
index 00000000..8cea558a
--- /dev/null
+++ b/_publications/shrivastava2023repofusion.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "RepoFusion: Training Code Models to Understand Your Repository"
+authors: Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau, Torsten Scholak
+conference:
+year: 2023
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2306.10998"}
+tags: ["completion"]
+---
+Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand the context present in the repository (e.g., imports, parent classes, files with similar names, etc.), thereby producing inaccurate code completions. This effect is more pronounced when using these assistants for repositories that the model has not seen during training, such as proprietary software or work-in-progress code projects. Recent work has shown the promise of using context from the repository during inference. In this work, we extend this idea and propose RepoFusion, a framework to train models to incorporate relevant repository context. Experiments on single-line code completion show that our models trained with repository context significantly outperform much larger code models as CodeGen-16B-multi ($\sim73\times$ larger) and closely match the performance of the $\sim 70\times$ larger StarCoderBase model that was trained with the Fill-in-the-Middle objective. We find these results to be a novel and compelling demonstration of the gains that training with repository context can bring. We carry out extensive ablation studies to investigate the impact of design choices such as context type, number of contexts, context length, and initialization within our framework. Lastly, we release Stack-Repo, a dataset of 200 Java repositories with permissive licenses and near-deduplicated files that are augmented with three types of repository contexts. Additionally, we are making available the code and trained checkpoints for our work. Our released resources can be found at \url{https://huggingface.co/RepoFusion}.

From 40de453692d20fd4658b9636bf742f2dabdf8c93 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Sat, 15 Jul 2023 20:35:21 +0100
Subject: [PATCH 079/114] Add paper

---
 _publications/ding2023static.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/ding2023static.markdown

diff --git a/_publications/ding2023static.markdown b/_publications/ding2023static.markdown
new file mode 100644
index 00000000..a4070318
--- /dev/null
+++ b/_publications/ding2023static.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "A Static Evaluation of Code Completion by Large Language Models"
+authors: Hantian Ding, Varun Kumar, Yuchen Tian, Zijian Wang, Rob Kwiatkowski, Xiaopeng Li, Murali Krishna Ramanathan, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, Bing Xiang
+conference:
+year: 2023
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2306.03203"}
+tags: ["LLM", "static analysis"]
+---
+Large language models trained on code have shown great potential to increase productivity of software developers. Several execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. Nevertheless, it is expensive to perform the same evaluation on complex real-world projects considering the execution cost. On the contrary, static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. In this work, we propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees. Compared with execution-based evaluation, our method is not only more efficient, but also applicable to code in the wild. For experiments, we collect code context from open source repos to generate one million function bodies using public models. Our static analysis reveals that Undefined Name and Unused Variable are the most common errors among others made by language models. Through extensive studies, we also show the impact of sampling temperature, model size, and context on static errors in code completions.

From f6fc942d39ec17aeac06d41080a027aa5023fdda Mon Sep 17 00:00:00 2001
From: Aashish Yadavally <aashish.yadavally1995@gmail.com>
Date: Fri, 21 Jul 2023 19:50:15 -0500
Subject: [PATCH 080/114] Create yadavally2023partial.markdown

---
 _publications/yadavally2023partial.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/yadavally2023partial.markdown

diff --git a/_publications/yadavally2023partial.markdown b/_publications/yadavally2023partial.markdown
new file mode 100644
index 00000000..ea8a4278
--- /dev/null
+++ b/_publications/yadavally2023partial.markdown
@@ -0,0 +1,12 @@
+---
+ layout: publication
+ title: "(Partial) Program Dependence Learning"
+ authors: Aashish Yadavally, Wenbo Wang, Shaohua Wang, Tien N. Nguyen
+ conference: ICSE
+ year: 2023
+ additional_links:
+    - {name: "website", url: "https://aashishyadavally.github.io/publication/C5"}
+    - {name: "code", url: "https://github.com/aashishyadavally/NeuralPDA"}
+ tags: ["large language models", "program analysis", "static analysis", "tool"]
+ ---
+Code fragments from developer forums often migrate to applications due to the code reuse practice. Owing to the incomplete nature of such programs, analyzing them to early determine the presence of potential vulnerabilities is challenging. In this work, we introduce NeuralPDA, a neural network-based program dependence analysis tool for both complete and partial programs. Our tool efficiently incorporates intra-statement and inter-statement contextual features into statement representations, thereby modeling program dependence analysis as a statement-pair dependence decoding task. In the empirical evaluation, we report that NeuralPDA predicts the CFG and PDG edges in complete Java and C/C++ code with combined F-scores of 94.29% and 92.46%, respectively. The F-score values for partial Java and C/C++ code range from 94.29%–97.17% and 92.46%–96.01%, respectively. We also test the usefulness of the PDGs predicted by NEURALPDA (i.e., PDG*) on the downstream task of method-level vulnerability detection. We discover that the performance of the vulnerability detection tool utilizing PDG* is only 1.1% less than that utilizing the PDGs generated by a program analysis tool. We also report the detection of 14 real-world vulnerable code snippets from StackOverflow by a machine learning-based vulnerability detection tool that employs the PDGs predicted by NeuralPDA for these code snippets.

From 2829d9060c9f9cb94c91da843f7c471e1993818d Mon Sep 17 00:00:00 2001
From: Aashish Yadavally <aashish.yadavally1995@gmail.com>
Date: Fri, 21 Jul 2023 20:03:00 -0500
Subject: [PATCH 081/114] Create wang2023deepvd

---
 _publications/wang2023deepvd | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/wang2023deepvd

diff --git a/_publications/wang2023deepvd b/_publications/wang2023deepvd
new file mode 100644
index 00000000..5e797eaf
--- /dev/null
+++ b/_publications/wang2023deepvd
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "DeepVD: Toward Class-Separation Features for Neural Network Vulnerability Detection"
+authors: Wenbo Wang, Tien N. Nguyen, Shaohua Wang, Yi Li, Jiyuan Zhang, Aashish Yadavally
+conference: ICSE
+year: 2023
+additional_links:
+   - {name: "website", url: "https://aashishyadavally.github.io/publication/C4"}
+   - {name: "code", url: "https://github.com/deepvd2022/deepvd2022"}
+tags: ["vulnerability"]
+---
+The advances of machine learning (ML) including deep learning (DL) have enabled several approaches to implicitly learn vulnerable code patterns to automatically detect software vulnerabilities. A recent study showed that despite successes, the existing ML/DL-based vulnerability detection (VD) models are limited in the ability to distinguish between the two classes of vulnerability and benign code. We propose DeepVD, a graph-based neural network VD model that emphasizes on class-separation features between vulnerability and benign code. DeepVD leverages three types of class-separation features at different levels of abstraction: statement types (similar to Part-of-Speech tagging), Post-Dominator Tree (covering regular flows of execution), and Exception Flow Graph (covering the exception and error-handling flows). We conducted several experiments to evaluate DeepVD in a real-world vulnerability dataset of 303 projects with 13,130 vulnerable methods. Our results show that DeepVD relatively improves over the state-of-the-art ML/DL-based VD approaches 13%–29.6% in precision, 15.6%–28.9% in recall, and 16.4%–25.8% in F-score. Our ablation study confirms that our designed features and components help DeepVD achieve high class-separability for vulnerability and benign code.

From d18a7f2d09eaf25c9f34a383dbab89b1a94b80bf Mon Sep 17 00:00:00 2001
From: Aashish Yadavally <aashish.yadavally1995@gmail.com>
Date: Fri, 21 Jul 2023 20:04:23 -0500
Subject: [PATCH 082/114] Rename wang2023deepvd to wang2023deepvd.markdown

---
 _publications/{wang2023deepvd => wang2023deepvd.markdown} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename _publications/{wang2023deepvd => wang2023deepvd.markdown} (100%)

diff --git a/_publications/wang2023deepvd b/_publications/wang2023deepvd.markdown
similarity index 100%
rename from _publications/wang2023deepvd
rename to _publications/wang2023deepvd.markdown

From 99ed29381a99aaf1d70d32ec0f7408671ea8fb29 Mon Sep 17 00:00:00 2001
From: Aashish Yadavally <aashish.yadavally1995@gmail.com>
Date: Fri, 21 Jul 2023 20:05:55 -0500
Subject: [PATCH 083/114] Update yadavally2023partial.markdown

---
 _publications/yadavally2023partial.markdown | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/_publications/yadavally2023partial.markdown b/_publications/yadavally2023partial.markdown
index ea8a4278..46ab23b5 100644
--- a/_publications/yadavally2023partial.markdown
+++ b/_publications/yadavally2023partial.markdown
@@ -1,12 +1,12 @@
 ---
- layout: publication
- title: "(Partial) Program Dependence Learning"
- authors: Aashish Yadavally, Wenbo Wang, Shaohua Wang, Tien N. Nguyen
- conference: ICSE
- year: 2023
- additional_links:
-    - {name: "website", url: "https://aashishyadavally.github.io/publication/C5"}
-    - {name: "code", url: "https://github.com/aashishyadavally/NeuralPDA"}
- tags: ["large language models", "program analysis", "static analysis", "tool"]
- ---
+layout: publication
+title: "(Partial) Program Dependence Learning"
+authors: Aashish Yadavally, Wenbo Wang, Shaohua Wang, Tien N. Nguyen
+conference: ICSE
+year: 2023
+additional_links:
+   - {name: "website", url: "https://aashishyadavally.github.io/publication/C5"}
+   - {name: "code", url: "https://github.com/aashishyadavally/NeuralPDA"}
+tags: ["large language models", "program analysis", "static analysis", "tool"]
+---
 Code fragments from developer forums often migrate to applications due to the code reuse practice. Owing to the incomplete nature of such programs, analyzing them to early determine the presence of potential vulnerabilities is challenging. In this work, we introduce NeuralPDA, a neural network-based program dependence analysis tool for both complete and partial programs. Our tool efficiently incorporates intra-statement and inter-statement contextual features into statement representations, thereby modeling program dependence analysis as a statement-pair dependence decoding task. In the empirical evaluation, we report that NeuralPDA predicts the CFG and PDG edges in complete Java and C/C++ code with combined F-scores of 94.29% and 92.46%, respectively. The F-score values for partial Java and C/C++ code range from 94.29%–97.17% and 92.46%–96.01%, respectively. We also test the usefulness of the PDGs predicted by NEURALPDA (i.e., PDG*) on the downstream task of method-level vulnerability detection. We discover that the performance of the vulnerability detection tool utilizing PDG* is only 1.1% less than that utilizing the PDGs generated by a program analysis tool. We also report the detection of 14 real-world vulnerable code snippets from StackOverflow by a machine learning-based vulnerability detection tool that employs the PDGs predicted by NeuralPDA for these code snippets.

From 2b4c8f1f87280321ce23d081386a1f5687a925c8 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Wed, 26 Jul 2023 09:49:23 +0300
Subject: [PATCH 084/114] Add paper.

---
 _publications/peng2023generative.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/peng2023generative.markdown

diff --git a/_publications/peng2023generative.markdown b/_publications/peng2023generative.markdown
new file mode 100644
index 00000000..f794b7c1
--- /dev/null
+++ b/_publications/peng2023generative.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "Generative Type Inference for Python"
+authors: Yun Peng, Chaozheng Wang, Wenxuan Wang, Cuiyun Gao, Michael R. Lyu
+conference:
+year: 2023
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2307.09163"}
+tags: ["types"]
+---
+Python is a popular dynamic programming language, evidenced by its ranking as the second most commonly used language on GitHub. However, its dynamic type system can lead to potential type errors, leading researchers to explore automatic type inference approaches for Python programs. The rule-based type inference approaches can ensure the accuracy of predicted variable types, but they suffer from low coverage problems. Supervised type inference approaches, while feature-agnostic, require large, high-quality annotated datasets and are limited to pre-defined types. As zero-shot approaches, the cloze-style approaches reformulate the type inference problem into a fill-in-the-blank problem. However, their performance is limited.   This paper introduces TypeGen, a few-shot generative type inference approach that incorporates static domain knowledge from static analysis. TypeGen creates chain-of-thought (COT) prompts by translating the type inference steps of static analysis into prompts based on the type dependency graphs (TDGs), enabling language models to learn from how static analysis infers types. By combining COT prompts with code slices and type hints, TypeGen constructs example prompts from human annotations. TypeGen only requires very few annotated examples to teach language models to generate similar COT prompts via in-context learning. Moreover, TypeGen enhances the interpretability of results through the use of the input-explanation-output strategy. Experiments show that TypeGen outperforms the best baseline Type4Py by 10.0% for argument type prediction and 22.5% in return value type prediction in terms of top-1 Exact Match by using only five examples. Furthermore, TypeGen achieves substantial improvements of 27% to 84% compared to the zero-shot performance of large language models with parameter sizes ranging from 1.3B to 175B in terms of top-1 Exact Match.

From f7513e991bc1bee242b0b32de0a2c36925dcecf5 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Sat, 23 Sep 2023 13:46:42 +0100
Subject: [PATCH 085/114] Add two papeprs.

---
 _publications/li2023hitchhiker.markdown | 12 ++++++++++++
 _publications/xia2023universal.markdown | 12 ++++++++++++
 2 files changed, 24 insertions(+)
 create mode 100644 _publications/li2023hitchhiker.markdown
 create mode 100644 _publications/xia2023universal.markdown

diff --git a/_publications/li2023hitchhiker.markdown b/_publications/li2023hitchhiker.markdown
new file mode 100644
index 00000000..7d1bb9ba
--- /dev/null
+++ b/_publications/li2023hitchhiker.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "The Hitchhiker's Guide to Program Analysis: A Journey with Large Language Models"
+authors: Haonan Li, Yu Hao, Yizhuo Zhai, Zhiyun Qian
+conference:
+year: 2023
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2308.00245"}
+tags: ["static analysis"]
+---
+Static analysis is a widely used technique in software engineering for identifying and mitigating bugs. However, a significant hurdle lies in achieving a delicate balance between precision and scalability. Large Language Models (LLMs) offer a promising alternative, as recent advances demonstrate remarkable capabilities in comprehending, generating, and even debugging code. Yet, the logic of bugs can be complex and require sophisticated reasoning and a large analysis scope spanning multiple functions. Therefore, at this point, LLMs are better used in an assistive role to complement static analysis. In this paper, we take a deep dive into the open space of LLM-assisted static analysis, using use-before-initialization (UBI) bugs as a case study. To this end, we develop LLift, a fully automated agent that interfaces with both a static analysis tool and an LLM. By carefully designing the agent and the prompts, we are able to overcome a number of challenges, including bug-specific modeling, the large problem scope, the non-deterministic nature of LLMs, etc. Tested in a real-world scenario analyzing nearly a thousand potential UBI bugs produced by static analysis, LLift demonstrates an extremely potent capability, showcasing a high precision (50%) and recall rate (100%). It even identified 13 previously unknown UBI bugs in the Linux kernel. This research paves the way for new opportunities and methodologies in the use of LLMs for bug discovery in extensive, real-world datasets.
diff --git a/_publications/xia2023universal.markdown b/_publications/xia2023universal.markdown
new file mode 100644
index 00000000..ac8789e1
--- /dev/null
+++ b/_publications/xia2023universal.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "Universal Fuzzing via Large Language Models"
+authors: Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, Lingming Zhang
+conference:
+year: 2023
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2308.04748"}
+tags: ["fuzzing"]
+---
+Fuzzing has achieved tremendous success in discovering bugs and vulnerabilities in various software systems. Systems under test (SUTs) that take in programming or formal language as inputs, e.g., compilers, runtime engines, constraint solvers, and software libraries with accessible APIs, are especially important as they are fundamental building blocks of software development. However, existing fuzzers for such systems often target a specific language, and thus cannot be easily applied to other languages or even other versions of the same language. Moreover, the inputs generated by existing fuzzers are often limited to specific features of the input language, and thus can hardly reveal bugs related to other or new features. This paper presents Fuzz4All, the first fuzzer that is universal in the sense that it can target many different input languages and many different features of these languages. The key idea behind Fuzz4All is to leverage large language models (LLMs) as an input generation and mutation engine, which enables the approach to produce diverse and realistic inputs for any practically relevant language. To realize this potential, we present a novel autoprompting technique, which creates LLM prompts that are wellsuited for fuzzing, and a novel LLM-powered fuzzing loop, which iteratively updates the prompt to create new fuzzing inputs. We evaluate Fuzz4All on nine systems under test that take in six different languages (C, C++, Go, SMT2, Java and Python) as inputs. The evaluation shows, across all six languages, that universal fuzzing achieves higher coverage than existing, language-specific fuzzers. Furthermore, Fuzz4All has identified 76 bugs in widely used systems, such as GCC, Clang, Z3, CVC5, OpenJDK, and the Qiskit quantum computing platform, with 47 bugs already confirmed by developers as previously unknown.

From e2ae08dc447f8672bb559e4dfe05e77121eb46ba Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Sat, 23 Sep 2023 13:48:17 +0100
Subject: [PATCH 086/114] Add one more paper

---
 _publications/muennighoff2023octopack.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/muennighoff2023octopack.markdown

diff --git a/_publications/muennighoff2023octopack.markdown b/_publications/muennighoff2023octopack.markdown
new file mode 100644
index 00000000..3e5483d7
--- /dev/null
+++ b/_publications/muennighoff2023octopack.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "OctoPack: Instruction Tuning Code Large Language Models"
+authors: Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, Shayne Longpre
+conference:
+year: 2023
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2308.07124"}
+tags: ["dataset", "instruction tuning"]
+---
+Finetuning large language models (LLMs) on instructions leads to vast performance improvements on natural language tasks. We apply instruction tuning using code, leveraging the natural structure of Git commits, which pair code changes with human instructions. We compile CommitPack: 4 terabytes of Git commits across 350 programming languages. We benchmark CommitPack against other natural and synthetic code instructions (xP3x, Self-Instruct, OASST) on the 16B parameter StarCoder model, and achieve state-of-the-art performance among models not trained on OpenAI outputs, on the HumanEval Python benchmark (46.2% pass@1). We further introduce HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust). Our models, OctoCoder and OctoGeeX, achieve the best performance across HumanEvalPack among all permissive models, demonstrating CommitPack's benefits in generalizing to a wider set of languages and natural coding tasks. Code, models and data are freely available at https://github.com/bigcode-project/octopack.

From b0f3ae62b8f7c8a5bf045a8606cdc953e6138357 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Mon, 16 Oct 2023 12:48:09 +0100
Subject: [PATCH 087/114] Add paper

---
 _publications/chen2023supersonic.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/chen2023supersonic.markdown

diff --git a/_publications/chen2023supersonic.markdown b/_publications/chen2023supersonic.markdown
new file mode 100644
index 00000000..33e2ff37
--- /dev/null
+++ b/_publications/chen2023supersonic.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "Supersonic: Learning to Generate Source Code Optimizations in C/C++"
+authors: Zimin Chen, Sen Fang, Martin Monperrus
+conference:
+year: 2023
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2309.14846"}
+tags: ["optimization"]
+---
+Software optimization refines programs for resource efficiency while preserving functionality. Traditionally, it is a process done by developers and compilers. This paper introduces a third option, automated optimization at the source code level. We present Supersonic, a neural approach targeting minor source code modifications for optimization. Using a seq2seq model, Supersonic is trained on C/C++ program pairs ($x_{t}$, $x_{t+1}$), where $x_{t+1}$ is an optimized version of $x_{t}$, and outputs a diff. Supersonic's performance is benchmarked against OpenAI's GPT-3.5-Turbo and GPT-4 on competitive programming tasks. The experiments show that Supersonic not only outperforms both models on the code optimization task but also minimizes the extent of the change with a model more than 600x smaller than GPT-3.5-Turbo and 3700x smaller than GPT-4.

From 6b4502b5d23200be2d5466e86898b36d4e92a545 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Tue, 14 Nov 2023 10:57:39 +0000
Subject: [PATCH 088/114] Add Liu et al.

---
 _publications/liu2023code.markdown | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/_publications/liu2023code.markdown b/_publications/liu2023code.markdown
index 1600487d..15cf547a 100644
--- a/_publications/liu2023code.markdown
+++ b/_publications/liu2023code.markdown
@@ -7,6 +7,6 @@ conference:
 year: 2023
 additional_links:
 - {name: "ArXiV", url: "https://arxiv.org/abs/2305.05383"}
-tags: ["execution", "dynamic"]
+tags: ["Transformer", "execution"]
 ---
 Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. In this paper, we investigate how well pre-trained models can understand and perform code execution. We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution, which challenges existing models such as Codex. We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension. We evaluate CodeExecutor on code execution and show its promising performance and limitations. We also demonstrate its potential benefits for code intelligence tasks such as zero-shot code-to-code search and text-to-code generation. Our analysis provides insights into the learning and generalization abilities of pre-trained models for code execution.

From 695c6c34717a13131adbc2e4280d6b0e46275aed Mon Sep 17 00:00:00 2001
From: Lakshya A Agrawal <t-lakagrawal@microsoft.com>
Date: Mon, 20 Nov 2023 18:28:14 +0530
Subject: [PATCH 089/114] Create agrawal2023monitor.markdown

---
 _publications/agrawal2023monitor.markdown | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)
 create mode 100644 _publications/agrawal2023monitor.markdown

diff --git a/_publications/agrawal2023monitor.markdown b/_publications/agrawal2023monitor.markdown
new file mode 100644
index 00000000..20e2e510
--- /dev/null
+++ b/_publications/agrawal2023monitor.markdown
@@ -0,0 +1,17 @@
+---
+layout: publication
+title: Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context
+authors: Lakshya A Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K Lahiri, Sriram Rajamani
+conference: NeurIPS
+year: 2023
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2306.10763"}
+   - {name: "NeurIPS website", url: "https://neurips.cc/virtual/2023/poster/70362"}
+   - {name: "code", url: "https://github.com/microsoft/monitors4codegen"}
+tags: ["autocomplete", "benchmark", "code completion", "code generation", "compilation", "completion", "dataset", "evaluation", "language model", "large language models", "program analysis", "static analysis", "tool"]
+---
+Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating.
+
+Integrated development environments (IDEs) assist developers in understanding repository context using static analysis. We extend this assistance, enjoyed by developers, to LMs. We propose monitor-guided decoding (MGD) where a monitor uses static analysis to guide the decoding. We construct a repository-level dataset PragmaticCode for method-completion in Java and evaluate MGD on it. On models of varying parameter scale, by monitoring for type-consistent object dereferences, MGD consistently improves compilation rates and agreement with ground truth. Further, LMs with fewer parameters, when augmented with MGD, can outperform larger LMs. With MGD, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model.
+
+We also conduct a generalizability study to evaluate the ability of MGD to generalize to multiple programming languages (Java, C# and Rust), coding scenarios (e.g., correct number of arguments to method calls), and to enforce richer semantic constraints (e.g., stateful API protocols). Our data and implementation are available at https://github.com/microsoft/monitors4codegen.

From 345a74548fc89efded650ddb269b69cd43c578d6 Mon Sep 17 00:00:00 2001
From: Haochen Li <alex.haochenli@gmail.com>
Date: Wed, 1 Nov 2023 16:53:33 +0800
Subject: [PATCH 090/114] Add one paper

---
 _publications/li2023rethinking.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/li2023rethinking.markdown

diff --git a/_publications/li2023rethinking.markdown b/_publications/li2023rethinking.markdown
new file mode 100644
index 00000000..69a64b5a
--- /dev/null
+++ b/_publications/li2023rethinking.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: Rethinking Negative Pairs in Code Search
+authors: Haochen Li, Xin Zhou, Luu Anh Tuan, Chunyan Miao
+conference: EMNLP
+year: 2023
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2310.08069"}
+   - {name: "code", url: "https://github.com/Alex-HaochenLi/Soft-InfoNCE"}
+tags: ["search", "Transformer", "retrieval", "optimization"]
+---
+Recently, contrastive learning has become a key component in fine-tuning code search models for software development efficiency and effectiveness. It pulls together positive code snippets while pushing negative samples away given search queries. Among contrastive learning, InfoNCE is the most widely used loss function due to its better performance. However, the following problems in negative samples of InfoNCE may deteriorate its representation learning: 1) The existence of false negative samples in large code corpora due to duplications. 2). The failure to explicitly differentiate between the potential relevance of negative samples. As an example, a bubble sorting algorithm example is less ``negative'' than a file saving function for the quick sorting algorithm query. In this paper, we tackle the above problems by proposing a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE. In our proposed loss function, we apply three methods to estimate the weights of negative pairs and show that the vanilla InfoNCE loss is a special case of Soft-InfoNCE. Theoretically, we analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation. We furthermore discuss the superiority of proposed loss functions with other design alternatives. Extensive experiments demonstrate the effectiveness of Soft-InfoNCE and weights estimation methods under state-of-the-art code search models on a large-scale public dataset consisting of six programming languages.
\ No newline at end of file

From 0fc24a0f4c4a3bd1088601632fc1bc34c6670b57 Mon Sep 17 00:00:00 2001
From: Haochen Li <alex.haochenli@gmail.com>
Date: Wed, 1 Nov 2023 17:50:31 +0800
Subject: [PATCH 091/114] Update _publications/li2023rethinking.markdown

Co-authored-by: Alex <bzz@users.noreply.github.com>
---
 _publications/li2023rethinking.markdown | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/_publications/li2023rethinking.markdown b/_publications/li2023rethinking.markdown
index 69a64b5a..daa816c0 100644
--- a/_publications/li2023rethinking.markdown
+++ b/_publications/li2023rethinking.markdown
@@ -7,6 +7,6 @@ year: 2023
 additional_links:
    - {name: "ArXiV", url: "https://arxiv.org/abs/2310.08069"}
    - {name: "code", url: "https://github.com/Alex-HaochenLi/Soft-InfoNCE"}
-tags: ["search", "Transformer", "retrieval", "optimization"]
+tags: ["search", "Transformer", "retrieval", "optimization", "representation"]
 ---
 Recently, contrastive learning has become a key component in fine-tuning code search models for software development efficiency and effectiveness. It pulls together positive code snippets while pushing negative samples away given search queries. Among contrastive learning, InfoNCE is the most widely used loss function due to its better performance. However, the following problems in negative samples of InfoNCE may deteriorate its representation learning: 1) The existence of false negative samples in large code corpora due to duplications. 2). The failure to explicitly differentiate between the potential relevance of negative samples. As an example, a bubble sorting algorithm example is less ``negative'' than a file saving function for the quick sorting algorithm query. In this paper, we tackle the above problems by proposing a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE. In our proposed loss function, we apply three methods to estimate the weights of negative pairs and show that the vanilla InfoNCE loss is a special case of Soft-InfoNCE. Theoretically, we analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation. We furthermore discuss the superiority of proposed loss functions with other design alternatives. Extensive experiments demonstrate the effectiveness of Soft-InfoNCE and weights estimation methods under state-of-the-art code search models on a large-scale public dataset consisting of six programming languages.
\ No newline at end of file

From d1371236598820efe2e4af7787550b157eb932fa Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Sun, 3 Dec 2023 10:52:55 +0000
Subject: [PATCH 092/114] Add paper.

---
 _publications/li2023think.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/li2023think.markdown

diff --git a/_publications/li2023think.markdown b/_publications/li2023think.markdown
new file mode 100644
index 00000000..89ab1a41
--- /dev/null
+++ b/_publications/li2023think.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "Think Outside the Code: Brainstorming Boosts Large Language Models in Code Generation"
+authors: Xin-Ye Li, Jiang-Tian Xue, Zheng Xie, Ming Li
+conference:
+year: 2023
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2305.10679"}
+tags: ["generation", "Transformer"]
+---
+Code generation aims to automatically generate source code from high-level task specifications, which can significantly increase productivity of software engineering. Recently, approaches based on large language models (LLMs) have shown remarkable code generation abilities on simple tasks. However, generate code for more complex tasks, such as competition-level problems, remains challenging. In this paper, we introduce Brainstorm framework for code generation. It leverages a brainstorming step that generates and selects diverse thoughts on the problem to facilitate algorithmic reasoning, where the thoughts are possible blueprint of solving the problem. We demonstrate that Brainstorm significantly enhances the ability of LLMs to solve competition-level programming problems, resulting in a more than 50% increase in the pass@$k$ metrics for ChatGPT on the CodeContests benchmark, achieving state-of-the-art performance. Furthermore, our experiments conducted on LeetCode contests show that our framework boosts the ability of ChatGPT to a level comparable to that of human programmers.

From 7100c9914af6abea81ed49d8fa5dfe81eb1defca Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Mon, 11 Dec 2023 07:57:45 +0000
Subject: [PATCH 093/114] Add paper.

---
 _publications/eniser2023automatically.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/eniser2023automatically.markdown

diff --git a/_publications/eniser2023automatically.markdown b/_publications/eniser2023automatically.markdown
new file mode 100644
index 00000000..584f40a9
--- /dev/null
+++ b/_publications/eniser2023automatically.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "Automatically Testing Functional Properties of Code Translation Models"
+authors: Hasan Ferit Eniser, Valentin Wüstholz, Maria Christakis
+conference: AAAI
+year: 2023
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2309.12813"}
+tags: ["translation"]
+---
+Large language models are becoming increasingly practical for translating code across programming languages, a process known as $transpiling$. Even though automated transpilation significantly boosts developer productivity, a key concern is whether the generated code is correct. Existing work initially used manually crafted test suites to test the translations of a small corpus of programs; these test suites were later automated. In contrast, we devise the first approach for automated, functional, property-based testing of code translation models. Our general, user-provided specifications about the transpiled code capture a range of properties, from purely syntactic to purely semantic ones. As shown by our experiments, this approach is very effective in detecting property violations in popular code translation models, and therefore, in evaluating model quality with respect to given properties. We also go a step further and explore the usage scenario where a user simply aims to obtain a correct translation of some code with respect to certain properties without necessarily being concerned about the overall quality of the model. To this purpose, we develop the first property-guided search procedure for code translation models, where a model is repeatedly queried with slightly different parameters to produce alternative and potentially more correct translations. Our results show that this search procedure helps to obtain significantly better code translations.

From 7d3be870c76658aa6bbd209884e3ffea712fe8ab Mon Sep 17 00:00:00 2001
From: Haochen Li <alex.haochenli@gmail.com>
Date: Wed, 17 Jan 2024 09:28:57 +0800
Subject: [PATCH 094/114] Add one paper

---
 _publications/li2024rewriting.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/li2024rewriting.markdown

diff --git a/_publications/li2024rewriting.markdown b/_publications/li2024rewriting.markdown
new file mode 100644
index 00000000..4401dfc7
--- /dev/null
+++ b/_publications/li2024rewriting.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search
+authors: Haochen Li, Xin Zhou, Zhiqi Shen
+conference: 
+year: 2024
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2401.04514"}
+tags: ["search", "large language models", "metrics"]
+---
+In code search, the Generation-Augmented Retrieval (GAR) framework, which generates exemplar code snippets to augment queries, has emerged as a promising strategy to address the principal challenge of modality misalignment between code snippets and natural language queries, particularly with the demonstrated code generation capabilities of Large Language Models (LLMs). Nevertheless, our preliminary investigations indicate that the improvements conferred by such an LLM-augmented framework are somewhat constrained. This limitation could potentially be ascribed to the fact that the generated codes, albeit functionally accurate, frequently display a pronounced stylistic deviation from the ground truth code in the codebase. In this paper, we extend the foundational GAR framework and propose a simple yet effective method that additionally Rewrites the Code (ReCo) within the codebase for style normalization. Experimental results demonstrate that ReCo significantly boosts retrieval accuracy across sparse (up to 35.7%), zero-shot dense (up to 27.6%), and fine-tuned dense (up to 23.6%) retrieval settings in diverse search scenarios. To further elucidate the advantages of ReCo and stimulate research in code style normalization, we introduce Code Style Similarity, the first metric tailored to quantify stylistic similarities in code. Notably, our empirical findings reveal the inadequacy of existing metrics in capturing stylistic nuances.
\ No newline at end of file

From 1e320c7966d78c2f7d6376ea35db74255fb1564e Mon Sep 17 00:00:00 2001
From: Miltos <miltos@allamanis.com>
Date: Fri, 19 Jan 2024 17:45:51 +0000
Subject: [PATCH 095/114] Update li2024rewriting.markdown

Fix escaping
---
 _publications/li2024rewriting.markdown | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/_publications/li2024rewriting.markdown b/_publications/li2024rewriting.markdown
index 4401dfc7..7b98ccd4 100644
--- a/_publications/li2024rewriting.markdown
+++ b/_publications/li2024rewriting.markdown
@@ -1,6 +1,6 @@
 ---
 layout: publication
-title: Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search
+title: "Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search"
 authors: Haochen Li, Xin Zhou, Zhiqi Shen
 conference: 
 year: 2024
@@ -8,4 +8,4 @@ additional_links:
    - {name: "ArXiV", url: "https://arxiv.org/abs/2401.04514"}
 tags: ["search", "large language models", "metrics"]
 ---
-In code search, the Generation-Augmented Retrieval (GAR) framework, which generates exemplar code snippets to augment queries, has emerged as a promising strategy to address the principal challenge of modality misalignment between code snippets and natural language queries, particularly with the demonstrated code generation capabilities of Large Language Models (LLMs). Nevertheless, our preliminary investigations indicate that the improvements conferred by such an LLM-augmented framework are somewhat constrained. This limitation could potentially be ascribed to the fact that the generated codes, albeit functionally accurate, frequently display a pronounced stylistic deviation from the ground truth code in the codebase. In this paper, we extend the foundational GAR framework and propose a simple yet effective method that additionally Rewrites the Code (ReCo) within the codebase for style normalization. Experimental results demonstrate that ReCo significantly boosts retrieval accuracy across sparse (up to 35.7%), zero-shot dense (up to 27.6%), and fine-tuned dense (up to 23.6%) retrieval settings in diverse search scenarios. To further elucidate the advantages of ReCo and stimulate research in code style normalization, we introduce Code Style Similarity, the first metric tailored to quantify stylistic similarities in code. Notably, our empirical findings reveal the inadequacy of existing metrics in capturing stylistic nuances.
\ No newline at end of file
+In code search, the Generation-Augmented Retrieval (GAR) framework, which generates exemplar code snippets to augment queries, has emerged as a promising strategy to address the principal challenge of modality misalignment between code snippets and natural language queries, particularly with the demonstrated code generation capabilities of Large Language Models (LLMs). Nevertheless, our preliminary investigations indicate that the improvements conferred by such an LLM-augmented framework are somewhat constrained. This limitation could potentially be ascribed to the fact that the generated codes, albeit functionally accurate, frequently display a pronounced stylistic deviation from the ground truth code in the codebase. In this paper, we extend the foundational GAR framework and propose a simple yet effective method that additionally Rewrites the Code (ReCo) within the codebase for style normalization. Experimental results demonstrate that ReCo significantly boosts retrieval accuracy across sparse (up to 35.7%), zero-shot dense (up to 27.6%), and fine-tuned dense (up to 23.6%) retrieval settings in diverse search scenarios. To further elucidate the advantages of ReCo and stimulate research in code style normalization, we introduce Code Style Similarity, the first metric tailored to quantify stylistic similarities in code. Notably, our empirical findings reveal the inadequacy of existing metrics in capturing stylistic nuances.

From dfc4de443c23a0ae188833093aa6fe8494caa69a Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Tue, 13 Feb 2024 12:23:08 +0000
Subject: [PATCH 096/114] Add forgotten papers.

---
 _publications/gupta2023grace.markdown          | 11 +++++++++++
 _publications/mohajer2023skipanalyzer.markdown | 12 ++++++++++++
 _publications/silva2023repairllama.markdown    | 12 ++++++++++++
 3 files changed, 35 insertions(+)
 create mode 100644 _publications/gupta2023grace.markdown
 create mode 100644 _publications/mohajer2023skipanalyzer.markdown
 create mode 100644 _publications/silva2023repairllama.markdown

diff --git a/_publications/gupta2023grace.markdown b/_publications/gupta2023grace.markdown
new file mode 100644
index 00000000..4c1f3596
--- /dev/null
+++ b/_publications/gupta2023grace.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Grace: Language Models Meet Code Edits"
+authors: Priyanshu Gupta, Avishree Khare, Yasharth Bajpai, Saikat Chakraborty, Sumit Gulwani, Aditya Kanade, Arjun Radhakrishna, Gustavo Soares, Ashish Tiwari
+conference: FSE
+year: 2023
+additional_links:
+   - {name: "ACM", url: "https://dl.acm.org/doi/abs/10.1145/3611643.3616253"}
+tags: ["editing"]
+---
+Developers spend a significant amount of time in editing code for a variety of reasons such as bug fixing or adding new features. Designing effective methods to predict code edits has been an active yet challenging area of research due to the diversity of code edits and the difficulty of capturing the developer intent. In this work, we address these challenges by endowing pre-trained large language models (LLMs) with the knowledge of relevant prior associated edits, which we call the Grace (Generation conditioned on Associated Code Edits) method. The generative capability of the LLMs helps address the diversity in code changes and conditioning code generation on prior edits helps capture the latent developer intent. We evaluate two well-known LLMs, codex and CodeT5, in zero-shot and fine-tuning settings respectively. In our experiments with two datasets, Grace boosts the performance of the LLMs significantly, enabling them to generate 29% and 54% more correctly edited code in top-1 suggestions relative to the current state-of-the-art symbolic and neural approaches, respectively.
diff --git a/_publications/mohajer2023skipanalyzer.markdown b/_publications/mohajer2023skipanalyzer.markdown
new file mode 100644
index 00000000..858e960a
--- /dev/null
+++ b/_publications/mohajer2023skipanalyzer.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "SkipAnalyzer: A Tool for Static Code Analysis with Large Language Models"
+authors: Mohammad Mahdi Mohajer, Reem Aleithan, Nima Shiri Harzevili, Moshi Wei, Alvine Boaye Belle, Hung Viet Pham, Song Wang
+conference:
+year: 2023
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2310.18532"}
+tags: ["repair"]
+---
+We introduce SkipAnalyzer, a large language model (LLM)-powered tool for static code analysis. SkipAnalyzer has three components: 1) an LLM-based static bug detector that scans source code and reports specific types of bugs, 2) an LLM-based false-positive filter that can identify false-positive bugs in the results of static bug detectors (e.g., the result of step 1) to improve detection accuracy, and 3) an LLM-based patch generator that can generate patches for the detected bugs above. As a proof-of-concept, SkipAnalyzer is built on ChatGPT, which has exhibited outstanding performance in various software engineering tasks. To evaluate SkipAnalyzer, we focus on two types of typical and critical bugs that are targeted by static bug detection, i.e., Null Dereference and Resource Leak as subjects. We employ Infer to aid the gathering of these two bug types from 10 open-source projects. Consequently, our experiment dataset contains 222 instances of Null Dereference bugs and 46 instances of Resource Leak bugs. Our study demonstrates that SkipAnalyzer achieves remarkable performance in the mentioned static analysis tasks, including bug detection, false-positive warning removal, and bug repair. In static bug detection, SkipAnalyzer achieves accuracy values of up to 68.37% for detecting Null Dereference bugs and 76.95% for detecting Resource Leak bugs, improving the precision of the current leading bug detector, Infer, by 12.86% and 43.13%, respectively. For removing false-positive warnings, SkipAnalyzer can reach a precision of up to 93.88% for Null Dereference bugs and 63.33% for Resource Leak bugs. Additionally, SkipAnalyzer surpasses state-of-the-art false-positive warning removal tools. Furthermore, in bug repair, SkipAnalyzer can generate syntactically correct patches to fix its detected bugs with a success rate of up to 97.30%.
diff --git a/_publications/silva2023repairllama.markdown b/_publications/silva2023repairllama.markdown
new file mode 100644
index 00000000..8969a41f
--- /dev/null
+++ b/_publications/silva2023repairllama.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair"
+authors: André Silva, Sen Fang, Martin Monperrus
+conference:
+year: 2023
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2312.15698"}
+tags: ["repair"]
+---
+Automated Program Repair (APR) has evolved significantly with the advent of Large Language Models (LLMs). Fine-tuning LLMs for program repair is a recent avenue of research, with many dimensions which have not been explored. Existing work mostly fine-tunes LLMs with naive code representations and is fundamentally limited in its ability to fine-tune larger LLMs. To address this problem, we propose RepairLLaMA, a novel program repair approach that combines 1) code representations for APR and 2) the state-of-the-art parameter-efficient LLM fine-tuning technique called LoRA. This results in RepairLLaMA producing a highly effective `program repair adapter' for fixing bugs with language models. Our experiments demonstrate the validity of both concepts. First, fine-tuning adapters with program repair specific code representations enables the model to use meaningful repair signals. Second, parameter-efficient fine-tuning helps fine-tuning to converge and contributes to the effectiveness of the repair adapter to fix data-points outside the fine-tuning data distribution. Overall, RepairLLaMA correctly fixes 125 Defects4J v2 and 82 HumanEval-Java bugs, outperforming all baselines.

From 180b42fe7b785a96690b5a45e0b0e9295909e3b3 Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Thu, 29 Feb 2024 07:32:56 +0000
Subject: [PATCH 097/114] Add paper

---
 _publications/ahmed2024studying.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/ahmed2024studying.markdown

diff --git a/_publications/ahmed2024studying.markdown b/_publications/ahmed2024studying.markdown
new file mode 100644
index 00000000..1677b84c
--- /dev/null
+++ b/_publications/ahmed2024studying.markdown
@@ -0,0 +1,12 @@
+
+---
+layout: publication
+title: "Studying LLM Performance on Closed- and Open-source Data"
+authors: Toufique Ahmed, Christian Bird, Premkumar Devanbu, Saikat Chakraborty
+conference:
+year: 2024
+additional_links:
+- {name: "ArXiV", url: "https://arxiv.org/abs/2402.15100"}
+tags: ["Transformers"]
+---
+Large Language models (LLMs) are finding wide use in software engineering practice. These models are extremely data-hungry, and are largely trained on open-source (OSS) code distributed with permissive licenses. In terms of actual use however, a great deal of software development still occurs in the for-profit/proprietary sphere, where the code under development is not, and never has been, in the public domain; thus, many developers, do their work, and use LLMs, in settings where the models may not be as familiar with the code under development. In such settings, do LLMs work as well as they do for OSS code? If not, what are the differences? When performance differs, what are the possible causes, and are there work-arounds? In this paper, we examine this issue using proprietary, closed-source software data from Microsoft, where most proprietary code is in C# and C++. We find that performance for C# changes little from OSS --> proprietary code, but does significantly reduce for C++; we find that this difference is attributable to differences in identifiers. We also find that some performance degradation, in some cases, can be ameliorated efficiently by in-context learning.

From 6862da35bb0c92827082c3c409afff8bc5b68598 Mon Sep 17 00:00:00 2001
From: Federico Cichetti <volpepe.prog@gmail.com>
Date: Wed, 13 Mar 2024 10:00:10 +0100
Subject: [PATCH 098/114] Added papers

---
 _publications/barchi2019code.markdown        | 12 ++++++++++++
 _publications/barchi2020exploration.markdown | 12 ++++++++++++
 _publications/barchi2022deep.markdown        | 11 +++++++++++
 _publications/parisi2022making.markdown      | 12 ++++++++++++
 _publications/parisi2022source.markdown      | 12 ++++++++++++
 5 files changed, 59 insertions(+)
 create mode 100644 _publications/barchi2019code.markdown
 create mode 100644 _publications/barchi2020exploration.markdown
 create mode 100644 _publications/barchi2022deep.markdown
 create mode 100644 _publications/parisi2022making.markdown
 create mode 100644 _publications/parisi2022source.markdown

diff --git a/_publications/barchi2019code.markdown b/_publications/barchi2019code.markdown
new file mode 100644
index 00000000..1c66dc6b
--- /dev/null
+++ b/_publications/barchi2019code.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Code Mapping in Heterogeneous Platforms Using Deep Learning and LLVM-IR"
+authors: Francesco Barchi, Gianvito Urgese, Enrico Macii, Andrea Acquaviva
+conference: DAC
+year: 2019
+additional_links:
+   - {name: "ACM", url: "https://dl.acm.org/doi/10.1145/3316781.3317789"}
+   - {name: "code", url: "https://gitlab.com/ecs-lab/deepllvm"}
+tags: ["optimization", "program analysis", "static analysis", "natural language processing"]
+---
+Modern heterogeneous platforms require compilers capable of choosing the appropriate device for the execution of program portions. This paper presents a machine learning method designed for supporting mapping decisions through the analysis of the program source code represented in LLVM assembly language (IR) for exploiting the advantages offered by this generalised and optimised representation. To evaluate our solution, we trained an LSTM neural network on OpenCL kernels compiled in LLVM-IR and processed with our tokenizer capable of filtering less-informative tokens. We tested the network that reaches an accuracy of 85% in distinguishing the best computational unit.
diff --git a/_publications/barchi2020exploration.markdown b/_publications/barchi2020exploration.markdown
new file mode 100644
index 00000000..bba80a87
--- /dev/null
+++ b/_publications/barchi2020exploration.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Exploration of Convolutional Neural Network models for source code classification"
+authors: Francesco Barchi, Emanuele Parisi, Gianvito Urgese, Elisa Ficarra, Andrea Acquaviva
+journal: Engineering Applications of Artificial Intelligence
+year: 2021
+additional_links:
+   - {name: "ScienceDirect", url: "https://www.sciencedirect.com/science/article/pii/S0952197620303353"}
+   - {name: "code", url: "https://gitlab.com/ecs-lab/deepllvm"}
+tags: ["optimization", "static analysis", "program analysis", "language model"]
+---
+The application of Artificial Intelligence is becoming common in many engineering fields. Among them, one of the newest and rapidly evolving is software generation, where AI can be used to automatically optimise the implementation of an algorithm for a given computing platform. In particular, Deep Learning technologies can be used to the decide how to allocate pieces of code to hardware platforms with multiple cores and accelerators, that are common in high performance and edge computing applications. In this work, we explore the use of Convolutional Neural Networks (CNN)s to analyse the application source code and decide the best compute unit to minimise the execution time. We demonstrate that CNN models can be successfully applied to source code classification, providing higher accuracy with consistently reduced learning time with respect to state-of-the-art methods. Moreover, we show the robustness of the method with respect to source code pre-processing, compiler options and hyper-parameters selection.
diff --git a/_publications/barchi2022deep.markdown b/_publications/barchi2022deep.markdown
new file mode 100644
index 00000000..0f508efa
--- /dev/null
+++ b/_publications/barchi2022deep.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Deep Learning Approaches to Source Code Analysis for Optimization of Heterogeneous Systems: Recent Results, Challenges and Opportunities"
+authors: Francesco Barchi, Emanuele Parisi, Andrea Bartolini, Andrea Acquaviva
+journal: Journal of Low Power Electronics and Applications
+year: 2022
+additional_links:
+   - {name: "MDPI", url: "https://www.mdpi.com/2079-9268/12/3/37"}
+tags: ["optimization", "review"]
+---
+To cope with the increasing complexity of digital systems programming, deep learning techniques have recently been proposed to enhance software deployment by analysing source code for different purposes, ranging from performance and energy improvement to debugging and security assessment. As embedded platforms for cyber-physical systems are characterised by increasing heterogeneity and parallelism, one of the most challenging and specific problems is efficiently allocating computational kernels to available hardware resources. In this field, deep learning applied to source code can be a key enabler to face this complexity. However, due to the rapid development of such techniques, it is not easy to understand which of those are suitable and most promising for this class of systems. For this purpose, we discuss recent developments in deep learning for source code analysis, and focus on techniques for kernel mapping on heterogeneous platforms, highlighting recent results, challenges and opportunities for their applications to cyber-physical systems.
diff --git a/_publications/parisi2022making.markdown b/_publications/parisi2022making.markdown
new file mode 100644
index 00000000..0c1efc18
--- /dev/null
+++ b/_publications/parisi2022making.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Making the Most of Scarce Input Data in Deep Learning-Based Source Code Classification for Heterogeneous Device Mapping"
+authors: Emanuele Parisi, Francesco Barchi, Andrea Bartolini, Andrea Acquaviva
+journal: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
+year: 2022
+additional_links:
+   - {name: "IEEE", url: "https://ieeexplore.ieee.org/document/9544064"}
+   - {name: "code", url: "https://gitlab.com/ecs-lab/deepllvm"}
+tags: ["optimization", "program analysis", "static analysis", "language model"]
+---
+Despite its relatively recent history, deep learning (DL)-based source code analysis is already a cornerstone in machine learning for compiler optimization. When applied to the classification of pieces of code to identify the best computational unit in a heterogeneous Systems-on-Chip, it can be effective in supporting decisions that a programmer has otherwise to take manually. Several techniques have been proposed exploiting different networks and input information, prominently sequence-based and graph-based representations, complemented by auxiliary information typically related to payload and device configuration. While the accuracy of DL methods strongly depends on the training and test datasets, so far no exhaustive and statistically meaningful analysis has been done on its impact on the results and on how to effectively extract the available information. This is relevant also considering the scarce availability of source code datasets that can be labeled by profiling on heterogeneous compute units. In this article, we first present such a study, which leads us to devise the contribution of code sequences and auxiliary inputs separately. Starting from this analysis, we then demonstrate that by using the normalization of auxiliary information, it is possible to improve state-of-the-art results in terms of accuracy. Finally, we propose a novel approach exploiting Siamese networks that further improve mapping accuracy by increasing the cardinality of the dataset, thus compensating for its relatively small size.
\ No newline at end of file
diff --git a/_publications/parisi2022source.markdown b/_publications/parisi2022source.markdown
new file mode 100644
index 00000000..91b5d41a
--- /dev/null
+++ b/_publications/parisi2022source.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers"
+authors: Emanuele Parisi, Francesco Barchi, Andrea Bartolini, Giuseppe Tagliavini, Andrea Acquaviva
+conference: DATE
+year: 2021
+additional_links:
+   - {name: "IEEE", url: "https://ieeexplore.ieee.org/document/9474085"}
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2012.06836"}
+tags: ["optimization"]
+---
+The analysis of source code through machine learning techniques is an increasingly explored research topic aiming at increasing smartness in the software toolchain to exploit modern architectures in the best possible way. In the case of low-power, parallel embedded architectures, this means finding the configuration, for instance in terms of the number of cores, leading to minimum energy consumption. Depending on the kernel to be executed, the energy optimal scaling configuration is not trivial. While recent work has focused on general-purpose systems to learn and predict the best execution target in terms of the execution time of a snippet of code or kernel (e.g. offload OpenCL kernel on multicore CPU or GPU), in this work we focus on static compile-time features to assess if they can be successfully used to predict the minimum energy configuration on PULP, an ultra-low-power architecture featuring an on-chip cluster of RISC-V processors. Experiments show that using machine learning models on the source code to select the best energy scaling configuration automatically is viable and has the potential to be used in the context of automatic system configuration for energy minimisation.
\ No newline at end of file

From 59e61a275564baaf0826b002a7bdb99e4ec1a206 Mon Sep 17 00:00:00 2001
From: Federico Cichetti <volpepe.prog@gmail.com>
Date: Wed, 13 Mar 2024 13:34:28 +0100
Subject: [PATCH 099/114] Update

---
 ...i2020exploration.markdown => barchi2021exploration.markdown} | 0
 .../{parisi2022source.markdown => parisi2021source.markdown}    | 2 +-
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename _publications/{barchi2020exploration.markdown => barchi2021exploration.markdown} (100%)
 rename _publications/{parisi2022source.markdown => parisi2021source.markdown} (97%)

diff --git a/_publications/barchi2020exploration.markdown b/_publications/barchi2021exploration.markdown
similarity index 100%
rename from _publications/barchi2020exploration.markdown
rename to _publications/barchi2021exploration.markdown
diff --git a/_publications/parisi2022source.markdown b/_publications/parisi2021source.markdown
similarity index 97%
rename from _publications/parisi2022source.markdown
rename to _publications/parisi2021source.markdown
index 91b5d41a..4cff09c3 100644
--- a/_publications/parisi2022source.markdown
+++ b/_publications/parisi2021source.markdown
@@ -7,6 +7,6 @@ year: 2021
 additional_links:
    - {name: "IEEE", url: "https://ieeexplore.ieee.org/document/9474085"}
    - {name: "ArXiV", url: "https://arxiv.org/abs/2012.06836"}
-tags: ["optimization"]
+tags: ["optimization", "program analysis"]
 ---
 The analysis of source code through machine learning techniques is an increasingly explored research topic aiming at increasing smartness in the software toolchain to exploit modern architectures in the best possible way. In the case of low-power, parallel embedded architectures, this means finding the configuration, for instance in terms of the number of cores, leading to minimum energy consumption. Depending on the kernel to be executed, the energy optimal scaling configuration is not trivial. While recent work has focused on general-purpose systems to learn and predict the best execution target in terms of the execution time of a snippet of code or kernel (e.g. offload OpenCL kernel on multicore CPU or GPU), in this work we focus on static compile-time features to assess if they can be successfully used to predict the minimum energy configuration on PULP, an ultra-low-power architecture featuring an on-chip cluster of RISC-V processors. Experiments show that using machine learning models on the source code to select the best energy scaling configuration automatically is viable and has the potential to be used in the context of automatic system configuration for energy minimisation.
\ No newline at end of file

From 10a5ed993c24f28da70fa74baabe0bec67ad5e6b Mon Sep 17 00:00:00 2001
From: Reza Gharibi <reza.gharibi.rg@gmail.com>
Date: Fri, 15 Mar 2024 18:44:57 +0330
Subject: [PATCH 100/114] Remove empty first line

---
 _publications/add_from_arxiv.py                  | 6 +++---
 _publications/ahmed2024studying.markdown         | 3 +--
 _publications/chen2023supersonic.markdown        | 3 +--
 _publications/ding2023static.markdown            | 3 +--
 _publications/eniser2023automatically.markdown   | 3 +--
 _publications/li2023hitchhiker.markdown          | 3 +--
 _publications/li2023starcoder.markdown           | 3 +--
 _publications/li2023think.markdown               | 3 +--
 _publications/liu2023code.markdown               | 3 +--
 _publications/mohajer2023skipanalyzer.markdown   | 3 +--
 _publications/muennighoff2023octopack.markdown   | 3 +--
 _publications/olausson2023demystifying.markdown  | 3 +--
 _publications/peng2023generative.markdown        | 3 +--
 _publications/shrivastava2023repofusion.markdown | 3 +--
 _publications/silva2023repairllama.markdown      | 3 +--
 _publications/wang2023codet5.markdown            | 3 +--
 _publications/xia2023universal.markdown          | 3 +--
 _publications/yin2022natural.markdown            | 3 +--
 18 files changed, 20 insertions(+), 37 deletions(-)

diff --git a/_publications/add_from_arxiv.py b/_publications/add_from_arxiv.py
index c9cfde73..0d4454a4 100644
--- a/_publications/add_from_arxiv.py
+++ b/_publications/add_from_arxiv.py
@@ -8,7 +8,7 @@
 
 
 def _first_non_stopword(title: str) -> str:
-    for word in re.split("\W", title.lower()):
+    for word in re.split(r"\W", title.lower()):
         if word in ("a", "an", "the", "is", "are", "what", "who", "your"):
             continue
         return word
@@ -30,7 +30,7 @@ def get_info(paper_id: str, out_dir: str) -> None:
     )
 
     tmpl = textwrap.dedent(
-        f"""
+        f"""\
         ---
         layout: publication
         title: "{paper.title}"
@@ -38,7 +38,7 @@ def get_info(paper_id: str, out_dir: str) -> None:
         conference:
         year: {paper.published.year}
         additional_links:
-        - {{name: "ArXiV", url: "https://arxiv.org/abs/%7Bpaper_id%7D"}}
+          - {{name: "ArXiV", url: "https://arxiv.org/abs/%7Bpaper_id%7D"}}
         tags: ["TODO"]
         ---
         {summary}
diff --git a/_publications/ahmed2024studying.markdown b/_publications/ahmed2024studying.markdown
index 1677b84c..2996a1bf 100644
--- a/_publications/ahmed2024studying.markdown
+++ b/_publications/ahmed2024studying.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "Studying LLM Performance on Closed- and Open-source Data"
@@ -6,7 +5,7 @@ authors: Toufique Ahmed, Christian Bird, Premkumar Devanbu, Saikat Chakraborty
 conference:
 year: 2024
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2402.15100"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2402.15100"}
 tags: ["Transformers"]
 ---
 Large Language models (LLMs) are finding wide use in software engineering practice. These models are extremely data-hungry, and are largely trained on open-source (OSS) code distributed with permissive licenses. In terms of actual use however, a great deal of software development still occurs in the for-profit/proprietary sphere, where the code under development is not, and never has been, in the public domain; thus, many developers, do their work, and use LLMs, in settings where the models may not be as familiar with the code under development. In such settings, do LLMs work as well as they do for OSS code? If not, what are the differences? When performance differs, what are the possible causes, and are there work-arounds? In this paper, we examine this issue using proprietary, closed-source software data from Microsoft, where most proprietary code is in C# and C++. We find that performance for C# changes little from OSS --> proprietary code, but does significantly reduce for C++; we find that this difference is attributable to differences in identifiers. We also find that some performance degradation, in some cases, can be ameliorated efficiently by in-context learning.
diff --git a/_publications/chen2023supersonic.markdown b/_publications/chen2023supersonic.markdown
index 33e2ff37..053333e2 100644
--- a/_publications/chen2023supersonic.markdown
+++ b/_publications/chen2023supersonic.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "Supersonic: Learning to Generate Source Code Optimizations in C/C++"
@@ -6,7 +5,7 @@ authors: Zimin Chen, Sen Fang, Martin Monperrus
 conference:
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2309.14846"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2309.14846"}
 tags: ["optimization"]
 ---
 Software optimization refines programs for resource efficiency while preserving functionality. Traditionally, it is a process done by developers and compilers. This paper introduces a third option, automated optimization at the source code level. We present Supersonic, a neural approach targeting minor source code modifications for optimization. Using a seq2seq model, Supersonic is trained on C/C++ program pairs ($x_{t}$, $x_{t+1}$), where $x_{t+1}$ is an optimized version of $x_{t}$, and outputs a diff. Supersonic's performance is benchmarked against OpenAI's GPT-3.5-Turbo and GPT-4 on competitive programming tasks. The experiments show that Supersonic not only outperforms both models on the code optimization task but also minimizes the extent of the change with a model more than 600x smaller than GPT-3.5-Turbo and 3700x smaller than GPT-4.
diff --git a/_publications/ding2023static.markdown b/_publications/ding2023static.markdown
index a4070318..9d0c4fc8 100644
--- a/_publications/ding2023static.markdown
+++ b/_publications/ding2023static.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "A Static Evaluation of Code Completion by Large Language Models"
@@ -6,7 +5,7 @@ authors: Hantian Ding, Varun Kumar, Yuchen Tian, Zijian Wang, Rob Kwiatkowski, X
 conference:
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2306.03203"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2306.03203"}
 tags: ["LLM", "static analysis"]
 ---
 Large language models trained on code have shown great potential to increase productivity of software developers. Several execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. Nevertheless, it is expensive to perform the same evaluation on complex real-world projects considering the execution cost. On the contrary, static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. In this work, we propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees. Compared with execution-based evaluation, our method is not only more efficient, but also applicable to code in the wild. For experiments, we collect code context from open source repos to generate one million function bodies using public models. Our static analysis reveals that Undefined Name and Unused Variable are the most common errors among others made by language models. Through extensive studies, we also show the impact of sampling temperature, model size, and context on static errors in code completions.
diff --git a/_publications/eniser2023automatically.markdown b/_publications/eniser2023automatically.markdown
index 584f40a9..cc664bbb 100644
--- a/_publications/eniser2023automatically.markdown
+++ b/_publications/eniser2023automatically.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "Automatically Testing Functional Properties of Code Translation Models"
@@ -6,7 +5,7 @@ authors: Hasan Ferit Eniser, Valentin Wüstholz, Maria Christakis
 conference: AAAI
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2309.12813"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2309.12813"}
 tags: ["translation"]
 ---
 Large language models are becoming increasingly practical for translating code across programming languages, a process known as $transpiling$. Even though automated transpilation significantly boosts developer productivity, a key concern is whether the generated code is correct. Existing work initially used manually crafted test suites to test the translations of a small corpus of programs; these test suites were later automated. In contrast, we devise the first approach for automated, functional, property-based testing of code translation models. Our general, user-provided specifications about the transpiled code capture a range of properties, from purely syntactic to purely semantic ones. As shown by our experiments, this approach is very effective in detecting property violations in popular code translation models, and therefore, in evaluating model quality with respect to given properties. We also go a step further and explore the usage scenario where a user simply aims to obtain a correct translation of some code with respect to certain properties without necessarily being concerned about the overall quality of the model. To this purpose, we develop the first property-guided search procedure for code translation models, where a model is repeatedly queried with slightly different parameters to produce alternative and potentially more correct translations. Our results show that this search procedure helps to obtain significantly better code translations.
diff --git a/_publications/li2023hitchhiker.markdown b/_publications/li2023hitchhiker.markdown
index 7d1bb9ba..eb046f44 100644
--- a/_publications/li2023hitchhiker.markdown
+++ b/_publications/li2023hitchhiker.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "The Hitchhiker's Guide to Program Analysis: A Journey with Large Language Models"
@@ -6,7 +5,7 @@ authors: Haonan Li, Yu Hao, Yizhuo Zhai, Zhiyun Qian
 conference:
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2308.00245"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2308.00245"}
 tags: ["static analysis"]
 ---
 Static analysis is a widely used technique in software engineering for identifying and mitigating bugs. However, a significant hurdle lies in achieving a delicate balance between precision and scalability. Large Language Models (LLMs) offer a promising alternative, as recent advances demonstrate remarkable capabilities in comprehending, generating, and even debugging code. Yet, the logic of bugs can be complex and require sophisticated reasoning and a large analysis scope spanning multiple functions. Therefore, at this point, LLMs are better used in an assistive role to complement static analysis. In this paper, we take a deep dive into the open space of LLM-assisted static analysis, using use-before-initialization (UBI) bugs as a case study. To this end, we develop LLift, a fully automated agent that interfaces with both a static analysis tool and an LLM. By carefully designing the agent and the prompts, we are able to overcome a number of challenges, including bug-specific modeling, the large problem scope, the non-deterministic nature of LLMs, etc. Tested in a real-world scenario analyzing nearly a thousand potential UBI bugs produced by static analysis, LLift demonstrates an extremely potent capability, showcasing a high precision (50%) and recall rate (100%). It even identified 13 previously unknown UBI bugs in the Linux kernel. This research paves the way for new opportunities and methodologies in the use of LLMs for bug discovery in extensive, real-world datasets.
diff --git a/_publications/li2023starcoder.markdown b/_publications/li2023starcoder.markdown
index 90474f19..416b3924 100644
--- a/_publications/li2023starcoder.markdown
+++ b/_publications/li2023starcoder.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "StarCoder: may the source be with you!"
@@ -6,7 +5,7 @@ authors: Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Ko
 conference:
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2305.06161"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2305.06161"}
 tags: ["Transformer"]
 ---
 The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI `code-cushman-001` model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.
diff --git a/_publications/li2023think.markdown b/_publications/li2023think.markdown
index 89ab1a41..441e3d49 100644
--- a/_publications/li2023think.markdown
+++ b/_publications/li2023think.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "Think Outside the Code: Brainstorming Boosts Large Language Models in Code Generation"
@@ -6,7 +5,7 @@ authors: Xin-Ye Li, Jiang-Tian Xue, Zheng Xie, Ming Li
 conference:
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2305.10679"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2305.10679"}
 tags: ["generation", "Transformer"]
 ---
 Code generation aims to automatically generate source code from high-level task specifications, which can significantly increase productivity of software engineering. Recently, approaches based on large language models (LLMs) have shown remarkable code generation abilities on simple tasks. However, generate code for more complex tasks, such as competition-level problems, remains challenging. In this paper, we introduce Brainstorm framework for code generation. It leverages a brainstorming step that generates and selects diverse thoughts on the problem to facilitate algorithmic reasoning, where the thoughts are possible blueprint of solving the problem. We demonstrate that Brainstorm significantly enhances the ability of LLMs to solve competition-level programming problems, resulting in a more than 50% increase in the pass@$k$ metrics for ChatGPT on the CodeContests benchmark, achieving state-of-the-art performance. Furthermore, our experiments conducted on LeetCode contests show that our framework boosts the ability of ChatGPT to a level comparable to that of human programmers.
diff --git a/_publications/liu2023code.markdown b/_publications/liu2023code.markdown
index 15cf547a..2009fd2d 100644
--- a/_publications/liu2023code.markdown
+++ b/_publications/liu2023code.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "Code Execution with Pre-trained Language Models"
@@ -6,7 +5,7 @@ authors: Chenxiao Liu, Shuai Lu, Weizhu Chen, Daxin Jiang, Alexey Svyatkovskiy,
 conference:
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2305.05383"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2305.05383"}
 tags: ["Transformer", "execution"]
 ---
 Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. In this paper, we investigate how well pre-trained models can understand and perform code execution. We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution, which challenges existing models such as Codex. We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension. We evaluate CodeExecutor on code execution and show its promising performance and limitations. We also demonstrate its potential benefits for code intelligence tasks such as zero-shot code-to-code search and text-to-code generation. Our analysis provides insights into the learning and generalization abilities of pre-trained models for code execution.
diff --git a/_publications/mohajer2023skipanalyzer.markdown b/_publications/mohajer2023skipanalyzer.markdown
index 858e960a..cbf424e7 100644
--- a/_publications/mohajer2023skipanalyzer.markdown
+++ b/_publications/mohajer2023skipanalyzer.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "SkipAnalyzer: A Tool for Static Code Analysis with Large Language Models"
@@ -6,7 +5,7 @@ authors: Mohammad Mahdi Mohajer, Reem Aleithan, Nima Shiri Harzevili, Moshi Wei,
 conference:
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2310.18532"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2310.18532"}
 tags: ["repair"]
 ---
 We introduce SkipAnalyzer, a large language model (LLM)-powered tool for static code analysis. SkipAnalyzer has three components: 1) an LLM-based static bug detector that scans source code and reports specific types of bugs, 2) an LLM-based false-positive filter that can identify false-positive bugs in the results of static bug detectors (e.g., the result of step 1) to improve detection accuracy, and 3) an LLM-based patch generator that can generate patches for the detected bugs above. As a proof-of-concept, SkipAnalyzer is built on ChatGPT, which has exhibited outstanding performance in various software engineering tasks. To evaluate SkipAnalyzer, we focus on two types of typical and critical bugs that are targeted by static bug detection, i.e., Null Dereference and Resource Leak as subjects. We employ Infer to aid the gathering of these two bug types from 10 open-source projects. Consequently, our experiment dataset contains 222 instances of Null Dereference bugs and 46 instances of Resource Leak bugs. Our study demonstrates that SkipAnalyzer achieves remarkable performance in the mentioned static analysis tasks, including bug detection, false-positive warning removal, and bug repair. In static bug detection, SkipAnalyzer achieves accuracy values of up to 68.37% for detecting Null Dereference bugs and 76.95% for detecting Resource Leak bugs, improving the precision of the current leading bug detector, Infer, by 12.86% and 43.13%, respectively. For removing false-positive warnings, SkipAnalyzer can reach a precision of up to 93.88% for Null Dereference bugs and 63.33% for Resource Leak bugs. Additionally, SkipAnalyzer surpasses state-of-the-art false-positive warning removal tools. Furthermore, in bug repair, SkipAnalyzer can generate syntactically correct patches to fix its detected bugs with a success rate of up to 97.30%.
diff --git a/_publications/muennighoff2023octopack.markdown b/_publications/muennighoff2023octopack.markdown
index 3e5483d7..718e7c30 100644
--- a/_publications/muennighoff2023octopack.markdown
+++ b/_publications/muennighoff2023octopack.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "OctoPack: Instruction Tuning Code Large Language Models"
@@ -6,7 +5,7 @@ authors: Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui,
 conference:
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2308.07124"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2308.07124"}
 tags: ["dataset", "instruction tuning"]
 ---
 Finetuning large language models (LLMs) on instructions leads to vast performance improvements on natural language tasks. We apply instruction tuning using code, leveraging the natural structure of Git commits, which pair code changes with human instructions. We compile CommitPack: 4 terabytes of Git commits across 350 programming languages. We benchmark CommitPack against other natural and synthetic code instructions (xP3x, Self-Instruct, OASST) on the 16B parameter StarCoder model, and achieve state-of-the-art performance among models not trained on OpenAI outputs, on the HumanEval Python benchmark (46.2% pass@1). We further introduce HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust). Our models, OctoCoder and OctoGeeX, achieve the best performance across HumanEvalPack among all permissive models, demonstrating CommitPack's benefits in generalizing to a wider set of languages and natural coding tasks. Code, models and data are freely available at https://github.com/bigcode-project/octopack.
diff --git a/_publications/olausson2023demystifying.markdown b/_publications/olausson2023demystifying.markdown
index 08466786..8f89853a 100644
--- a/_publications/olausson2023demystifying.markdown
+++ b/_publications/olausson2023demystifying.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "Demystifying GPT Self-Repair for Code Generation"
@@ -6,7 +5,7 @@ authors: Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, Ar
 conference:
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2306.09896"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2306.09896"}
 tags: ["repair"]
 ---
 Large Language Models (LLMs) have shown remarkable aptitude in code generation but still struggle on challenging programming tasks. Self-repair -- in which the model debugs and fixes mistakes in its own code -- has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of providing accurate feedback on why the code is wrong when that code was generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4's ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. To do so, we first establish a new evaluation strategy dubbed pass@t that measures the pass rate of the tasks against the total number of tokens sampled from the model, enabling a fair comparison to purely sampling-based approaches. With this evaluation strategy, we find that the effectiveness of self-repair is only seen in GPT-4. We also observe that self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback on the programs generated by GPT-3.5 and using expert human programmers to give feedback on the programs generated by GPT-4, we unlock significant performance gains.
diff --git a/_publications/peng2023generative.markdown b/_publications/peng2023generative.markdown
index f794b7c1..7238aea7 100644
--- a/_publications/peng2023generative.markdown
+++ b/_publications/peng2023generative.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "Generative Type Inference for Python"
@@ -6,7 +5,7 @@ authors: Yun Peng, Chaozheng Wang, Wenxuan Wang, Cuiyun Gao, Michael R. Lyu
 conference:
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2307.09163"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2307.09163"}
 tags: ["types"]
 ---
 Python is a popular dynamic programming language, evidenced by its ranking as the second most commonly used language on GitHub. However, its dynamic type system can lead to potential type errors, leading researchers to explore automatic type inference approaches for Python programs. The rule-based type inference approaches can ensure the accuracy of predicted variable types, but they suffer from low coverage problems. Supervised type inference approaches, while feature-agnostic, require large, high-quality annotated datasets and are limited to pre-defined types. As zero-shot approaches, the cloze-style approaches reformulate the type inference problem into a fill-in-the-blank problem. However, their performance is limited.   This paper introduces TypeGen, a few-shot generative type inference approach that incorporates static domain knowledge from static analysis. TypeGen creates chain-of-thought (COT) prompts by translating the type inference steps of static analysis into prompts based on the type dependency graphs (TDGs), enabling language models to learn from how static analysis infers types. By combining COT prompts with code slices and type hints, TypeGen constructs example prompts from human annotations. TypeGen only requires very few annotated examples to teach language models to generate similar COT prompts via in-context learning. Moreover, TypeGen enhances the interpretability of results through the use of the input-explanation-output strategy. Experiments show that TypeGen outperforms the best baseline Type4Py by 10.0% for argument type prediction and 22.5% in return value type prediction in terms of top-1 Exact Match by using only five examples. Furthermore, TypeGen achieves substantial improvements of 27% to 84% compared to the zero-shot performance of large language models with parameter sizes ranging from 1.3B to 175B in terms of top-1 Exact Match.
diff --git a/_publications/shrivastava2023repofusion.markdown b/_publications/shrivastava2023repofusion.markdown
index 8cea558a..e450ec90 100644
--- a/_publications/shrivastava2023repofusion.markdown
+++ b/_publications/shrivastava2023repofusion.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "RepoFusion: Training Code Models to Understand Your Repository"
@@ -6,7 +5,7 @@ authors: Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau, Tor
 conference:
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2306.10998"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2306.10998"}
 tags: ["completion"]
 ---
 Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand the context present in the repository (e.g., imports, parent classes, files with similar names, etc.), thereby producing inaccurate code completions. This effect is more pronounced when using these assistants for repositories that the model has not seen during training, such as proprietary software or work-in-progress code projects. Recent work has shown the promise of using context from the repository during inference. In this work, we extend this idea and propose RepoFusion, a framework to train models to incorporate relevant repository context. Experiments on single-line code completion show that our models trained with repository context significantly outperform much larger code models as CodeGen-16B-multi ($\sim73\times$ larger) and closely match the performance of the $\sim 70\times$ larger StarCoderBase model that was trained with the Fill-in-the-Middle objective. We find these results to be a novel and compelling demonstration of the gains that training with repository context can bring. We carry out extensive ablation studies to investigate the impact of design choices such as context type, number of contexts, context length, and initialization within our framework. Lastly, we release Stack-Repo, a dataset of 200 Java repositories with permissive licenses and near-deduplicated files that are augmented with three types of repository contexts. Additionally, we are making available the code and trained checkpoints for our work. Our released resources can be found at \url{https://huggingface.co/RepoFusion}.
diff --git a/_publications/silva2023repairllama.markdown b/_publications/silva2023repairllama.markdown
index 8969a41f..42df7795 100644
--- a/_publications/silva2023repairllama.markdown
+++ b/_publications/silva2023repairllama.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair"
@@ -6,7 +5,7 @@ authors: André Silva, Sen Fang, Martin Monperrus
 conference:
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2312.15698"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2312.15698"}
 tags: ["repair"]
 ---
 Automated Program Repair (APR) has evolved significantly with the advent of Large Language Models (LLMs). Fine-tuning LLMs for program repair is a recent avenue of research, with many dimensions which have not been explored. Existing work mostly fine-tunes LLMs with naive code representations and is fundamentally limited in its ability to fine-tune larger LLMs. To address this problem, we propose RepairLLaMA, a novel program repair approach that combines 1) code representations for APR and 2) the state-of-the-art parameter-efficient LLM fine-tuning technique called LoRA. This results in RepairLLaMA producing a highly effective `program repair adapter' for fixing bugs with language models. Our experiments demonstrate the validity of both concepts. First, fine-tuning adapters with program repair specific code representations enables the model to use meaningful repair signals. Second, parameter-efficient fine-tuning helps fine-tuning to converge and contributes to the effectiveness of the repair adapter to fix data-points outside the fine-tuning data distribution. Overall, RepairLLaMA correctly fixes 125 Defects4J v2 and 82 HumanEval-Java bugs, outperforming all baselines.
diff --git a/_publications/wang2023codet5.markdown b/_publications/wang2023codet5.markdown
index 1c4abb27..a75b04a2 100644
--- a/_publications/wang2023codet5.markdown
+++ b/_publications/wang2023codet5.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "CodeT5+: Open Code Large Language Models for Code Understanding and Generation"
@@ -6,7 +5,7 @@ authors: Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li,
 conference:
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2305.07922"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2305.07922"}
 tags: ["Transformer"]
 ---
 Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.
diff --git a/_publications/xia2023universal.markdown b/_publications/xia2023universal.markdown
index ac8789e1..0f20b845 100644
--- a/_publications/xia2023universal.markdown
+++ b/_publications/xia2023universal.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "Universal Fuzzing via Large Language Models"
@@ -6,7 +5,7 @@ authors: Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, Ling
 conference:
 year: 2023
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2308.04748"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2308.04748"}
 tags: ["fuzzing"]
 ---
 Fuzzing has achieved tremendous success in discovering bugs and vulnerabilities in various software systems. Systems under test (SUTs) that take in programming or formal language as inputs, e.g., compilers, runtime engines, constraint solvers, and software libraries with accessible APIs, are especially important as they are fundamental building blocks of software development. However, existing fuzzers for such systems often target a specific language, and thus cannot be easily applied to other languages or even other versions of the same language. Moreover, the inputs generated by existing fuzzers are often limited to specific features of the input language, and thus can hardly reveal bugs related to other or new features. This paper presents Fuzz4All, the first fuzzer that is universal in the sense that it can target many different input languages and many different features of these languages. The key idea behind Fuzz4All is to leverage large language models (LLMs) as an input generation and mutation engine, which enables the approach to produce diverse and realistic inputs for any practically relevant language. To realize this potential, we present a novel autoprompting technique, which creates LLM prompts that are wellsuited for fuzzing, and a novel LLM-powered fuzzing loop, which iteratively updates the prompt to create new fuzzing inputs. We evaluate Fuzz4All on nine systems under test that take in six different languages (C, C++, Go, SMT2, Java and Python) as inputs. The evaluation shows, across all six languages, that universal fuzzing achieves higher coverage than existing, language-specific fuzzers. Furthermore, Fuzz4All has identified 76 bugs in widely used systems, such as GCC, Clang, Z3, CVC5, OpenJDK, and the Qiskit quantum computing platform, with 47 bugs already confirmed by developers as previously unknown.
diff --git a/_publications/yin2022natural.markdown b/_publications/yin2022natural.markdown
index bd44ea68..da39d6cf 100644
--- a/_publications/yin2022natural.markdown
+++ b/_publications/yin2022natural.markdown
@@ -1,4 +1,3 @@
-
 ---
 layout: publication
 title: "Natural Language to Code Generation in Interactive Data Science Notebooks"
@@ -6,7 +5,7 @@ authors: Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kense
 conference:
 year: 2022
 additional_links:
-- {name: "ArXiV", url: "https://arxiv.org/abs/2212.09248"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2212.09248"}
 tags: ["notebook", "evaluation"]
 ---
 Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.

From 64eeaf1bca9e6762cd6f328c5e8fa57832a9acfe Mon Sep 17 00:00:00 2001
From: Reza Gharibi <reza.gharibi.rg@gmail.com>
Date: Fri, 15 Mar 2024 16:51:30 +0330
Subject: [PATCH 101/114] Add T5APR

---
 _publications/gharibi2024t5apr.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/gharibi2024t5apr.markdown

diff --git a/_publications/gharibi2024t5apr.markdown b/_publications/gharibi2024t5apr.markdown
new file mode 100644
index 00000000..7f4cb6be
--- /dev/null
+++ b/_publications/gharibi2024t5apr.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "T5APR: Empowering Automated Program Repair across Languages through Checkpoint Ensemble"
+authors: Reza Gharibi, Mohammad Hadi Sadreddini, Seyed Mostafa Fakhrahmad
+journal:
+year: 2024
+additional_links:
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2309.15742"}
+  - {name: "Code", url: "https://github.com/h4iku/T5APR"}
+tags: ["repair", "Transformer"]
+---
+Automated program repair (APR) using deep learning techniques has become an important area of research in recent years, aiming to automatically generate bug-fixing patches that can improve software reliability and maintainability. However, most existing methods either target a single language or require high computational resources to train multilingual models. In this paper, we propose T5APR, a novel neural program repair approach that provides a unified solution for bug fixing across multiple programming languages. T5APR leverages CodeT5, a powerful pre-trained text-to-text transformer model, and adopts a checkpoint ensemble strategy to improve patch recommendation. We conduct comprehensive evaluations on six well-known benchmarks in four programming languages (Java, Python, C, JavaScript), demonstrating T5APR's competitiveness against state-of-the-art techniques. T5APR correctly fixes 1,985 bugs, including 1,442 bugs that none of the compared techniques has fixed. We further support the effectiveness of our approach by conducting detailed analyses, such as comparing the correct patch ranking among different techniques. The findings of this study demonstrate the potential of T5APR for use in real-world applications and highlight the importance of multilingual approaches in the field of APR.

From 27fd3f90a44f4835639733ab582c320384303f5f Mon Sep 17 00:00:00 2001
From: Arno Schneuwly <arno@schneuwlys.org>
Date: Mon, 18 Mar 2024 13:59:20 +0100
Subject: [PATCH 102/114] Add LLM4Decompile paper

---
 _publications/tan2024llm4decompile.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/tan2024llm4decompile.markdown

diff --git a/_publications/tan2024llm4decompile.markdown b/_publications/tan2024llm4decompile.markdown
new file mode 100644
index 00000000..8ea0b686
--- /dev/null
+++ b/_publications/tan2024llm4decompile.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "LLM4Decompile: Decompiling Binary Code with Large Language Models"
+authors: Hanzhuo Tan, Qi Luo, Jing Li, Yuqun Zhang
+conference:
+year: 2024
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2403.05286"}
+   - {name: "code", url: "https://github.com/albertan017/LLM4Decompile"}
+tags: ["decompilation", "translation", "evaluation", "large language models", "LLM"]
+---
+Decompilation aims to restore compiled code to human-readable source code, but struggles with details like names and structure. Large language models (LLMs) show promise for programming tasks, motivating their application to decompilation. However, there does not exist any open-source LLM for decompilation. Moreover, existing decompilation evaluation systems mainly consider token-level accuracy and largely ignore code executability, which is the most important feature of any program. Therefore, we release the first open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. The open-source LLMs can serve as baselines for further development in the field. To ensure practical program evaluation, we introduce Decompile-Eval, the first dataset that considers re-compilability and re-executability for decompilation. The benchmark emphasizes the importance of evaluating the decompilation model from the perspective of program semantics. Experiments indicate that our LLM4Decompile has demonstrated the capability to accurately decompile 21% of the assembly code, which achieves a 50% improvement over GPT-4. Our code, dataset, and models are released at this [https URL](https://github.com/albertan017/LLM4Decompile)

From 75accff06fa7f2082c9cf70aac1387c6ffe6aefc Mon Sep 17 00:00:00 2001
From: Arno Schneuwly <arno@schneuwlys.org>
Date: Tue, 19 Mar 2024 23:05:57 +0100
Subject: [PATCH 103/114] Use /usr/bin/env shebang in arXiv script

---
 _publications/add_from_arxiv.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
 mode change 100644 => 100755 _publications/add_from_arxiv.py

diff --git a/_publications/add_from_arxiv.py b/_publications/add_from_arxiv.py
old mode 100644
new mode 100755
index 0d4454a4..c8e7caaf
--- a/_publications/add_from_arxiv.py
+++ b/_publications/add_from_arxiv.py
@@ -1,4 +1,4 @@
-#!/bin/python3
+#! /usr/bin/env python3
 
 import argparse
 import arxiv

From 0aeaf2d94bf81f17d679adf5c77aacf05f178ba1 Mon Sep 17 00:00:00 2001
From: Arno Schneuwly <arno@schneuwlys.org>
Date: Tue, 19 Mar 2024 23:06:29 +0100
Subject: [PATCH 104/114] Add 2024 Casey et al Cybersec Rep Learning Paper

---
 _publications/casey2024survey.markdown | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 _publications/casey2024survey.markdown

diff --git a/_publications/casey2024survey.markdown b/_publications/casey2024survey.markdown
new file mode 100644
index 00000000..9e9e2c2f
--- /dev/null
+++ b/_publications/casey2024survey.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks"
+authors: Beatrice Casey, Joanna C. S. Santos, George Perry
+conference:
+year: 2024
+additional_links:
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2403.10646"}
+tags: ["survey", "cybersecurity", "vulnerability"]
+---
+Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what's not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.

From 78c0ef336da7ff011b818d639f22486a4711934a Mon Sep 17 00:00:00 2001
From: Miltos Allamanis <miltos@allamanis.com>
Date: Wed, 3 Apr 2024 11:30:00 +0300
Subject: [PATCH 105/114] Add papers

---
 _publications/berabi2024deepcode.markdown      | 11 +++++++++++
 _publications/bouzenia2024repairagent.markdown | 11 +++++++++++
 _publications/cassano2023can.markdown          | 11 +++++++++++
 _publications/guo2024deepseek.markdown         | 11 +++++++++++
 _publications/tian2024debugbench.markdown      | 11 +++++++++++
 5 files changed, 55 insertions(+)
 create mode 100644 _publications/berabi2024deepcode.markdown
 create mode 100644 _publications/bouzenia2024repairagent.markdown
 create mode 100644 _publications/cassano2023can.markdown
 create mode 100644 _publications/guo2024deepseek.markdown
 create mode 100644 _publications/tian2024debugbench.markdown

diff --git a/_publications/berabi2024deepcode.markdown b/_publications/berabi2024deepcode.markdown
new file mode 100644
index 00000000..6f55041d
--- /dev/null
+++ b/_publications/berabi2024deepcode.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models"
+authors: Berkay Berabi, Alexey Gronskiy, Veselin Raychev, Gishor Sivanrupan, Victor Chibotaru, Martin Vechev
+conference:
+year: 2024
+additional_links:
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2402.13291"}
+tags: ["repair", "vulnerability"]
+---
+The automated program repair field has attracted substantial interest over the years, but despite significant research efforts, creating a system that works well for complex semantic bugs such as security vulnerabilities has proven difficult. A promising direction to solve this challenge is by leveraging large language models (LLMs), which are increasingly used to solve various programming tasks. In this paper, we investigate the effectiveness of LLMs for solving code-repair task. We show that the task is difficult as it requires the model to learn long-range code relationships, a task that inherently relies on extensive amounts of training data. At the same time, creating a large, clean dataset for complex program bugs and their corresponding fixes is non-trivial. We propose a technique to address these challenges with a new approach for querying and fine-tuning LLMs. The idea is to use program analysis to limit the LLM's attention mechanism on the portions of code needed to perform the fix, drastically reducing the amount of required training data. Concretely, for training and inference, rather than feeding the entire program to the LLM, we reduce its code to a much shorter snippet that contains the reported defect together with the necessary context - and use that instead. Our evaluation shows that this code reduction approach substantially improves available models such as GPT-4 using few-shot learning, as well as fine-tuning models. To train and evaluate our system, we created a comprehensive code fixing dataset by extensively labeling 156 bug patterns (including 40 security rules), requiring complex interprocedural dataflow to discover. Our best system with Mixtral-8x7B can remove more than 80% of the reported defects while exactly matching the human fix in between 10 and 50% of cases, outperforming baselines based on GPT-3.5 and GPT-4, or based on window-based models like TFix.
diff --git a/_publications/bouzenia2024repairagent.markdown b/_publications/bouzenia2024repairagent.markdown
new file mode 100644
index 00000000..9796ab25
--- /dev/null
+++ b/_publications/bouzenia2024repairagent.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "RepairAgent: An Autonomous, LLM-Based Agent for Program Repair"
+authors: Islem Bouzenia, Premkumar Devanbu, Michael Pradel
+conference:
+year: 2024
+additional_links:
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2403.17134"}
+tags: ["repair"]
+---
+Automated program repair has emerged as a powerful technique to mitigate the impact of software bugs on system reliability and user experience. This paper introduces RepairAgent, the first work to address the program repair challenge through an autonomous agent based on a large language model (LLM). Unlike existing deep learning-based approaches, which prompt a model with a fixed prompt or in a fixed feedback loop, our work treats the LLM as an agent capable of autonomously planning and executing actions to fix bugs by invoking suitable tools. RepairAgent freely interleaves gathering information about the bug, gathering repair ingredients, and validating fixes, while deciding which tools to invoke based on the gathered information and feedback from previous fix attempts. Key contributions that enable RepairAgent include a set of tools that are useful for program repair, a dynamically updated prompt format that allows the LLM to interact with these tools, and a finite state machine that guides the agent in invoking the tools. Our evaluation on the popular Defects4J dataset demonstrates RepairAgent's effectiveness in autonomously repairing 164 bugs, including 39 bugs not fixed by prior techniques. Interacting with the LLM imposes an average cost of 270,000 tokens per bug, which, under the current pricing of OpenAI's GPT-3.5 model, translates to 14 cents of USD per bug. To the best of our knowledge, this work is the first to present an autonomous, LLM-based agent for program repair, paving the way for future agent-based techniques in software engineering.
diff --git a/_publications/cassano2023can.markdown b/_publications/cassano2023can.markdown
new file mode 100644
index 00000000..37fc1248
--- /dev/null
+++ b/_publications/cassano2023can.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions"
+authors: Federico Cassano, Luisa Li, Akul Sethi, Noah Shinn, Abby Brennan-Jones, Jacob Ginesin, Edward Berman, George Chakhnashvili, Anton Lozhkov, Carolyn Jane Anderson, Arjun Guha
+conference:
+year: 2023
+additional_links:
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2312.12450"}
+tags: ["editing"]
+---
+A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at https://github.com/nuprl/CanItEdit.
diff --git a/_publications/guo2024deepseek.markdown b/_publications/guo2024deepseek.markdown
new file mode 100644
index 00000000..91c16fbe
--- /dev/null
+++ b/_publications/guo2024deepseek.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence"
+authors: Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, Wenfeng Liang
+conference:
+year: 2024
+additional_links:
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2401.14196"}
+tags: ["Transformers"]
+---
+The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.
diff --git a/_publications/tian2024debugbench.markdown b/_publications/tian2024debugbench.markdown
new file mode 100644
index 00000000..10dd79a9
--- /dev/null
+++ b/_publications/tian2024debugbench.markdown
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "DebugBench: Evaluating Debugging Capability of Large Language Models"
+authors: Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Zhiyuan Liu, Maosong Sun
+conference:
+year: 2024
+additional_links:
+  - {name: "ArXiV", url: "https://arxiv.org/abs/2401.04621"}
+tags: ["repair"]
+---
+Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs' debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench', an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and three open-source models in a zero-shot scenario. We find that (1) while closed-source models like GPT-4 exhibit inferior debugging performance compared to humans, open-source models such as Code Llama fail to attain any pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.

From 782a648a20f762fcd61b8ab87a4e1351800df3d6 Mon Sep 17 00:00:00 2001
From: Reza Gharibi <reza.gharibi.rg@gmail.com>
Date: Fri, 15 Mar 2024 22:37:23 +0330
Subject: [PATCH 106/114] Fix deprecated arxiv method call

---
 _publications/add_from_arxiv.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/_publications/add_from_arxiv.py b/_publications/add_from_arxiv.py
index c8e7caaf..8b69723c 100755
--- a/_publications/add_from_arxiv.py
+++ b/_publications/add_from_arxiv.py
@@ -20,8 +20,9 @@ def _author_lastname(author_name: str) -> str:
 
 
 def get_info(paper_id: str, out_dir: str) -> None:
+    client = arxiv.Client()
     search = arxiv.Search(id_list=[paper_id])
-    paper = next(search.results())
+    paper = next(client.results(search))
 
     summary = (
         paper.summary.replace("\n\n", "@@--@@")

From 616ee6ade17f52a22c91b4bc0469c5478d575d60 Mon Sep 17 00:00:00 2001
From: Aashish Yadavally <aashish.yadavally1995@gmail.com>
Date: Sun, 31 Mar 2024 17:38:34 -0500
Subject: [PATCH 107/114] Create yadavally2024static-slicing.markdown

---
 _publications/yadavally2024static-slicing.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/yadavally2024static-slicing.markdown

diff --git a/_publications/yadavally2024static-slicing.markdown b/_publications/yadavally2024static-slicing.markdown
new file mode 100644
index 00000000..8fff1de0
--- /dev/null
+++ b/_publications/yadavally2024static-slicing.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "A Learning-Based Approach to Static Program Slicing"
+authors: Aashish Yadavally, Yi Li, Shaohua Wang, Tien N. Nguyen
+conference: OOPSLA
+year: 2024
+additional_links:
+   - {name: "website", url: "https://aashishyadavally.github.io/assets/pdf/pub-oopsla2024.pdf"}
+   - {name: "code", url: "https://github.com/aashishyadavally/ns-slicer"}
+tags: ["large language models", "program analysis", "static analysis", "tool"]
+---
+Traditional program slicing techniques are crucial for early bug detection and manual/automated debugging of online code snippets. Nevertheless, their inability to handle incomplete code hinders their real-world applicability in such scenarios. To overcome these challenges, we present NS-Slicer, a novel learning-based approach that predicts static program slices for both complete and partial code. Our tool leverages a pre-trained language model to exploit its understanding of fine-grained variable-statement dependencies within source code. With this knowledge, given a variable at a specific location and a statement in a code snippet, NS-Slicer determines whether the statement belongs to the backward slice or forward slice, respectively. We conducted a series of experiments to evaluate NS-Slicer’s performance. On complete code, it predicts the backward and forward slices with an F1-score of 97.41% and 95.82%, respectively, while achieving an overall F1-score of 96.77%. Notably, in 85.20% of the cases, the static program slices predicted by NS-Slicer exactly match entire slices from the oracle. For partial programs, it achieved an F1-score of 96.77%–97.49% for backward slicing, 92.14%–95.40% for forward slicing, and an overall F1-score of 94.66%–96.62%. Furthermore, we demonstrate NS-Slicer’s utility in vulnerability detection (VD), integrating its predicted slices into an automated VD tool. In this setup, the tool detected vulnerabilities in Java code with a high F1-score of 73.38%. We also include the analyses studying NS-Slicer’s promising performance and limitations, providing insights into its understanding of intrinsic code properties such as variable aliasing, leading to better slicing.

From 8ec8336fd32690b98899165f66f9070d9e0bd2d3 Mon Sep 17 00:00:00 2001
From: Aashish Yadavally <aashish.yadavally1995@gmail.com>
Date: Sun, 31 Mar 2024 17:47:03 -0500
Subject: [PATCH 108/114] Update and rename
 yadavally2024static-slicing.markdown to yadavally2024learning.markdown

---
 ...24static-slicing.markdown => yadavally2024learning.markdown} | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
 rename _publications/{yadavally2024static-slicing.markdown => yadavally2024learning.markdown} (96%)

diff --git a/_publications/yadavally2024static-slicing.markdown b/_publications/yadavally2024learning.markdown
similarity index 96%
rename from _publications/yadavally2024static-slicing.markdown
rename to _publications/yadavally2024learning.markdown
index 8fff1de0..3a46067e 100644
--- a/_publications/yadavally2024static-slicing.markdown
+++ b/_publications/yadavally2024learning.markdown
@@ -7,6 +7,6 @@ year: 2024
 additional_links:
    - {name: "website", url: "https://aashishyadavally.github.io/assets/pdf/pub-oopsla2024.pdf"}
    - {name: "code", url: "https://github.com/aashishyadavally/ns-slicer"}
-tags: ["large language models", "program analysis", "static analysis", "tool"]
+tags: ["large language models", "program analysis", "static", "tool"]
 ---
 Traditional program slicing techniques are crucial for early bug detection and manual/automated debugging of online code snippets. Nevertheless, their inability to handle incomplete code hinders their real-world applicability in such scenarios. To overcome these challenges, we present NS-Slicer, a novel learning-based approach that predicts static program slices for both complete and partial code. Our tool leverages a pre-trained language model to exploit its understanding of fine-grained variable-statement dependencies within source code. With this knowledge, given a variable at a specific location and a statement in a code snippet, NS-Slicer determines whether the statement belongs to the backward slice or forward slice, respectively. We conducted a series of experiments to evaluate NS-Slicer’s performance. On complete code, it predicts the backward and forward slices with an F1-score of 97.41% and 95.82%, respectively, while achieving an overall F1-score of 96.77%. Notably, in 85.20% of the cases, the static program slices predicted by NS-Slicer exactly match entire slices from the oracle. For partial programs, it achieved an F1-score of 96.77%–97.49% for backward slicing, 92.14%–95.40% for forward slicing, and an overall F1-score of 94.66%–96.62%. Furthermore, we demonstrate NS-Slicer’s utility in vulnerability detection (VD), integrating its predicted slices into an automated VD tool. In this setup, the tool detected vulnerabilities in Java code with a high F1-score of 73.38%. We also include the analyses studying NS-Slicer’s promising performance and limitations, providing insights into its understanding of intrinsic code properties such as variable aliasing, leading to better slicing.

From dcc44775b747b2cdfe5b0582cffa0434c8cfe05e Mon Sep 17 00:00:00 2001
From: Aashish Yadavally <aashish.yadavally1995@gmail.com>
Date: Sun, 31 Mar 2024 17:42:05 -0500
Subject: [PATCH 109/114] Create yadavally2024dynamic-slicing.markdown

---
 _publications/yadavally2024dynamic-slicing.markdown | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 _publications/yadavally2024dynamic-slicing.markdown

diff --git a/_publications/yadavally2024dynamic-slicing.markdown b/_publications/yadavally2024dynamic-slicing.markdown
new file mode 100644
index 00000000..5556bb13
--- /dev/null
+++ b/_publications/yadavally2024dynamic-slicing.markdown
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Predictive Program Slicing via Execution Knowledge-Guided Dynamic Dependence Learning"
+authors: Aashish Yadavally, Yi Li, Tien N. Nguyen
+conference: FSE
+year: 2024
+additional_links:
+   - {name: "website", url: "https://aashishyadavally.github.io/assets/pdf/pub-fse2024.pdf"}
+   - {name: "code", url: "https://github.com/aashishyadavally/nd-slicer"}
+tags: ["large language models", "program analysis", "dynamic analysis", "tool"]
+---
+Program slicing, the process of extracting program statements that influence values at a designated location (known as the slicing criterion), is helpful in both manual and automated debugging. However, such slicing techniques prove ineffective in scenarios where executing specific inputs is prohibitively expensive, or even impossible, as with partial code. In this paper, we introduce ND-Slicer, a predictive slicing methodology that caters to specific executions based on a particular input, overcoming the need for actual execution. We enable such a process by leveraging execution-aware pre-training to learn the dynamic program dependencies, including both dynamic data and control dependencies between variables in the slicing criterion and the remaining program statements. Such knowledge forms the cornerstone for constructing a predictive backward slice. Our empirical evaluation revealed a high accuracy in predicting program slices, achieving an exact-match accuracy of 81.3% and a ROUGE-LCS F1-score of 95.4% on Python programs. As an extrinsic evaluation, we illustrate ND-Slicer’s usefulness in crash detection, with it locating faults with an accuracy of 63.9%. Furthermore, we include an in-depth qualitative evaluation, assessing ND-Slicer’s understanding of branched structures such as if-else blocks and loops, as well as the control flow in inter-procedural calls.

From 3f529da65eed8aa44531a1aa64953ff8f9b834e8 Mon Sep 17 00:00:00 2001
From: Aashish Yadavally <aashish.yadavally1995@gmail.com>
Date: Sun, 31 Mar 2024 17:45:32 -0500
Subject: [PATCH 110/114] Update and rename
 yadavally2024dynamic-slicing.markdown to yadavally2024predictive.markdown

---
 ...ynamic-slicing.markdown => yadavally2024predictive.markdown} | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
 rename _publications/{yadavally2024dynamic-slicing.markdown => yadavally2024predictive.markdown} (95%)

diff --git a/_publications/yadavally2024dynamic-slicing.markdown b/_publications/yadavally2024predictive.markdown
similarity index 95%
rename from _publications/yadavally2024dynamic-slicing.markdown
rename to _publications/yadavally2024predictive.markdown
index 5556bb13..9f8930b1 100644
--- a/_publications/yadavally2024dynamic-slicing.markdown
+++ b/_publications/yadavally2024predictive.markdown
@@ -7,6 +7,6 @@ year: 2024
 additional_links:
    - {name: "website", url: "https://aashishyadavally.github.io/assets/pdf/pub-fse2024.pdf"}
    - {name: "code", url: "https://github.com/aashishyadavally/nd-slicer"}
-tags: ["large language models", "program analysis", "dynamic analysis", "tool"]
+tags: ["large language models", "program analysis", "dynamic", "tool"]
 ---
 Program slicing, the process of extracting program statements that influence values at a designated location (known as the slicing criterion), is helpful in both manual and automated debugging. However, such slicing techniques prove ineffective in scenarios where executing specific inputs is prohibitively expensive, or even impossible, as with partial code. In this paper, we introduce ND-Slicer, a predictive slicing methodology that caters to specific executions based on a particular input, overcoming the need for actual execution. We enable such a process by leveraging execution-aware pre-training to learn the dynamic program dependencies, including both dynamic data and control dependencies between variables in the slicing criterion and the remaining program statements. Such knowledge forms the cornerstone for constructing a predictive backward slice. Our empirical evaluation revealed a high accuracy in predicting program slices, achieving an exact-match accuracy of 81.3% and a ROUGE-LCS F1-score of 95.4% on Python programs. As an extrinsic evaluation, we illustrate ND-Slicer’s usefulness in crash detection, with it locating faults with an accuracy of 63.9%. Furthermore, we include an in-depth qualitative evaluation, assessing ND-Slicer’s understanding of branched structures such as if-else blocks and loops, as well as the control flow in inter-procedural calls.

From bb2af2153f2743dcb3c94881dd229d539bfd14c2 Mon Sep 17 00:00:00 2001
From: Reza Gharibi <reza.gharibi.rg@gmail.com>
Date: Sun, 28 Apr 2024 10:50:14 +0330
Subject: [PATCH 111/114] Allow `journal` key in the publication template

---
 _layouts/publication.html |  2 +-
 _publications/template    |  6 +++---
 contributing.markdown     | 13 +++++++------
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/_layouts/publication.html b/_layouts/publication.html
index 4aff5c72..89e8b916 100644
--- a/_layouts/publication.html
+++ b/_layouts/publication.html
@@ -4,7 +4,7 @@
 
 <div class="page">
   <h1 class="page-title">{{ page.title }}</h1>
-  <h5>{{ page.authors }}. {{ page.conference }} {{ page.year }}</h5>
+  <h5>{{ page.authors }}. {{ page.conference | default: page.journal }} {{ page.year }}</h5>
   <p>
     {% for additional_link in page.additional_links %}
       [<a href="https://github.com/%7B%7B%20additional_link.url%20%7D%7D" target="_blank">{{ additional_link.name }}</a>]
diff --git a/_publications/template b/_publications/template
index a6d0c379..8e8f760a 100644
--- a/_publications/template
+++ b/_publications/template
@@ -2,11 +2,11 @@
 layout: publication
 title: "Add title here"
 authors: FirstName LastName, FirstName LastName
-conference: Optional
+conference: Optional  # OR journal
 year: 2000
 additional_links:
-   - {name: "ArXiV", url: "https://arxiv.org/abs/xxxx.xxxxxx"}
-   - {name: "Dataset", url: "https://blah/blah"}
+  - {name: "ArXiV", url: "https://arxiv.org/abs/xxxx.xxxxxx"}
+  - {name: "Dataset", url: "https://blah/blah"}
 tags: ["dataset"]
 ---
 Abstract here
diff --git a/contributing.markdown b/contributing.markdown
index fe22b050..bd906b48 100644
--- a/contributing.markdown
+++ b/contributing.markdown
@@ -8,21 +8,22 @@ Contributions of new or missing publications are very welcome. Alternative categ
 
 ### Adding a publication
 To add a publication (new or missing), create a file in the `_publications` folder. The name of the file should follow the structure `lastnameYEARfirstword.markdown` where `lastname` is the last name of the first author and `firstword` is the first non-punctuation word of the work's title. Within each file, follow the structure shown in the other files. Once the file is added, the work will appear in the "All Papers" section.
-<pre>
+
+```yaml
 ---
 layout: publication
 title: The title of the Publication
 authors: F. M. LastName, F. M. LastName, ...
-conference: AbbreviatedNameOfConference
+conference: AbbreviatedNameOfConference  # Or journal: AbbreviatedNameOfJournal
 year: YEAR
 additional_links:
-   - {name: "ArXiV", url: "http://arxiv.org/abs/XXXX.YYYY"}
-   - {name: "website", url: "http://paperwebsite.com/"}
-   - {name: "code", url: "https://github.com/path-to/code"}
+  - {name: "ArXiV", url: "http://arxiv.org/abs/XXXX.YYYY"}
+  - {name: "website", url: "http://paperwebsite.com/"}
+  - {name: "code", url: "https://github.com/path-to/code"}
 tags: ["tag1", "tag2"]
 ---
 Text of abstract goes here.
-</pre>
+```
 
 The `additional_links` are optional and arbitrary and they will appear on the page referring to this work. Feel free to add as many additional links as needed.
 

From bb4f18c60ceccc27acf26518eba6ad09308fc8a5 Mon Sep 17 00:00:00 2001
From: SeekingDream <920730325@qq.com>
Date: Tue, 13 Aug 2024 17:05:45 +0800
Subject: [PATCH 112/114] add ppm and nnreverse

---
 _publications/chen2022learning.md | 11 +++++++++++
 _publications/chen2024ppm.md      | 12 ++++++++++++
 2 files changed, 23 insertions(+)
 create mode 100644 _publications/chen2022learning.md
 create mode 100644 _publications/chen2024ppm.md

diff --git a/_publications/chen2022learning.md b/_publications/chen2022learning.md
new file mode 100644
index 00000000..56f2e380
--- /dev/null
+++ b/_publications/chen2022learning.md
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Learning to Reverse DNNs from AI Programs Automatically"
+authors: Simin Chen, Hamed Khanpour, Cong Liu, Wei Yang
+conference: IJCAI-ECAI 2022
+year: 2022
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/pdf/2205.10364"}
+tags: ["Reverse Engineering", "Binary Code"]
+---
+With the privatization deployment of DNNs on edge devices, the security of on-device DNNs has raised significant concern. To quantify the model leakage risk of on-device DNNs automatically, we propose NNReverse, the first learning-based method which can reverse DNNs from AI programs without domain knowledge. NNReverse trains a representation model to represent the semantics of binary code for DNN layers. By searching the most similar function in our database, NNReverse infers the layer type of a given function’s binary code. To represent assembly instructions semantics precisely, NNReverse proposes a more finegrained embedding model to represent the textual and structural-semantic of assembly functions.
diff --git a/_publications/chen2024ppm.md b/_publications/chen2024ppm.md
new file mode 100644
index 00000000..bbd5e083
--- /dev/null
+++ b/_publications/chen2024ppm.md
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models"
+authors: Simin Chen, Xiaoning Feng, Xiaohong Han, Cong Liu, Wei Yang
+conference: FSE 2024
+year: 2024
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/2401.15545"}
+   - {name: "Code", url: "https://github.com/SeekingDream/PPM"}
+tags: ["benchmarking", "evaluation"]
+---
+In recent times, a plethora of Large Code Generation Models (LCGMs) have been proposed, showcasing significant potential in assisting developers with complex programming tasks. Benchmarking LCGMs necessitates the creation of a set of diverse programming problems, and each problem comprises the prompt (including the task description), canonical solution, and test inputs. The existing methods for constructing such a problem set can be categorized into two main types: manual methods and perturbation-based methods. However, manual methods demand high effort and lack scalability, while also risking data integrity due to LCGMs' potentially contaminated data collection, and perturbation-based approaches mainly generate semantically homogeneous problems with the same canonical solutions and introduce typos that can be easily auto-corrected by IDE, making them ineffective and unrealistic. In this work, we propose the idea of programming problem merging (PPM) and provide two implementation of this idea, we utilize our tool on two widely-used datasets and compare it against nine baseline methods using eight code generation models. The results demonstrate the effectiveness of our tool in generating more challenging, diverse, and natural programming problems, comparing to the baselines.

From f36a0e8bd4d75485b0db7dee31c31553336ed0d3 Mon Sep 17 00:00:00 2001
From: Reza Gharibi <reza.gharibi.rg@gmail.com>
Date: Tue, 20 Aug 2024 08:05:57 +0330
Subject: [PATCH 113/114] Update NLTK tokenizer's data

---
 etc/compute_related.py | 2 +-
 etc/compute_topics.py  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/etc/compute_related.py b/etc/compute_related.py
index 17d09fb7..36f3bc2c 100644
--- a/etc/compute_related.py
+++ b/etc/compute_related.py
@@ -6,7 +6,7 @@
 
 nltk.download('stopwords')
 nltk.download('wordnet')
-nltk.download('punkt')
+nltk.download('punkt_tab')
 
 from nltk.corpus import stopwords
 from nltk.stem import WordNetLemmatizer
diff --git a/etc/compute_topics.py b/etc/compute_topics.py
index a219a61f..0bba7ade 100644
--- a/etc/compute_topics.py
+++ b/etc/compute_topics.py
@@ -5,7 +5,7 @@
 nltk.download('omw-1.4')
 nltk.download('stopwords')
 nltk.download('wordnet')
-nltk.download('punkt')
+nltk.download('punkt_tab')
 
 from nltk.corpus import stopwords
 from nltk.stem import WordNetLemmatizer

From 44e20ba023fc5d95ebdd0c067f3a61045947eebd Mon Sep 17 00:00:00 2001
From: williambrach <wibrach@gmail.com>
Date: Fri, 7 Feb 2025 13:02:33 +0100
Subject: [PATCH 114/114] adding publication

---
 _publications/brach2024can.markdown | 13 +++++++++++++
 1 file changed, 13 insertions(+)
 create mode 100644 _publications/brach2024can.markdown

diff --git a/_publications/brach2024can.markdown b/_publications/brach2024can.markdown
new file mode 100644
index 00000000..99b25d3e
--- /dev/null
+++ b/_publications/brach2024can.markdown
@@ -0,0 +1,13 @@
+---
+layout: publication
+title: Can Large Language Model Detect Plagiarism in Source Code?
+authors: William Brach, Kristián Košťál, Michal Ries
+conference: FLLM
+year: 2024
+additional_links:
+  - {name: "IEEE", url: "https://ieeexplore.ieee.org/abstract/document/10852497"}
+  - {name: "website", url: "https://www.researchgate.net/profile/Kristian-Kostal/publication/386176004_Can_Large_Language_Model_Detect_Plagiarism_in_Source_Code/links/67479110a7fbc259f1935bcb/Can-Large-Language-Model-Detect-Plagiarism-in-Source-Code.pdf"}
+  - {name: "code", url: "https://github.com/fiit-ba/llm-plagiarism-check"}
+tags: ["code similarity", "large language models", "LLM","plagiarism detection", "natural language processing"]
+---
+The issue of code plagiarism represents a significant challenge in the academic environment. This study examines the potential of large language models (LLMs) in improving the detection of code plagiarism. The performance of several LLMs, including GPT-4o, GPT3.5 Turbo, LLaMA 3, and CodeLlama, is evaluated in comparison to conventional tools, such as JPlag, across a range of levels of code plagiarism. The findings of our study illustrate that state-of-the-art LLMs are able to outperform traditional methods, particularly in the detection of sophisticated forms of plagiarism. GPT-4o exhibited the highest overall accuracy (78.70%) and an F1 score of 86.97%. It is important to note that open-source models, such as LLaMA 3 (accuracy 71.53%, F1 score 82.75%), demonstrated the ability to detect the most complex forms of plagiarism with the same accuracy as GPT-4o. While these results demonstrate the promising potential of LLMs in code similarity analysis, it is also evident that higher false positive rates may be an inherent limitation, emphasizing the need for human oversight. This study contributes valuable insights into the application of AI in maintaining code integrity and academic honesty, paving the way for more effective, interpretable, and fair plagiarism detection systems in software development education and practice.
\ No newline at end of file