Skip to content

Commit 146465e

Browse files
authored
Merge branch 'ml4code:source' into source
2 parents 68ce931 + f36a0e8 commit 146465e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+461
-26
lines changed

_layouts/publication.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
<div class="page">
66
<h1 class="page-title">{{ page.title }}</h1>
7-
<h5>{{ page.authors }}. {{ page.conference }} {{ page.year }}</h5>
7+
<h5>{{ page.authors }}. {{ page.conference | default: page.journal }} {{ page.year }}</h5>
88
<p>
99
{% for additional_link in page.additional_links %}
1010
[<a href="{{ additional_link.url }}" target="_blank">{{ additional_link.name }}</a>]

_publications/add_from_arxiv.py

100644100755
Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#!/bin/python3
1+
#! /usr/bin/env python3
22

33
import argparse
44
import arxiv
@@ -8,7 +8,7 @@
88

99

1010
def _first_non_stopword(title: str) -> str:
11-
for word in re.split("\W", title.lower()):
11+
for word in re.split(r"\W", title.lower()):
1212
if word in ("a", "an", "the", "is", "are", "what", "who", "your"):
1313
continue
1414
return word
@@ -20,8 +20,9 @@ def _author_lastname(author_name: str) -> str:
2020

2121

2222
def get_info(paper_id: str, out_dir: str) -> None:
23+
client = arxiv.Client()
2324
search = arxiv.Search(id_list=[paper_id])
24-
paper = next(search.results())
25+
paper = next(client.results(search))
2526

2627
summary = (
2728
paper.summary.replace("\n\n", "@@--@@")
@@ -30,15 +31,15 @@ def get_info(paper_id: str, out_dir: str) -> None:
3031
)
3132

3233
tmpl = textwrap.dedent(
33-
f"""
34+
f"""\
3435
---
3536
layout: publication
3637
title: "{paper.title}"
3738
authors: {", ".join(a.name for a in paper.authors)}
3839
conference:
3940
year: {paper.published.year}
4041
additional_links:
41-
- {{name: "ArXiV", url: "https://arxiv.org/abs/%3Cspan%20class="pl-s1">{paper_id}"}}
42+
- {{name: "ArXiV", url: "https://arxiv.org/abs/%3Cspan%20class="pl-s1">{paper_id}"}}
4243
tags: ["TODO"]
4344
---
4445
{summary}
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
layout: publication
3+
title: Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context
4+
authors: Lakshya A Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K Lahiri, Sriram Rajamani
5+
conference: NeurIPS
6+
year: 2023
7+
additional_links:
8+
- {name: "ArXiV", url: "https://arxiv.org/abs/2306.10763"}
9+
- {name: "NeurIPS website", url: "https://neurips.cc/virtual/2023/poster/70362"}
10+
- {name: "code", url: "https://github.com/microsoft/monitors4codegen"}
11+
tags: ["autocomplete", "benchmark", "code completion", "code generation", "compilation", "completion", "dataset", "evaluation", "language model", "large language models", "program analysis", "static analysis", "tool"]
12+
---
13+
Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating.
14+
15+
Integrated development environments (IDEs) assist developers in understanding repository context using static analysis. We extend this assistance, enjoyed by developers, to LMs. We propose monitor-guided decoding (MGD) where a monitor uses static analysis to guide the decoding. We construct a repository-level dataset PragmaticCode for method-completion in Java and evaluate MGD on it. On models of varying parameter scale, by monitoring for type-consistent object dereferences, MGD consistently improves compilation rates and agreement with ground truth. Further, LMs with fewer parameters, when augmented with MGD, can outperform larger LMs. With MGD, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model.
16+
17+
We also conduct a generalizability study to evaluate the ability of MGD to generalize to multiple programming languages (Java, C# and Rust), coding scenarios (e.g., correct number of arguments to method calls), and to enforce richer semantic constraints (e.g., stateful API protocols). Our data and implementation are available at https://github.com/microsoft/monitors4codegen.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
layout: publication
3+
title: "Studying LLM Performance on Closed- and Open-source Data"
4+
authors: Toufique Ahmed, Christian Bird, Premkumar Devanbu, Saikat Chakraborty
5+
conference:
6+
year: 2024
7+
additional_links:
8+
- {name: "ArXiV", url: "https://arxiv.org/abs/2402.15100"}
9+
tags: ["Transformers"]
10+
---
11+
Large Language models (LLMs) are finding wide use in software engineering practice. These models are extremely data-hungry, and are largely trained on open-source (OSS) code distributed with permissive licenses. In terms of actual use however, a great deal of software development still occurs in the for-profit/proprietary sphere, where the code under development is not, and never has been, in the public domain; thus, many developers, do their work, and use LLMs, in settings where the models may not be as familiar with the code under development. In such settings, do LLMs work as well as they do for OSS code? If not, what are the differences? When performance differs, what are the possible causes, and are there work-arounds? In this paper, we examine this issue using proprietary, closed-source software data from Microsoft, where most proprietary code is in C# and C++. We find that performance for C# changes little from OSS --> proprietary code, but does significantly reduce for C++; we find that this difference is attributable to differences in identifiers. We also find that some performance degradation, in some cases, can be ameliorated efficiently by in-context learning.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
layout: publication
3+
title: "Code Mapping in Heterogeneous Platforms Using Deep Learning and LLVM-IR"
4+
authors: Francesco Barchi, Gianvito Urgese, Enrico Macii, Andrea Acquaviva
5+
conference: DAC
6+
year: 2019
7+
additional_links:
8+
- {name: "ACM", url: "https://dl.acm.org/doi/10.1145/3316781.3317789"}
9+
- {name: "code", url: "https://gitlab.com/ecs-lab/deepllvm"}
10+
tags: ["optimization", "program analysis", "static analysis", "natural language processing"]
11+
---
12+
Modern heterogeneous platforms require compilers capable of choosing the appropriate device for the execution of program portions. This paper presents a machine learning method designed for supporting mapping decisions through the analysis of the program source code represented in LLVM assembly language (IR) for exploiting the advantages offered by this generalised and optimised representation. To evaluate our solution, we trained an LSTM neural network on OpenCL kernels compiled in LLVM-IR and processed with our tokenizer capable of filtering less-informative tokens. We tested the network that reaches an accuracy of 85% in distinguishing the best computational unit.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
layout: publication
3+
title: "Exploration of Convolutional Neural Network models for source code classification"
4+
authors: Francesco Barchi, Emanuele Parisi, Gianvito Urgese, Elisa Ficarra, Andrea Acquaviva
5+
journal: Engineering Applications of Artificial Intelligence
6+
year: 2021
7+
additional_links:
8+
- {name: "ScienceDirect", url: "https://www.sciencedirect.com/science/article/pii/S0952197620303353"}
9+
- {name: "code", url: "https://gitlab.com/ecs-lab/deepllvm"}
10+
tags: ["optimization", "static analysis", "program analysis", "language model"]
11+
---
12+
The application of Artificial Intelligence is becoming common in many engineering fields. Among them, one of the newest and rapidly evolving is software generation, where AI can be used to automatically optimise the implementation of an algorithm for a given computing platform. In particular, Deep Learning technologies can be used to the decide how to allocate pieces of code to hardware platforms with multiple cores and accelerators, that are common in high performance and edge computing applications. In this work, we explore the use of Convolutional Neural Networks (CNN)s to analyse the application source code and decide the best compute unit to minimise the execution time. We demonstrate that CNN models can be successfully applied to source code classification, providing higher accuracy with consistently reduced learning time with respect to state-of-the-art methods. Moreover, we show the robustness of the method with respect to source code pre-processing, compiler options and hyper-parameters selection.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
layout: publication
3+
title: "Deep Learning Approaches to Source Code Analysis for Optimization of Heterogeneous Systems: Recent Results, Challenges and Opportunities"
4+
authors: Francesco Barchi, Emanuele Parisi, Andrea Bartolini, Andrea Acquaviva
5+
journal: Journal of Low Power Electronics and Applications
6+
year: 2022
7+
additional_links:
8+
- {name: "MDPI", url: "https://www.mdpi.com/2079-9268/12/3/37"}
9+
tags: ["optimization", "review"]
10+
---
11+
To cope with the increasing complexity of digital systems programming, deep learning techniques have recently been proposed to enhance software deployment by analysing source code for different purposes, ranging from performance and energy improvement to debugging and security assessment. As embedded platforms for cyber-physical systems are characterised by increasing heterogeneity and parallelism, one of the most challenging and specific problems is efficiently allocating computational kernels to available hardware resources. In this field, deep learning applied to source code can be a key enabler to face this complexity. However, due to the rapid development of such techniques, it is not easy to understand which of those are suitable and most promising for this class of systems. For this purpose, we discuss recent developments in deep learning for source code analysis, and focus on techniques for kernel mapping on heterogeneous platforms, highlighting recent results, challenges and opportunities for their applications to cyber-physical systems.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
layout: publication
3+
title: "DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models"
4+
authors: Berkay Berabi, Alexey Gronskiy, Veselin Raychev, Gishor Sivanrupan, Victor Chibotaru, Martin Vechev
5+
conference:
6+
year: 2024
7+
additional_links:
8+
- {name: "ArXiV", url: "https://arxiv.org/abs/2402.13291"}
9+
tags: ["repair", "vulnerability"]
10+
---
11+
The automated program repair field has attracted substantial interest over the years, but despite significant research efforts, creating a system that works well for complex semantic bugs such as security vulnerabilities has proven difficult. A promising direction to solve this challenge is by leveraging large language models (LLMs), which are increasingly used to solve various programming tasks. In this paper, we investigate the effectiveness of LLMs for solving code-repair task. We show that the task is difficult as it requires the model to learn long-range code relationships, a task that inherently relies on extensive amounts of training data. At the same time, creating a large, clean dataset for complex program bugs and their corresponding fixes is non-trivial. We propose a technique to address these challenges with a new approach for querying and fine-tuning LLMs. The idea is to use program analysis to limit the LLM's attention mechanism on the portions of code needed to perform the fix, drastically reducing the amount of required training data. Concretely, for training and inference, rather than feeding the entire program to the LLM, we reduce its code to a much shorter snippet that contains the reported defect together with the necessary context - and use that instead. Our evaluation shows that this code reduction approach substantially improves available models such as GPT-4 using few-shot learning, as well as fine-tuning models. To train and evaluate our system, we created a comprehensive code fixing dataset by extensively labeling 156 bug patterns (including 40 security rules), requiring complex interprocedural dataflow to discover. Our best system with Mixtral-8x7B can remove more than 80% of the reported defects while exactly matching the human fix in between 10 and 50% of cases, outperforming baselines based on GPT-3.5 and GPT-4, or based on window-based models like TFix.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
layout: publication
3+
title: "RepairAgent: An Autonomous, LLM-Based Agent for Program Repair"
4+
authors: Islem Bouzenia, Premkumar Devanbu, Michael Pradel
5+
conference:
6+
year: 2024
7+
additional_links:
8+
- {name: "ArXiV", url: "https://arxiv.org/abs/2403.17134"}
9+
tags: ["repair"]
10+
---
11+
Automated program repair has emerged as a powerful technique to mitigate the impact of software bugs on system reliability and user experience. This paper introduces RepairAgent, the first work to address the program repair challenge through an autonomous agent based on a large language model (LLM). Unlike existing deep learning-based approaches, which prompt a model with a fixed prompt or in a fixed feedback loop, our work treats the LLM as an agent capable of autonomously planning and executing actions to fix bugs by invoking suitable tools. RepairAgent freely interleaves gathering information about the bug, gathering repair ingredients, and validating fixes, while deciding which tools to invoke based on the gathered information and feedback from previous fix attempts. Key contributions that enable RepairAgent include a set of tools that are useful for program repair, a dynamically updated prompt format that allows the LLM to interact with these tools, and a finite state machine that guides the agent in invoking the tools. Our evaluation on the popular Defects4J dataset demonstrates RepairAgent's effectiveness in autonomously repairing 164 bugs, including 39 bugs not fixed by prior techniques. Interacting with the LLM imposes an average cost of 270,000 tokens per bug, which, under the current pricing of OpenAI's GPT-3.5 model, translates to 14 cents of USD per bug. To the best of our knowledge, this work is the first to present an autonomous, LLM-based agent for program repair, paving the way for future agent-based techniques in software engineering.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
layout: publication
3+
title: "A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks"
4+
authors: Beatrice Casey, Joanna C. S. Santos, George Perry
5+
conference:
6+
year: 2024
7+
additional_links:
8+
- {name: "ArXiV", url: "https://arxiv.org/abs/2403.10646"}
9+
tags: ["survey", "cybersecurity", "vulnerability"]
10+
---
11+
Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what's not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.

0 commit comments

Comments
 (0)