Skip to content

Commit 2bd610a

Browse files
committed
Add some related publications.
1 parent a455e9c commit 2bd610a

File tree

7 files changed

+125
-0
lines changed

7 files changed

+125
-0
lines changed
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
---
2+
layout: publication
3+
title: "Path-Based Function Embedding and its Application to Specification Mining"
4+
authors: D. DeFreez, A. V. Thakur, C. Rubio-González
5+
conference: ICSE
6+
year: 2018
7+
bibkey: defreez2018path
8+
---
9+
Identifying the relationships among program elements is useful
10+
for program understanding, debugging, and analysis. One such
11+
relationship is synonymy. Function synonyms are functions that
12+
play a similar role in code, e.g. functions that perform initialization
13+
for different device drivers, or functions that implement different
14+
symmetric-key encryption schemes. Function synonyms are not
15+
necessarily semantically equivalent and can be syntactically dissimilar; consequently, approaches for identifying code clones or
16+
functional equivalence cannot be used to identify them. This paper presents `func2vec`, an algorithm that maps each function to a vector in a vector space such that function synonyms are grouped
17+
together. We compute the function embedding by training a neu-
18+
ral network on sentences generated from random walks over an
19+
encoding of the program as a labeled pushdown system (ℓ-PDS).
20+
We demonstrate that `func2vec`
21+
is effective at identifying function
22+
synonyms in the Linux kernel. Furthermore, we show how function
23+
synonyms enable mining error-handling specifications with high
24+
support in Linux file systems and drivers.

_publications/gu2018deep.markdown

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
---
2+
layout: publication
3+
title: "Deep code search"
4+
authors: X. Gu, H. Zhang, S. Kim
5+
conference: ICSE
6+
year: 2018
7+
bibkey: gu2018deep
8+
---
9+
To implement a program functionality, developers can reuse previously written code snippets by searching through a large-scale codebase. Over the years, many code search tools have been proposed to help developers. The existing approaches often treat source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. These approaches mainly rely on the textual similarity between source code and natural language query. They lack a deep understanding of the semantics of queries and source code.
10+
11+
In this paper, we propose a novel deep neural network named CODEnn (Code-Description Embedding Neural Network). Instead of matching text similarity, CODEnn jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors. Using the unified vector representation, code snippets related to a natural language query can be retrieved according to their vectors. Semantically related words can also be recognized and irrelevant/noisy keywords in queries can be handled.
12+
13+
As a proof-of-concept application, we implement a code search tool named DeepCS using the proposed CODEnn model. We empirically evaluate DeepCS on a large scale codebase collected from GitHub. The experimental results show that our approach can effectively retrieve relevant code snippets and outperforms previous techniques.
14+
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
---
2+
layout: publication
3+
title: "Exploring the Naturalness of Buggy Code with Recurrent Neural Network"
4+
authors: J. Lanchantin, J. Gao
5+
conference:
6+
year: 2018
7+
bibkey: lanchantin2018exploring
8+
---
9+
Statistical language models are powerful tools
10+
which have been used for many tasks within natural language processing. Recently, they have been
11+
used for other sequential data such as source code.
12+
(Ray et al., 2015) showed that it is possible train an
13+
n-gram
14+
source code language mode, and use it to
15+
predict buggy lines in code by determining “unnatural” lines via entropy with respect to the language
16+
model. In this work, we propose using a more advanced language modeling technique, Long Short-term Memory recurrent neural networks, to model
17+
source code and classify buggy lines based on entropy. We show that our method slightly outperforms an
18+
n-gram model in the buggy line classification task using AUC
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
layout: publication
3+
title: "Deep Learning to Detect Redundant Method Comments"
4+
authors: A. Louis, S. K. Dash, E. T. Barr, C. Sutton
5+
conference:
6+
year: 2018
7+
bibkey: louis2018deep
8+
additional_links:
9+
- {name: "ArXiV", url: "https://arxiv.org/abs/1806.04616"}
10+
---
11+
Comments in software are critical for maintenance and reuse. But apart from prescriptive advice, there is little practical support or quantitative understanding of what makes a comment useful. In this paper, we introduce the task of identifying comments which are uninformative about the code they are meant to document. To address this problem, we introduce the notion of comment entailment from code, high entailment indicating that a comment's natural language semantics can be inferred directly from the code. Although not all entailed comments are low quality, comments that are too easily inferred, for example, comments that restate the code, are widely discouraged by authorities on software style. Based on this, we develop a tool called CRAIC which scores method-level comments for redundancy. Highly redundant comments can then be expanded or alternately removed by the developer. CRAIC uses deep language models to exploit large software corpora without requiring expensive manual annotations of entailment. We show that CRAIC can perform the comment entailment task with good agreement with human judgements. Our findings also have implications for documentation tools. For example, we find that common tags in Javadoc are at least two times more predictable from code than non-Javadoc sentences, suggesting that Javadoc tags are less informative than more free-form comments
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
layout: publication
3+
title: "Building Language Models for Text with Named Entities"
4+
authors: M.R. Parvez, S. Chakraborty, B. Ray, KW Chang
5+
conference: ACL
6+
year: 2018
7+
bibkey: parvez2018building
8+
---
9+
Text in many domains involves a significant amount of named entities. Predicting the entity names is often challenging
10+
for a language model as they appear less
11+
frequent on the training corpus. In this
12+
paper, we propose a novel and effective
13+
approach to building a discriminative language model which can learn the entity
14+
names by leveraging their entity type information. We also introduce two benchmark datasets based on recipes and Java
15+
programming codes, on which we evaluate the proposed model. Experimental results show that our model achieves 52.2%
16+
better perplexity in recipe generation and
17+
22.06% on code generation than the state-of-the-art language models.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
layout: publication
3+
title: "Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities"
4+
authors: M. White, M. Tufano, M. Martínez, M. Monperrus, D. Poshyvanyk
5+
conference:
6+
year: 2017
7+
bibkey: white2017sorting
8+
---
9+
In the field of automated program repair, the redundancy assumption claims large programs contain the seeds
10+
of their own repair. However, most redundancy-based program
11+
repair techniques do not reason about the repair ingredients—the code that is reused to craft a patch. We aim to reason about
12+
the repair ingredients by using code similarities to prioritize and
13+
transform statements in a codebase for patch generation. Our
14+
approach, DeepRepair, relies on deep learning to reason about
15+
code similarities. Code fragments at well-defined levels of granularity in a codebase can be sorted according to their similarity
16+
to suspicious elements (i.e., code elements that contain suspicious
17+
statements) and statements can be transformed by mapping out-of-scope identifiers to similar identifiers in scope. We examined
18+
these new search strategies for patch generation with respect to
19+
effectiveness from the viewpoint of a software maintainer. Our
20+
comparative experiments were executed on six open-source Java
21+
projects including 374 buggy program revisions and consisted
22+
of 19,949 trials spanning 2,616 days of computation time. DeepRepair’s search strategy using code similarities generally found
23+
compilable ingredients faster than the baseline, jGenProg, but
24+
this improvement neither yielded test-adequate patches in fewer
25+
attempts (on average) nor found significantly more patches than
26+
the baseline. Although the patch counts were not statistically
27+
different, there were notable differences between the nature of
28+
DeepRepair patches and baseline patches. The results demonstrate that our learning-based approach finds patches that cannot
29+
be found by existing redundancy-based repair techniques
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
layout: publication
3+
title: "Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow"
4+
authors: P. Yin, B. Deng, E. Chen, B. Vasilescu, G. Neubig
5+
conference: MSR
6+
year: 2018
7+
bibkey: yin2018mining
8+
additional_links:
9+
- {name: "data", url: "https://conala-corpus.github.io/"}
10+
---
11+
For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.
12+

0 commit comments

Comments
 (0)