tims0
diff --git a/‎_publications/defreez2018path.markdown‎
Lines changed: 24 additions & 0 deletions b/‎_publications/defreez2018path.markdown‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎_publications/gu2018deep.markdown‎
Lines changed: 14 additions & 0 deletions b/‎_publications/gu2018deep.markdown‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎_publications/lanchantin2018exploring.markdown‎
Lines changed: 18 additions & 0 deletions b/‎_publications/lanchantin2018exploring.markdown‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎_publications/louis2018deep.markdown‎
Lines changed: 11 additions & 0 deletions b/‎_publications/louis2018deep.markdown‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎_publications/parvez2018building.markdown‎
Lines changed: 17 additions & 0 deletions b/‎_publications/parvez2018building.markdown‎
Lines changed: 17 additions & 0 deletions
diff --git a/‎_publications/white2017sorting.markdown‎
Lines changed: 29 additions & 0 deletions b/‎_publications/white2017sorting.markdown‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎_publications/yin2018mining.markdown‎
Lines changed: 12 additions & 0 deletions b/‎_publications/yin2018mining.markdown‎
Lines changed: 12 additions & 0 deletions
@@ -0,0 +1,24 @@
+---
+layout: publication
+title: "Path-Based Function Embedding and its Application to Specification Mining"
+authors: D. DeFreez, A. V. Thakur, C. Rubio-González
+conference: ICSE
+year: 2018
+bibkey: defreez2018path
+---
+Identifying the relationships among program elements is useful
+for program understanding, debugging, and analysis. One such
+relationship is synonymy. Function synonyms are functions that
+play a similar role in code, e.g. functions that perform initialization
+for different device drivers, or functions that implement different
+symmetric-key encryption schemes. Function synonyms are not
+necessarily semantically equivalent and can be syntactically dissimilar; consequently, approaches for identifying code clones or
+functional equivalence cannot be used to identify them. This paper presents `func2vec`, an algorithm that maps each function to a vector in a vector space such that function synonyms are grouped
+together. We compute the function embedding by training a neu-
+ral network on sentences generated from random walks over an
+encoding of the program as a labeled pushdown system (ℓ-PDS).
+We demonstrate that `func2vec`
+is effective at identifying function
+synonyms in the Linux kernel. Furthermore, we show how function
+synonyms enable mining error-handling specifications with high
+support in Linux file systems and drivers.
@@ -0,0 +1,14 @@
+---
+layout: publication
+title: "Deep code search"
+authors: X. Gu, H. Zhang, S. Kim
+conference: ICSE
+year: 2018
+bibkey: gu2018deep
+---
+To implement a program functionality, developers can reuse previously written code snippets by searching through a large-scale codebase. Over the years, many code search tools have been proposed to help developers. The existing approaches often treat source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. These approaches mainly rely on the textual similarity between source code and natural language query. They lack a deep understanding of the semantics of queries and source code.
+
+In this paper, we propose a novel deep neural network named CODEnn (Code-Description Embedding Neural Network). Instead of matching text similarity, CODEnn jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors. Using the unified vector representation, code snippets related to a natural language query can be retrieved according to their vectors. Semantically related words can also be recognized and irrelevant/noisy keywords in queries can be handled.
+
+As a proof-of-concept application, we implement a code search tool named DeepCS using the proposed CODEnn model. We empirically evaluate DeepCS on a large scale codebase collected from GitHub. The experimental results show that our approach can effectively retrieve relevant code snippets and outperforms previous techniques.
+
@@ -0,0 +1,18 @@
+---
+layout: publication
+title: "Exploring the Naturalness of Buggy Code with Recurrent Neural Network"
+authors: J. Lanchantin, J. Gao
+conference: 
+year: 2018
+bibkey: lanchantin2018exploring
+---
+Statistical   language   models   are   powerful   tools
+which  have  been  used  for  many  tasks  within  natural language processing. Recently, they have been
+used for other sequential data such as source code.
+(Ray et al., 2015) showed that it is possible train an
+n-gram
+source code language mode,  and use it to
+predict buggy lines in code by determining “unnatural” lines via entropy with respect to the language
+model.  In this work, we propose using a more advanced language modeling technique, Long Short-term Memory recurrent neural networks, to model
+source code and classify buggy lines based on entropy.   We  show  that  our  method  slightly  outperforms an
+n-gram model in the buggy line classification task using AUC
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Deep Learning to Detect Redundant Method Comments"
+authors: A. Louis, S. K. Dash, E. T. Barr, C. Sutton
+conference: 
+year: 2018
+bibkey: louis2018deep
+additional_links:
+   - {name: "ArXiV", url: "https://arxiv.org/abs/1806.04616"}
+---
+Comments in software are critical for maintenance and reuse. But apart from prescriptive advice, there is little practical support or quantitative understanding of what makes a comment useful. In this paper, we introduce the task of identifying comments which are uninformative about the code they are meant to document. To address this problem, we introduce the notion of comment entailment from code, high entailment indicating that a comment's natural language semantics can be inferred directly from the code. Although not all entailed comments are low quality, comments that are too easily inferred, for example, comments that restate the code, are widely discouraged by authorities on software style. Based on this, we develop a tool called CRAIC which scores method-level comments for redundancy. Highly redundant comments can then be expanded or alternately removed by the developer. CRAIC uses deep language models to exploit large software corpora without requiring expensive manual annotations of entailment. We show that CRAIC can perform the comment entailment task with good agreement with human judgements. Our findings also have implications for documentation tools. For example, we find that common tags in Javadoc are at least two times more predictable from code than non-Javadoc sentences, suggesting that Javadoc tags are less informative than more free-form comments 
@@ -0,0 +1,17 @@
+---
+layout: publication
+title: "Building Language Models for Text with Named Entities"
+authors: M.R. Parvez, S. Chakraborty, B. Ray, KW Chang
+conference: ACL
+year: 2018
+bibkey: parvez2018building
+---
+Text  in  many  domains  involves  a  significant amount of named entities.   Predicting the entity names is often challenging
+for a language model as they appear less
+frequent  on  the  training  corpus.   In  this
+paper,  we  propose  a  novel  and  effective
+approach to building a discriminative language  model  which  can  learn  the  entity
+names by leveraging their entity type information.  We also introduce two benchmark  datasets  based  on  recipes  and  Java
+programming codes,  on which we evaluate the proposed model.  Experimental results show that our model achieves 52.2%
+better perplexity in recipe generation and
+22.06% on code generation than the state-of-the-art language models.
@@ -0,0 +1,29 @@
+---
+layout: publication
+title: "Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities"
+authors: M. White, M. Tufano, M. Martínez, M. Monperrus, D. Poshyvanyk
+conference: 
+year: 2017
+bibkey: white2017sorting
+---
+In  the  field  of  automated  program  repair,  the  redundancy  assumption  claims  large  programs  contain  the  seeds
+of  their  own  repair.  However,  most  redundancy-based  program
+repair  techniques  do  not  reason  about  the  repair  ingredients—the code that is reused to craft a patch. We aim to reason about
+the repair ingredients by using code similarities to prioritize and
+transform  statements  in  a  codebase  for  patch  generation.  Our
+approach,  DeepRepair,  relies  on  deep  learning  to  reason  about
+code  similarities.  Code  fragments  at  well-defined  levels  of  granularity in a codebase can be sorted according to their similarity
+to suspicious elements (i.e., code elements that contain suspicious
+statements) and statements can be transformed by mapping out-of-scope  identifiers  to  similar  identifiers  in  scope.  We  examined
+these new search strategies for patch generation with respect to
+effectiveness  from  the  viewpoint  of  a  software  maintainer.  Our
+comparative experiments were executed on six open-source Java
+projects  including  374  buggy  program  revisions  and  consisted
+of  19,949  trials  spanning  2,616  days  of  computation  time.  DeepRepair’s  search  strategy  using  code  similarities  generally  found
+compilable  ingredients  faster  than  the  baseline,  jGenProg,  but
+this improvement neither yielded test-adequate patches in fewer
+attempts (on average) nor found significantly more patches than
+the  baseline.  Although  the  patch  counts  were  not  statistically
+different,  there  were  notable  differences  between  the  nature  of
+DeepRepair  patches  and  baseline  patches.  The  results  demonstrate that our learning-based approach finds patches that cannot
+be  found  by  existing  redundancy-based  repair  techniques
@@ -0,0 +1,12 @@
+---
+layout: publication
+title: "Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow"
+authors: P. Yin, B. Deng, E. Chen, B. Vasilescu, G. Neubig
+conference: MSR
+year: 2018
+bibkey: yin2018mining
+additional_links:
+   - {name: "data", url: "https://conala-corpus.github.io/"}
+---
+For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data. 
+