Skip to content

Commit 4f2dab3

Browse files
author
Miltos Allamanis
committed
Add bunch of papers.
1 parent 15b4b36 commit 4f2dab3

File tree

4 files changed

+48
-0
lines changed

4 files changed

+48
-0
lines changed
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
layout: publication
3+
title: "Learning Lenient Parsing & Typing via Indirect Supervision"
4+
authors: T. Ahmed, V. Hellendoorn, P. Devanbu
5+
conference:
6+
year: 2019
7+
bibkey: ahmed2019learning
8+
additional_links:
9+
- {name: "ArXiV", url: "https://arxiv.org/abs/1910.05879"}
10+
tags: ["types"]
11+
---
12+
Both professional coders and teachers frequently deal with imperfect (fragmentary, incomplete, ill-formed) code. Such fragments are common in StackOverflow; students also frequently produce ill-formed code, for which instructors, TAs (or students themselves) must find repairs. In either case, the developer experience could be greatly improved if such code could somehow be parsed & typed; this makes them more amenable to use within IDEs and allows early detection and repair of potential errors. We introduce a lenient parser, which can parse & type fragments, even ones with simple errors. Training a machine learner to leniently parse & type imperfect code requires a large training set of pairs of imperfect code and its repair (and/or type information); such training sets are limited by human effort and curation. In this paper, we present a novel indirectly supervised approach to train a lenient parser, without access to such human-curated training data. We leverage the huge corpus of mostly correct code available on Github, and the massive, efficient learning capacity of Transformer-based NN architectures. Using GitHub data, we first create a large dataset of fragments of code and corresponding tree fragments and type annotations; we then randomly corrupt the input fragments (while requiring correct output) by seeding errors that mimic corruptions found in StackOverflow and student data. Using this data, we train high-capacity transformer models to overcome both fragmentation and corruption. With this novel approach, we can achieve reasonable performance on parsing & typing StackOverflow fragments; we also demonstrate that our approach achieves best-in-class performance on a large dataset of student errors.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
layout: publication
3+
title: "https://arxiv.org/abs/1910.05493"
4+
authors: Y. Hussain, Z. Huang, Y. Zhou, S. Wang
5+
conference:
6+
year: 2019
7+
bibkey: hussain2019deep
8+
additional_links:
9+
- {name: "ArXiV", url: "https://arxiv.org/abs/1910.05493"}
10+
tags: ["pretraining"]
11+
---
12+
In recent years, deep learning models have shown great potential in source code modeling and analysis. Generally, deep learning-based approaches are problem-specific and data-hungry. A challenging issue of these approaches is that they require training from starch for a different related problem. In this work, we propose a transfer learning-based approach that significantly improves the performance of deep learning-based source code models. In contrast to traditional learning paradigms, transfer learning can transfer the knowledge learned in solving one problem into another related problem. First, we present two recurrent neural network-based models RNN and GRU for the purpose of transfer learning in the domain of source code modeling. Next, via transfer learning, these pre-trained (RNN and GRU) models are used as feature extractors. Then, these extracted features are combined into attention learner for different downstream tasks. The attention learner leverages from the learned knowledge of pre-trained models and fine-tunes them for a specific downstream task. We evaluate the performance of the proposed approach with extensive experiments with the source code suggestion task. The results indicate that the proposed approach outperforms the state-of-the-art models in terms of accuracy, precision, recall, and F-measure without training the models from scratch.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
layout: publication
3+
title: "Evaluating Semantic Representations of Source Code"
4+
authors: Y. Wainakh, M. Rauf, M. Pradel
5+
conference:
6+
year: 2019
7+
bibkey: waunakh2019evaluating
8+
additional_links:
9+
- {name: "ArXiV", url: "https://arxiv.org/abs/1910.05177"}
10+
tags: ["reoresentation"]
11+
---
12+
Learned representations of source code enable various software developer tools, e.g., to detect bugs or to predict program properties. At the core of code representations often are word embeddings of identifier names in source code, because identifiers account for the majority of source code vocabulary and convey important semantic information. Unfortunately, there currently is no generally accepted way of evaluating the quality of word embeddings of identifiers, and current evaluations are biased toward specific downstream tasks. This paper presents IdBench, the first benchmark for evaluating to what extent word embeddings of identifiers represent semantic relatedness and similarity. The benchmark is based on thousands of ratings gathered by surveying 500 software developers. We use IdBench to evaluate state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions, as these are often used in current developer tools. Our results show that the effectiveness of embeddings varies significantly across different embedding techniques and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing embedding provides a satisfactory representation of semantic similarities, e.g., because embeddings consider identifiers with opposing meanings as similar, which may lead to fatal mistakes in downstream developer tools. IdBench provides a gold standard to guide the development of novel embeddings that address the current limitations.

_publications/wei2019code.markdown

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
layout: publication
3+
title: "Code Generation as a Dual Task of Code Summarization"
4+
authors: B. Wei, G. Li, X. Xia, Z. Fu, Z. Jin
5+
conference: NeurIPS
6+
year: 2019
7+
bibkey: wei2019code
8+
additional_links:
9+
- {name: "ArXiV", url: "https://arxiv.org/abs/1910.05923"}
10+
tags: ["generation", "summarization"]
11+
---
12+
Code summarization (CS) and code generation (CG) are two crucial tasks in the field of automatic software development. Various neural network-based approaches are proposed to solve these two tasks separately. However, there exists a specific intuitive correlation between CS and CG, which have not been exploited in previous work. In this paper, we apply the relations between two tasks to improve the performance of both tasks. In other words, exploiting the duality between the two tasks, we propose a dual training framework to train the two tasks simultaneously. In this framework, we consider the dualities on probability and attention weights, and design corresponding regularization terms to constrain the duality. We evaluate our approach on two datasets collected from GitHub, and experimental results show that our dual framework can improve the performance of CS and CG tasks over baselines.

0 commit comments

Comments
 (0)