Skip to content

Commit 5b4bc67

Browse files
author
Miltos Allamanis
committed
Add pubs
1 parent b5fb382 commit 5b4bc67

File tree

2 files changed

+35
-0
lines changed

2 files changed

+35
-0
lines changed
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
---
2+
layout: publication
3+
title: "Learning-based Recursive Aggregation of AbstractSyntax Trees for Code Clone Detection"
4+
authors: L. Bush, A. Andrzejak
5+
conference: SANER
6+
year: 2019
7+
bibkey: bush2019learning
8+
additional_links:
9+
- {name: "TR", url: "https://pvs.ifi.uni-heidelberg.de/fileadmin/papers/2019/Buech-Andrzejak-SANER2019.pdf"}
10+
---
11+
Code clone detection remains a crucial challenge in
12+
maintaining software projects. Many classic approaches rely on
13+
handcrafted aggregation schemes, while recent work uses supervised or unsupervised learning.
14+
In this work, we study several aspects of aggregation schemes for code clone detection
15+
based on supervised learning. To this aim, we implement an AST-based
16+
Recursive Neural Network. Firstly, our ablation study shows the influence of model
17+
choices and hyperparameters. We introduce error scaling as a way
18+
to effectively and efficiently address the class imbalance problem
19+
arising in code clone detection. Secondly, we study the influence of
20+
pretrained embeddings representing nodes in ASTs. We show that simply averaging all node vectors of
21+
a given AST yields strong baseline aggregation scheme. Further, learned AST aggregation
22+
schemes greatly benefit from pretrained node embeddings. Finally, we show the importance of carefully
23+
separating training and test data by clone clusters, to reliably measure generalization
24+
of models learned with supervision.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
layout: publication
3+
title: "Import2vec - Learning Embeddings for Software Libraries"
4+
authors: B. Theeten, F. Vandeputte, T.Van Cutsem
5+
conference: MSR
6+
year: 2019
7+
bibkey: theeten2019import2vec
8+
---
9+
We consider the problem of developing suitable learning representations (embeddings) for library packages that capture semantic similarity among libraries. Such representations are known to improve the performance of downstream learning tasks (e.g. classification) or applications such as contextual search and analogical reasoning.
10+
11+
We apply word embedding techniques from natural language processing (NLP) to train embeddings for library packages ("library vectors"). Library vectors represent libraries by similar context of use as determined by import statements present in source code. Experimental results obtained from training such embeddings on three large open source software corpora reveals that library vectors capture semantically meaningful relationships among software libraries, such as the relationship between frameworks and their plug-ins and libraries commonly used together within ecosystems such as big data infrastructure projects (in Java), front-end and back-end web development frameworks (in JavaScript) and data science toolkits (in Python).

0 commit comments

Comments
 (0)