aminmf
diff --git a/‎_publications/bush2019learning.markdown‎
Lines changed: 24 additions & 0 deletions b/‎_publications/bush2019learning.markdown‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎_publications/theeten2019import2vec.markdown‎
Lines changed: 11 additions & 0 deletions b/‎_publications/theeten2019import2vec.markdown‎
Lines changed: 11 additions & 0 deletions
@@ -0,0 +1,24 @@
+---
+layout: publication
+title: "Learning-based Recursive Aggregation of AbstractSyntax Trees for Code Clone Detection"
+authors: L. Bush, A. Andrzejak
+conference: SANER
+year: 2019
+bibkey: bush2019learning
+additional_links:
+   - {name: "TR", url: "https://pvs.ifi.uni-heidelberg.de/fileadmin/papers/2019/Buech-Andrzejak-SANER2019.pdf"}
+---
+Code clone detection remains a crucial challenge in
+maintaining  software  projects.  Many  classic  approaches  rely  on
+handcrafted aggregation schemes, while recent work uses supervised  or unsupervised  learning.
+In  this  work,  we  study  several aspects of aggregation schemes for code clone detection
+based on supervised  learning.  To  this  aim,  we  implement  an  AST-based 
+Recursive Neural Network. Firstly, our ablation study shows the influence  of  model
+choices  and  hyperparameters.  We  introduce error  scaling  as  a  way
+to  effectively  and  efficiently  address  the class imbalance problem 
+arising in code clone detection. Secondly, we  study  the  influence  of
+pretrained  embeddings  representing nodes in ASTs. We show that simply averaging all node vectors of
+a given AST yields strong baseline aggregation scheme. Further, learned AST aggregation
+schemes greatly benefit from pretrained node  embeddings.  Finally,  we  show  the  importance  of  carefully 
+separating  training  and  test  data  by  clone  clusters,  to  reliably measure  generalization
+of  models  learned  with  supervision.
@@ -0,0 +1,11 @@
+---
+layout: publication
+title: "Import2vec - Learning Embeddings for Software Libraries"
+authors: B. Theeten, F. Vandeputte, T.Van Cutsem
+conference: MSR
+year: 2019
+bibkey: theeten2019import2vec
+---
+We consider the problem of developing suitable learning representations (embeddings) for library packages that capture semantic similarity among libraries. Such representations are known to improve the performance of downstream learning tasks (e.g. classification) or applications such as contextual search and analogical reasoning.
+
+We apply word embedding techniques from natural language processing (NLP) to train embeddings for library packages ("library vectors"). Library vectors represent libraries by similar context of use as determined by import statements present in source code. Experimental results obtained from training such embeddings on three large open source software corpora reveals that library vectors capture semantically meaningful relationships among software libraries, such as the relationship between frameworks and their plug-ins and libraries commonly used together within ecosystems such as big data infrastructure projects (in Java), front-end and back-end web development frameworks (in JavaScript) and data science toolkits (in Python).