Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10631 publications
    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
    Diganta Misra
    Yanqi Luo
    Anjali Sridhar
    Justine Gehring
    Silvio Soares Ribeiro Junior
    2026
    Preview abstract AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization. View details
    Global earthquake detection and warning using Android phones
    Marc Stogaitis
    Youngmin Cho
    Richard Allen
    Boone Spooner
    Patrick Robertson
    Micah Berman
    Greg Wimpey
    Robert Bosch
    Nivetha Thiruverahan
    Steve Malkos
    Alexei Barski
    Science, 389 (2025), pp. 254-259
    Preview abstract Earthquake early-warning systems are increasingly being deployed as a strategy to reduce losses in earthquakes, but the regional seismic networks they require do not exist in many earthquake-prone countries. We use the global Android smartphone network to develop an earthquake detection capability, an alert delivery system, and a user feedback framework. Over 3 years of operation, the system detected an average of 312 earthquakes per month with magnitudes from M 1.9 to M 7.8 in Türkiye. Alerts were delivered in 98 countries for earthquakes with M ≥4.5, corresponding to ~60 events and 18 million alerts per month. User feedback shows that 85% of people receiving an alert felt shaking, and 36, 28, and 23% received the alert before, during, and after shaking, respectively. We show how smartphone-based earthquake detection algorithms can be implemented at scale and improved through postevent analysis. View details
    Improving Informally Romanized Language Identification
    Adrian Benton
    Christo Kirov
    Proceedings of EMNLP (2025) (to appear)
    Preview abstract The Latin script is often used to informally write languages with non-Latin native scripts. In many cases (e.g., most languages in India), the lack of conventional spelling in the Latin script results in high spelling variability. Such romanization renders languages that are normally easily distinguished due to being written in different scripts – Hindi and Urdu, for example – highly confusable. In this work, we increase language identification (LID) accuracy for romanized text by improving the methods used to synthesize training sets. We find that training on synthetic samples which incorporate natural spelling variation yields higher LID system accuracy than including available naturally occurring examples in the training set, or even training higher capacity models. We demonstrate new state-of-the-art LID performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set (Madhani et al., 2023a), improving test F1 from the reported 74.7% (using a pretrained neural model) to 85.4% using a linear classifier trained solely on synthetic data and 88.2% when also training on available harvested text. View details
    Preview abstract We introduce efficient differentially private (DP) algorithms for several linear algebraic tasks, including solving linear equalities over arbitrary fields, linear inequalities over the reals, and computing affine spans and convex hulls. As an application, we obtain efficient DP algorithms for learning halfspaces and affine subspaces. Our algorithms addressing equalities are strongly polynomial, whereas those addressing inequalities are weakly polynomial. Furthermore, this distinction is inevitable: no DP algorithm for linear programming can be strongly polynomial-time efficient. View details
    Preview abstract Initially conceived as a way to explain memory sharing in romantic couples, the concept of transactive memory systems (TMS) has been adopted by organizational psychology, information management, and other fields of study to examine team performance in corporate settings. While findings highlight a clear advantage for humans teams with TMS, it's not evident if AI-human teams could also develop such a psychological dynamic. This paper considers AI-human interaction through the lens of TMS and identifies potential opportunities for improvement in this area. View details
    Preview abstract We give a new privacy amplification analysis for truncated Poisson sampling, a Poisson sampling variant that truncates a batch if it exceeds a given maximum batch size. View details
    Simulation-Based Inference: A Practical Guide
    Michael Deistler
    Jan Boelts
    Peter Steinbach
    Guy Moss
    Thomas Moreau
    Manuel Gloeckler
    Pedro L. C. Rodriguez
    Julia Linhart
    Janne K. Lappalainen
    Benjamin Kurt Miller
    Pedro J. Goncalves
    Cornelius Schröder
    Jakob H. Macke
    arXiv (2025)
    Preview abstract A central challenge in many areas of science and engineering is to identify model parameters that are consistent with empirical data and prior knowledge. Bayesian inference offers a principled framework for this task, but can be computationally prohibitive when models are defined by stochastic simulators. Simulation-Based Inference (SBI) provides a suite of methods to overcome this limitation and has enabled scientific discoveries in fields such as particle physics, astrophysics and neuroscience. The core idea of SBI is to train neural networks on data generated by a simulator, without requiring access to likelihood evaluations. Once trained, the neural network can rapidly perform inference on empirical observations without requiring additional optimization or simulations. In this tutorial, we provide a practical guide for practitioners aiming to apply SBI methods. We outline a structured SBI workflow and offer practical guidelines and diagnostic tools for every stage of the process--from setting up the simulator and prior, choosing the SBI method and neural network architecture, training the inference model, to validating results and interpreting the inferred parameters. We illustrate these steps through examples from astrophysics, psychophysics, and neuroscience. This tutorial empowers researchers to apply state-of-the-art SBI methods, facilitating efficient parameter inference for scientific discovery. View details
    A personal health large language model for sleep and fitness coaching
    Anastasiya Belyaeva
    Zhun Yang
    Nick Furlotte
    Chace Lee
    Erik Schenck
    Yojan Patel
    Jian Cui
    Logan Schneider
    Robby Bryant
    Ryan Gomes
    Allen Jiang
    Roy Lee
    Javier Perez
    Jamie Rogers
    Cathy Speed
    Shyam Tailor
    Megan Walker
    Jeffrey Yu
    Tim Althoff
    Conor Heneghan
    Mark Malhotra
    Shwetak Patel
    Shravya Shetty
    Jiening Zhan
    Daniel McDuff
    Nature Medicine (2025)
    Preview abstract Although large language models (LLMs) show promise for clinical healthcare applications, their utility for personalized health monitoring using wearable device data remains underexplored. Here we introduce the Personal Health Large Language Model (PH-LLM), designed for applications in sleep and fitness. PH-LLM is a version of the Gemini LLM that was finetuned for text understanding and reasoning when applied to aggregated daily-resolution numerical sensor data. We created three benchmark datasets to assess multiple complementary aspects of sleep and fitness: expert domain knowledge, generation of personalized insights and recommendations and prediction of self-reported sleep quality from longitudinal data. PH-LLM achieved scores that exceeded a sample of human experts on multiple-choice examinations in sleep medicine (79% versus 76%) and fitness (88% versus 71%). In a comprehensive evaluation involving 857 real-world case studies, PH-LLM performed similarly to human experts for fitness-related tasks and improved over the base Gemini model in providing personalized sleep insights. Finally, PH-LLM effectively predicted self-reported sleep quality using a multimodal encoding of wearable sensor data, further demonstrating its ability to effectively contextualize wearable modalities. This work highlights the potential of LLMs to revolutionize personal health monitoring via tailored insights and predictions from wearable data and provides datasets, rubrics and benchmark performance to further accelerate personal health-related LLM research. View details
    Heterogeneous graph neural networks for species distribution modeling
    Christine Kaeser-Chen
    Keith Anderson
    Michelangelo Conserva
    Elise Kleeman
    Maxim Neumann
    Matt Overlan
    Millie Chapman
    Drew Purves
    arxiv (2025)
    Preview abstract Species distribution models (SDMs) are necessary for measuring and predicting occurrences and habitat suitability of species and their relationship with environmental factors. We introduce a novel presence-only SDM with graph neural networks (GNN). In our model, species and locations are treated as two distinct node sets, and the learning task is predicting detection records as the edges that connect locations to species. Using GNN for SDM allows us to model fine-grained interactions between species and the environment. We evaluate the potential of this methodology on the six-region dataset compiled by National Center for Ecological Analysis and Synthesis (NCEAS) for benchmarking SDMs. For each of the regions, the heterogeneous GNN model is comparable to or outperforms previously-benchmarked single-species SDMs as well as a feed-forward neural network baseline model. View details
    Preview abstract The integration of vector search into databases, driven by advancements in embedding models, semantic search, and Retrieval-Augmented Generation (RAG), enables powerful combined querying of structured and unstructured data. This paper focuses on filtered vector search (FVS), a core operation where relational predicates restrict the dataset before or during the vector similarity search (top-k). While approximate near neighbor (ANN) indices are commonly used to accelerate vector search by trading latency for recall, the addition of filters complicates performance optimization and makes achieving stable, declarative recall guarantees challenging. Filters alter the effective dataset size and distribution, impacting the search effort required. We discuss the primary FVS execution strategies – pre-filtering, post-filtering, and inline-filtering – whose efficiencies depend on factors like filter selectivity, cardinality, and data correlation. We review existing approaches that modify index structures and search algorithms (e.g., iterative post-filtering, filter-aware index traversal) to enhance FVS performance. This tutorial provides a comprehensive overview of filtered vector search, discussing its use cases, classifying current solutions and their trade-offs, and highlighting crucial research challenges and future directions for developing efficient and accurate FVS systems.   View details
    ZAPBench: A Benchmark for Whole-Brain Activity Prediction in Zebrafish
    Alexander Immer
    Alex Bo-Yuan Chen
    Mariela D. Petkova
    Nirmala A. Iyer
    Luuk Willem Hesselink
    Aparna Dev
    Gudrun Ihrke
    Woohyun Park
    Alyson Petruncio
    Aubrey Weigel
    Wyatt Korff
    Florian Engert
    Jeff W. Lichtman
    Misha B. Ahrens
    International Conference on Learning Representations (ICLR) (2025)
    Preview abstract Data-driven benchmarks have led to significant progress in key scientific modeling domains including weather and structural biology. Here, we present the Zebrafish Activity Prediction Benchmark (ZAPBench), which quantitatively measures progress on the problem of predicting cellular-resolution neural activity throughout an entire vertebrate brain. The benchmark is based on a novel dataset containing 4d light-sheet microscopy recordings of more than 70,000 neurons in a larval zebrafish brain, along with motion stabilized and voxel-level cell segmentations of these data that facilitate development of a variety of forecasting methods. Initial results from a selection of time series and volumetric video modeling approaches achieve better performance than naive baseline methods, but also show room for further improvement. The specific brain used in the activity recording is also undergoing synaptic-level anatomical mapping, which will enable future integration of detailed structural information into ZAP forecasting methods. View details
    A multi-target DMRG-X algorithm for circuit-QED system modeling
    Guifre Vidal
    Sofia Garcia
    Aaron Szasz
    Alice Pagano
    Agustin Di Paolo
    2025
    Preview abstract Obtaining accurate representations of the eigenstates of an array of coupled superconducting qubits is a crucial step in the design of circuit quantum electrodynamics (circuit-QED)-based quantum processors. However, exact diagonalization of the device Hamiltonian is challenging for system sizes beyond tens of qubits. Here, we employ a tensor network method based on the density matrix renormalization group (DMRG) algorithm, DMRG-X, to efficiently obtain localized eigenstates of a 2D transmon array without the need to first compute lower-energy states. We also introduce MTDMRG-X, a new algorithm that combines DMRG-X with multi-target DMRG to efficiently compute excited states even in regimes with strong eigenstate hybridization. We showcase the use of these methods for the analysis of long-range couplings in a multi-transmon Hamiltonian including qubits and couplers, and we discuss eigenstate localization. These developments facilitate the design and parameter optimization of large-scale superconducting quantum processors. View details
    Preview abstract In modern datasets, where single records can have multiple owners, enforcing user-level differential privacy requires capping each user's total contribution. This "contribution bounding" becomes a significant combinatorial challenge. Existing sequential algorithms for this task are computationally intensive and do not scale to the massive datasets prevalent today. To address this scalability bottleneck, we propose a novel and efficient distributed algorithm. Our approach models the complex ownership structure as a hypergraph, where users are vertices and records are hyperedges. The algorithm proceeds in rounds, allowing users to propose records in parallel. A record is added to the final dataset only if all its owners unanimously agree, thereby ensuring that no user's predefined contribution limit is violated. This method aims to maximize the size of the resulting dataset for high utility while providing a practical, scalable solution for implementing user-level privacy in large, real-world systems. View details
    Zero-Shot Offline Styled Text Image Generation, but Make It Autoregressive
    Vittorio Pippi
    Fabio Quattrini
    Silvia Cascianelli
    Rita Cucchiara
    2025
    Preview abstract Styled Handwritten Text Generation (HTG) has recently received attention from the computer vision and document analysis communities, which have developed several solutions, either GAN- or diffusion-based, that achieved promising results. Nonetheless, these strategies fail to generalize to novel styles and have technical constraints, particularly in terms of maximum output length and training efficiency. To overcome these limitations, in this work, we propose a novel framework for text image generation, dubbed Emuru. Our approach leverages a powerful text image representation model (a variational autoencoder) combined with an autoregressive Transformer. Our approach enables the generation of styled text images conditioned on textual content and style examples, such as specific fonts or handwriting styles. We train our model solely on a diverse, synthetic dataset of English text rendered in over 100,000 typewritten and calligraphy fonts, which gives it the capability to reproduce unseen styles (both fonts and users' handwriting) in zero-shot. To the best of our knowledge, Emuru is the first autoregressive model for HTG, and the first designed specifically for generalization to novel styles. Moreover, our model generates images without background artifacts, which are easier to use for downstream applications. Extensive evaluation on both typewritten and handwritten, any-length text image generation scenarios demonstrates the effectiveness of our approach. View details
    GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities
    Diganta Misra
    Nizar Islah
    Brice Rauby
    Zihan Wang
    Justine Gehring
    Antonio Orvieto
    Muawiz Chaudhary
    Eilif Muller
    Irina Rish
    Samira Ebrahimi Kahou
    Massimo Caccia
    2025
    Preview abstract The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon 2.0, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon 2.0 rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon 2.0 enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. View details