prof_pic.jpg

Benjamin Minixhofer

b{lastname}@gmail.com

Hi there! I am a PhD student at the Language Technology Lab of the University of Cambridge and incoming intern at Ai2. I do research in Natural Language Processing. Right now, I am especially interested in multilinguality, tokenization and language emergence. I am also interested in Rust as a language for writing fast, correct research code. I obtained a BSc in Artifical Intelligence from Johannes Kepler University Linz in 2023. Previously, I interned at Google DeepMind, Cohere, H2O.ai and Huawei Noah’s Ark Lab in London. I started out by being active on Kaggle.

News

Nov 29, 2024 Talk about «The Past, Present and Future of Tokenization» at the NLIP Seminar in Cambridge. This talk was based on an Invited Lecture at the University of Göttingen in early November. Slides.
Nov 27, 2024 Attended the ELLIS NLP Workshop at Dagstuhl. Some nice photos.
Sep 25, 2024 Zero-Shot Tokenizer Transfer is accepted at NeurIPS 2024. See you in Vancouver! :canada:
Jul 24, 2024 I presented Zero-Shot Tokenizer Transfer at Google DeepMind and Mozilla. Slides.

Selected Publications

  1. Cross-Tokenizer Distillation via Approximate Likelihood Matching
    Benjamin Minixhofer, Ivan Vulić, and Edoardo Maria Ponti
    arXiv preprint arXiv:2503.20083, Mar 2025
  2. Zero-Shot Tokenizer Transfer
    Benjamin Minixhofer, Edoardo Ponti, and Ivan Vulić
    In The Thirty-eighth Annual Conference on Neural Information Processing Systems, Dec 2024
  3. CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models
    Benjamin Minixhofer, Jonas Pfeiffer, and Ivan Vulić
    In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023
  4. Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation
    Benjamin Minixhofer, Jonas Pfeiffer, and Ivan Vulić
    In Proceedings of the 2023 Conference of the Association for Computational Linguistics: Human Language Technologies, Jul 2023
  5. WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models
    Benjamin Minixhofer, Fabian Paischer, and Navid Rekabsaz
    In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jul 2022