🎧 Vibe Coding + Knowledge Graphs = 10x Cheaper

🎧 Vibe Coding + Knowledge Graphs = 10x Cheaper

In this issue:

  1. Repository-level software engineering
  2. Chain-of-Tools for better tool calling
  3. The most complete AI model to date


Article content

For those of you that enjoy the Linkedin version of LLM Watch, that's great, it's why I publish it here. However, there's a better way to read my newsletter: click here for the full experience.

You don't have to use the website or the app itself if you don't want to, you can simply subscribe to me there and get all of my updates straight to your e-mail. This includes additional content that I only publish over there.


1. Enhancing Repository-Level Software Repair via Repository-Aware Knowledge Graphs

Watching: KGCompass (paper)

Article content

What problem does it solve? Large codebases are hard for AI to fix! When a bug report comes in, current LLM-based systems struggle to figure out which specific piece of code needs fixing among thousands of files and tens of thousands of functions. This paper identifies three key challenges: First, semantic ambiguity - when identical function names have different meanings in different contexts, LLMs get confused. Second, limited structural understanding - LLMs don't naturally connect issue descriptions with relevant code locations across the repository. Third, lack of interpretability - most AI repair systems work like black boxes, with no clear explanation for their decisions. Only 32% of bugs in their benchmark explicitly mention where the problem is, making bug localization a critical bottleneck in automated repair.

How does it solve the problem? The researchers created KGCompass, which introduces a clever "repository-aware knowledge graph" that acts like a map connecting issue reports to code locations. Instead of treating code and documentation as separate text chunks, their system builds a network of relationships between repository artifacts (issues, pull requests) and code entities (files, classes, functions). This graph can trace multi-hop connections, following chains like "issue → pull request → file → function" to identify the most likely bug locations. Using this knowledge graph, KGCompass narrows down the search space from thousands of functions to just 20 highly relevant candidates. Then, when generating patches, the system provides LLMs with these "entity paths" as context, helping them understand not just the code but the relationships between different repository components.

What are the key findings? KGCompass achieved state-of-the-art repair performance (45.67%) and function-level localization accuracy (51.33%) on the SWE-bench-Lite benchmark, while costing only $0.20 per repair - about 10x cheaper than competing approaches. Their most striking finding was that 69.7% of successfully fixed bugs required multi-hop traversals through the knowledge graph, revealing why traditional LLMs struggle with this task - they can't easily model these indirect relationships. The system significantly reduced the search space while maintaining high coverage of ground truth bug locations (84.3% file-level, 58.8% function-level). KGCompass uniquely fixed 19 bugs that no other open-source approach could handle and 11 bugs that even commercial solutions missed. Impressively, the approach worked well with different LLMs, including open-source models like Deepseek V3 and Qwen2.5 Max.

Why does it matter? By creating a structured representation of the connections between issues and code, KGCompass addresses a fundamental limitation in current approaches that rely solely on text understanding. The 10x cost reduction makes automated repair much more practical for real-world development teams, and the knowledge graph's interpretable nature increases trust by providing clear reasoning paths. Since the approach is language-agnostic and can be incrementally updated as code changes, it's highly adaptable to various programming languages and development workflows. Perhaps most importantly, by showing how properly structured knowledge can compensate for LLM limitations, this research opens the door to more accessible and effective software maintenance tools that can work with a wide range of models, not just the most expensive proprietary ones.

Caveat: There’s no repository yet and it’s unclear to me if they’ll release KGCompass to the public. However, if this approach will prove as promising as it sounds, someone will surely replicate it very soon.


2. Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models

Watching: Chain-of-Tools (paper/code)

Article content

What problem does it solve? Current tool learning methods for Large Language Models (LLMs) face significant limitations. Fine-tuning approaches like ToolLLM restrict models to using only tools seen during training, while in-context learning methods like HuggingGPT become inefficient when dealing with many tools. This creates a challenging dilemma: either sacrifice flexibility (can't use new tools) or sacrifice efficiency (performance degrades with more tools). Real-world applications need LLMs that can efficiently reason with a massive, ever-growing toolkit, including tools they've never seen before, while maintaining their powerful reasoning abilities.

How does it solve the problem? Chain-of-Tools (CoTools) is a new method that keeps the foundation LLM completely frozen while adding specialized modules for tool integration. Their approach has three key components: 1) A Tool Judge that determines when to call tools during text generation by analyzing the hidden states of the LLM, 2) A Tool Retriever that selects appropriate tools by matching query vectors with tool vectors derived from tool descriptions, and 3) A Tool Calling component that handles parameter filling and execution. This design allows the model to maintain its original reasoning capabilities while efficiently incorporating new tools through their descriptions without requiring additional training.

What are the key findings? CoTools outperforms baseline methods across multiple benchmarks, with particularly impressive results on the newly created SimpleToolQuestions dataset containing 1,836 tools (999 seen and 837 unseen). The method maintained high performance even with hundreds or thousands of tools, showing 10-20% better accuracy than the ToolkenGPT baseline. For unseen tools specifically, CoTools achieved 10.4% top-1 accuracy and 33.7% top-5 accuracy, while the baseline couldn't handle unseen tools at all (0% accuracy). The authors also identified specific dimensions in the LLM's hidden states that play crucial roles in tool selection, enhancing model interpretability.

Why does it matter? By enabling efficient use of massive tool libraries including unseen tools, CoTools bridges a critical gap between theoretical capabilities and practical applications. This approach allows for "plug-and-play" tool integration without compromising reasoning abilities or requiring expensive retraining. As new tools emerge daily in real-world scenarios, the ability to dynamically incorporate them is invaluable. Additionally, the insights into which hidden state dimensions influence tool selection improve our understanding of how LLMs make decisions, potentially leading to further improvements in tool learning systems.


3. Qwen2.5-Omni Technical Report

Watching: Qwen2.5-Omni (paper/code)

Article content

What problem does it solve? Current AI systems typically specialize in either understanding or generating specific modalities (text, images, audio, or video), but struggle to seamlessly integrate multiple input and output formats in real-time. This paper addresses the challenge of creating a unified multimodal system that can simultaneously perceive diverse inputs (text, images, audio, and video) while generating both text and speech responses in a streaming manner. The researchers identified three key challenges: synchronizing temporal aspects of multimodal inputs (especially audio-video alignment), preventing interference between different output modalities, and designing architectures that support real-time understanding and response with minimal latency.

How does it solve the problem? The authors introduced a novel architecture called "Thinker-Talker," reminiscent of how humans use different organs to produce various signals simultaneously while coordinating them through the same neural networks. The "Thinker" functions like a brain, handling perception and text generation, while the "Talker" operates like a mouth, converting high-level representations into speech. To synchronize audio and video inputs, they developed TMRoPE (Time-aligned Multimodal RoPE), a position embedding approach that explicitly incorporates temporal information to align the modalities. For streaming capabilities, they modified encoders to support block-wise processing and implemented a sliding-window attention mechanism in the speech generation component, significantly reducing initial latency and enabling real-time interactions.

What are the key findings? Qwen2.5-Omni demonstrates performance comparable to similarly sized single-modality models in their respective domains while excelling at multimodal tasks. It achieved state-of-the-art results on multimodal benchmarks like OmniBench (56.13%) and performed strongly on vision tasks like MMMU and MMBench. Notably, its performance on speech instruction following nearly matches its text input capabilities, closing a gap that existed in previous systems. For speech generation, it achieved impressive word error rates (1.42%, 2.33%, and 6.54% on different test sets), outperforming many specialized streaming and non-streaming speech synthesis systems in both robustness and naturalness.

Why does it matter? This represents a significant step toward more general AI systems that can interact with the world in ways similar to humans - perceiving multiple streams of information simultaneously and responding through different channels. By unifying multiple modalities in a single model with streaming capabilities, Qwen2.5-Omni bridges the gap between specialized systems and demonstrates that we can build AI that processes information more holistically without compromising performance. This approach could fundamentally improve human-computer interaction by enabling more natural, responsive, and context-aware AI assistants that can smoothly transition between different modalities. The streaming architecture also makes these capabilities practical for real-world applications where latency is crucial.


Papers of the Week:

  1. Current and Future Use of Large Language Models for Knowledge Work:
  2. WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference:
  3. REALM: A Dataset of Real-World LLM Use Cases:
  4. ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning:
  5. MARS: Memory-Enhanced Agents with Reflective Self-improvement:
  6. Enhancing the Robustness of LLM-Generated Code: Empirical Study and Framework:
  7. MemInsight: Autonomous Memory Augmentation for LLM Agents:
  8. ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition:


👍 If you enjoyed this article, give it a like and share it with your peers.


Marcel Petrick

Head of Application Software at DATA MODUL AG

6mo

Pascal Biese Thank you for that info about #KGcompass. I hope a reference implementation becomes available.

Like
Reply
Deepesh Jain

Founder & CEO, Durapid Technologies | Enterprise Architect | Assisting Enterprises With Seamless Digital Transformation

6mo

KGCompass is a smart way to help AI fix code. Instead of guessing, it understands the whole system using knowledge graphs. This could make AI coding much better.

Like
Reply
Luong NGUYEN

AI & Cybersecurity Expert | Building Secure AI-Driven Solutions | R&D Expert in Network Security, Testing & Innovation

6mo

Thanks for sharing. Very interesting topic. However I think the link to Qwen paper is not correct (it points to HTMLRAG).

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories