Trending Papers

130

GitHub 4.39k arXiv Page

Submitted by

Paper99

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image, a 6B-parameter Scalable Single-Stream Diffusion Transformer (S3-DiT) model, achieves high-performance image generation with reduced computational cost, offering sub-second inference and compatibility with consumer hardware.

Tongyi-MAI · Nov 27, 2025

130

GitHub 4.39k arXiv Page

Submitted by

Cxxs

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

The study reveals that in text-to-image generation, CFG Augmentation is the primary driver of few-step distillation in Distribution Matching Distillation (DMD), while the distribution matching term acts as a regularizer.

Tongyi-MAI · Published on Nov 27, 2025

16

GitHub 4.41k arXiv Page

Submitted by

Cxxs

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

The study reveals that in text-to-image generation, CFG Augmentation is the primary driver of few-step distillation in Distribution Matching Distillation (DMD), while the distribution matching term acts as a regularizer.

Tongyi-MAI · Nov 27, 2025

16

GitHub 4.41k arXiv Page

Submitted by

akhaliq

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

WizardCoder, a Code LLM fine-tuned with complex instructions using Evol-Instruct, outperforms other open-source and closed LLMs on several code generation benchmarks.

Microsoft · Published on Jun 14, 2023

GitHub 9.47k arXiv Page

Submitted by

akhaliq

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

WizardCoder, a Code LLM fine-tuned with complex instructions using Evol-Instruct, outperforms other open-source and closed LLMs on several code generation benchmarks.

Microsoft · Jun 14, 2023

GitHub 9.47k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Published on Oct 8, 2024

21

GitHub 25.3k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Oct 8, 2024

21

GitHub 25.3k arXiv Page

Submitted by

taesiri

SAM 3: Segment Anything with Concepts

Segment Anything Model 3 achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization.

AI at Meta · Published on Nov 20, 2025

GitHub 5.17k arXiv Page

Submitted by

taesiri

SAM 3: Segment Anything with Concepts

Segment Anything Model 3 achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization.

AI at Meta · Nov 20, 2025

GitHub 5.17k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle · Published on Oct 16, 2025

101

GitHub 65.8k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle · Oct 16, 2025

101

GitHub 65.8k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Published on Dec 28, 2024

GitHub 26.1k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Dec 28, 2024

GitHub 26.1k arXiv Page

Submitted by

daixufang

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Agent Lightning is a flexible RL framework for training LLMs in various agents, using a hierarchical RL algorithm and decoupling execution from training to handle complex interactions.

8 authors

· Published on Aug 5, 2025

120

GitHub 9.36k arXiv Page

Submitted by

daixufang

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Agent Lightning is a flexible RL framework for training LLMs in various agents, using a hierarchical RL algorithm and decoupling execution from training to handle complex interactions.

8 authors

· Aug 5, 2025

120

GitHub 9.36k arXiv Page

Submitted by

dyyyyyyyy

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Flawed-Aware Policy Optimization (FAPO) enhances reinforcement learning with verifiable rewards by penalizing flawed-positive rollouts, improving reasoning capability and training stability in large language models.

6 authors

· Published on Oct 26, 2025

10

GitHub 17.2k arXiv Page

Submitted by

dyyyyyyyy

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Flawed-Aware Policy Optimization (FAPO) enhances reinforcement learning with verifiable rewards by penalizing flawed-positive rollouts, improving reasoning capability and training stability in large language models.

6 authors

· Oct 26, 2025

10

GitHub 17.2k arXiv Page

Submitted by

thomagram

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

STARFlow-V, a normalizing flow-based video generator, offers end-to-end learning, robust causal prediction, and high-quality video generation with practical sampling efficiency.

Apple · Published on Nov 25, 2025

29

GitHub 335 arXiv Page

Submitted by

thomagram

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

STARFlow-V, a normalizing flow-based video generator, offers end-to-end learning, robust causal prediction, and high-quality video generation with practical sampling efficiency.

Apple · Nov 25, 2025

29

GitHub 335 arXiv Page

Submitted by

taesiri

SAM 3D: 3Dfy Anything in Images

SAM 3D is a generative model that reconstructs 3D objects from single images using a multi-stage training framework that includes synthetic pretraining and real-world alignment, achieving high performance in human preference tests.

AI at Meta · Published on Nov 20, 2025

GitHub 4.4k arXiv Page

Submitted by

taesiri

SAM 3D: 3Dfy Anything in Images

SAM 3D is a generative model that reconstructs 3D objects from single images using a multi-stage training framework that includes synthetic pretraining and real-world alignment, achieving high performance in human preference tests.

AI at Meta · Nov 20, 2025

GitHub 4.4k arXiv Page

Submitted by

taesiri

HunyuanOCR Technical Report

HunyuanOCR, a lightweight Vision-Language Model, achieves state-of-the-art performance in OCR tasks through a unified end-to-end architecture combining Vision Transformer and lightweight LLM, supported by data-driven and RL strategies.

Tencent Hunyuan · Published on Nov 24, 2025

19

GitHub 1.06k arXiv Page

Submitted by

taesiri

HunyuanOCR Technical Report

HunyuanOCR, a lightweight Vision-Language Model, achieves state-of-the-art performance in OCR tasks through a unified end-to-end architecture combining Vision Transformer and lightweight LLM, supported by data-driven and RL strategies.

Tencent Hunyuan · Nov 24, 2025

19

GitHub 1.06k arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Published on Sep 26, 2025

134

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Sep 26, 2025

134

Submitted by

KaituoFeng

OneThinker: All-in-one Reasoning Model for Image and Video

OneThinker, an all-in-one multimodal reasoning model, unifies image and video understanding across various tasks using RL and demonstrates strong performance and knowledge transfer.

14 authors

· Published on Dec 2, 2025

17

GitHub 29 arXiv Page

Submitted by

KaituoFeng

OneThinker: All-in-one Reasoning Model for Image and Video

OneThinker, an all-in-one multimodal reasoning model, unifies image and video understanding across various tasks using RL and demonstrates strong performance and knowledge transfer.

14 authors

· Dec 2, 2025

17

GitHub 29 arXiv Page

Submitted by

taesiri

GigaWorld-0: World Models as Data Engine to Empower Embodied AI

GigaWorld-0 is a unified world model framework that integrates video generation and 3D modeling to produce high-quality, diverse, and physically plausible VLA data, enabling strong real-world performance in embodied AI without real-world training.

25 authors

· Published on Nov 25, 2025

GitHub 439 arXiv Page

Submitted by

taesiri

GigaWorld-0: World Models as Data Engine to Empower Embodied AI

GigaWorld-0 is a unified world model framework that integrates video generation and 3D modeling to produce high-quality, diverse, and physically plausible VLA data, enabling strong real-world performance in embodied AI without real-world training.

25 authors

· Nov 25, 2025

GitHub 439 arXiv Page

Submitted by

wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

18 authors

· Published on Sep 27, 2024

32

Submitted by

wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

18 authors

· Sep 27, 2024

32

Submitted by

Sakits

VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference

VLASH is an asynchronous inference framework for Vision-Language-Action models that achieves high speed and low latency without sacrificing accuracy, enabling precise robotic tasks like ping-pong and whack-a-mole.

MIT HAN Lab · Published on Nov 30, 2025

GitHub 111 arXiv Page

Submitted by

Sakits

VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference

VLASH is an asynchronous inference framework for Vision-Language-Action models that achieves high speed and low latency without sacrificing accuracy, enabling precise robotic tasks like ping-pong and whack-a-mole.

MIT HAN Lab · Nov 30, 2025

GitHub 111 arXiv Page

Submitted by

probejie

CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

CLaRa enhances retrieval-augmented generation by introducing unified embedding-based compression and joint optimization, achieving state-of-the-art performance in QA benchmarks.

Apple · Published on Nov 24, 2025

11

GitHub 224 arXiv Page

Submitted by

probejie

CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

CLaRa enhances retrieval-augmented generation by introducing unified embedding-based compression and joint optimization, achieving state-of-the-art performance in QA benchmarks.

Apple · Nov 24, 2025

11

GitHub 224 arXiv Page

Submitted by

jiaruz2

Latent Collaboration in Multi-Agent Systems

LatentMAS enables efficient, lossless collaboration among LLM agents in latent space, improving performance and reducing computational costs compared to text-based methods.

Princeton-AI · Published on Nov 25, 2025

GitHub 409 arXiv Page

Submitted by

jiaruz2

Latent Collaboration in Multi-Agent Systems

LatentMAS enables efficient, lossless collaboration among LLM agents in latent space, improving performance and reducing computational costs compared to text-based methods.

Princeton-AI · Nov 25, 2025

GitHub 409 arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Published on Jan 20, 2025

7

GitHub 20.8k arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Jan 20, 2025

7

GitHub 20.8k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Published on Mar 20, 2024

173

GitHub 63.5k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Mar 20, 2024

173

GitHub 63.5k arXiv Page

Submitted by

fengerhu

MobiAgent: A Systematic Framework for Customizable Mobile Agents

MobiAgent, a comprehensive mobile agent system, achieves state-of-the-art performance in real-world mobile scenarios through its MobiMind-series models, AgentRR framework, and MobiFlow benchmarking suite, while also reducing data annotation costs.

10 authors

· Published on Aug 30, 2025

GitHub 1.2k arXiv Page

Submitted by

fengerhu

MobiAgent: A Systematic Framework for Customizable Mobile Agents

MobiAgent, a comprehensive mobile agent system, achieves state-of-the-art performance in real-world mobile scenarios through its MobiMind-series models, AgentRR framework, and MobiFlow benchmarking suite, while also reducing data annotation costs.

10 authors

· Aug 30, 2025

GitHub 1.2k arXiv Page

Submitted by

Dubhe-zmc

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

ViSAudio, an end-to-end framework using conditional flow matching, generates high-quality binaural audio from silent video, providing spatial immersion and consistency across various acoustic conditions.

Zhejiang University · Published on Dec 2, 2025

20

Submitted by

Dubhe-zmc

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

ViSAudio, an end-to-end framework using conditional flow matching, generates high-quality binaural audio from silent video, providing spatial immersion and consistency across various acoustic conditions.

Zhejiang University · Dec 2, 2025

20

Submitted by

Forceless

PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides

PPTAgent, a two-stage approach, improves presentation generation by analyzing reference presentations and ensuring structural and content consistency, outperforming traditional methods across content, design, and coherence.

9 authors

· Published on Jan 7, 2025

GitHub 2.62k arXiv Page

Submitted by

Forceless

PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides

PPTAgent, a two-stage approach, improves presentation generation by analyzing reference presentations and ensuring structural and content consistency, outperforming traditional methods across content, design, and coherence.

9 authors

· Jan 7, 2025

GitHub 2.62k arXiv Page

Submitted by

lz1001

General Agentic Memory Via Deep Research

GAM, a novel framework that employs JIT compilation principles, improves memory efficiency and task completion by leveraging a lightweight memorizer and researcher in conjunction with reinforcement learning.

Beijing Academy of Artificial Intelligence · Published on Nov 23, 2025

152

GitHub 662 arXiv Page

Submitted by

lz1001

General Agentic Memory Via Deep Research

GAM, a novel framework that employs JIT compilation principles, improves memory efficiency and task completion by leveraging a lightweight memorizer and researcher in conjunction with reinforcement learning.

Beijing Academy of Artificial Intelligence · Nov 23, 2025

152

GitHub 662 arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Published on Oct 23, 2024

GitHub 50.5k arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Oct 23, 2024

GitHub 50.5k arXiv Page

Submitted by

AdinaY

Depth Anything 3: Recovering the Visual Space from Any Views

Depth Anything 3 (DA3) uses a plain transformer for geometry prediction from visual inputs, achieving state-of-the-art results in camera pose estimation, any-view geometry, visual rendering, and monocular depth estimation.

ByteDance Seed · Published on Nov 13, 2025

91

GitHub 3.18k arXiv Page

Submitted by

AdinaY

Depth Anything 3: Recovering the Visual Space from Any Views

Depth Anything 3 (DA3) uses a plain transformer for geometry prediction from visual inputs, achieving state-of-the-art results in camera pose estimation, any-view geometry, visual rendering, and monocular depth estimation.

ByteDance Seed · Nov 13, 2025

91

GitHub 3.18k arXiv Page

Submitted by

zhangshaolei

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

DeepAnalyze-8B, an agentic LLM, autonomously completes the data science pipeline from raw data to research reports using curriculum-based training and data-grounded trajectory synthesis.

RUC-DataLab · Published on Oct 19, 2025

GitHub 2.8k arXiv Page

Submitted by

zhangshaolei

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

DeepAnalyze-8B, an agentic LLM, autonomously completes the data science pipeline from raw data to research reports using curriculum-based training and data-grounded trajectory synthesis.

RUC-DataLab · Oct 19, 2025

GitHub 2.8k arXiv Page

Submitted by

haodongli

Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model

A two-stage deterministic framework, Lotus-2, leverages diffusion models' world priors for high-quality geometric inference, achieving state-of-the-art results in monocular depth estimation and competitive surface normal prediction with limited training data.

4 authors

· Published on Nov 30, 2025

12

GitHub 113 arXiv Page

Submitted by

haodongli

Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model

A two-stage deterministic framework, Lotus-2, leverages diffusion models' world priors for high-quality geometric inference, achieving state-of-the-art results in monocular depth estimation and competitive surface normal prediction with limited training data.

4 authors

· Nov 30, 2025

12

GitHub 113 arXiv Page

Submitted by

FayeHongfeiZhang

DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation

DualCamCtrl is a diffusion model for camera-controlled video generation that uses a dual-branch framework and Semantic Guided Mutual Alignment to improve consistency and disentangle appearance and geometry modeling.

9 authors

· Published on Nov 28, 2025

Submitted by

FayeHongfeiZhang

DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation

DualCamCtrl is a diffusion model for camera-controlled video generation that uses a dual-branch framework and Semantic Guided Mutual Alignment to improve consistency and disentangle appearance and geometry modeling.

9 authors

· Nov 28, 2025

Submitted by

UglyToilet

MemOS: A Memory OS for AI System

MemOS, a memory operating system for Large Language Models, addresses memory management challenges by unifying plaintext, activation-based, and parameter-level memories, enabling efficient storage, retrieval, and continual learning.

39 authors

· Published on Jul 4, 2025

156

GitHub 3.31k arXiv Page

Submitted by

UglyToilet

MemOS: A Memory OS for AI System

MemOS, a memory operating system for Large Language Models, addresses memory management challenges by unifying plaintext, activation-based, and parameter-level memories, enabling efficient storage, retrieval, and continual learning.

39 authors

· Jul 4, 2025

156

GitHub 3.31k arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Published on Feb 8, 2025

GitHub 16.2k arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Feb 8, 2025

GitHub 16.2k arXiv Page

Submitted by

Zilence006

Vidi: Large Multimodal Models for Video Understanding and Editing

Vidi, a family of Large Multimodal Models, excels in temporal retrieval for video editing by processing long, multimodal video content and outperforming proprietary models on the VUE-TR benchmark.

22 authors

· Published on Apr 22, 2025

GitHub 442 arXiv Page

Submitted by

Zilence006

Vidi: Large Multimodal Models for Video Understanding and Editing

Vidi, a family of Large Multimodal Models, excels in temporal retrieval for video editing by processing long, multimodal video content and outperforming proprietary models on the VUE-TR benchmark.

22 authors

· Apr 22, 2025

GitHub 442 arXiv Page

Submitted by

Vfrz

Deep Research: A Systematic Survey

Deep Research systems integrate LLMs with external tools to enhance problem-solving capabilities, involving query planning, information acquisition, memory management, and answer generation.

26 authors

· Published on Nov 24, 2025

48

GitHub 156 arXiv Page

Submitted by

Vfrz

Deep Research: A Systematic Survey

Deep Research systems integrate LLMs with external tools to enhance problem-solving capabilities, involving query planning, information acquisition, memory management, and answer generation.

26 authors

· Nov 24, 2025

48

GitHub 156 arXiv Page

Submitted by

taesiri

Code2Video: A Code-centric Paradigm for Educational Video Generation

Code2Video generates educational videos using a code-centric agent framework, improving coherence and interpretability compared to direct code generation.

Show Lab · Published on Oct 1, 2025

33

GitHub 1.25k arXiv Page

Submitted by

taesiri

Code2Video: A Code-centric Paradigm for Educational Video Generation

Code2Video generates educational videos using a code-centric agent framework, improving coherence and interpretability compared to direct code generation.

Show Lab · Oct 1, 2025

33

GitHub 1.25k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Published on Apr 28, 2025

34

GitHub 43.8k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Apr 28, 2025

34

GitHub 43.8k arXiv Page

Submitted by

shizhediao

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

A small orchestrator using ToolOrchestra method coordinates various intelligent tools with reinforcement learning, achieving higher accuracy and efficiency in solving complex tasks like Humanity's Last Exam compared to larger models.

NVIDIA · Published on Nov 26, 2025

80

GitHub 247 arXiv Page

Submitted by

shizhediao

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

A small orchestrator using ToolOrchestra method coordinates various intelligent tools with reinforcement learning, achieving higher accuracy and efficiency in solving complex tasks like Humanity's Last Exam compared to larger models.

NVIDIA · Nov 26, 2025

80

GitHub 247 arXiv Page

Submitted by

CaraJ

Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation

Echo-4o-Image, a synthetic dataset generated by GPT-4o, enhances image generation models by addressing rare scenarios and providing clean supervision, leading to improved performance and transferability.

12 authors

· Published on Aug 13, 2025

25

GitHub 340 arXiv Page

Submitted by

CaraJ

Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation

Echo-4o-Image, a synthetic dataset generated by GPT-4o, enhances image generation models by addressing rare scenarios and providing clean supervision, leading to improved performance and transferability.

12 authors

· Aug 13, 2025

25

GitHub 340 arXiv Page

Submitted by

nielsr

Back to Basics: Let Denoising Generative Models Denoise

Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

Massachusetts Institute of Technology · Published on Nov 17, 2025

61

GitHub 1.58k arXiv Page

Submitted by

nielsr

Back to Basics: Let Denoising Generative Models Denoise

Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

Massachusetts Institute of Technology · Nov 17, 2025

61

GitHub 1.58k arXiv Page

Submitted by

JiaaqiLiu

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Agent0-VL, a self-evolving vision-language agent, incorporates tool usage into both reasoning and self-evaluation, enabling continual improvement through evidence-grounded analysis and reinforcement learning.

University of North Carolina at Chapel Hill · Published on Nov 25, 2025

46

Submitted by

JiaaqiLiu

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Agent0-VL, a self-evolving vision-language agent, incorporates tool usage into both reasoning and self-evaluation, enabling continual improvement through evidence-grounded analysis and reinforcement learning.

University of North Carolina at Chapel Hill · Nov 25, 2025

46

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

A simple head-specific sigmoid gate applied after Scaled Dot-Product Attention improves performance, stability, and scaling in large models, mitigating 'attention sink' and enhancing long-context extrapolation.

13 authors

· Published on May 10, 2025

GitHub 380 arXiv Page

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

A simple head-specific sigmoid gate applied after Scaled Dot-Product Attention improves performance, stability, and scaling in large models, mitigating 'attention sink' and enhancing long-context extrapolation.

13 authors

· May 10, 2025

GitHub 380 arXiv Page

Submitted by

YuWangX

MIRIX: Multi-Agent Memory System for LLM-Based Agents

MIRIX, a modular multi-agent memory system, enhances language models' memory capabilities by integrating diverse memory types and a dynamic framework, achieving superior performance in multimodal and long-form conversation benchmarks.

2 authors

· Published on Jul 10, 2025

79

GitHub 3.5k arXiv Page

Submitted by

YuWangX

MIRIX: Multi-Agent Memory System for LLM-Based Agents

MIRIX, a modular multi-agent memory system, enhances language models' memory capabilities by integrating diverse memory types and a dynamic framework, achieving superior performance in multimodal and long-form conversation benchmarks.

2 authors

· Jul 10, 2025

79

GitHub 3.5k arXiv Page

Submitted by

Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

47 authors

· Published on Apr 14, 2025

304

arXiv Page

Submitted by

Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

47 authors

· Apr 14, 2025

304

arXiv Page

Submitted by

richardxp888

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Agent0, a self-evolving framework utilizing multi-step co-evolution and tool integration, enhances LLM reasoning capabilities without human-curated data.

University of North Carolina at Chapel Hill · Published on Nov 20, 2025

103

Submitted by

richardxp888

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Agent0, a self-evolving framework utilizing multi-step co-evolution and tool integration, enhances LLM reasoning capabilities without human-curated data.

University of North Carolina at Chapel Hill · Nov 20, 2025

103

Submitted by

taesiri

AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

The proposed AnyTalker framework generates high-quality multi-person talking videos by extending Diffusion Transformer with identity-aware attention, leveraging single-person videos for training, and using a specialized dataset for evaluation.

15 authors

· Published on Nov 28, 2025

GitHub 150 arXiv Page

Submitted by

taesiri

AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

The proposed AnyTalker framework generates high-quality multi-person talking videos by extending Diffusion Transformer with identity-aware attention, leveraging single-person videos for training, and using a specialized dataset for evaluation.

15 authors

· Nov 28, 2025

GitHub 150 arXiv Page

Submitted by

Jeff-Wang

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain-0, a VLA foundation model, uses world model-generated data to enhance cross-task generalization and policy robustness, improving real-world performance on complex manipulation tasks.

GigaAI · Published on Oct 22, 2025

47

GitHub 317 arXiv Page

Submitted by

Jeff-Wang

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain-0, a VLA foundation model, uses world model-generated data to enhance cross-task generalization and policy robustness, improving real-world performance on complex manipulation tasks.

GigaAI · Oct 22, 2025

47

GitHub 317 arXiv Page

Submitted by

SteveZeyuZhang

EvoVLA: Self-Evolving Vision-Language-Action Model

EvoVLA, a self-supervised VLA framework, enhances long-horizon robotic manipulation by addressing stage hallucination through triplet contrastive learning, pose-based exploration, and long-horizon memory, achieving improved success rates and sample efficiency on both simulated and real-world tasks.

Peking University · Published on Nov 20, 2025

4

GitHub 115 arXiv Page

Submitted by

SteveZeyuZhang

EvoVLA: Self-Evolving Vision-Language-Action Model

EvoVLA, a self-supervised VLA framework, enhances long-horizon robotic manipulation by addressing stage hallucination through triplet contrastive learning, pose-based exploration, and long-horizon memory, achieving improved success rates and sample efficiency on both simulated and real-world tasks.

Peking University · Nov 20, 2025

4

GitHub 115 arXiv Page

Submitted by

mao1207

SimWorld: An Open-ended Realistic Simulator for Autonomous Agents in Physical and Social Worlds

SimWorld, a new Unreal Engine 5-based simulator, enables the development and evaluation of LLM/VLM agents in realistic, real-world-like settings with diverse physical and social reasoning scenarios.

23 authors

· Published on Nov 30, 2025

23

GitHub 149 arXiv Page

Submitted by

mao1207

SimWorld: An Open-ended Realistic Simulator for Autonomous Agents in Physical and Social Worlds

SimWorld, a new Unreal Engine 5-based simulator, enables the development and evaluation of LLM/VLM agents in realistic, real-world-like settings with diverse physical and social reasoning scenarios.

23 authors

· Nov 30, 2025

23

GitHub 149 arXiv Page

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The PyTorch distributed data parallel module optimizes large-scale model training using techniques like gradient bucketing, computation-communication overlap, and selective synchronization to achieve near-linear scalability.

11 authors

· Published on Jun 28, 2020

3

GitHub 95.6k arXiv Page

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The PyTorch distributed data parallel module optimizes large-scale model training using techniques like gradient bucketing, computation-communication overlap, and selective synchronization to achieve near-linear scalability.

11 authors

· Jun 28, 2020

3

GitHub 95.6k arXiv Page

MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

A large-scale dataset and extensive comparative analysis are provided for multilingual speech translation in the medical domain, enhancing communication and healthcare efficiency.

13 authors

· Published on Apr 4, 2025

2

GitHub 116 arXiv Page

MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

A large-scale dataset and extensive comparative analysis are provided for multilingual speech translation in the medical domain, enhancing communication and healthcare efficiency.

13 authors

· Apr 4, 2025