Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback
TL; DR: We present Playmate2, which effectively tackles key challenges related to temporal coherence in long sequences and multi-character animations, for generating high-quality audio-driven videos. To the best of our knowledge, this is the first training-free approach capable of enabling audio-driven animation for three or more characters without requiring additional data or model modifications.
2025/11/21: 🔥🔥🔥 We release the weights and inference code of Playmate2!2025/11/10: 🎉🎉🎉 Our paper has been accepted and will be presented at AAAI 2026. We plan to release the inference code and model weights for both Playmate and Playmate2 in the coming weeks. Stay tuned and thank you for your patience!2025/10/15: 🚀🚀🚀 Our paper is in public on arxiv.
multi_persons_09-multiperson_30.mp4 |
test_1.mp4 |
multi_persons_11-multiperson_09.mp4 |
cover.mp4 |
female_song_01-female_55.mp4 |
sing_1_1.mp4 |
sing_3_1.mp4 |
sing_4_1.mp4 |
11.mp4 |
22.mp4 |
33.mp4 |
44_1.mp4 |
55.mp4 |
66_1.mp4 |
77.mp4 |
88.mp4 |
99.mp4 |
Explore more examples.
conda create -n playmate2 python=3.10
conda activate playmate2
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -U xformers==0.0.29 --index-url https://download.pytorch.org/whl/cu124
pip install misaki[en]
pip install ninja
pip install psutil
pip install packaging
pip install flash_attn==2.7.4.post1 --no-build-isolation
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
or
sudo yum install ffmpeg ffmpeg-devel
| Models | Download Link | Save Path |
|---|---|---|
| Wan2.1-I2V-14B-720P | Huggingface | pretrained_weights/Wan2.1-I2V-14B-720P |
| chinese-wav2vec2-base | Huggingface | pretrained_weights/chinese-wav2vec2-base |
| VideoLLaMA3-7B | Huggingface | pretrained_weights/VideoLLaMA3-7B |
| Our Pretrained Model | Huggingface | pretrained_weights/playmate2 |
Download models using huggingface-cli:
mkdir pretrained_weights
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./pretrained_weights/Wan2.1-I2V-14B-720P
huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./pretrained_weights/chinese-wav2vec2-base
huggingface-cli download TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./pretrained_weights/chinese-wav2vec2-base
huggingface-cli download DAMO-NLP-SG/VideoLLaMA3-7B --local-dir ./pretrained_weights/VideoLLaMA3-7B
huggingface-cli download PlaymateAI/Playmate2 --local-dir ./pretrained_weights/playmate2It is recommended to use an A100 or higher GPUs for inference.
- One person
python inference.py \
--gpu_num 1 \ # 1(single gpu) or 3(multiple gpus)
--image_path examples/images/01.png \
--audio_path examples/audios/01.wav \
--prompt_path examples/prompts/01.txt \
--output_path examples/outputs/01.mp4 \
--max_size 1280 \
--id_num 1
- Multiple Persons
# N represent the number of persons
python inference.py \
--gpu_num 1 \ # 1(single gpu) or 3+N-1(multiple gpus)
--image_path examples/images/04.png \
--audio_path examples/audios/04 \
--mask_path examples/masks/04 \
--prompt_path examples/prompts/04.txt \
--output_path examples/outputs/04.mp4 \
--max_size 1280 \
--id_num 3
If you find our work useful for your research, please consider citing the paper:
@article{ma2025playmate2,
title={Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback},
author={Ma, Xingpei and Huang, Shenneng and Cai, Jiaran and Guan, Yuansheng and Zheng, Shen and Zhao, Hanfeng and Zhang, Qiang and Zhang, Shunsi},
journal={arXiv preprint arXiv:2510.12089},
year={2025}
}
