Cong Wang1*, Zexuan Deng1*, Zhiwei Jiang1†, Yafeng Yin1†, Fei Shen2, Zifeng Cheng1, Shiping Ge1, Shiwei Gan1, Qing Gu1
1 Nanjing University,
2 National University of Singapore
* Equal contribution | † Corresponding authors
NeurIPS 2025 (Spotlight)
.
├── configs/ # Configuration files
├── metrics/ # Evaluation metrics
├── models/ # Model architectures
├── pipelines/ # Data processing pipelines
├── scripts/ # Utility scripts
├── signdatasets/ # Dataset handling
├── train.sh # Training script
├── train_stage_1.py # Stage 1 training (single frame)
├── train_stage_2.py # Stage 1 training (Temp.-Attn. Layer)
├── train_compress_vq_multicond.py # Stage 2 training
├── train_multihead_t2vqpgpt.py # Stage 3 training
└── utils.py # Utility functionsThe system is trained in three main stages:
-
Stage I: Sign Video Diffusion Model Training Files:
train_stage_1.py,train_stage_2.py -
Stage II: FSQ Autoencoder Training File:
train_compress_vq_multicond.py -
Stage III: Multi-Condition Token Translator Training File:
train_multihead_t2vqpgpt.py
pip install -r requirements.txt- RWTH-T Models: [huggingface]
- How2Sign Models: comming soon
# Single frame training
accelerate launch \
--config_file accelerate_config.yaml \
--num_processes 2 --gpu_ids "0,1" \
train_stage_1.py --config "configs/stage1/stage_1_multicond_RWTH.yaml"
# Temporal-Attention Layer training
accelerate launch \
--config_file accelerate_config.yaml \
--num_processes 2 --gpu_ids "0,1" \
train_stage_2.py --config "configs/stage2/stage_2_RWTH.yaml"accelerate launch \
--config_file accelerate_config.yaml \
--num_processes 2 --gpu_ids "0,1" \
train_compress_vq_multicond.py \
--config "configs/vq/vq_multicond_RWTH_compress.yaml"accelerate launch \
--config_file accelerate_config_bf16.yaml \
--num_processes 2 --gpu_ids "0,1" \
train_multihead_t2vqpgpt.py \
--config "configs/gpt/multihead_t2vqpgpt_RWTH.yaml"python get_compress_vq_pose_latent.py \
--config /path/to/config_train.yaml \
--output_dir /path/to/output/train_processed_videos/(similarly for val/test datasets)
python eval_compress_vq_video.py \
--config /path/to/config_test.yaml \
--input_dir /path/to/test_processed_videos \
--video_base_path /path/to/original_videos \
--pose_size 12 # 12 for RWTH, 64 for How2Signeval_multihead_t2vqpgpt.py→ Evaluates token translatoreval_compress_video_from_origin.py→ Evaluates video compressioneval_compress_vq_video.py→ Evaluates quantized video compressioncombined_t2s_eval.py→ Combined text-to-sign evaluation
RWTH-T Examples
| Example 1 | Example 2 | Example 3 |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
How2Sign Examples
| Example 1 | Example 2 | Example 3 |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Located in scripts/RWTH-T/:
1_make_video.py→ Create videos2_check_video.py→ Validate videos3_process_annotation.py→ Process annotations
Located in scripts/how2sign/:
1_create_json.py→ Create metadata2_clip_videos.py→ Clip videos3_check_clip_videos.py→ Validate clips4_crop_and_resize_videos.py→ Crop & resize5_create_final_json.py→ Final dataset metadata
scripts/hamer/→ HAMER dataset processingscripts/sk/→ SK (DWPose) dataset processing
If you find this work useful, please cite:
@article{wang2025advanced,
title={Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization},
author={Wang, Cong and Deng, Zexuan and Jiang, Zhiwei and Shen, Fei and Yin, Yafeng and Gan, Shiwei and Cheng, Zifeng and Ge, Shiping and Gu, Qing},
journal={arXiv preprint arXiv:2506.15980},
year={2025}
}










