Skip to content

Commit f359de0

Browse files
authored
add pretrain dataset (#2803)
1 parent 0cd7b01 commit f359de0

26 files changed

+1272
-127
lines changed

examples/README.md

Lines changed: 102 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,102 @@ save_to_hf: false
2828
```
2929
3030
31-
## 1. 精调
31+
## 1. 预训练
3232
33-
### 1.1 数据准备
33+
### 1.1. 数据准备
34+
35+
#### 1.1.1. 在线数据流
36+
37+
我们支持的精调数据格式是每行包含一个字典的 json 文件,每个字典包含以下字段:
38+
39+
- `text` : `str, List(str)`, 预训练文本。
40+
41+
样例数据:
42+
43+
```text
44+
{"text": ["一个需要连续输入值的分类问题的示例是房屋价格预测。房屋的价格通常基于诸如平方英尺、位置、卧室和浴室数量以及像后院或车库等功能这样的因素定价。为了准确预测房屋价格,这些标准必须作为连续输入值输入到分类模型中。"]}
45+
...
46+
```
47+
48+
#### 1.1.2. 离线数据流
49+
50+
我们也可以选择使用离线的比特预训练数据流,更节省内存。离线数据流制作方法如下:
51+
52+
下载一个文本数据集,例如 https://modelscope.cn/datasets/BazingaLyn/mini_pretrain_dataset
53+
54+
格式需为jsonl,每行格式例如BazingaLyn/mini_pretrain_dataset/pretrain_hq_v7.jsonl:
55+
```text
56+
{"text": "番茄炒蛋\n材料:\n鸡蛋3个、番茄1个、油、盐、糖、水淀粉\n做法:..."}
57+
{"text": "请描述一下如何正确规划个人理财。正确规划个人理财需要以下几个步骤..."}
58+
{"text": "请输入一段描述有关海洋保护的情景对话。Person A: 哇,这个海滩真..."}
59+
{"text": "鉴别两种不同类型的葡萄酒。鉴别葡萄酒的方法因其类型和品种而异,下..."}
60+
```
61+
62+
运行`examples/tools/create_pretraining_data.py`,生成数据将会保存在当前目录下的`./pretrain_data.bin`和`./pretrain_data.idx`
63+
```text
64+
python -u examples/tools/create_pretraining_data.py \
65+
--model_name_or_path "/path/to/your/Qwen3-0.6B-base" \
66+
--data_format "JSON" \
67+
--input_path "/path/to/your/BazingaLyn/mini_pretrain_dataset/pretrain_hq_v7.jsonl" \
68+
--append_eos \
69+
--output_prefix "./pretrain_data" \
70+
--workers 1 \
71+
--log_interval 10000 \
72+
--data_impl "mmap"
73+
```
74+
75+
- 参数说明
76+
77+
| 参数名 | 类型 | 说明 |
78+
|--------------------|----------- |-----------------|
79+
| `--model_name_or_path` | string | 模型路径 |
80+
| `--data_format` | string | 支持的文件格式,之前只支持 json |
81+
| `--input_path` | string | 输入的json文件的路径 |
82+
| `--append_eos` | store_true | 是否在document的结尾添加eos token |
83+
| `--output_prefix` | str | 输出文件的前缀 |
84+
| `--workers` | int | 运行的进程数 |
85+
| `--log_interval` | int | 打印日志间隔 |
86+
| `--data_impl` | str | 制作的数据集类型,默认为mmap,也可以选择lazy |
87+
88+
### 1.2. 全参 PT
89+
90+
预训练需要在配置文件中指定 `stage: PT`
91+
92+
在线数据流
93+
```bash
94+
# 单卡
95+
paddleformers-cli train ./config/pt/full.yaml
96+
# 多卡
97+
paddleformers-cli train ./config/pt/full_tp_pp.yaml
98+
```
99+
100+
离线数据流
101+
102+
在配置文件中:
103+
104+
`input_dir`指定数据集的前缀,例如:数据集 `data-1-part0.bin` 需要设置为 `input_dir: "1.0 ./data-1-part0"``1.0` 为数据配比;
105+
106+
`split` 字段为 `train/eval` 的分配比例,如:`split: "998,2"`, 其中`train`为训练集,`eval`为评估集
107+
108+
`dataset_type` 指定为 `pretrain`,例如:`dataset_type: "pretrain"`
109+
110+
```bash
111+
paddleformers-cli train ./config/pt/full_offline_data.yaml
112+
```
113+
114+
### 1.3. LoRA PT
115+
116+
LoRA SFT 启动命令参考
117+
```bash
118+
# 单卡
119+
paddleformers-cli train ./config/pt/lora.yaml
120+
# 多卡
121+
paddleformers-cli train ./config/pt/lora_tp_pp.yaml
122+
```
123+
124+
## 2. 精调
125+
126+
### 2.1 数据准备
34127

35128
我们支持的精调数据格式是每行包含一个字典的 json 文件,每个字典包含以下字段:
36129

@@ -51,7 +144,7 @@ wget https://bj.bcebos.com/paddlenlp/datasets/examples/alpaca_demo.gz
51144
mkdir -p data/sft && tar -xf alpaca_demo.gz -C data/sft/ --strip-components=1
52145
```
53146

54-
### 1.2 全参 SFT
147+
### 2.2 全参 SFT
55148

56149
单卡
57150
```bash
@@ -63,17 +156,17 @@ python -u run_finetune.py ./config/sft/full.yaml
63156
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" run_finetune.py ./config/sft/full_tp_pp.yaml
64157
```
65158

66-
### 1.3 LoRA SFT
159+
### 2.3 LoRA SFT
67160

68161
LoRA SFT 启动命令参考
69162
```bash
70163
python -u run_finetune.py ./config/sft/lora.yaml
71164
```
72165

73166

74-
## 2. 对齐
167+
## 3. 对齐
75168

76-
### 2.1 数据准备
169+
### 3.1 数据准备
77170

78171
我们支持的精调数据格式是每行包含一个字典的 json 文件,每个字典包含以下字段:
79172

@@ -105,7 +198,7 @@ wget https://bj.bcebos.com/paddlenlp/datasets/examples/ultrafeedback_binarized.t
105198
mkdir -p data/dpo && tar -zxf ultrafeedback_binarized.tar.gz -C data/dpo/ --strip-components=1
106199
```
107200

108-
### 2.2 全参 DPO
201+
### 3.2 全参 DPO
109202

110203
单卡
111204
```bash
@@ -117,15 +210,15 @@ python -u ./alignment/dpo/run_dpo.py ./config/dpo/full.yaml
117210
python -u -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/dpo/run_dpo.py ./config/dpo/full_tp_pp.yaml
118211
```
119212

120-
### 2.3 LoRA DPO
213+
### 3.3 LoRA DPO
121214

122215
LoRA DPO 启动命令参考
123216
```bash
124217
python -u ./alignment/dpo/run_dpo.py ./config/dpo/lora.yaml
125218
```
126219

127220

128-
## 3. LoRA 参数合并
221+
## 4. LoRA 参数合并
129222

130223
使用 LoRA 方式训练模型后,为了方便推理,我们提供将 LoRA 参数合并到模型主权重中的脚本`tools/mergekit.py`
131224

paddleformers/cli/README.md renamed to examples/cli_usage.md

Lines changed: 21 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@ Expected output:
2222
```
2323
------------------------------------------------------------
2424
| Usage: |
25-
| paddleformers-cli train -h: model finetuning |
26-
| paddleformers-cli export -h: model export |
25+
| paddleformers-cli train : model finetuning |
26+
| paddleformers-cli export : model export |
2727
| paddleformers-cli help: show helping info |
2828
------------------------------------------------------------
2929
```
@@ -60,33 +60,42 @@ Examples using **Qwen/Qwen3-0.6B-Base** model:
6060
## 1.1. Chat
6161
待补充
6262

63-
## 1.2. Model Fine-tuning
63+
## 1.2. Model Pre-training
6464

65-
### 1.2.1. SFT & LoRA Fine-tuning
65+
```bash
66+
# Example 1: SFT-Full using online dataset
67+
paddleformers-cli train examples/config/pt/full.yaml
68+
# Example 2: SFT-Full using offline dataset
69+
paddleformers-cli train examples/config/pt/full_offline_data.yaml
70+
```
71+
72+
## 1.3. Model Fine-tuning
73+
74+
### 1.3.1. SFT & LoRA Fine-tuning
6675
```bash
6776
# Example 1: SFT
68-
paddleformers-cli train examples/config/sft_lora.yaml
77+
paddleformers-cli train examples/config/sft/lora.yaml
6978
# Example 2: SFT-Full
70-
paddleformers-cli train examples/config/sft_full.yaml
79+
paddleformers-cli train examples/config/sft/full.yaml
7180
```
7281

73-
### 1.2.2. DPO & LoRA Fine-tuning
82+
### 1.3.2. DPO & LoRA Fine-tuning
7483
```bash
7584
# Example 1: 8K seq length, DPO
76-
paddleformers-cli train examples/config/dpo_full.yaml
85+
paddleformers-cli train examples/config/dpo/full.yaml
7786
# Example 2: 8K seq length, DPO-LoRA
78-
paddleformers-cli train examples/config/dpo_lora.yaml
87+
paddleformers-cli train examples/config/dpo/lora.yaml
7988
```
8089

81-
## 1.3 Model Eval
90+
## 1.4. Model Eval
8291
待补充
8392

84-
## 1.4. Model Export
93+
## 1.5. Model Export
8594
```bash
8695
paddleformers-cli export examples/config/run_export.yaml
8796
```
8897

89-
## 1.5. Multi-Node Training
98+
## 1.6. Multi-Node Training
9099
```bash
91100
NNODES={num_nodes} MASTER_ADDR={your_master_addr} MASTER_PORT={your_master_port} CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 paddleformers-cli train examples/config/sft_full.yaml
92101
```

examples/cli_usage_zh.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# 命令行界面
2+
3+
## 概述
4+
5+
CLI(Command Line Interface)提供基于终端的程序交互,通过参数化配置,高效灵活地执行模型训练、推理和评估任务。
6+
7+
## 快速入门
8+
9+
**安装**
10+
11+
在PaddleFormers根目录下运行:
12+
```bash
13+
python -m pip install -e .
14+
```
15+
16+
验证安装:
17+
```bash
18+
paddleformers-cli help
19+
```
20+
21+
预期输出:
22+
```
23+
------------------------------------------------------------
24+
| Usage: |
25+
| paddleformers-cli train : model finetuning |
26+
| paddleformers-cli export : model export |
27+
| paddleformers-cli help: show helping info |
28+
------------------------------------------------------------
29+
```
30+
31+
**GPU配置**
32+
33+
默认情况下,CLI 中使用所有可用的 GPU。
34+
如果您想指定某些 GPU,请在运行 CLI 之前设置 CUDA_VISIBLE_DEVICES:
35+
36+
```bash
37+
# Single GPU
38+
export CUDA_VISIBLE_DEVICES=0
39+
# Multi GPUs
40+
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
41+
42+
# Single XPU
43+
export XPU_VISIBLE_DEVICES=0
44+
# Multi XPUs
45+
export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
46+
47+
# Single NPU
48+
export ASCEND_RT_VISIBLE_DEVICES=0
49+
# Multi NPUs
50+
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
51+
```
52+
53+
* 注:在`Chat`模块,CUDA_VISIBLE_DEVICES配置的GPU数量应该等于`tensor_parallel_degree`在配置中。
54+
或者,您也可以取消设置 CUDA_VISIBLE_DEVICES。
55+
56+
# 1. CLI 用法
57+
58+
使用 **Qwen/Qwen3-0.6B-Base** 模型的示例:
59+
60+
## 1.1.聊天
61+
待补充
62+
63+
## 1.2.模型预训练
64+
65+
```bash
66+
# Example 1: SFT-Full using online dataset
67+
paddleformers-cli train examples/config/pt/full.yaml
68+
# Example 2: SFT-Full using offline dataset
69+
paddleformers-cli train examples/config/pt/full_offline_data.yaml
70+
```
71+
72+
## 1.3.模型微调
73+
74+
### 1.3.1. SFT 和 LoRA 微调
75+
```bash
76+
# Example 1: SFT
77+
paddleformers-cli train examples/config/sft/lora.yaml
78+
# Example 2: SFT-Full
79+
paddleformers-cli train examples/config/sft/full.yaml
80+
```
81+
82+
### 1.3.2. DPO 和 LoRA 微调
83+
```bash
84+
# Example 1: 8K seq length, DPO
85+
paddleformers-cli train examples/config/dpo/full.yaml
86+
# Example 2: 8K seq length, DPO-LoRA
87+
paddleformers-cli train examples/config/dpo/lora.yaml
88+
```
89+
90+
## 1.4.模型评估
91+
待补充
92+
93+
## 1.5.模型导出
94+
```bash
95+
paddleformers-cli export examples/config/run_export.yaml
96+
```
97+
98+
## 1.6.多节点训练
99+
```bash
100+
NNODES={num_nodes} MASTER_ADDR={your_master_addr} MASTER_PORT={your_master_port} CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 paddleformers-cli train examples/config/sft_full.yaml
101+
```

examples/config/dpo/full.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ train_dataset_prob: "1.0"
66
eval_dataset_path: ./data/dpo/dev.jsonl
77
eval_dataset_prob: "1.0"
88
max_seq_len: 8192
9-
num_samples_each_epoch: 6000000
109
packing: false
1110
mix_strategy: concat
1211

@@ -28,7 +27,6 @@ max_steps: -1
2827
eval_steps: 100
2928
evaluation_strategy: steps
3029
save_steps: 100
31-
save_total_limit: 1
3230
save_strategy: steps
3331
logging_steps: 1
3432
gradient_accumulation_steps: 4

examples/config/dpo/full_function_call.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ train_dataset_prob: "1.0"
66
eval_dataset_path: ./data/fc/function-call-eval.jsonl
77
eval_dataset_prob: "1.0"
88
max_seq_len: 8192
9-
num_samples_each_epoch: 6000000
109
packing: false
1110
mix_strategy: concat
1211

@@ -30,7 +29,6 @@ max_steps: -1
3029
eval_steps: 100
3130
evaluation_strategy: steps
3231
save_steps: 100
33-
save_total_limit: 1
3432
save_strategy: steps
3533
logging_steps: 1
3634
gradient_accumulation_steps: 4

examples/config/dpo/full_tp_pp.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,6 @@ max_steps: -1
2828
eval_steps: 100
2929
evaluation_strategy: steps
3030
save_steps: 100
31-
save_total_limit: 1
3231
save_strategy: steps
3332
logging_steps: 1
3433
gradient_accumulation_steps: 4

examples/config/dpo/lora.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ train_dataset_prob: "1.0"
66
eval_dataset_path: ./data/dpo/dev.jsonl
77
eval_dataset_prob: "1.0"
88
max_seq_len: 8192
9-
num_samples_each_epoch: 6000000
109
packing: false
1110
mix_strategy: concat
1211

@@ -30,7 +29,6 @@ max_steps: -1
3029
eval_steps: 100
3130
evaluation_strategy: steps
3231
save_steps: 100
33-
save_total_limit: 1
3432
save_strategy: steps
3533
logging_steps: 1
3634
gradient_accumulation_steps: 4

examples/config/dpo/lora_tp_pp.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ train_dataset_prob: "1.0"
66
eval_dataset_path: ./data/dpo/dev.jsonl
77
eval_dataset_prob: "1.0"
88
max_seq_len: 8192
9-
num_samples_each_epoch: 6000000
109
packing: true
1110
mix_strategy: concat
1211

@@ -30,7 +29,6 @@ max_steps: -1
3029
eval_steps: 100
3130
evaluation_strategy: steps
3231
save_steps: 100
33-
save_total_limit: 1
3432
save_strategy: steps
3533
logging_steps: 1
3634
gradient_accumulation_steps: 4

0 commit comments

Comments
 (0)