大模型入门三部曲-1：从0-1体验模型部署到评测

原创已于 2026-03-24 19:52:39 修改 · 592 阅读

6 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

于 2026-03-20 00:59:01 首次发布

以下为mac电脑环境，window部分命令自行替换

常见问题1：执行报错：是没安装 PyTorch

常见问题2：代码执行超时，是由于网络问题，最好使用国内镜像

5.运行评测命令

常见问题1：ModuleNotFoundError: No module named 'accelerate'

常见问题2：httpx.ConnectTimeout: [Errno 60] Operation timed out

常见问题3：'timed out' thrown while requesting HEAD https://huggingface.co/datasets/Rowan/hellaswag/resolve/main/README.mdRetrying in 1s [Retry 1/5].

补充说明：

仅yaml文件不创建python utils.py的文件

查看.parquet文件内容的方式：

1.首先python环境安装

2.创建python虚拟环境

# 创建虚拟环境
python3 -m venv venv  
# 激活 ,激活后剩余base命令操作均需在虚拟环境中
source venv/bin/activate

当一切操作结束退出虚拟环境：

deactivate

3.安装评测框架

# 下载评测框架
git clone https://github.com/EleutherAI/lm-evaluation-harness
# 安装
cd lm-evaluation-harness
pip install -e .

4.小模型下载

可以在https://huggingface.co/ 上直接下载小模型到本地，也可以通过代码下载

模型名	说明
`gpt2`	GPT‑2 基础模型，非常小，很适合初步体验评测链路
`EleutherAI/pythia‑160m`	约 160M 权重的小模型，训练/评估快
`StabilityAI/stablelm‑2‑1.6b`	中型开源模型，质量和速度比较好（本地可跑）

以下载 gpt2 为例：

# 首先安装transformers
pip install transformers 
# 其次安装 torch
pip install torch
# 在安装
pip install accelerate

# 全部安装完成后执行如下命令验证
python -c "import torch; import transformers; import accelerate; print('All good!')"

在python代码中下载gpt2模型

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "gpt2"  # 也可以是 "EleutherAI/pythia-160m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

这段代码会自动把模型权重下载到本地缓存 (~ ~/.cache/huggingface/transformers)。

常见问题1：执行报错：是没安装 PyTorch

安装命令(CPU版本)：

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

如果GPU（版本）

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

如果不确定显卡或驱动情况，先用 CPU 版本即可跑小模型，足够练手。

验证安装

在 Python 中执行：

import torch
print(torch.__version__)
print(torch.cuda.is_available())

输出类似：

2.1.0
False

说明 PyTorch 安装成功（CPU 可用，GPU 可选）。

常见问题2：代码执行超时，是由于网络问题，最好使用国内镜像

import os
# 设置镜像源加速下载
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "gpt2"

# 让 transformers 自动管理缓存,不要手动指定路径
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

print("模型加载成功!")
print(f"模型参数量: {sum(p.numel() for p in model.parameters()):,}")

加载完成会得到如下输出：

如果还会报加载模型失败，就干脆直接在huaggingface下载模型。核心文件如下：

5.运行评测命令

查看有哪些评测任务

lm-eval ls tasks

评估模型基本能力

以 GPT‑2 在 HellaSwag benchmark 上跑分为例：

lm_eval --model hf --model_args pretrained=gpt2 --tasks hellaswag --device cpu --batch_size 4 --output results.json

注意：如果执行报错连接失败：看常见问题3，按本地数据集的方式运行！！！

参数解释：
--model hf：使用 HuggingFace 模型后端
--model_args pretrained=gpt2：模型名称，可以换成本地路径
--tasks hellaswag：评测任务名字
--device cpu：若有 GPU，可以设成 cuda:0
--batch_size 4：每批多少样本

--output results.json：输出评测结果 JSON 文件

评测结束后（大概5-10分钟）你将看到类似：

{"results": {
    "hellaswag_local": {
      "name": "hellaswag_local",
      "alias": "hellaswag_local",
      "sample_len": 10042,
      "acc,none": 0.2891854212308305,
      "acc_stderr,none": 0.004524575892953094,
      "acc_norm,none": 0.31139215295757816,
      "acc_norm_stderr,none": 0.004621163476949437
    }
  }
}

这表示 GPT‑2 在 HellaSwag 上的准确率大约是 28.91%

- acc,none → 准确率 28.92%
- acc_stderr,none → 标准误 0.45%（就是 ± 后面的数）
- acc_norm,none → 标准化准确率 31.14%
- acc_norm_stderr,none → 标准误 0.46%

也可以在过程文件 eval_output.log 和日志打印中看到。

也可以评测多个任务. 示例：

lm_eval --model hf \
  --model_args pretrained=gpt2 \
  --tasks hellaswag,mmlu \
  --device cpu \
  --batch_size 4 \
  --output full_results.json

这里列出运行日志：

2026-03-20:14:26:02 INFO     [_cli.run:377] Including path: /Users/hongshao/dataset/tasks
2026-03-20:14:26:02 INFO     [_cli.run:378] Selected Tasks: ['hellaswag_local']
2026-03-20:14:26:03 INFO     [evaluator:213] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-20:14:26:03 INFO     [evaluator:238] Initializing hf model, with arguments: {'pretrained': '/Users/hongshao/models/gpt2'}
2026-03-20:14:26:05 INFO     [models.huggingface:256] Using device 'cpu'
2026-03-20:14:26:05 INFO     [models.huggingface:518] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cpu'}

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]
Loading weights: 100%|██████████| 148/148 [00:00<00:00, 66519.18it/s]
2026-03-20:14:26:06 INFO     [evaluator_utils:446] Selected tasks:
2026-03-20:14:26:06 INFO     [evaluator_utils:480] Task: hellaswag_local (/Users/hongshao/dataset/tasks/hellaswag_local.yaml)
2026-03-20:14:26:06 INFO     [api.task:312] Building contexts for hellaswag_local on rank 0...

  0%|          | 0/10042 [00:00<?, ?it/s]
  3%|▎         | 296/10042 [00:00<00:08, 1216.45it/s]
  7%|▋         | 727/10042 [00:00<00:03, 2359.78it/s]
 12%|█▏        | 1181/10042 [00:00<00:02, 3112.42it/s]
中间省略---------------------------
Running loglikelihood requests: 100%|█████████▉| 40164/40168 [16:02<00:00, 90.43it/s]
Running loglikelihood requests: 100%|██████████| 40168/40168 [16:02<00:00, 41.73it/s]
fatal: not a git repository (or any of the parent directories): .git
2026-03-20:14:42:21 INFO     [loggers.evaluation_tracker:247] Saving results aggregated
hf ({'pretrained': '/Users/hongshao/models/gpt2'}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 4
|     Tasks     |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|---------------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag_local|      1|none  |     0|acc     |↑  |0.2892|±  |0.0045|
|               |       |none  |     0|acc_norm|↑  |0.3114|±  |0.0046|

常见问题1：ModuleNotFoundError: No module named 'accelerate'

在虚拟环境中执行

pip install accelerate

常见问题2：httpx.ConnectTimeout: [Errno 60] Operation timed out

由于我们是联网加载模型进行评测，因此会受网络问题影响。这里就需要将gpt模型下载到本地。然后修改模型加载的方式：用本地模型方式

from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "/Users/hongshao/models/gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_dir, local_files_only=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, local_files_only=True)

同时评测命令的执行注意使用：


# 也是加载本地模型的方式，就避免了联网不稳定
lm_eval --model hf --model_args pretrained=/Users/hongshao/models/gpt2 --tasks hellaswag --device cpu --batch_size 4 --output results.json

常见问题3：'timed out' thrown while requesting HEAD https://huggingface.co/datasets/Rowan/hellaswag/resolve/main/README.md
Retrying in 1s [Retry 1/5].

原因：模型已经加载完成，但是 lm-evaluation-harness 仍在尝试从 HuggingFace Hub 下载 benchmark 数据集，因为 hellaswag benchmark 数据集默认不是本地的，需要联网下载。你的网络不稳定或者被墙，所以报超时。

解决办法：

1.打开 HellaSwag 数据集页面：

https://huggingface.co/datasets/Rowan/hellaswag

2.点击 Files and versions，下载文件到本地 /User/hongshao/dataset/

此时只能通过代码的方式执行，因为 lm-evaluation-harness 没有支持的CLI 参数加载本地评测数据集

3.处理文件差异：

原始 hellaswag 数据集字段：
  {
      "activity_label": "Removing ice from car",
      "ctx_a": "Then, the man writes over the snow...",
      "ctx_b": "then",
      "endings": ["option1", "option2", "option3", "option4"],
      "label": "3"  # 字符串类型
  }

  lm-eval 需要的字段：
  {
      "query": "Removing ice from car: Then, the man writes...",  # 需要拼接
      "choices": ["option1", "option2", "option3", "option4"],
      "gold": 3  # 需要是整数
  }

4.运行评测脚本

4.1）创建本地yaml配置文件 /Users/hongshao/dataset/tasks/hellaswag_local.yaml

  task: hellaswag_local
  dataset_path: /Users/hongshao/dataset/hellaswag
  dataset_name: null
  output_type: multiple_choice
  training_split: null
  validation_split: validation
  test_split: null
  process_docs: !function utils.process_docs
  doc_to_text: "{{query}}"
  doc_to_target: "{{gold}}"
  doc_to_choice: "choices"
  metric_list:
    - metric: acc
      aggregation: mean
      higher_is_better: true
    - metric: acc_norm
      aggregation: mean
      higher_is_better: true
  metadata:
    version: 1.0

4.2）创建本地 Utils 函数文件 (/Users/hongshao/dataset/tasks/utils.py) 也可以使用纯yaml完成这件事，下面补充

  import re

  def preprocess(text):
      text = text.strip()
      text = text.replace(" [title]", ". ")
      text = re.sub("\\[.*?\\]", "", text)
      text = text.replace("  ", " ")
      return text

  def process_docs(dataset):
      def _process_doc(doc):
          ctx = doc["ctx_a"] + " " + doc["ctx_b"].capitalize()
          label = doc.get("label", "0")
          try:
              gold = int(label)
          except (ValueError, TypeError):
              gold = 0
          out_doc = {
              "query": preprocess(doc["activity_label"] + ": " + ctx),
              "choices": [preprocess(ending) for ending in doc["endings"]],
              "gold": gold,
          }
          return out_doc
      return dataset.map(_process_doc)

process_docs 函数做三件事：
1. 拼接字段: 把 activity_label + ctx_a + ctx_b 拼成完整的 query
2. 类型转换: 把 label 从字符串 "3" 转成整数 3
3. 文本清洗: preprocess 去除多余空格和伪影

在虚拟机中执行：

HF_ENDPOINT=https://hf-mirror.com lm-eval run \
    --model hf \
    --model_args pretrained=/Users/hongshao/models/gpt2 \
    --tasks hellaswag_local \
    --include_path /Users/hongshao/dataset/tasks \
    --device cpu \
    --batch_size 4 \
    --output_path /Users/hongshao/results.json

到这里你就静静等待结果吧。

补充说明：

仅yaml文件不创建python utils.py的文件

  task: hellaswag_simple
  dataset_path: /Users/hongshao/dataset/hellaswag
  dataset_name: null
  output_type: multiple_choice
  validation_split: validation
  doc_to_text: "{{activity_label}}: {{ctx_a}} {{ctx_b | capitalize}}"
  doc_to_target: "{{label | int}}"
  doc_to_choice: "{{endings}}"
  metric_list:
    - metric: acc
      aggregation: mean
      higher_is_better: true
  metadata:
    version: 1.0

查看.parquet文件内容的方式：

1）使用 Python + pandas（最简单）

  source venv/bin/activate
  python -c "
  import pandas as pd
  df = pd.read_parquet('/Users/hongshao/dataset/hellaswag/data/validation-00000-of-00001.parquet')
  print(df.head(2))  # 打印前 2 行
  print(df.columns)  # 打印列名
  print(df.shape)    # 打印形状
  "

2）直接用 lm-eval 内置的查看功能

source venv/bin/activate
  python -c "
  from datasets import load_dataset
  ds = load_dataset('/Users/hongshao/dataset/hellaswag', split='validation')
  print(ds.features)  # 查看字段
  print(ds[0])        # 查看第一条数据
  "

输出结果：

     === 字段定义 ===
     {'ind': Value('int32'), 'activity_label': Value('string'), 'ctx_a': Value('string'), 'ctx_b': Value('string'), 'ctx': Value('string'), 'endings': List(Value('string')), 'source_id': Value('string'), 'split': Value('string'), 'split_type': Value('string'),
     'label': Value('string')}

     === 第一条数据 ===
     ind: 24
     activity_label: Roof shingle removal
     ctx_a: A man is sitting on a roof.
     ctx_b: he
     ctx: A man is sitting on a roof. he
     endings: ['is using wrap to wrap a pair of skis.', 'is ripping level tiles off.', "is holding a rubik's cube.", 'starts pulling up roofing on a roof.']
     source_id: activitynet~v_-JhWjGDPHMY
     split: val
     split_type: indomain
     label: 3

标签