以下是为本地部署的DeepSeek大模型编写的3个PyTorch推理代码示例,涵盖不同场景和优化方案:
示例1:基础推理(完整模型加载)
from transformers import AutoModelForCausalLM, AutoTokenizer
# 配置参数
model_path = "/path/to/deepseek-model" # 本地模型目录
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载模型和分词器
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map="auto"
).to(device)
# 推理函数
def generate_text(prompt, max_length=100):
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
inputs.input_ids,
max_length=max_length,
temperature=0.7,
top_p=0.9,
do_sample=True
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# 使用示例
print(generate_text("人工智能的未来发展方向是"))
说明:
- 需要完整的GPU显存放得下整个模型
- 使用
device_map="auto"自动分配多GPU资源 - 适合显存充足的场景(如A100 80GB)
示例2:模型分片加载(大模型优化)
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import load_checkpoint_and_dispatch
model_path = "/path/to/deepseek-model"
device = "cuda"
# 加载分片模型
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
offload_folder="offload", # 溢出到CPU的缓存目录
offload_state_dict=True,
low_cpu_mem_usage=True
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# 手动加载分片检查点
model = load_checkpoint_and_dispatch(
model,
model_path,
device_map="auto",
no_split_module_classes=["DeepseekBlock"] # 根据实际模型结构修改
)
def stream_generate(prompt, max_length=200):
inputs = tokenizer(prompt, return_tensors="pt").to(device)
for _ in range(max_length):
outputs = model.generate(
**inputs,
max_new_tokens=1,
pad_token_id=tokenizer.eos_token_id
)
yield tokenizer.decode(outputs[0])
# 使用流式输出
for token in stream_generate("量子计算的优势包括"):
print(token, end="", flush=True)
说明:
- 使用HuggingFace Accelerate库的分片加载
- 支持将不同层分配到不同设备
- 需要模型保存时包含
model.safetensors.index.json分片索引文件
示例3:8-bit量化推理(显存优化)
from transformers import BitsAndBytesConfig
import torch
quant_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0,
llm_int8_skip_modules=["lm_head"]
)
model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=quant_config,
device_map="auto"
)
# 4-bit量化(需要Pytorch 2.1+)
# quant_config = BitsAndBytesConfig(load_in_4bit=True)
def batch_inference(prompts, batch_size=4):
inputs = tokenizer(prompts, padding=True, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=3
)
return [tokenizer.decode(out, skip_special_tokens=True) for out in outputs]
# 批量推理示例
results = batch_inference([
"解释机器学习中的注意力机制:",
"写一首关于AI的诗:",
"翻译成英文:深度学习需要大量数据"
])
for res in results:
print("------\n" + res)
说明:
- 8-bit量化可减少50%显存占用
- 4-bit量化需要兼容的硬件(如Ampere架构以上GPU)
- 量化会带来轻微精度损失(约1-3%性能下降)
关键注意事项:
-
模型兼容性:
- 确认模型格式(HF格式/PyTorch bin)
- 检查
config.json中的auto_map配置
-
依赖版本:
pip install transformers>=4.35.0 accelerate>=0.25.0 bitsandbytes>=0.41.0 -
性能监控:
# 添加性能分析 with torch.profiler.profile( activities=[torch.profiler.ProfilerActivity.CUDA] ) as prof: generate_text("测试输入") print(prof.key_averages().table(sort_by="cuda_time_total"))
根据实际硬件情况选择方案:
- 单卡小显存:使用量化方案(示例3)
- 多卡大显存:使用分片加载(示例2)
- 快速原型开发:基础方案(示例1)
【哈佛博后带小白玩转机器学习】 哔哩哔哩_bilibili
总课时超400+,时长75+小时

被折叠的 条评论
为什么被折叠?



