docker部署Xinference，Dify调用（dify环境中如何引入rerank模型）

原创已于 2025-06-23 13:08:01 修改 · 3.7k 阅读

43 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#docker #容器 #Dify #Xinference

于 2025-06-11 21:04:22 首次发布

AI本地化部署同时被 2 个专栏收录

7 篇文章

订阅专栏

AI应用实战

6 篇文章

订阅专栏

该文章已生成可运行项目，

1.Why Xinference

Xorbits Inference (Xinference) 是一个性能强大且功能全面、开源的、分布式推理框架，专为大规模模型推理任务设计。支持大语言模型（LLM）、多模态模型、语音识别模型等多种模型的推理。该框架弥补了Ollama不能部署Rerank等模型的问题。Ollama专注 LLM（大型语言模型）。

Xinference 的主要特点：

模型一键部署：极大简化了大语言模型、多模态模型和语音识别模型的部署过程。
内置前沿模型：支持一键下载并部署大量前沿开源模型，特别是rerank模型。
异构硬件支持：可以利用 CPU 和 GPU 进行推理，提升集群吞吐量和降低延迟。
灵活的 API：提供包括 RPC 和 RESTful API 在内的多种接口，兼容 OpenAI 协议，方便与现有系统集成。
分布式架构：支持跨设备和跨服务器的分布式部署，允许高并发推理，并简化扩容和缩容操作。
第三方集成：与Dify等开源流行库无缝对接，快速构建基于 AI 的应用程序。

如Dify 为以下模型提供商提供原生支持，Xinference对大预言模型、嵌入模型以及重排序模型均有较好支持，这是Ollama所不具备的：

Provider LLM Text Embedding Rerank Speech to text TTS
OpenAI ✔️(🛠️)(👓) ✔️ ✔️ ✔️
Anthropic ✔️(🛠️)
Azure OpenAI ✔️(🛠️)(👓) ✔️ ✔️ ✔️
Gemini ✔️
Google Cloud ✔️(👓) ✔️
Nvidia API Catalog ✔️ ✔️ ✔️
Nvidia NIM ✔️
Nvidia Triton Inference Server ✔️
AWS Bedrock ✔️ ✔️
OpenRouter ✔️
Cohere ✔️ ✔️ ✔️
together.ai ✔️
Ollama ✔️ ✔️
Mistral AI ✔️
groqcloud ✔️
Replicate ✔️ ✔️
Hugging Face ✔️ ✔️
Xorbits inference ✔️ ✔️ ✔️ ✔️ ✔️
智谱 ✔️(🛠️)(👓) ✔️
百川 ✔️ ✔️
讯飞星火 ✔️
Minimax ✔️(🛠️) ✔️
通义千问 ✔️ ✔️ ✔️
文心一言 ✔️ ✔️
月之暗面 ✔️(🛠️)
Tencent Cloud ✔️
阶跃星辰 ✔️
火山引擎 ✔️ ✔️
零一万物 ✔️
360 智脑 ✔️
Azure AI Studio ✔️ ✔️
deepseek ✔️(🛠️)
腾讯混元 ✔️
SILICONFLOW ✔️ ✔️
Jina AI ✔️ ✔️
ChatGLM ✔️
Xinference ✔️(🛠️)(👓) ✔️ ✔️
OpenLLM ✔️ ✔️
LocalAI ✔️ ✔️ ✔️ ✔️
OpenAI API-Compatible ✔️ ✔️ ✔️
PerfXCloud ✔️ ✔️
Lepton AI ✔️
novita.ai ✔️
Amazon Sagemaker ✔️ ✔️ ✔️
Text Embedding Inference ✔️ ✔️
GPUStack ✔️(🔧️)(👓) ✔️ ✔️ ✔️ ✔️

其中 (🛠️) 代表支持 Function Calling，(👓) 代表视觉能力。

实际上Xinference支持的模型类型还要更多

Feature Xinference FastChat OpenLLM RayLLM
OpenAI-Compatible RESTful API ✅ ✅ ✅ ✅
vLLM Integrations ✅ ✅ ✅ ✅
More Inference Engines (GGML, TensorRT) ✅ ❌ ✅ ✅
More Platforms (CPU, Metal) ✅ ✅ ❌ ❌
Multi-node Cluster Deployment ✅ ❌ ❌ ✅
Image Models (Text-to-Image) ✅ ✅ ❌ ❌
Text Embedding Models ✅ ❌ ❌ ❌
Multimodal Models ✅ ❌ ❌ ❌
Audio Models ✅ ❌ ❌ ❌
More OpenAI Functionalities (Function Calling) ✅ ❌ ❌ ❌

Feature	Xinference	FastChat	OpenLLM	RayLLM
OpenAI-Compatible RESTful API	✅	✅	✅	✅
vLLM Integrations	✅	✅	✅	✅
More Inference Engines (GGML, TensorRT)	✅	❌	✅	✅
More Platforms (CPU, Metal)	✅	✅	❌	❌
Multi-node Cluster Deployment	✅	❌	❌	✅
Image Models (Text-to-Image)	✅	✅	❌	❌
Text Embedding Models	✅	❌	❌	❌
Multimodal Models	✅	❌	❌	❌
Audio Models	✅	❌	❌	❌
More OpenAI Functionalities (Function Calling)	✅	❌	❌	❌

2.Docker Desktop

首先安装Docker Desktop ，需要后端引擎，建议Windows Subsystem for Linux（WSL）

WSL 2：在Windows操作系统中，Docker可以使用WSL 2作为后端引擎来运行Linux容器。WSL 2是一种轻量级虚拟机环境，使用由微软维护并优化的真实Linux内核，提供更高的系统兼容性和性能表现。它是当前在Windows上推荐使用的Docker部署方式

PC机器上还没有装过WSL，首先在PowerShell输入

wsl --install

執行 wsl --install 并没有安装，只看到 WSL 說明文字，嘗試執行 wsl --list --online 以查看可用的散發版本清單，结果返回错误代码 0x80072ee7，大概查了下：“通常是由于网络问题或 Microsoft Store 连接故障导致的。”其实上连Microsoft Store是无问题，那么在微软商店上直接下个Windows Subsystem for Linux

安装好WSL后，再在商店中下载Docker Desktop并运行。Pull拉取Xinference镜像（需要科学上网）

或者阿里云公共镜像仓库中拉取Xinference最新的发布版本

docker pull registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:latest

现在可以直接拉取镜像了！

docker pull xprobe/xinference

执行结果如下，docker镜像有不小（有些机器是40多G、有些机器近27G，不知何故）

3.Setup Xinference

windows下powershell中执行：

docker run -d `
    --name xinference `
    -v /xinference/data/.xinference:/root/.xinference `
    -v /xinference/data/.cache/huggingface:/root/.cache/huggingface `
    -v /xinference/data/.cache/modelscope:/root/.cache/modelscope `
    -v /xinference/log:/workspace/xinference/logs `
    -e XINFERENCE_HOME=/xinference `
    -p 9998:9997 `
    --gpus all `
    xprobe/xinference:latest `
    xinference-local -H 0.0.0.0 --log-level debug

参数详解

参数 / 选项	说明
`-d`	以守护进程（后台）模式运行容器，避免阻塞终端。
`--name xinference`	为容器指定名称为 `xinference`，便于后续管理（如停止、重启）。
`-v 宿主机路径:容器路径`	挂载卷（Volume），实现数据持久化。
`-e XINFERENCE_HOME=/xinference`	设置 Xinference 的主目录为 `/xinference`，用于配置文件和缓存。

挂载卷（Volume）说明

/xinference/data/.xinference:/root/.xinference
持久化存储 Xinference 的配置文件（如模型元数据、服务配置）。
/xinference/data/.cache/huggingface:/root/.cache/huggingface
缓存从 Hugging Face 下载的模型文件，避免重复下载。
/xinference/data/.cache/modelscope:/root/.cache/modelscope
缓存从 ModelScope 下载的模型文件，避免重复下载。
/xinference/log:/workspace/xinference/logs
持久化存储 Xinference 的日志文件，便于后续分析。

生产环境部署
1. 通过命名容器和数据卷，确保服务重启后配置和模型缓存不丢失。
2. 日志持久化便于排查故障，尤其是在模型加载或推理出错时。
模型缓存优化
1. 挂载 Hugging Face 和 ModelScope 的缓存目录后，即使容器重建，已下载的模型也不会重复下载，节省带宽和时间。
管理建议
- 停止服务：docker stop xinference
- 启动服务：docker start xinference
- 查看日志：docker logs xinference
- 进入容器：docker exec -it xinference bash

Xinference官网给出的命令如下（使用如下方式在容器内启动 Xinference，同时将 9997 端口映射到宿主机的 9998 端口，并且指定日志级别为 DEBUG，也可以指定需要的环境变量。）：

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:v<your_version> xinference-local -H 0.0.0.0 --log-level debug

详见：Docker 镜像 — Xinference

4.Rerank Model

通过9998端口访问管理页面，可以切换为中文界面

下载重排序模型（rerank）

bge - reranker - v2 - m3：轻量级多语言模型，由北京智源人工智能研究院开发。基于 bge - m3 构建，支持多语言（覆盖 100 + 种语言），部署简单、推理快，能直接输出查询与文档相关性分数，在 BEIR、CMTEB 等评测表现好，适合多语言检索场景，像跨语言信息检索任务，还适配长文本（支持最长 8192 token 文本重排序），如合同、论文检索，推荐与 BGE - M3 搭配构建 “检索 - 重排序 - 生成” 流程。

注明：下载模型优先选用 modelscope(魔塔)源。

模型运行成功如下所示：

同理，也可以本地化LLM模型运行：

4. Work in Dify

接下来，就是在Dify环境中设置调用Xinference部署的重排序模型。

需要填写必填项是模型名称、服务器URL和模型UID(Dify 1.4 这里服务器URL显示为***密码，不明白啥意思、大概率是个bug)

模型名称：bge-reranker-v2-m3 （Xinference管理页面拷贝过来就可以）
服务器URL：http://host.docker.internal:9997
模型UID：bge-reranker-v2-m3

后续就可以在Dify的知识库中使用rerank模型了！

本文章已经生成可运行项目

Provider	LLM	Text Embedding	Rerank	Speech to text	TTS
OpenAI	✔️(🛠️)(👓)	✔️		✔️	✔️
Anthropic	✔️(🛠️)
Azure OpenAI	✔️(🛠️)(👓)	✔️		✔️	✔️
Gemini	✔️
Google Cloud	✔️(👓)	✔️
Nvidia API Catalog	✔️	✔️	✔️
Nvidia NIM	✔️
Nvidia Triton Inference Server	✔️
AWS Bedrock	✔️	✔️
OpenRouter	✔️
Cohere	✔️	✔️	✔️
together.ai	✔️
Ollama	✔️	✔️
Mistral AI	✔️
groqcloud	✔️
Replicate	✔️	✔️
Hugging Face	✔️	✔️
Xorbits inference	✔️	✔️	✔️	✔️	✔️
智谱	✔️(🛠️)(👓)	✔️
百川	✔️	✔️
讯飞星火	✔️
Minimax	✔️(🛠️)	✔️
通义千问	✔️	✔️			✔️
文心一言	✔️	✔️
月之暗面	✔️(🛠️)
Tencent Cloud				✔️
阶跃星辰	✔️
火山引擎	✔️	✔️
零一万物	✔️
360 智脑	✔️
Azure AI Studio	✔️		✔️
deepseek	✔️(🛠️)
腾讯混元	✔️
SILICONFLOW	✔️	✔️
Jina AI		✔️	✔️
ChatGLM	✔️
Xinference	✔️(🛠️)(👓)	✔️	✔️
OpenLLM	✔️	✔️
LocalAI	✔️	✔️	✔️	✔️
OpenAI API-Compatible	✔️	✔️		✔️
PerfXCloud	✔️	✔️
Lepton AI	✔️
novita.ai	✔️
Amazon Sagemaker	✔️	✔️	✔️
Text Embedding Inference		✔️	✔️
GPUStack	✔️(🔧️)(👓)	✔️	✔️	✔️	✔️