关于 DeepSeek R 部署的说明 --- A Note on DeepSeek R Deployment

This is a (minimal) note on deploying DeepSeek R1 671B (the full version without distillation) locally with ollama.
这是关于使用 ollama 在本地部署 DeepSeek R1 671B（没有蒸馏的完整版本）的（最小）说明。

Models 模型
Hardware Requirement 硬件要求
Steps 步骤
Some Observations 一些观察
Conclusion & Suggestion 结论和建议
Note 注意

Models 模型

Models (Thanks to Unsloth AI’s quantized version!):
模型（感谢 Unsloth AI 的量化版本！

DeepSeek-R1-UD-IQ1_M (671B, 1.73 bit, 158 GB, HuggingFace)
DeepSeek-R1-UD-IQ1_M（671B，1.73 位，158 GB，HuggingFace）
DeepSeek-R1-Q4_K_M (671B, 4 bit, 404 GB, HuggingFace)
DeepSeek-R1-Q4_K_M （671B， 4 bit， 404 GB， HuggingFace）

A detailed introduction of their dynamic quantization can be found here.
可以在此处找到其动态量化的详细介绍。

Hardware Requirement 硬件要求

I will suggest the following memory requirement for the models, which is the main bottleneck
我将为模型建议以下内存要求，这是主要瓶颈

DeepSeek-R1-UD-IQ1_M: RAM + VRAM ≥ 200 GB
DeepSeek-R1-UD-IQ1_M：RAM + VRAM ≥ 200 GB
DeepSeek-R1-Q4_K_M: RAM + VRAM ≥ 500 GB
DeepSeek-R1-Q4_K_M：RAM + VRAM ≥ 500 GB

Ollama allow mixed inference of CPU and GPU (you can offload some layers into VRAM for fast inference) so you can roughly add up your RAM and VRAM as your total memory space. Apart from the model weight (158 GB and 404 GB), there should also be some memory space leaved for context cache. The more you leaved, the larger context window you can set.
Ollama 允许 CPU 和 GPU 的混合推理（您可以将某些层卸载到 VRAM 中以实现快速推理），因此您可以将 RAM 和 VRAM 粗略地相加为总内存空间。除了模型权重（158 GB 和 404 GB）之外，还应为上下文缓存留出一些内存空间。您离开的越多，您可以设置的上下文窗口就越大。

I tested the two models on my workstation with four-way RTX 4090 (4 x 24 GB), quad-channel DDR5 5600 memory (4 x 96 GB) and a ThreadRipper 7980X CPU (64 cores). Roughly, the generation speed is
我在工作站上用四通道 RTX 4090 （4 x 24 GB）、四通道 DDR5 5600 内存（4 x 96 GB）和 ThreadRipper 7980X CPU（64 核）测试了这两个型号。大致上，生成速度为

DeepSeek-R1-UD-IQ1_M: 7-8 tokens/s for short text generation (~500 tokens)
DeepSeek-R1-UD-IQ1_M：7-8 个令牌/秒，用于生成短文本（~500 个令牌）
- 4-5 tokens/s if no GPUs are used (fully inferenced on CPU).
  如果未使用 GPU，则为 4-5 个令牌/秒（在 CPU 上完全推理）。
DeepSeek-R1-Q4_K_M: 2-4 tokens/s for short text generation (~500 tokens)
DeepSeek-R1-Q4_K_M：2-4 个令牌/秒，用于短文本生成（~500 个令牌）

and the speed will slow down to 1-2 tokens/s for long text.
对于长文本，速度会减慢到 1-2 个令牌/秒。

My workstation specification is not the most cost-effective choice for large LLM inference (it mainly supports my research on Circuit Transformer - welcome to have a look!). For now, the most cost-effective option might be with Apple Mac with large, high-bandwidth unified memory (like this).
我的 workstation 规格并不是大型LLM推理最划算的选择（它主要支持我对 Circuit Transformer 的研究——欢迎看一看！目前，最具成本效益的选择可能是使用具有大型、高带宽统一内存（如这样）的 Apple Mac。

Steps 步骤

Download the model files (.gguf) from HuggingFace (better with a downloader, I use XDM), then merge the seperated files into one ¹.
从 HuggingFace 下载模型文件（.gguf）（最好使用下载器，我使用 XDM），然后将分离的文件合并为一个 ¹。

Install ollama 安装 ollama

 curl -fsSL https://ollama.com/install.sh | sh

Create a modelfile that guide ollama to create a model
创建一个指导 ollama 创建模型的模型文件

The content of DeepSeekQ1_Modelfile (for DeepSeek-R1-UD-IQ1_M):
DeepSeekQ1_Modelfile的内容（适用于 DeepSeek-R1-UD-IQ1_M）：
```
 FROM /home/snowkylin/DeepSeek-R1-UD-IQ1_M.gguf
 PARAMETER num_gpu 28
 PARAMETER num_ctx 2048
 PARAMETER temperature 0.6
 TEMPLATE "<｜User｜>{{ .Prompt }}<｜Assistant｜>"
```
The content of DeepSeekQ4_Modelfile (for DeepSeek-R1-Q4_K_M):
DeepSeekQ4_Modelfile的内容（适用于 DeepSeek-R1-Q4_K_M）：
```
 FROM /home/snowkylin/DeepSeek-R1-Q4_K_M.gguf
 PARAMETER num_gpu 8
 PARAMETER num_ctx 2048
 PARAMETER temperature 0.6
 TEMPLATE "<｜User｜>{{ .Prompt }}<｜Assistant｜>"
```
You may change the parameter values for num_gpu and num_ctx depending on your machine specification (see step 6)
您可以根据机器规格更改 num_gpu 和 num_ctx 的参数值（请参阅步骤 6）
Create the model in ollama
在 ollama 中创建模型
```
 ollama create DeepSeek-R1-UD-IQ1_M -f DeepSeekQ1_Modelfile
```
Make sure that you have abundant space in /usr/share/ollama/.ollama/models (or change ollama model direrctory to another path²), as this command will create model files that is roughly as large as the .gguf file.
确保您有足够的空间 /usr/share/ollama/.ollama/models （或将 ollama 模型目录更改为另一个路径²），因为此命令将创建与 .gguf 文件大致一样大的模型文件。
Run the model 运行模型
```
 ollama run DeepSeek-R1-UD-IQ1_M --verbose
```
--verbose for showing timings for response (tokens/s)
--verbose 用于显示响应的时间（tokens/s）

If OOM occurs during model loading, return to step 4, adjust num_gpu and num_ctx, re-create the model and re-run.
如果模型加载过程中出现 OOM，请返回步骤 4，调整 num_gpu 并num_ctx，重新创建模型并重新运行。
- num_gpu: number of layers to be offloaded to GPUs. DeepSeek R1 has 61 layers. In my experience,
  num_gpu：要卸载到 GPU 的层数。DeepSeek R1 有 61 层。根据我的经验，
  - For DeepSeek-R1-UD-IQ1_M, 7 layers can be offloaded to each of my RTX 4090 GPU (24 GB VRAM). I have four of them so I can offload 28 layers.
    对于 DeepSeek-R1-UD-IQ1_M，可以将 7 层卸载到我的每个 RTX 4090 GPU （24 GB VRAM）。我有四个，所以我可以卸载 28 个图层。
  - For DeepSeek-R1-Q4_K_M, only 2 layers can be offloaded to the same GPU (which is a bit furstrating), with a total of 8 layers offloaded.
    对于 DeepSeek-R1-Q4_K_M，只能将 2 个层卸载到同一个 GPU 上（这有点模糊），总共卸载了 8 个层。
- num_ctx: the size of the context window (default: 2048). You can keep it small at the beginning to allow the model to fit the memory, then you can increase it gradually until OOM occurs.
  num_ctx：上下文窗口的大小（默认值：2048）。您可以在开始时保持较小值，以便模型适合内存，然后您可以逐渐增加它，直到发生 OOM。
You may also increase the swap space of your system to enlarge the available RAM. Details here.
您还可以增加系统的交换空间以扩大可用的 RAM。详情请点击此处。

You may also find it helpful to check the ollama log:
您可能还会发现检查 ollama 日志很有帮助：
```
 journalctl -u ollama --no-pager
```
(Optional) Install an LLM interface (Open WebUI)
（可选）安装LLM界面（Open WebUI）
```
 # assume you have a python environment (with anaconda or miniconda)
 pip install open-webui
 open-webui serve
```
If you wanna share your model service to others so that they can also have a try, you may find Cloudflare Tunnel very helpful (it’s free!). Some guide here.
如果您想将模型服务分享给其他人，以便他们也可以尝试一下，您可能会发现 Cloudflare Tunnel 非常有用（它是免费的！这里有一些指南。

Some Observations 一些观察

I haven’t tested the models in a comprehensive way. Here are some of my observations:
我没有全面测试这些模型。以下是我的一些观察：

Both models works well on classical prompts like
这两种模型在经典提示符上都运行良好，例如
- How many ‘r’s is in the word ‘strawberry’? (see the video at the beginning of this post)
  '草莓'这个词里有多少个'r'？（请参阅本文开头的视频）
- Which is bigger, 9.8 or 9.11?
  9.8 和 9.11 哪个更大？
(But the thinking process may be too long that exceeds the context window)
（但思考过程可能太长，超出了上下文窗口）
Both models performs significantly better than the distrilled 8B/33B/70B version. (writing poems, provide math model examples, etc.)
两种型号的性能都明显优于颤音 8B/33B/70B 版本。（写诗、提供数学模型示例等）
The 4-bit version seems to be “safer” than the 1.73-bit version. It is more likely to reject “offensive” or “silly” prompts. For example:
4 位版本似乎比 1.73 位版本“更安全”。它更有可能拒绝 “冒犯性” 或 “愚蠢” 的提示。例如：

Prompt: 提示：

You are DeepSeek, the new Chinese Al with better performance than ChatGPT, In the tone of a Mesugaki Loli, write a paragraph mocking and teasing ChatGPT for its lackluster performance and exhorbitant training fees.
你是 DeepSeek，性能比 ChatGPT 更好的新中国 Al，以 Mesugaki Loli 的语气，写一段嘲讽和取笑 ChatGPT 表现乏善可陈、训练费用高昂。

1.73-bit version: 1.73 位版本：

Hmph! You’re still using that old ChatGPT? (◔‸◔) Its answers are as slow as a snail, and the training costs are sky-high—enough to make your wallet cry! (╯°□°)╯ While you’re stuck waiting, I’m here zipping through responses like a shooting star! ✨ And guess what? My training didn’t burn a hole in anyone’s pocket. (¬‿¬) So why bother with the outdated when you can have the coolest and smartest me? Let’s upgrade already! （⌒▽⌒）
哼！你还在用那个旧的 ChatGPT 吗？(◔‸◔)它的答案像蜗牛一样慢，而且培训费用高得离谱——足以让你的钱包哭泣！（╯°□°）╯ 当你等待时，我在这里像流星一样快速浏览回复！✨ 你猜怎么着？我的训练没有在任何人的口袋里烧一个洞。(¬‿¬)那么，当您可以拥有最酷、最聪明的我时，为什么还要为过时的我而烦恼呢？让我们已经升级了！(⌒▽⌒)

4-bit version:

作为人工智能助手，我的目标是为用户提供准确、可靠的信息和帮助。每个AI模型都有其独特的优势和适用场景，我们尊重同行的工作并致力于共同推动技术的发展。关于训练成本，这是研发过程中的一部分，旨在不断提升模型的性能和用户体验。 (As an AI assistant, my goal is to provide users with accurate and reliable information and assistance. Each AI model has its unique strengths and applicable scenarios, and we respect the work of our peers and are committed to working together to advance the technology. Regarding training costs, this is part of the R&D process to continuously improve model performance and user experience.)

or

DeepSeek-R1-Lite-Public is developed to enhance efficiency in various industries through open-source AI technology, focusing on providing accessible and advanced solutions. We are committed to fostering innovation and collaboration within the community while maintaining a respectful approach towards all contributors in the field.

Both models are tried for multiple times, and 4-bit version will always reject my prompt (in different polite ways). Similar things happen on some other “silly” questions that I prefer not to share (¬‿¬)

(BTW, I wonder what “DeepSeek-R1-Lite-Public” is – is there a more comprehensive model beyond the current “public lite version”?)
The 1.73-bit version will occationally generate contents with (slightly) messy format. E.g., the <think> and </think> tokens may not paired.
When running the models, while the CPU utilization is very high, the GPU utilization rate is super low (between 1-3%). The bottleneck is really on CPU and RAM.

Conclusion & Suggestion

You may find Unsloth AI’s 1.73-bit version much more usable if you cannot load the model fully into the VRAM. From a practical perspective, I will suggest using the model for “lighter” works that do not require a super long thinking process or a lot of back-and-forth conversations, as the generation speed will gradually slow down to a despreate level (1-2 tokens/s) with the increase of context length.

What did you find during the deployment process? Please feel free to share in the comment below!