DeepSeek-V2：一个强大、经济且高效的混合专家语言模型

引言

今天，我们推出 DeepSeek-V2，一个强大的混合专家（Mixture-of-Experts, MoE）语言模型，具有训练经济和推理高效的特点。它包含 236B 总参数，其中每个 token 激活 21B 参数。与 DeepSeek 67B 相比，DeepSeek-V2 实现了更强的性能，同时节省了 42.5% 的训练成本，减少了 93.3% 的 KV 缓存，并将最大生成吞吐量提升至 5.76 倍。

我们在包含 8.1 万亿 token 的多样化高质量语料库上预训练了 DeepSeek-V2。在全面的预训练之后，我们通过监督微调（SFT）和强化学习（RL）的过程来充分释放模型的潜力。评估结果验证了我们方法的有效性，因为 DeepSeek-V2 在标准基准测试和开放式生成评估上都取得了显著的性能。

新闻

2024.05.16：我们发布了 DeepSeek-V2-Lite。
2024.05.06：我们发布了 DeepSeek-V2。

模型下载

模型	总参数	激活参数	上下文长度	下载
DeepSeek-V2-Lite	16B	2.4B	32k	🤗 HuggingFace
DeepSeek-V2-Lite-Chat (SFT)	16B	2.4B	32k	🤗 HuggingFace
DeepSeek-V2	236B	21B	128k	🤗 HuggingFace
DeepSeek-V2-Chat (RL)	236B	21B	128k	🤗 HuggingFace

由于 HuggingFace 的限制，当前开源代码在使用 HuggingFace 在 GPU 上运行时，性能比我们内部代码库慢。为了促进我们模型的有效执行，我们提供了专门的 vllm 解决方案，优化了模型的运行性能。

评估结果

基础模型

#### 标准基准测试（大于 67B 的模型）

基准测试	领域	LLaMA3 70B	Mixtral 8x22B	DeepSeek-V1 (Dense-67B)	DeepSeek-V2 (MoE-236B)
MMLU	英文	78.9	77.6	71.3	78.5
BBH	英文	81.0	78.9	68.7	78.9
C-Eval	中文	67.5	58.6	66.1	81.7
CMMLU	中文	69.3	60.0	70.8	84.0
HumanEval	代码	48.2	53.1	45.1	48.8
MBPP	代码	68.6	64.2	57.4	66.6
GSM8K	数学	83.0	80.3	63.4	79.2
Math	数学	42.2	42.5	18.7	43.6

#### 标准基准测试（小于 16B 的模型）

基准测试	领域	DeepSeek 7B (Dense)	DeepSeekMoE 16B	DeepSeek-V2-Lite (MoE-16B)
架构	-	MHA+Dense	MHA+MoE	MLA+MoE
MMLU	英文	48.2	45.0	58.3
BBH	英文	39.5	38.9	44.1
C-Eval	中文	45.0	40.6	60.3
CMMLU	中文	47.2	42.5	64.3
HumanEval	代码	26.2	26.8	29.9
MBPP	代码	39.0	39.2	43.2
GSM8K	数学	17.4	18.8	41.1
Math	数学	3.3	4.3	17.1

更多评估细节，如少样本设置和提示词，请查看我们的论文。

#### 上下文窗口

在"大海捞针"（NIAH）测试中的评估结果。DeepSeek-V2 在高达 128K 的所有上下文窗口长度上表现良好。

Chat 模型

#### 标准基准测试（大于 67B 的模型）

基准测试	领域	QWen1.5 72B Chat	Mixtral 8x22B	LLaMA3 70B Instruct	DeepSeek-V1 Chat (SFT)	DeepSeek-V2 Chat (SFT)	DeepSeek-V2 Chat (RL)
MMLU	英文	76.2	77.8	80.3	71.1	78.4	77.8
BBH	英文	65.9	78.4	80.1	71.7	81.3	79.7
C-Eval	中文	82.2	60.0	67.9	65.2	80.9	78.0
CMMLU	中文	82.9	61.0	70.7	67.8	82.4	81.6
HumanEval	代码	68.9	75.0	76.2	73.8	76.8	81.1
MBPP	代码	52.2	64.4	69.8	61.4	70.4	72.0
LiveCodeBench (0901-0401)	代码	18.8	25.0	30.5	18.3	28.7	32.5
GSM8K	数学	81.9	87.9	93.2	84.1	90.8	92.2
Math	数学	40.6	49.8	48.5	32.6	52.7	53.9

#### 标准基准测试（小于 16B 的模型）

基准测试	领域	DeepSeek 7B Chat (SFT)	DeepSeekMoE 16B Chat (SFT)	DeepSeek-V2-Lite 16B Chat (SFT)
MMLU	英文	49.7	47.2	55.7
BBH	英文	43.1	42.2	48.1
C-Eval	中文	44.7	40.0	60.1
CMMLU	中文	51.2	49.3	62.5
HumanEval	代码	45.1	45.7	57.3
MBPP	代码	39.0	46.2	45.8
GSM8K	数学	62.6	62.2	72.0
Math	数学	14.7	15.2	27.9

#### 英文开放式生成评估

我们在 AlpacaEval 2.0 和 MTBench 上评估了我们的模型，展示了 DeepSeek-V2-Chat-RL 在英文对话生成方面的竞争力。

#### 中文开放式生成评估

Alignbench (https://arxiv.org/abs/2311.18743)

模型	开源/闭源	总分	中文推理	中文语言
gpt-4-1106-preview	闭源	8.01	7.73	8.29
DeepSeek-V2 Chat (RL)	开源	7.91	7.45	8.36
erniebot-4.0-202404 (文心一言)	闭源	7.89	7.61	8.17
DeepSeek-V2 Chat (SFT)	开源	7.74	7.30	8.17
gpt-4-0613	闭源	7.53	7.47	7.59
erniebot-4.0-202312 (文心一言)	闭源	7.36	6.84	7.88
moonshot-v1-32k-202404 (月之暗面)	闭源	7.22	6.42	8.02
Qwen1.5-72B-Chat (通义千问)	开源	7.19	6.45	7.93
DeepSeek-67B-Chat	开源	6.43	5.75	7.11
Yi-34B-Chat (零一万物)	开源	6.12	4.86	7.38
gpt-3.5-turbo-0613	闭源	6.08	5.35	6.71
DeepSeek-V2-Lite 16B Chat	开源	6.01	4.71	7.32

#### 编码基准测试

我们在 LiveCodeBench (0901-0401) 上评估了我们的模型，这是一个为实时代码挑战设计的基准测试。如图所示，DeepSeek-V2 在 LiveCodeBench 上展示了相当高的熟练度，其 Pass@1 得分超过了其他几个复杂模型。这一性能突出了模型在处理实时代码任务方面的有效性。

模型架构

DeepSeek-V2 采用创新架构来保证训练经济和推理高效：

在注意力方面，我们设计了 MLA（多头潜在注意力），利用低秩键值联合压缩来消除推理时键值缓存的瓶颈，从而支持高效推理。
在前馈网络（FFNs）方面，我们采用 DeepSeekMoE 架构，这是一种高性能的 MoE 架构，能够以更低的成本训练更强的模型。

聊天网站

您可以在 DeepSeek 官方网站上与 DeepSeek-V2 聊天：chat.deepseek.com

API 平台

我们还在 DeepSeek 平台上提供与 OpenAI 兼容的 API：platform.deepseek.com。注册即可获得数百万免费 token。您也可以按使用量付费，享受无与伦比的价格。

如何在本地运行

要以 BF16 格式使用 DeepSeek-V2 进行推理，需要 80GB\*8 GPU。

使用 Huggingface Transformers 推理

您可以直接使用 Huggingface Transformers 进行模型推理。

#### 文本补全

python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
max_memory 应根据您的设备设置
max_memory = {i: "75GB" for i in range(8)}
device_map 不能设置为 auto
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

#### 聊天补全

python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
max_memory 应根据您的设备设置
max_memory = {i: "75GB" for i in range(8)}
device_map 不能设置为 auto
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "Write a piece of quicksort code in C++"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

完整的聊天模板可以在 HuggingFace 模型仓库中的 tokenizer_config.json 内找到。

聊天模板的示例如下：

<｜begin▁of▁sentence｜>User: {user_message_1}

Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2}

Assistant:

您也可以添加一个可选的系统消息：

<｜begin▁of▁sentence｜>{system_message}

User: {user_message_1}

Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2}

Assistant:

使用 SGLang 推理（推荐）

SGLang 目前支持 MLA 优化、FP8 (W8A8)、FP8 KV Cache 和 Torch Compile，在开源框架中提供了最佳的延迟和吞吐量。以下是启动与 OpenAI API 兼容服务器的一些示例命令：

bash
BF16，张量并行度 = 8
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V2-Chat --tp 8 --trust-remote-code

BF16，使用 torch.compile（编译可能需要几分钟）
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V2-Lite-Chat --trust-remote-code --enable-torch-compile

FP8，张量并行度 = 8，FP8 KV 缓存
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V2-Chat --tp 8 --trust-remote-code --quant fp8 --kv-cache-dtype fp8_e5m2

启动服务器后，您可以使用 OpenAI API 查询它：

python
import openai
client = openai.Client(
    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

聊天补全
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print(response)

使用 vLLM 推理（推荐）

要使用 vLLM 进行模型推理，请将此 Pull Request 合并到您的 vLLM 代码库中：vllm-project/vllm#4650。

python
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 8192, 8
model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])

messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference."}],
    [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

LangChain 支持

由于我们的 API 与 OpenAI 兼容，您可以轻松地在 langchain 中使用它。以下是一个示例：

python
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
    model='deepseek-chat',
    openai_api_key=,
    openai_api_base='https://api.deepseek.com/v1',
    temperature=0.85,
    max_tokens=8000)

许可证

此代码仓库根据 MIT 许可证获得许可。DeepSeek-V2 Base/Chat 模型的使用受模型许可证约束。DeepSeek-V2 系列（包括 Base 和 Chat）支持商业使用。

引用

bibtex
@misc{deepseekv2,
    title={DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}, 
    author={DeepSeek-AI},
    year={2024},
    eprint={2405.04434},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

联系

如果您有任何问题，请提出 issue 或通过 service@deepseek.com 与我们联系。

引言

新闻