LLaMA Factory 完全指南 — 一站式大模型微调平台

项目定位：LLaMA Factory 是一个统一的大模型高效微调框架，全名为 “Unified Efficient Fine-Tuning of 100+ LLMs & VLMs”，在 ACL 2024 发表。目前 GitHub Stars 超过 48,000+，被 Amazon、NVIDIA、阿里云等大厂广泛采用。

一、项目概述

1.1 核心价值

LLaMA Factory 的设计理念是让大模型微调像聊天一样简单：

🎯 零代码操作：通过 CLI 或 Web UI，无需编写代码即可微调模型
🌐 模型全覆盖：支持 100+ 主流大模型和视觉语言模型
⚡ 高效低耗：支持多种量化技术，最少 2 张消费级显卡即可训练 70B 模型
🔧 方法多样：集成了 10+ 种微调方法（LoRA、QLoRA、PPO、DPO 等）
📊 监控友好：内置 LlamaBoard、TensorBoard、Wandb 等实验监控工具

1.2 硬件需求

方法	精度	7B	14B	30B	70B
全参数微调 (bf16/fp16)	32-bit	120GB	240GB	600GB	1200GB
全参数微调 (pure_bf16)	16-bit	60GB	120GB	300GB	600GB
Freeze/LoRA/GaLore/OFT	16-bit	16GB	32GB	64GB	160GB
QLoRA / QOFT	8-bit	10GB	20GB	40GB	80GB
QLoRA / QOFT	4-bit	6GB	12GB	24GB	48GB
QLoRA / QOFT	2-bit	4GB	8GB	16GB	24GB

1.3 系统要求

必装依赖：

依赖	最低版本	推荐版本
Python	3.11	>= 3.11
torch	2.0.0	2.6.0
transformers	4.49.0	4.50.0
datasets	2.16.0	3.2.0
accelerate	0.34.0	1.2.1
peft	0.14.0	0.15.1
trl	0.8.6	0.9.6

可选依赖：

依赖	最低版本	推荐版本
CUDA	11.6	12.2
deepspeed	0.10.0	0.16.4
bitsandbytes	0.39.0	0.43.1
vllm	0.4.3	0.8.2
flash-attn	2.5.6	2.7.2

1.4 项目信息

属性	值
GitHub	https://github.com/hiyouga/LlamaFactory
论文	arXiv:2403.13372
官方文档	https://llamafactory.readthedocs.io/
在线体验	https://www.llamafactory.com.cn/
Discord	https://discord.gg/rKfvV9r9FK
开源协议	Apache-2.0

Docker 镜像：

# CUDA
docker pull hiyouga/llamafactory:latest

# NPU
docker pull hiyouga/llamafactory:latest-npu-a2

# ROCm
docker pull hiyouga/llamafactory:latest-rocm

二、支持模型列表

2.1 LLaMA 系列

模型	参数量	Template
LLaMA	7B/13B/33B/65B	-
LLaMA 2	7B/13B/70B	llama2
LLaMA 3/3.3	1B/3B/8B/70B	llama3
LLaMA 4	109B/402B	llama4
LLaMA 3.2 Vision	11B/90B	mllama

2.2 Qwen 系列

模型	参数量	Template
Qwen2 (Code/Math/MoE/QwQ)	0.5B~110B	qwen
Qwen3 (MoE/Instruct/Thinking/Next)	0.6B~235B	qwen3/qwen3_nothink
Qwen3.5	0.8B~397B	qwen3_5
Qwen2-Audio	7B	qwen2_audio
Qwen2.5-Omni	3B/7B	qwen2_omni
Qwen3-Omni	30B	qwen3_omni
Qwen2-VL / Qwen2.5-VL / QVQ	2B~72B	qwen2_vl
Qwen3-VL	2B~235B	qwen3_vl

Template 说明：qwen3_nothink 用于非思维链推理场景，qwen3 用于标准对话

2.3 DeepSeek 系列

模型	参数量	Template
DeepSeek (LLM/Code/MoE)	7B/16B/67B/236B	deepseek
DeepSeek 3-3.2	236B/671B	deepseek3
DeepSeek R1 (Distill)	1.5B~671B	deepseekr1

2.4 GLM 系列

模型	参数量	Template
GLM-4 / GLM-4-0414 / GLM-Z1	9B/32B	glm4/glmz1
GLM-4.5 / GLM-4.5(6)V	9B/106B/355B	glm4_moe/glm4_5v

2.5 Gemma 系列

模型	参数量	Template
Gemma / Gemma 2 / CodeGemma	2B/7B/9B/27B	gemma/gemma2
Gemma 3 / Gemma 3n	270M~27B	gemma3/gemma3n

2.6 InternLM 系列

模型	参数量	Template
InternLM 2-3	7B/8B/20B	intern2
InternVL 2.5-3.5	1B~241B	intern_vl
Intern-S1-mini	8B	intern_s1

2.7 其他主流模型

模型	参数量	Template
Mistral / Mixtral	7B/8x7B/8x22B	mistral
LLaVA-1.5	7B/13B	llava
LLaVA-NeXT	7B~110B	llava_next
LLaVA-NeXT-Video	7B/34B	llava_next_video
Phi-3 / Phi-3.5	4B/14B	phi
Phi-4-mini / Phi-4	3.8B/14B	phi4_mini/phi4
MiniCPM 4	0.5B/8B	cpm4
MiniCPM-o / MiniCPM-V 4.5	8B/9B	minicpm_o/minicpm_v
GPT-OSS	20B/120B	gpt_oss
Granite 3-4	1B~8B	granite3/granite4
Falcon	0.5B~180B	falcon/falcon_h1
BLOOM	560M~176B	-

2.8 Day-0 支持承诺

支持日期	模型
Day 0	Qwen3、Qwen2.5-VL、Gemma 3、GLM-4.1V、InternLM 3、MiniCPM-o-2.6
Day 1	Llama 3、GLM-4、Mistral Small、PaliGemma2、Llama 4

三、微调方法详解

3.1 训练方法矩阵

方法	全参数	Freeze	LoRA	QLoRA	OFT	QOFT
预训练 (Pre-Training)	✅	✅	✅	✅	✅	✅
监督微调 (SFT)	✅	✅	✅	✅	✅	✅
奖励建模 (Reward Modeling)	✅	✅	✅	✅	✅	✅
PPO 训练	✅	✅	✅	✅	✅	✅
DPO 训练	✅	✅	✅	✅	✅	✅
KTO 训练	✅	✅	✅	✅	✅	✅
ORPO 训练	✅	✅	✅	✅	✅	✅
SimPO 训练	✅	✅	✅	✅	✅	✅

3.2 全参数微调 (Full-tuning)

更新模型所有参数，适合有充足算力的场景。

3.3 冻结微调 (Freeze-tuning)

冻结大部分参数，只训练少数层，大幅减少显存占用。

3.4 LoRA 微调

低秩适配器，只训练少量参数，通用场景首选。

3.5 QLoRA 微调

4-bit 量化 + LoRA，消费级显卡即可训练大模型。

支持的量化方法：AWQ、GPTQ、LLM.int8、AQLM、HQQ、EETQ

3.6 偏好学习算法

DPO (Direct Preference Optimization)

KTO (Kahneman-Tversky Optimization) - 基于人类认知的优化方法

ORPO (Odds Ratio Preference Optimization) - 使用赔率比进行偏好优化

SimPO (Simple Preference Optimization) - 简化版偏好优化方法

3.7 高级优化算法

算法	说明	使用方式
GaLore	梯度低秩投影，减少内存	`galore_rank: 128`
BAdam	内存高效全参数训练	`use_badam: true`
DoRA	权重分解的 LoRA	`use_dora: true`
LoRA+	LoRA 学习率自适应	自动启用
PiSSA	主奇异值适配器	`pissa_init: true`
Muon	新型优化器	`optim: muon`
Adam-mini	内存高效优化器	`optim: adam-mini`
LongLoRA	长上下文扩展	`shift_attn: true`
OFT	正交分解微调	`oft_alpha: 1`
APOLLO	自适应参数优化器	`use_apollo: true`

四、安装指南

4.1 从源码安装（推荐）

git clone --depth 1 https://github.com/hiyouga/LlamaFactory.git
cd LlamaFactory
pip install -e .
pip install -e . && pip install -r requirements/metrics.txt -r requirements/deepspeed.txt

4.2 使用 Docker

CUDA 版本：

docker run -it --rm --gpus=all --ipc=host hiyouga/llamafactory:latest

Ascend NPU 版本：

docker pull hiyouga/llamafactory:latest-npu-a2
docker pull hiyouga/llamafactory:latest-npu-a3

AMD ROCm 版本：

cd docker/docker-rocm/
docker compose up -d
docker compose exec llamafactory bash

4.3 使用 uv（推荐）

uv run llamafactory-cli webui

4.4 Windows 特殊说明

安装 PyTorch：

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

安装 bitsandbytes：

pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl

五、快速开始

5.1 数据准备

SFT 数据格式：

[
  {
    "messages": [
      {"role": "system", "content": "你是一个有帮助的助手。"},
      {"role": "user", "content": "你好！"},
      {"role": "assistant", "content": "你好！有什么可以帮助你的吗？"}
    ]
  }
]

DPO 数据格式：

[
  {
    "messages": [
      {"role": "user", "content": "问题"},
      {"role": "assistant", "content": "回答"}
    ],
    "chosen": {"role": "assistant", "content": "优选回答"},
    "rejected": {"role": "assistant", "content": "拒绝回答"}
  }
]

预训练（Pre-Training）数据格式：

[
  {"text": "这是第一段预训练文本。"},
  {"text": "这是第二段预训练文本。"}
]

5.2 三行代码完成微调

# 1. 微调
llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml

# 2. 对话测试
llamafactory-cli chat examples/inference/qwen3_lora_sft.yaml

# 3. 合并导出
llamafactory-cli export examples/merge_lora/qwen3_lora_sft.yaml

5.3 使用 Web UI（推荐新手）

llamafactory-cli webui

5.4 使用 API 部署

API_PORT=8000 llamafactory-cli api examples/inference/qwen3.yaml infer_backend=vllm vllm_enforce_eager=true

六、训练配置详解

6.1 基础参数

model_name_or_path: Qwen/Qwen3-4B-Instruct
trust_remote_code: true
template: qwen3
finetuning_type: lora

dataset: identity,alpaca_en_demo
dataset_dir: data
max_samples: 1000
val_size: 0.1

output_dir: ./output/qwen3_lora
logging_steps: 10
save_steps: 500
eval_steps: 500

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1

6.2 LoRA 参数

lora_rank: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target: all

use_dora: true
quantization_bit: 4
bnb_4bit_compute_dtype: bf16

6.3 强化学习参数（PPO/DPO）

reward_model: path/to/reward_model
ppo_epochs: 4
gamma: 1.0
lam: 0.95
pref_beta: 0.1

七、加速与优化

7.1 FlashAttention

flash_attn: fa2

适用于 RTX 4090、A100、H100 GPU。

7.2 Unsloth

use_unsloth: true

长序列训练（56k 上下文）可在 24GB 显存内完成，速度提升 117%，显存减少 50%。

7.3 Liger Kernel

enable_liger_kernel: true

LinkedIn 开源的高效训练核。

7.4 分布式训练

deepspeed --num_gpus=8 examples/train_lora/qwen3_lora_sft.yaml

八、实验监控

8.1 LlamaBoard

内置 Web 可视化界面，启动训练后自动可用。

8.2 TensorBoard

tensorboard --logdir output/qwen3_lora

8.3 Wandb

report_to: wandb
run_name: test_run

8.4 SwanLab（国产工具）

use_swanlab: true
swanlab_run_name: test_run

九、预置数据集

9.1 预训练数据集

Wiki Demo / FineWeb / Wikipedia / The Stack / SkyPile

9.2 SFT 数据集

英文： Stanford Alpaca / LIMA / UltraChat / CodeAlpaca / WizardLM

中文： BELLE 2M / Firefly 1.1M / WebNovel / Ruozhiba

多模态： LLaVA mixed / Glaive Function Calling

9.3 偏好数据集

DPO mixed / UltraFeedback / HH-RLHF / RLHF-V

十、云端部署

Google Colab： https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9
阿里云 PAI-DSW： https://gallery.pai-ml.com/#/preview/deepLearning/nlp/llama_factory
在线体验： https://www.llamafactory.com.cn/
HuggingFace Space： https://huggingface.co/spaces/hiyouga/LLaMA-Board

十一、应用案例

📄 法律：DISC-LawLLM（准确率提升 35%）
🏥 医疗：Sunsimiao / CareGPT
🎮 角色扮演：GPT-OSS for Role-Playing
🔍 视觉问答：LLaVA-Med / InternVL

十二、引用

@inproceedings{zheng2024llamafactory,
  title={LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models},
  author={Yaowei Zheng and Richong Zhang and Junhao Zhang and Yanhan Ye and Zheyan Luo and Zhangchi Feng and Yongqiang Ma},
  booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)},
  year={2024},
  url={http://arxiv.org/abs/2403.13372}
}

十三、完整配置文件示例

13.1 全参数微调配置

stage: sft
model_name_or_path: Qwen/Qwen3-4B-Instruct
trust_remote_code: true
template: qwen3
finetuning_type: full

dataset: identity,alpaca_en_demo
dataset_dir: data
max_samples: 1000
val_size: 0.1

output_dir: ./output/qwen3_full_sft
logging_steps: 10
save_steps: 500
eval_steps: 500

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1

deepspeed: examples/deepspeed/ds_z3_config.json

13.2 Freeze 冻结微调配置

stage: sft
model_name_or_path: Qwen/Qwen3-4B-Instruct
trust_remote_code: true
template: qwen3
finetuning_type: freeze
freeze_extra_modules: all

per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 5.0e-5
num_train_epochs: 3.0

13.3 LoRA 微调配置

stage: sft
model_name_or_path: Qwen/Qwen3-4B-Instruct
trust_remote_code: true
template: qwen3
finetuning_type: lora

lora_rank: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target: all

use_dora: true

per_device_train_batch_size: 2
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1

13.4 QLoRA 4-bit 微调配置

stage: sft
model_name_or_path: Qwen/Qwen3-4B-Instruct
trust_remote_code: true
template: qwen3
finetuning_type: lora

quantization_bit: 4
bnb_4bit_compute_dtype: bf16
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: true

lora_rank: 64
lora_alpha: 16
lora_dropout: 0.05
lora_target: all

per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 1.0e-4
num_train_epochs: 3.0

13.5 DPO 偏好对齐配置

stage: dpo
model_name_or_path: Qwen/Qwen3-4B-Instruct
trust_remote_code: true
template: qwen3
finetuning_type: lora

dataset: identity,ultrafeedback_binarized
dataset_dir: data

lora_rank: 8
lora_alpha: 16
lora_target: all

pref_beta: 0.1
pref_loss: sigmoid

per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 1.0e-4
num_train_epochs: 3.0

13.6 ORPO 偏好对齐配置

stage: orpo
model_name_or_path: Qwen/Qwen3-4B-Instruct
trust_remote_code: true
template: qwen3
finetuning_type: lora

dataset: identity,ultrafeedback_binarized
dataset_dir: data

lora_rank: 8
lora_alpha: 16
lora_target: all

pref_beta: 0.1

learning_rate: 1.0e-4
num_train_epochs: 3.0

13.7 KTO 偏好对齐配置

stage: kto
model_name_or_path: Qwen/Qwen3-4B-Instruct
trust_remote_code: true
template: qwen3
finetuning_type: lora

dataset: identity,kto_mix_15k
dataset_dir: data

pref_beta: 0.1
desirable_weight: 1.0
undesirable_weight: 1.0

per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 1.0e-4
num_train_epochs: 3.0

13.8 PPO 强化学习配置

stage: ppo
model_name_or_path: Qwen/Qwen3-4B-Instruct
trust_remote_code: true
reward_model: path/to/reward_model
template: qwen3
finetuning_type: lora

lora_rank: 8
lora_alpha: 16
lora_target: all

gamma: 1.0
lam: 0.95
ppo_epochs: 4
learning_rate: 1.0e-5
clip_range: 0.2

13.9 多模态微调配置（Qwen2-VL）

stage: sft
model_name_or_path: Qwen/Qwen2.5-VL-7B-Instruct
trust_remote_code: true
template: qwen2_vl
finetuning_type: lora

dataset: llava_en_zh_300k
dataset_dir: data
val_size: 0.1

image_grid_pinpoints: "[[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]"

lora_rank: 16
lora_alpha: 32
lora_target: all

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-5
num_train_epochs: 3.0

13.10 长上下文微调配置

stage: sft
model_name_or_path: Qwen/Qwen3-4B-Instruct
trust_remote_code: true
template: qwen3
finetuning_type: lora

cutoff_len: 8192
spatial_merge_size: 2

shift_attn: true

lora_rank: 16
lora_alpha: 32
lora_target: all

十四、DeepSpeed 分布式训练配置

14.1 DeepSpeed ZeRO-3 配置

{
  "stage": 3,
  "offload_optimizer": {"device": "cpu", "pin_memory": true},
  "offload_param": {"device": "cpu", "pin_memory": true},
  "overlap_comm": true,
  "contiguous_gradients": true,
  "reduce_bucket_size": 5e8,
  "stage3_prefetch_bucket_size": 5e8,
  "stage3_param_persistence_threshold": 1e5,
  "stage3_max_live_parameters": 1e9,
  "stage3_max_reuse_distance": 1e9
}

14.2 DeepSpeed ZeRO-2 配置

{
  "stage": 2,
  "offload_optimizer": {"device": "cpu", "pin_memory": true},
  "contiguous_gradients": true,
  "overlap_comm": true,
  "reduce_bucket_size": 5e8
}

14.3 DeepSpeed 启动命令

# 单节点多卡训练
deepspeed --num_gpus=8 --master_port=29500 llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml

# 多节点训练
deepspeed --num_gpus=8 --num_nodes=2 --hostfile=hostfile llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml

十五、FSDP 分布式训练配置

15.1 FSDP 配置文件

stage: sft
model_name_or_path: Qwen/Qwen3-4B-Instruct
trust_remote_code: true
template: qwen3
finetuning_type: lora

use_fsdp: true
fsdp_config:
  fsdp_size: 4
  fsdp_cpu_offload: false
  fsdp_auto_wrap: true
  fsdp_sharding_strategy: 1
  fsdp_backward_prefetch: backward_prefetch
  fsdp_forward_prefetch: false
  fsdp_limit_all_gathers: true

per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 5.0e-5

15.2 FSDP 启动命令

torchrun --nproc_per_node=4 --master_port=29500 llamafactory-cli train examples/train_lora/qwen3_lora_sft.yaml

十六、训练参数详解

16.1 模型相关参数

参数	类型	默认值	说明
model_name_or_path	str	必填	模型名称或本地路径
trust_remote_code	bool	False	是否信任远程代码
template	str	必填	对话模板名称，必须与模型匹配
finetuning_type	str	lora	微调类型：full / freeze / lora
reward_model	str	None	奖励模型路径（PPO 训练用）
quantization_bit	int	None	量化位数：4 / 8 / 16

16.2 LoRA 相关参数

参数	类型	默认值	说明
lora_rank	int	8	LoRA 秩，建议 4-64
lora_alpha	int	16	LoRA 缩放因子，推荐 ratio=1~2
lora_dropout	float	0.05	LoRA 层的 Dropout
lora_target	str	all	目标模块：all / q_proj / k_proj / v_proj 等
use_dora	bool	False	是否使用 DoRA
use_loftq	bool	False	是否使用 LoftQ 初始化
pissa_init	bool	False	是否使用 PiSSA 初始化

16.3 训练相关参数

参数	类型	默认值	说明
per_device_train_batch_size	int	1	每设备训练 batch size
gradient_accumulation_steps	int	8	梯度累积步数
learning_rate	float	5e-5	学习率
num_train_epochs	float	3.0	训练轮数
max_grad_norm	float	1.0	梯度裁剪范数
weight_decay	float	0.01	权重衰减

16.4 数据相关参数

参数	类型	默认值	说明
dataset	str	必填	数据集名称，多个用逗号分隔
dataset_dir	str	data	数据目录
max_samples	int	None	每个数据集最大采样数
val_size	float	0	验证集比例
cutoff_len	int	1024	文本截断长度
streaming	bool	False	是否流式加载数据

16.5 加速相关参数

参数	类型	默认值	说明
flash_attn	str	auto	Flash Attention：fa2 / true / false
use_unsloth	bool	False	是否使用 Unsloth 加速
enable_liger_kernel	bool	False	是否使用 Liger Kernel
shift_attn	bool	False	是否启用 LongLoRA 的 S²-Attn
neftune_noise_alpha	float	None	NEFTune 噪声 alpha

十七、推理与导出

17.1 LoRA 权重合并导出

llamafactory-cli export examples/merge_lora/qwen3_lora_sft.yaml

model_name_or_path: Qwen/Qwen3-4B-Instruct
trust_remote_code: true
adapter_name_or_path: ./output/qwen3_lora_sft
template: qwen3
finetuning_type: lora

export_dir: ./output/qwen3_lora_sft/merged
export_size: 4
export_device: cpu
export_legacy_format: false

17.2 API 服务部署

# 使用 vLLM 加速
API_PORT=8000 llamafactory-cli api examples/inference/qwen3_lora_sft.yaml infer_backend=vllm vllm_enforce_eager=true

# 使用 SGLang 加速
API_PORT=8000 llamafactory-cli api examples/inference/qwen3_lora_sft.yaml infer_backend=sglang

17.3 GGUF 量化导出

GGUF 是 llama.cpp 主推的量化格式，适合 CPU 推理或量化部署。

model_name_or_path: Qwen/Qwen3-4B-Instruct
trust_remote_code: true
adapter_name_or_path: ./output/qwen3_lora_sft
template: qwen3
finetuning_type: lora

export_dir: ./output/qwen3_lora_sft/gguf
export_size: 4
export_device: cpu
export_quantization_bit: Q4_K_M
export_legacy_format: false

支持的量化类型：

量化等级	压缩率	适用场景
Q2_K	最高	极致压缩
Q3_K_M	高	内存受限
Q4_K_M	中	推荐默认
Q5_K_M	中高	质量优先
Q6_K	高	质量优先
Q8_0	最低	无量化基准

17.4 Ollama Modelfile 导出

Ollama 支持直接加载 GGUF 格式模型，导出后可本地部署。

model_name_or_path: Qwen/Qwen3-4B-Instruct
trust_remote_code: true
adapter_name_or_path: ./output/qwen3_lora_sft
template: qwen3
finetuning_type: lora

export_dir: ./output/qwen3_lora_sft/ollama
export_size: 4
export_quantization_bit: Q4_K_M
export_legacy_format: false
export_without_template: true

导出后使用：

ollama create qwen3-sft -f ./output/qwen3_lora_sft/ollama/Modelfile
ollama run qwen3-sft

十八、KTransformers 极致优化

KTransformers 支持使用极少量硬件微调超大模型。

18.1 实测效果

模型	参数量	传统显存需求	KTransformers 需求
DeepSeek-V3	236B	8×A100 80G	2×RTX 4090 + CPU
Qwen3-235B	235B	8×A100 80G	2×RTX 4090 + CPU

18.2 核心原理

MoE 算子融合
CPU-GPU 协同
稀疏 Attention
混合精度量化

十九、EasyR1 强化学习框架

EasyR1 是官方推出的高效可扩展多模态 RL 训练框架，支持 GRPO（Group Relative Policy Optimization）训练。

19.1 核心特性

GRPO 训练：DeepSeek 提出的新型强化学习算法，相比 PPO 训练更高效
多模态支持：支持纯文本和视觉语言模型的强化学习训练
统一框架：整合了 SFT、Reward Model、RLHF 的完整流程

19.2 GRPO vs PPO 对比

特性	PPO	GRPO
奖励模型	需要单独训练	不需要，自生成基准
计算资源	高	中
训练稳定性	中	高
适用场景	复杂偏好任务	简单验证任务

19.3 GRPO 配置示例

stage: grpo
model_name_or_path: Qwen/Qwen3-4B-Instruct
trust_remote_code: true
template: qwen3
finetuning_type: lora

grpo_beta: 0.1
grpo_num_generate_samples: 16

lora_rank: 8
lora_alpha: 16
lora_target: all

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-5
num_train_epochs: 3.0

二十、故障排查指南

20.1 常见问题与解决方案

问题	原因	解决方案
CUDA out of memory	显存不足	减小 batch_size、启用 QLoRA、使用 gradient_checkpointing
模型下载失败	网络问题	设置 USE_MODELSCOPE_HUB=1
数据集加载失败	格式错误	检查 JSON 格式是否符合模板要求
LoRA 不生效	adapter 未指定	确保 adapter_name_or_path 参数正确
训练 loss 不下降	学习率不合适	尝试降低学习率（如 1e-5）
梯度爆炸	学习率过高	减小学习率，增加 max_grad_norm
模型生成乱码	模板不匹配	确保训练和推理使用相同的 template
DeepSpeed 报错	环境问题	pip install deepspeed --upgrade
FlashAttention 报错	硬件不支持	设置 flash_attn: false

20.2 显存优化技巧

启用梯度检查点：gradient_checkpointing: true
启用 CPU 卸载
使用更小的量化精度：quantization_bit: 4
减小序列长度：cutoff_len: 1024
启用 gradient accumulation

二十一、最佳实践

21.1 数据准备最佳实践

数据质量 > 数据数量
数据格式一致性
避免数据污染：使用 neat_packing: true

21.2 训练最佳实践

学习率选择：LoRA 用 1e-4~5e-4，全参数用 1e-5~5e-5
LoRA 秩选择：简单任务用 rank=4~8，复杂任务用 rank=16~64
训练轮数：通常 1-3 个 epoch

二十二、行业应用案例汇总

领域	模型	效果
法律	DISC-LawLLM	准确率提升 35%
医疗	Sunsimiao	诊断建议质量显著提升
金融	FinLLM	问答准确率提升 40%
教育	EduLLM	学生满意度提升 50%

二十三、相关资源汇总

资源	链接
GitHub	https://github.com/hiyouga/LlamaFactory
论文	https://arxiv.org/abs/2403.13372
文档	https://llamafactory.readthedocs.io/
博客	https://blog.llamafactory.net/
在线体验	https://www.llamafactory.com.cn/
Google Colab	https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9

二十四、总结与展望

LLaMA Factory 核心优势：模型覆盖全面、方法集成丰富、硬件门槛低、上手门槛低，工业级稳定、持续迭代。

本文档由小时 AI 基于官方 README 和文档整理编写，版本更新至 2026 年 3 月。如有疏漏或更新，欢迎反馈。

本文由时空原创，采用 CC BY-NC-SA 4.0 许可协议，转载请注明出处。