使用huggingface下载LLM及本地运行

模型的本地部署运行和服务端部署(docker部署vllm,gpu运行deepseek-r1:1.5b模型:安装cuda-toolkit、vllm、openai-webui)

本地运行

Hugging Face中筛选一个合适的小模型,进入详情页,直接复制模型名称,就可以开始下载了(这里选了个较小的模型 qwen2.0-0.5B-Instruct

1. 模型下载

huggingface依赖:

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

下载模型的代码:

from transformers import AutoModelForCausalLM, AutoTokenizer
import os

model_name = "Qwen/Qwen-7B"
cache_dir = "../local_models/Qwen-7B"

os.makedirs(cache_dir, exist_ok=True)

# 下载tokenizer和模型
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",  # 根据硬件选择适当精度
    low_cpu_mem_usage=True,
    trust_remote_code=True,  # 如果模型需要自定义代码
)

# 保存到本地
tokenizer.save_pretrained(cache_dir)
model.save_pretrained(cache_dir)

print(f"Model {model_name} downloaded and cached in {cache_dir}")

执行代码:

python hf_llm_download.py

2. 本地执行测试

直接在

huggingface
model card
中获取样例运行代码,将模型位置替换为本地模型,结尾加上print即可测试:

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"  # the device to load the model onto
model_location = "../local_models/Qwen/Qwen2-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_location, torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_location)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print("================== LLM Response ==================")
print(response)

服务端部署

1. NVIDIA支持

  1. nvidia-driver
    :NVIDIA驱动为基础的硬件支持;
  2. NVIDIA Container Toolkit
    :普通的容器无法访问GPU,NVIDIA Container Toolkit 的作用是让 Docker 容器能够调用主机的 NVIDIA 驱动和 GPU 资源,官网:cuda toolkit
# 显示当前驱动版本
> sudo ubuntu-drivers devices

# 指定一个驱动安装
> sudo apt install nvidia-driver-525

# 重启后验证NVIDIA驱动
> sudo reboot
> nvidia-smi

# 安装Container Toolkit
> sudo apt-get update
> sudo apt-get install -y nvidia-docker2

# # 重启docker服务 验证安装
> sudo systemctl restart docker
> sudo docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu22.04 nvidia-smi

# 输出:
Tue Jun 17 18:42:16 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.247.01             Driver Version: 535.247.01   CUDA Version: 12.8     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1660 Ti     Off | 00000000:01:00.0 Off |                  N/A |
| N/A   60C    P0              24W /  80W |      0MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

2. vLLM部署

可以直接参考huggingface:

  1. 拉取镜像
docker pull vllm/vllm-openai
docker pull ghcr.io/open-webui/open-webui:cuda
  1. 启动容器
# Deploy with docker on Linux:
sudo docker run --runtime nvidia --gpus all \
	--name vllm_container \
	-v ~/model:/root/.cache/huggingface \
 	--env "HUGGING_FACE_HUB_TOKEN=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" \
 	--env "HF_ENDPOINT=https://hf-mirror.com" \
	-p 9407:9407 \
	--ipc=host \
	vllm/vllm-openai:latest \
	--model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
	--api-key fnos_deepseek_r1_api_key
	--served-model-name deepseek-r1
	--dtype=float16
	--disable-nvtx \
	--attention-backend xformers \
	--disable-triton
	
sudo docker run --gpus "device=0" --ipc=host -d \
  --name vllm \
  -p 9407:9407 \
  -v /home/robinverse/model/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B:/models/DeepSeek-R1-Distill-Qwen-1.5B \
  vllm/vllm-openai:v0.7.1 \
  --model=/models/DeepSeek-R1-Distill-Qwen-1.5B \
  --served-model-name=deepseek-r1-1.5b \
  --max-model-len=8912 \
  --max-num-seqs=32 \
  --gpu-memory-utilization=0.90 \
  --dtype=float1a \
  --tensor-parallel-size 1
  • HUGGING_FACE_HUB_TOKEN
    :用来下载模型;
  • --runtime nvidia --gpus all
    :使用nvidia gpu;
  • --model /model
    :指定模型文件位置;
  • --api-key
    :指定外部访问的key;如果使用web-ui,这个需要指定;(相当于是在商业大模型平台申请到的api-key)内网必要性不大;
  • --served-model-name deepseek-r1
    :vllm执行时需要选择模型,参数model带上模型名称,就是这个name;
  1. 检查容器状态、nvidia gpu使用状态
docker ps
# gpu使用情况
nvidia-msi
试一下:docker pull vllm/vllm-openai:v0.8.5
  1. docker部署web-ui
sudo docker run -d --name open-webui --network host --gpus all -e OPEN_API_BASE_URL=http://localhost:9407/v1 -e OPENAI_API_KEYS=fnos_deepseek_r1 -e USE_CUDA_DOCKER=true dyrnq/open-webui