模型的本地部署运行和服务端部署(docker部署vllm,gpu运行deepseek-r1:1.5b模型:安装cuda-toolkit、vllm、openai-webui)
本地运行
在Hugging Face中筛选一个合适的小模型,进入详情页,直接复制模型名称,就可以开始下载了(这里选了个较小的模型 qwen2.0-0.5B-Instruct)
1. 模型下载
huggingface依赖:
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
下载模型的代码:
from transformers import AutoModelForCausalLM, AutoTokenizer import os model_name = "Qwen/Qwen-7B" cache_dir = "../local_models/Qwen-7B" os.makedirs(cache_dir, exist_ok=True) # 下载tokenizer和模型 tokenizer = AutoTokenizer.from_pretrained( model_name, trust_remote_code=True, ) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", # 根据硬件选择适当精度 low_cpu_mem_usage=True, trust_remote_code=True, # 如果模型需要自定义代码 ) # 保存到本地 tokenizer.save_pretrained(cache_dir) model.save_pretrained(cache_dir) print(f"Model {model_name} downloaded and cached in {cache_dir}")
- 这里找了一个比较小的模型来测试:https://huggingface.co/Qwen/Qwen2-0.5B-Instruct
- 不同的模型内会有其他的依赖项,期间会报错提示缺少依赖,手动安装,再次执行即可。
执行代码:
python hf_llm_download.py
2. 本地执行测试
直接在
huggingface
的model card
中获取样例运行代码,将模型位置替换为本地模型,结尾加上print即可测试:
from transformers import AutoModelForCausalLM, AutoTokenizer device = "cuda" # the device to load the model onto model_location = "../local_models/Qwen/Qwen2-0.5B-Instruct" model = AutoModelForCausalLM.from_pretrained( model_location, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_location) prompt = "Give me a short introduction to large language model." messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt}, ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(device) generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512) generated_ids = [ output_ids[len(input_ids) :] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] print("================== LLM Response ==================") print(response)
服务端部署
1. NVIDIA支持
- nvidia-driver:NVIDIA驱动为基础的硬件支持;
- NVIDIA Container Toolkit:普通的容器无法访问GPU,NVIDIA Container Toolkit 的作用是让 Docker 容器能够调用主机的 NVIDIA 驱动和 GPU 资源,官网:cuda toolkit
# 显示当前驱动版本 > sudo ubuntu-drivers devices # 指定一个驱动安装 > sudo apt install nvidia-driver-525 # 重启后验证NVIDIA驱动 > sudo reboot > nvidia-smi # 安装Container Toolkit > sudo apt-get update > sudo apt-get install -y nvidia-docker2 # # 重启docker服务 验证安装 > sudo systemctl restart docker > sudo docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu22.04 nvidia-smi # 输出: Tue Jun 17 18:42:16 2025 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.247.01 Driver Version: 535.247.01 CUDA Version: 12.8 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GTX 1660 Ti Off | 00000000:01:00.0 Off | N/A | | N/A 60C P0 24W / 80W | 0MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
2. vLLM部署
可以直接参考huggingface:
- 拉取镜像
docker pull vllm/vllm-openai docker pull ghcr.io/open-webui/open-webui:cuda
- 启动容器
# Deploy with docker on Linux: sudo docker run --runtime nvidia --gpus all \ --name vllm_container \ -v ~/model:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN=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" \ --env "HF_ENDPOINT=https://hf-mirror.com" \ -p 9407:9407 \ --ipc=host \ vllm/vllm-openai:latest \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --api-key fnos_deepseek_r1_api_key --served-model-name deepseek-r1 --dtype=float16 --disable-nvtx \ --attention-backend xformers \ --disable-triton sudo docker run --gpus "device=0" --ipc=host -d \ --name vllm \ -p 9407:9407 \ -v /home/robinverse/model/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B:/models/DeepSeek-R1-Distill-Qwen-1.5B \ vllm/vllm-openai:v0.7.1 \ --model=/models/DeepSeek-R1-Distill-Qwen-1.5B \ --served-model-name=deepseek-r1-1.5b \ --max-model-len=8912 \ --max-num-seqs=32 \ --gpu-memory-utilization=0.90 \ --dtype=float1a \ --tensor-parallel-size 1
- HUGGING_FACE_HUB_TOKEN:用来下载模型;
- --runtime nvidia --gpus all:使用nvidia gpu;
- --model /model:指定模型文件位置;
- --api-key:指定外部访问的key;如果使用web-ui,这个需要指定;(相当于是在商业大模型平台申请到的api-key)内网必要性不大;
- --served-model-name deepseek-r1:vllm执行时需要选择模型,参数model带上模型名称,就是这个name;
- 检查容器状态、nvidia gpu使用状态
docker ps # gpu使用情况 nvidia-msi
- docker部署web-ui
sudo docker run -d --name open-webui --network host --gpus all -e OPEN_API_BASE_URL=http://localhost:9407/v1 -e OPENAI_API_KEYS=fnos_deepseek_r1 -e USE_CUDA_DOCKER=true dyrnq/open-webui