DeepSeek 1.5B:
VRAM (FP16): ~2.6 GB
Recommended GPUs: NVIDIA RTX 3060 (12GB) or Tesla T4 (16GB) for edge deployment.
Optimization: Supports 4/8-bit quantization for memory reduction.
DeepSeek-LLM 7B:
VRAM (FP16): ~14–16 GB
VRAM (4-bit): ~4 GB
Recommended GPUs: RTX 3090 (24GB), RTX 4090 (24GB), or NVIDIA A10 (24GB).
Throughput: RTX 4090 achieves ~82.6 TFLOPS for FP16 operations.
DeepSeek V2 16B:
VRAM (FP16): ~30–37 GB
VRAM (4-bit): ~8–9 GB
Recommended GPUs: RTX 6000 (48GB) or dual RTX 3090.
DeepSeek-R1 32B/70B:
VRAM (FP16): ~70–154 GB
Recommended GPUs: NVIDIA A100 (80GB) or H100 (80GB) in multi-GPU setups.
Optimization: 4-bit quantization reduces VRAM by 50–75%.
DeepSeek-V2 236B (MoE):
VRAM (FP16): ~20–543 GB (sparse activation reduces compute load).
Recommended GPUs: RTX 4090 (24GB) with quantization or 8× H100 (80GB) for full FP16 precision.
DeepSeek 67B:
VRAM (FP16): ~140 GB
Recommended GPUs: 4× A100-80GB GPUs with NVLink.
Optimization: 4-bit quantization allows single-GPU deployment (e.g., H100 80GB).
DeepSeek V3 671B:
VRAM (FP16): ~1.2–1.5 TB
Recommended GPUs: 16× H100 (80GB) or 6× H200 (100GB) with tensor parallelism.
Throughput: H200 achieves 250 TFLOPS for FP16.
Quantization:
4-bit quantization reduces VRAM by 75% (e.g., 671B model drops from 1.5 TB to ~386 GB).
FP8/INT8 formats balance precision and memory efficiency.
Model Parallelism:
Split large models across multiple GPUs (e.g., 671B requires 16× H100).
Batch Size Reduction:
Smaller batches lower activation memory (e.g., for 236B models, batch size ≤4).
Checkpointing:
Trade computation time for memory by recomputing gradients during training.
Power/Cooling: Large multi-GPU setups (e.g., 10× RTX A6000) require 1.5–2 kW per rack.
Edge Deployment: Lightweight models (e.g., 1.3B) run on low-power GPUs like Tesla T4.
This list provides a quick reference for hardware requirements and optimization strategies for DeepSeek models.
Nvidia asserting that its RTX 4090 is nearly 50% faster than the RX 7900 XTX in DeepSeek AI benchmarks.
Waiting for my RTX 5090.