As far as i know the weight is of 160gb + 9.6gb needed for max 1 million token window + 5 gigs overhead = 175gb vram.
But vllm and othere sources said "To use the full 1M context, you need 4x A100 80G" --> thats a 320gb vram ?? Am i missing something??
Sources:
- https://lushbinary.com/blog/deepseek-v4-self-hosting-guide-vllm-hardware-deployment/?hl=en-GB
- Vllm blog of deployment
9.6 gig is also sourced from vllm blog page + official model page says it take 10% kv cache of what 3.2 used to take
[link] [comments]