Hey all,
we’re a mid-sized company (~70 people) and currently planning to bring a lot of our workloads on-prem instead of relying on cloud APIs. The goal for the moment is to run small to mid-sized models in the range of 30B like Qwen3.6 or Gemma4.
Use cases:
- Internal Chatbot (email, assistants, maybe some RAG)
- ~30 software devs, currently not yet using agentic coding
- ML training (PyTorch, CNNs, ViTs)
- Some raytracing
We’ve got a server with 10 PCIe slots and are considering:
Option A (NVIDIA):
- 2× RTX 6000 Pro (as a starting point)
- ~192 GB VRAM total for 19k€
Option B (AMD):
- 10× Radeon AI Pro R9700
- ~320 GB VRAM total for ~15k€
Main concerns:
- Multi-GPU scaling (2 big vs 10 small)
- AMD vs NVIDIA for mixed workloads (esp. rendering, pytorch training)
- Scaling options in the future
- We are currently using llamacpp but from what I've read here, vllm would be better for our multi-user use-case. How does vllm behave when splitting models up over many gpus?
What would you pick for a team setup like this?
[link] [comments]