Edit: I’m getting the consensus is that the budget I suggested is not enough for my lil ambitious project. I’d like to reshape the question for the upcoming comments: what’s the minimal budget to achieve my goal? And with which gpu configuration?
Hello,
I’m trying to figure out a realistic on-prem setup for a small team (approx 20–30 developers) to use a local coding/agent model (thinking something like Kimi K2.5 or GLM 5.1)
I guess my constraints are:
- everything has to stay on-prem
- vram is important but bandwidth and low latency are essential
- decent UX is important (not expecting instant responses obvy, but I also don’t want it to feel laggy or constantly queued)
My initial pick was a cluster of 4 DGX Spark connected with a Switch, but I read a few articles about heat and latency issues which steered me away from it. A cluster of mac studios was my second option but given how difficult it is to get your hands on a couple of 512GB macs nowadays, I dont think it's a viable option either. Plus the fact that it's not tailored for batch processing (vllm-mlx is still rudimentary in that regard).
I rambled a lot but I guess my question is : What’s the best hardware + model + serving setup that $30k can buy that actually feels “comfortable” for 20–30 devs using it in parallel?
If anyone is running something similar:
- what did you end up with?
- what bottleneck surprised you?
- anything you’d do differently?
Appreciate any feedback... I'm trying to avoid building something that looks good on paper but feels sluggish in real use.
Cheers.
[link] [comments]