DGX Spark just arrived — planning to run vLLM + local models, looking for advice

Just got a DGX Spark set up today and starting to configure it for local LLM inference.

Plan is to run:

• vLLM • PyTorch • Hugging Face models

as a local API backend for an application I’m building (education / analytics use case, trying to keep everything local/private).

I’ve mostly been working with cloud GPUs up to now, so this is my first time running something like this fully on-prem.

A few things I’m curious about:

• Best models people are running efficiently on this hardware? • Any tuning tips for vLLM on unified memory systems like this? • Real-world throughput vs expectations?

Would appreciate any insights from people running similar setups.

submitted by /u/dalemusser
[link] [comments]

Leave a Comment