Strix Halo Clustering (Hardware Setup Discussion)

Cross post from Strix Halo, but I think The fine folks here also have some wisdom, maybe on the model side:

Hey there!

I recently got into the local hardware game with the Strix Halo (bosgame m5), ever since buying the hardware it went up in price by some 10~20% in 2 weeks.

I'm now thinking that it would be good to buy another one and cluster the two nodes to run bigger models before prices go up further.

I am an enterprise user working on sensitive code so local hosting of the model is the only way to use LLMs in my field of work.

Does anybody have experience with clustering tools for running models across multiple nodes?

The real motivation that I see behind this approach is the fact that I would have 256 GB of ram rather then 128 GB, based on reading some bartowski quants on hugging face, the models I would be able to run would be:

128 GB:

- Minimax 2.7 high q3 quant with small context

- q1/q2 version of GLM 4.7 (NOT Flash)

- q3 ish qwen 3.5 ~400b

Meanwhile with two systems, potentially:

256gb:

- Minimax q4 2.7 with decent context

- q4 of GLM 4.7

- q1/2 of GLM 5.1 (maybe higher with some REAP version)

- q4 of Qwen 3.5 ~400b

Yes I get it, qwen 3.6 27b is good, yes gemma is good, but for real agentic work and actually getting things done, I was not that happy with just those models that are in the ~32/64gb range.

What I want to find out is:

⁠What methods you can use for clustering?

1.1) I have seen people using thunderbolt networking which would be a nice option, but the protocol itself has very high latency due to the wrapping of the data packet into the thunderbolt layer, and as far as my understanding goes, there is still no option for RDMA over thunderbolt on strix halo as there is with MAC Studios.

1.2) I have also seen people use M2 NVME adapters to networking/ Oculink, this is a feasible approach but I would need to run a high speed network card at each of the strix halos.

1.2.1). Would 50Gig networking be good for the interconnect? Can i do 100 Gig? Over those Nvidia DGX spark connectors?

1.2.2) What is the achievable speed? And whats the ltency ( I know its limited by the M2 slot with something like pice gen 4 speeds from the 4x4 slot), but is it slower in reality?

1.3) Have I missed any additional options?

2) What clustering techniques would work well?

2.1) I know tensor parallelism across two machines is nice for prefill acceleration (and the strix halo would benefit from higher prefit speed for agentic coding workloads to process the high context), How is the stack for this? I know of vLLM strix halo toolboxes, is it painfull to install / has it been tried?

2.2) Pipeline paralelism, does it offer any generation speed advantages in tokens/ sec? I would preferably want to use something decently fast for my work.

2.3) Would something like Exo work on the strix halo? Ive only seen people use it with MAC clusters and Im under the impression that its a MAC Specific thing.

3) To be more clear with my backgrond: I am an embeded engineer so I am ok with hacky solutions as long as someone else has done it before and made at least some documentation for it. I just figured out how to train my own models on Strix Halo using pytorch, it was a mess but I manged using some configuration. What were your experiences? is there another solution you can recomend? Distributed compute?

Would love to hear everyone's experience. Even if you got a setup like this running i would love to jump together on a quick call or sth (Im on the Local Llama discord btw) So just PM me and lets find a time. All responses welcome!

submitted by /u/Thanks-Suitable
[link] [comments]

Leave a Comment