| Hello everyone. Over the last couple months I have been assembling my local AI setup for personal use, and I thought to write a post here, firstly to collect some thoughts on the whole concept, and secondly to perhaps gather some feedback. My setup is nowhere near as advanced as many professional rigs posted here, but I have the following specs: So far I have mainly been using it to run Qwen 3.6 27B at Q8 on the two cards together. I experimented around a little bit, but overall I landed on running my models using llama.cpp with Vulkan drivers. To get it out of the way, I am aware of the limitation of the connectivity in this system, especially for the 3rd GPU, which would run at a measly 4x gen 4 lanes. This is likely to be a significant bottleneck if I were to run a singular model distributed over all of my GPUs. I would love to eventually upgrade to something like a threadripper platform or use a PCIe fabric card to connect the GPUs more directly (something like LR-Link recently shown on the level1techs channel) but due to high costs it will have to wait. I am working on a hobby research project in the programming languages area, so generally access to some less common knowledge is very helpful. AFAIK there isn't really anything stronger at the moment than 27B to run for me locally at the moment. Eventually with 96GB of VRAM I could run something bigger but the PCI limitations would affect the overall performance in that scenario. Therefore I was considering potentially running 2/3 agents locally, with a smarter API overseer like K2.6 via API. For certain tasks which could be smaller in scope or where the lower speed would be acceptable, I could also consider running some CPU inference since I have a bunch of system RAM to utilize as well. Generally the idea I was considering was constructing some form of harness to allow me for semi-autonomous research and development in the scope of my project. Potential deployments could consist of a number of agentic developers/testers/thinkers running separately, for example with something like Q6 quants of 27B, so each could have its own GPU. Depending on the workload, it could be nice for the "overseer" to dynamically deploy necessary agents and models to fit the current workload (maybe for certain tasks we would want to put the development on pause and run a big model on all GPUs together, to benefit from larger knowledge). Because of the complex and specific nature of the project, it touches on more niche CS areas which the models like 27B have the awareness of, however they might not be well optimized for, so I think one key aspect would be allowing the agents to access the internet search and bigger cloud models when necessary. Overall, the most interesting part for me which I do not know too much about at the moment and would like to learn more about, is how to effectively engineer a harness to manage this hardware deployment and project. I could definitely spend some time just (vibe) coding something to fit my specific needs, however I do not think my setup, at least conceptually is anything new. I am aware there exist certain solutions like LangGraph and CrewAI, although I am unsure which would fit my use-case best, and be well extensible for my needs. I would be very curious to learn about other peoples experiences and thoughts on this hardware setup and potential deployments on it. If you read through all of that, thank you very much and sorry for the chaotic writing style. Cheers. [link] [comments] |