basically what I'm doing here is trying to validate whether or not it's a reasonable idea to get a couple of V100s, either SXMs with PCIe adapters or straight-up PCIe cards in the first place, for the sake of running this model or models like it, for codegen and other mostly-text applications. a pair of these is around $1200 for 64GB RAM, compared to $1100 for 24GB RAM from a 3090. my sense is that with 64GB RAM you are simply not going to run out of context with an arrangement like this, with the model running at INT8 and the KV cache unquantized, for any remotely reasonable amount of context.
one thing though is that I'm not sure why pp takes a dive at 64K context in this series of benchmarks. I'm just wondering if there are obvious things I'm not remembering to do here. TIA.
4478180@pdgx0001:~/llama.cpp/build/bin$ CUDA_VISIBLE_DEVICES=0,1 ./llama-bench -hf unsloth/Qwen3.6-27B-GGUF:Q8_0 -sm tensor -ngl 999 -t 64 --flash-attn 1 -p 2048 -d 4096,16384,65536 ggml_cuda_init: found 2 CUDA devices (Total VRAM: 65002 MiB): Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32501 MiB Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32501 MiB | model | size | params | backend | ngl | threads | sm | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | -: | --------------: | -------------------: | | qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | pp2048 @ d4096 | 797.25 ± 3.55 | | qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | tg128 @ d4096 | 31.16 ± 0.40 | | qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | pp2048 @ d16384 | 702.58 ± 8.55 | | qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | tg128 @ d16384 | 30.27 ± 0.36 | | qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | pp2048 @ d65536 | 473.34 ± 2.69 | | qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | tg128 @ d65536 | 26.71 ± 0.29 | build: 2496f9c14 (9049) 4478180@pdgx0001:~/llama.cpp/build/bin$ CUDA_VISIBLE_DEVICES=0,1 ./llama-bench -hf unsloth/Qwen3.6-27B-GGUF:Q8_0 -sm tensor -ngl 999 -t 64 --flash-attn 1 -p 2048 -d 200000 ggml_cuda_init: found 2 CUDA devices (Total VRAM: 65002 MiB): Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32501 MiB Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32501 MiB | model | size | params | backend | ngl | threads | sm | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | -: | --------------: | -------------------: | | qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | pp2048 @ d200000 | 267.16 ± 0.29 | | qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | tg128 @ d200000 | 18.53 ± 0.14 | build: 2496f9c14 (9049) 4478180@pdgx0001:~/llama.cpp/build/bin$ CUDA_VISIBLE_DEVICES=0,1 ./llama-bench -hf unsloth/Qwen3.6-27B-GGUF:Q8_0 -sm tensor -ngl 999 -t 64 --flash-attn 1 -p 2048 -d 128000 ggml_cuda_init: found 2 CUDA devices (Total VRAM: 65002 MiB): Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32501 MiB Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32501 MiB | model | size | params | backend | ngl | threads | sm | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | -: | --------------: | -------------------: | | qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | pp2048 @ d128000 | 352.66 ± 0.61 | | qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | tg128 @ d128000 | 23.06 ± 0.23 | build: 2496f9c14 (9049) [link] [comments]