| I invested quite a bit of time and it wasn't easy but finally I can run models like Minimax 2.7 Q4 using Cuda+ROCm at the same time bypassing Vulkan. load_tensors: offloaded 63/63 layers to GPU load_tensors: CUDA0 model buffer size = 83650.42 MiB load_tensors: CUDA_Host model buffer size = 622.76 MiB load_tensors: ROCm0 model buffer size = 40314.35 MiB the main advantage is the prefill. On windows : rmdir /s /q build cmake -B build -G Ninja ^ -DCMAKE_C_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" ^ -DCMAKE_CXX_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" ^ -DCMAKE_HIP_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" ^ -DCMAKE_PREFIX_PATH="C:/Program Files/AMD/ROCm/6.4" ^ -DHIP_ROOT_DIR="C:/Program Files/AMD/ROCm/6.4" ^ -DGGML_HIP=ON ^ -DGGML_CUDA=ON ^ -DGGML_BACKEND_DL=ON ^ -DGGML_CPU_ALL_VARIANTS=ON ^ -DGGML_AVX_VNNI=OFF ^ -DGGML_AVX512=OFF ^ -DGGML_AVX512_VBMI=OFF ^ -DGGML_AVX512_VNNI=OFF ^ -DGGML_AVX512_BF16=OFF ^ -DGGML_AMX_TILE=OFF ^ -DGGML_AMX_INT8=OFF ^ -DGGML_AMX_BF16=OFF ^ -DCMAKE_CUDA_COMPILER="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.1/bin/nvcc.exe" ^ -DCMAKE_CUDA_ARCHITECTURES="120" ^ -DCMAKE_BUILD_TYPE=Release ___________________ cmake --build build -j _______________________ Unfortunately, this flag: -DGGML_CPU_ALL_VARIANTS=ON --> creates many compilation errors and I had to edit, for example: notepad C:\llm\llamacpp\ggml\src\CMakeLists.txt and remove # ggml_add_cpu_backend_variant(alderlake SSE42 AVX F16C FMA AVX2 BMI2 AVX_VNNI) With Ryzen 5950x it's ok. then: set PATH=C:\Program Files\AMD\ROCm\6.4\bin;%PATH% llama-server.exe --model "H:\gptmodel\unsloth\MiniMax-M2.7-GGUF\MiniMax-M2.7-UD-Q4_K_S-00001-of-00004.gguf" --ctx-size 91920 --threads 16 --host 127.0.0.1 --no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --cache-type-k q8_0 --cache-type-v q8_0 --parallel 1 Done. [link] [comments] |