Running Qwen3.6 MTP GGUF on AMD AI MAX 395 with llama.cpp ROCm

June 1, 2026 - 3 minutes read - 454 words

Background

MTP support was recently merged into llama.cpp through the following pull request:

After the merge, I wanted to test MTP models on my mini-PC powered by the AMD AI MAX 395. I tried several approaches, including manually building llama.cpp and using Unsloth GGUF models directly. However, despite multiple attempts, I could not get a stable working setup.

I also searched through GitHub issues and asked several AI assistants, including ChatGPT, Gemini, and DeepSeek. Unfortunately, none of the suggested solutions worked reliably in my environment.

As a last resort, I tried Claude. Claude recommended using the pre-built ROCm binaries from lemonade-sdk/llamacpp-rocm instead of continuing to debug the manual build process. That recommendation worked.

This post documents my setup and the token generation speed I observed when running:

Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL

on AMD AI MAX 395.

Environment

Hardware:

Mini-PC with AMD AI MAX 395

Software:

Ubuntu with ROCm support Pre-built llama.cpp ROCm binary for gfx1151 GGUF models from Unsloth Hugging Face CLI

Setup

The following commands download and install the pre-built llama.cpp ROCm binary from lemonade-sdk/llamacpp-rocm.

cd /tmp

wget https://github.com/lemonade-sdk/llamacpp-rocm/releases/download/b1282/llama-b1282-ubuntu-rocm-gfx1151-x64.zip

mkdir llama
mv llama-b1282-ubuntu-rocm-gfx1151-x64.zip llama

cd llama
unzip llama-b1282-ubuntu-rocm-gfx1151-x64.zip
rm llama-b1282-ubuntu-rocm-gfx1151-x64.zip
cd ..
mv llama ~/.local/bin

hf download unsloth/Qwen3.6-35B-A3B-MTP-GGUF Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf

Running the Models

I tested several models, all quantized with UD-Q4_K_XL.

unsloth/Qwen3.5-27B-GGUF
unsloth/Qwen3.5-4B-GGUF
unsloth/Qwen3.5-9B-GGUF
unsloth/Qwen3.5-9B-MTP-GGUF
unsloth/Qwen3.6-27B-GGUF
unsloth/Qwen3.6-27B-MTP-GGUF
unsloth/Qwen3.6-35B-A3B-MTP-GGUF

# command to serve mtp models
llama-server  -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL    \
        -ngl 99 -c 131072 -fa on -np 1     \
        --spec-type draft-mtp --spec-draft-n-max 2

Test Prompt

For consistency, I used the same prompt "mindmap of dspy in mermaid format" across the tested models:

Among the models tested, Qwen3.6-35B-A3B-MTP-GGUF produced the best result. The 4B model entered an infinite loop during testing and was not usable for this prompt.

Performance Result

Below is the token generation speed observed when using Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL.

The generation speed was generally around 60–66 tokens per second, which is quite impressive for this setup.

Notes

A few observations from this experiment:

Manually building llama.cpp for AMD AI MAX 395 was not straightforward in my environment. The pre-built lemonade-sdk/llamacpp-rocm binary worked immediately. Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL gave the best result among the models I tested. The 4B model did not behave correctly for my test prompt and entered an infinite loop. MTP with --spec-type draft-mtp and --spec-draft-n-max 2 worked successfully with the 35B MTP model.

Conclusion

After several unsuccessful attempts with manual builds and different setup methods, the most reliable solution was to use the pre-built ROCm-enabled llama.cpp binary from lemonade-sdk/llamacpp-rocm.

For AMD AI MAX 395 users who want to test MTP models with llama.cpp, this may be the fastest way to get started.

In my test, Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL achieved around 60–66 tokens per second, making it a strong option for local inference experiments on this hardware.