Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Open Access Qwen3.6-35B-A3B-UD-Q5_K_M with TurboQuant
3 points by freakynit 1 hour ago | hide | past | favorite | 2 comments
https://w418ufqpha7gzj-80.proxy.runpod.net

Started for myself, but since Im not using it continuously, sharing it:

Open Access Qwen3.6-35B-A3B-UD-Q5_K_M with TurboQuant (TheTom/llama-cpp-turboquant) on RTX 3090 (Runpod spot instance).

5 parallel requests supported.. full context available (please don't misuse..there are no safety guards in place)

Open till spot instance lasts or max 4 hours.

And yes, no request logging (I don't even know how to do it with llama-server)

Prompt processing and generation speeds (at 8K context): 900t/s and 60t/s. And at 100K context: 450t/s and 30t/s.

Command used:

    ./build/bin/llama-server \
      -m ../Qwen3.6-35B-A3B-UD-Q5_K_M.gguf \
      --alias 'Qwen3-6-35B-A3B-turbo' \
      --ctx-size 262144 \
      --no-mmproj \
      --host 0.0.0.0 \
      --port 80 \
      --jinja \
      --flash-attn on \
      --cache-type-k turbo3 \
      --cache-type-v turbo3 \
      --reasoning off \
      --temp 0.6 \
      --top-p 0.95 \
      --top-k 20 \
      --min-p 0.0 \
      --presence-penalty 0.0 \
      --repeat-penalty 1.0 \
      --parallel 5.0 \
      --cont-batching \
      --threads 16 \
      --threads-batch 16
Thanks..
 help



Also 3090. using Q4_XL, reduced max context size, with 100k prompt length, I get 2520 tk/s for prompt processing, 68 token/s generation:

  llama-server \
    --model /mnt/ubuntu/models/llama-cpp-qwen/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
    --ctx-size 150000 \
    --n-gpu-layers 99 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --parallel 3 \
    --kv-unified \
    --ctx-checkpoints 32 \
    --checkpoint-every-n-tokens 8192 \
    --checkpoint-min-tokens 64 \
    --flash-attn on \
    --batch-size 4096 \
    --ubatch-size 1024 \
    --reasoning on \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20
I was wondering if turboquant is worth the effort right now, but I'm not yet seeing it speed wise.

checkpoint-min-tokens is a local patch I have so that small background tasks don't wreck my checkpoint cache.


Update: spot terminated



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: