https://w418ufqpha7gzj-80.proxy.runpod.netStarted for myself, but since Im not using it continuously, sharing it:
Open Access Qwen3.6-35B-A3B-UD-Q5_K_M with TurboQuant (TheTom/llama-cpp-turboquant) on RTX 3090 (Runpod spot instance).
5 parallel requests supported.. full context available (please don't misuse..there are no safety guards in place)
Open till spot instance lasts or max 4 hours.
And yes, no request logging (I don't even know how to do it with llama-server)
Prompt processing and generation speeds (at 8K context): 900t/s and 60t/s. And at 100K context: 450t/s and 30t/s.
Command used:
./build/bin/llama-server \
-m ../Qwen3.6-35B-A3B-UD-Q5_K_M.gguf \
--alias 'Qwen3-6-35B-A3B-turbo' \
--ctx-size 262144 \
--no-mmproj \
--host 0.0.0.0 \
--port 80 \
--jinja \
--flash-attn on \
--cache-type-k turbo3 \
--cache-type-v turbo3 \
--reasoning off \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--parallel 5.0 \
--cont-batching \
--threads 16 \
--threads-batch 16
Thanks..
checkpoint-min-tokens is a local patch I have so that small background tasks don't wreck my checkpoint cache.
reply