I tried a new 8B local LLM, and its design might be the biggest shift since DeepSeek R1
… This happens because Zyphra trained on MI300X, validated on MI300X, and the kernel was sized for it. I patched the sampler to take the single-block radix-sort path on ROCm and bypass the 1024-thread merge variant entirely... though it was a rather naive move on my part. …