I tried a new 8B local LLM, and its design might be the biggest shift since DeepSeek R1
…I patched the sampler to take the single-block radix-sort path on ROCm and bypass the 1024-thread merge variant entirely... though it was a rather naive move on my part…