High-VRAM GPUs aren't the future of local AI — unified memory and Mixture of Experts models are
… There's also a routing step that costs you a little per token, and the irregular token-by-token routing hurts memory locality in a way that a single user where you can't amortize weight reads across a big batch will feel more than a server would. …