Paper page - BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE
… With a straight-through estimator and an auxiliary regularization loss , BEAM induces dynamic expert sparsity through end-to-end training while maintaining model capability. …