Meta Superintelligence - Leadership Compute, Talent, and Data
… Here, expert choice routing struggles as the experts can only choose from to 1 token x batch size per layer initially, resulting in each expert only given a very small set of tokens compared to when it was trained an example training run would have 8k seqlen x 16 batch size = 128k tokens per pass . …