nice that it's open weight, but comparing dense vs moe models and only looking at total params is pretty unfair, if you look at active params instead of total params it's a different story:
- GLM 4.6 (32B): 74% fewer
- Minimax M2 (10B): 92% fewer
- K2 thinking (32B): 74% fewer
- V3.2 (37B): 70% fewer
size (both total or active!) is not the right metric here, we should have the same graph with speed on vllm / sglang
Training llm's (now: @huggingface)
anon feedback: https://t.co/JmMh7Sfvxd