So incredibly based that DeepSeek won best paper at ACL with NSA, then, it seems, was dissatisfied with results at scale, figured out a better architecture that can use full attention models to boot, published that and shared weights. We often suspect Google does the opposite
Loading thread detail
Fetching the original tweets from X for a clean reading view.
Hang tight—this usually only takes a few seconds.
