I really don't like that in the first layers X_t should be the representation of token t and gradually becomes that of token t+1 in the last layer. It makes absolutely no sense, it is objectively repugnant.
Loading thread detail
Fetching the original tweets from X for a clean reading view.
Hang tight—this usually only takes a few seconds.