RT @prompt_Tunes: (1/n) the way i like to think about this is : in post-norm the gradient must pass through norm fn i.e for out = norm(x +…
Loading thread detail
Fetching the original tweets from X for a clean reading view.
Hang tight—this usually only takes a few seconds.