探索 | Thread Easy - Twitterスレッドを展開 | リーダー・要約・作成

#16 - an investigation into causal mechanism of attention sinks link - https://t.co/7Nb21sY76l nice read by @HeMuyu0327 >RoPE imposes a frequency-dependent ordering on positions. the sink forms at the position whose composite rotation angles are extremal in that ordering. >continuity matters because neighboring positions produce correlated q/k directions, which makes the extremal position’s value vector repeatedly injected across layers. discontinuity breaks this reinforcement. >the author shows causal dependence by logit-patching the sink’s attention contribution. the sink vanishes because the iterative QK reinforcement loop is broken. >this means sinks aren’t semantic artifacts, they’re positional attractors produced by RoPE’s geometric structure and the transformer’s iterative update dynamics.

making models learn • eXperiments lab • memes and training lores

tokenbender

Mon Nov 17 19:06:31

#15 - Cautious Weight Decay link - https://t.co/5KyC4dUbah > weight decay does not care about te direction that optimizer update wants. > for SGD, AdamW, Lion, Muon, etc., decoupled weight decay is equivalent to optimizing some regularized or constrained version of the loss, not the original loss. so can we get the good effects of weight decay (stability, regularization) without biasing the optimizer toward a different objective?

#16 - an investigation into causal mechanism of attention sinks link - https://t.co/7Nb21sY76l nice read by @HeMuyu0327 >RoPE imposes a frequency-dependent ordering on positions. the sink forms at the position whose composite rotation angles are extremal in that ordering. >continuity matters because neighboring positions produce correlated q/k directions, which makes the extremal position’s value vector repeatedly injected across layers. discontinuity breaks this reinforcement. >the author shows causal dependence by logit-patching the sink’s attention contribution. the sink vanishes because the iterative QK reinforcement loop is broken. >this means sinks aren’t semantic artifacts, they’re positional attractors produced by RoPE’s geometric structure and the transformer’s iterative update dynamics.

tokenbender

Mon Nov 17 18:51:29

#9 - The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton link - https://t.co/wlkpXHz4sf the paper shows that if you use actual Gauss-Newton curvature instead of the watered-down approximations everyone uses, you can train LLMs dramatically faster: full GN cuts the number of training steps by about 5.4× compared to SOAP and 16× compared to muon. neither do they make theoretical guarantees regarding this claim nor has it been tested at scale (only 150M params).

making models learn • eXperiments lab • memes and training lores

tokenbender

Mon Nov 17 17:59:58

i am not sure if throwing myself in a month long dump for filtering and sharing with minimal details is more aligned with will to live or a desire to chill myself.

making models learn • eXperiments lab • memes and training lores

tokenbender

Mon Nov 17 16:34:06

dinner break. after that if I've the will to live i would continue this bookmark dump of everything i found interesting in the last month.

i am not sure if throwing myself in a month long dump for filtering and sharing with minimal details is more aligned with will to live or a desire to chill myself.

tokenbender

Mon Nov 17 16:31:26

#8 - Defeating the Training-Inference Mismatch via FP16 link - https://t.co/rFKo8w36nc continuing on the same topic as in 6 and 7, think of this as a mandatory read to catch up on the discussion on this topic. previous works tried to patch this training-inference mismatch issue with importance sampling hacks or heavy engineering to better align kernels. it sort of helps but: >costs extra compute (extra forward passes) >doesn’t truly fix that you’re optimizing one and deploying another >can still be unstable so the paper’s thesis is: the real villain is bf16. use fp16 i had lots of fun making several memes on this on my twt.

#9 - The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton link - https://t.co/wlkpXHz4sf the paper shows that if you use actual Gauss-Newton curvature instead of the watered-down approximations everyone uses, you can train LLMs dramatically faster: full GN cuts the number of training steps by about 5.4× compared to SOAP and 16× compared to muon. neither do they make theoretical guarantees regarding this claim nor has it been tested at scale (only 150M params).

tokenbender

Mon Nov 17 16:07:34

探索

Newest first — browse tweet threads

探索

Newest first — browse tweet threads

i am not sure if throwing myself in a month long dump for filtering and sharing with minimal details is more aligned with will to live or a desire to chill myself.

dinner break. after that if I've the will to live i would continue this bookmark dump of everything i found interesting in the last month.