Keep on to blur preview images; turn off to show them clearly

making models learn • eXperiments lab • memes and training lores


#16 - an investigation into causal mechanism of attention sinks link - https://t.co/7Nb21sY76l nice read by @HeMuyu0327 >RoPE imposes a frequency-dependent ordering on positions. the sink forms at the position whose composite rotation angles are extremal in that ordering. >continuity matters because neighboring positions produce correlated q/k directions, which makes the extremal position’s value vector repeatedly injected across layers. discontinuity breaks this reinforcement. >the author shows causal dependence by logit-patching the sink’s attention contribution. the sink vanishes because the iterative QK reinforcement loop is broken. >this means sinks aren’t semantic artifacts, they’re positional attractors produced by RoPE’s geometric structure and the transformer’s iterative update dynamics.


making models learn • eXperiments lab • memes and training lores


making models learn • eXperiments lab • memes and training lores


i am not sure if throwing myself in a month long dump for filtering and sharing with minimal details is more aligned with will to live or a desire to chill myself.


#9 - The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton link - https://t.co/wlkpXHz4sf the paper shows that if you use actual Gauss-Newton curvature instead of the watered-down approximations everyone uses, you can train LLMs dramatically faster: full GN cuts the number of training steps by about 5.4× compared to SOAP and 16× compared to muon. neither do they make theoretical guarantees regarding this claim nor has it been tested at scale (only 150M params).
