LogoThread Easy
  • 発見
  • スレッド作成
LogoThread Easy

Twitter スレッドの万能パートナー

© 2025 Thread Easy All Rights Reserved.

探索

Newest first — browse tweet threads

Keep on to blur preview images; turn off to show them clearly

#16 - an investigation into causal mechanism of attention sinks
link - https://t.co/7Nb21sY76l
nice read by @HeMuyu0327 
>RoPE imposes a frequency-dependent ordering on positions. the sink forms at the position whose composite rotation angles are extremal in that ordering.

>continuity matters because neighboring positions produce correlated q/k directions, which makes the extremal position’s value vector repeatedly injected across layers. discontinuity breaks this reinforcement.

>the author shows causal dependence by logit-patching the sink’s attention contribution. the sink vanishes because the iterative QK reinforcement loop is broken.

>this means sinks aren’t semantic artifacts, they’re positional attractors produced by RoPE’s geometric structure and the transformer’s iterative update dynamics.

#16 - an investigation into causal mechanism of attention sinks link - https://t.co/7Nb21sY76l nice read by @HeMuyu0327 >RoPE imposes a frequency-dependent ordering on positions. the sink forms at the position whose composite rotation angles are extremal in that ordering. >continuity matters because neighboring positions produce correlated q/k directions, which makes the extremal position’s value vector repeatedly injected across layers. discontinuity breaks this reinforcement. >the author shows causal dependence by logit-patching the sink’s attention contribution. the sink vanishes because the iterative QK reinforcement loop is broken. >this means sinks aren’t semantic artifacts, they’re positional attractors produced by RoPE’s geometric structure and the transformer’s iterative update dynamics.

making models learn • eXperiments lab • memes and training lores

avatar for tokenbender
tokenbender
Mon Nov 17 19:06:31
#15 - Cautious Weight Decay
link - https://t.co/5KyC4dUbah

> weight decay does not care about te direction that optimizer update wants.
> for SGD, AdamW, Lion, Muon, etc., decoupled weight decay is equivalent to optimizing some regularized or constrained version of the loss, not the original loss.

so can we get the good effects of weight decay (stability, regularization) without biasing the optimizer toward a different objective?

#15 - Cautious Weight Decay link - https://t.co/5KyC4dUbah > weight decay does not care about te direction that optimizer update wants. > for SGD, AdamW, Lion, Muon, etc., decoupled weight decay is equivalent to optimizing some regularized or constrained version of the loss, not the original loss. so can we get the good effects of weight decay (stability, regularization) without biasing the optimizer toward a different objective?

#16 - an investigation into causal mechanism of attention sinks link - https://t.co/7Nb21sY76l nice read by @HeMuyu0327 >RoPE imposes a frequency-dependent ordering on positions. the sink forms at the position whose composite rotation angles are extremal in that ordering. >continuity matters because neighboring positions produce correlated q/k directions, which makes the extremal position’s value vector repeatedly injected across layers. discontinuity breaks this reinforcement. >the author shows causal dependence by logit-patching the sink’s attention contribution. the sink vanishes because the iterative QK reinforcement loop is broken. >this means sinks aren’t semantic artifacts, they’re positional attractors produced by RoPE’s geometric structure and the transformer’s iterative update dynamics.

avatar for tokenbender
tokenbender
Mon Nov 17 18:51:29
#9 - The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton
link - https://t.co/wlkpXHz4sf

the paper shows that if you use actual Gauss-Newton curvature instead of the watered-down approximations everyone uses, you can train LLMs dramatically faster: full GN cuts the number of training steps by about 5.4× compared to SOAP and 16× compared to muon.

neither do they make theoretical guarantees regarding this claim nor has it been tested at scale (only 150M params).

#9 - The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton link - https://t.co/wlkpXHz4sf the paper shows that if you use actual Gauss-Newton curvature instead of the watered-down approximations everyone uses, you can train LLMs dramatically faster: full GN cuts the number of training steps by about 5.4× compared to SOAP and 16× compared to muon. neither do they make theoretical guarantees regarding this claim nor has it been tested at scale (only 150M params).

making models learn • eXperiments lab • memes and training lores

avatar for tokenbender
tokenbender
Mon Nov 17 17:59:58
i am not sure if throwing myself in a month long dump for filtering and sharing with minimal details is more aligned with will to live or a desire to chill myself.

i am not sure if throwing myself in a month long dump for filtering and sharing with minimal details is more aligned with will to live or a desire to chill myself.

making models learn • eXperiments lab • memes and training lores

avatar for tokenbender
tokenbender
Mon Nov 17 16:34:06
dinner break.
after that if I've the will to live i would continue this bookmark dump of everything i found interesting in the last month.

dinner break. after that if I've the will to live i would continue this bookmark dump of everything i found interesting in the last month.

i am not sure if throwing myself in a month long dump for filtering and sharing with minimal details is more aligned with will to live or a desire to chill myself.

avatar for tokenbender
tokenbender
Mon Nov 17 16:31:26
#8 - Defeating the Training-Inference Mismatch via FP16
link - https://t.co/rFKo8w36nc

continuing on the same topic as in 6 and 7, think of this as a mandatory read to catch up on the discussion on this topic.

previous works tried to patch this training-inference mismatch issue with importance sampling hacks or heavy engineering to better align kernels. it sort of helps but:
>costs extra compute (extra forward passes)
>doesn’t truly fix that you’re optimizing one and deploying another
>can still be unstable

so the paper’s thesis is: the real villain is bf16. use fp16
i had lots of fun making several memes on this on my twt.

#8 - Defeating the Training-Inference Mismatch via FP16 link - https://t.co/rFKo8w36nc continuing on the same topic as in 6 and 7, think of this as a mandatory read to catch up on the discussion on this topic. previous works tried to patch this training-inference mismatch issue with importance sampling hacks or heavy engineering to better align kernels. it sort of helps but: >costs extra compute (extra forward passes) >doesn’t truly fix that you’re optimizing one and deploying another >can still be unstable so the paper’s thesis is: the real villain is bf16. use fp16 i had lots of fun making several memes on this on my twt.

#9 - The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton link - https://t.co/wlkpXHz4sf the paper shows that if you use actual Gauss-Newton curvature instead of the watered-down approximations everyone uses, you can train LLMs dramatically faster: full GN cuts the number of training steps by about 5.4× compared to SOAP and 16× compared to muon. neither do they make theoretical guarantees regarding this claim nor has it been tested at scale (only 150M params).

avatar for tokenbender
tokenbender
Mon Nov 17 16:07:34
  • Previous
  • 1
  • 2
  • 3
  • More pages
  • 18
  • 19
  • Next