LogoThread Easy
  • 探索
  • 線程創作
LogoThread Easy

Twitter 線程的一站式夥伴

© 2025 Thread Easy All Rights Reserved.

探索

Newest first — browse tweet threads

Keep on to blur preview images; turn off to show them clearly

Any time I've trained a transformer from scratch on webtext, the loss curve looks like this. The first drop makes sense, but why the second one?

Gemini is telling me nonsense.

Architecture same as gpt2 except swiglu, rope, untied embeddings

training:
muon + adam
linear warmup (up til 500 steps)

My best thought is the induction head formation meme, but my understanding is this happens quite late, like after several thousand training steps or like a billion tokens or sth, and i have 100k tokens per batch.

Any transformer training people know why this happens?

Any time I've trained a transformer from scratch on webtext, the loss curve looks like this. The first drop makes sense, but why the second one? Gemini is telling me nonsense. Architecture same as gpt2 except swiglu, rope, untied embeddings training: muon + adam linear warmup (up til 500 steps) My best thought is the induction head formation meme, but my understanding is this happens quite late, like after several thousand training steps or like a billion tokens or sth, and i have 100k tokens per batch. Any transformer training people know why this happens?

Interests: AI (Safety), meditation, philosophy, mathematics, algorithms If I say something you disagree with, please dm or quote tweet. I love to argue!

avatar for William Wale
William Wale
Mon Dec 08 10:46:32
  • Previous
  • 1
  • Next