i think this is beautiful - doing ViT from raw pixels means you need to jointly train everything - this poor model must independently solve MNIST, and THEN/ALSO learn to be a perfect calculator in its weights. then keep going.... only constrained by the data you give it. it's why @percyliang's concept of "foundation models" in 2021 was so disruptive/sacrilegious in the Google vs OpenAI sprint to the GPT: instead of 1000 different small models all specialized in their tasks, concentrate all that budget/data/resources in one supermodel that has the capacity to model 1000 tasks; along the way you get 1) transfer learning, 2) capabilities you never explicitly trained for, 3) emergent abilities that only unlock at a given param/depth/data exposure rate.
Loading thread detail
Fetching the original tweets from X for a clean reading view.
Hang tight—this usually only takes a few seconds.
