but this feels like it has to be right in the end! 1. unsupervised learning on everything to understand the world. 1st person, 3rd person, car cameras, 2d animation, cctv, instructional videos, text, images, any and all robotics data, etc. 2. that should transfer downstream to a model you finetune with teleoperation data. your robotics model uses its deep latent understanding of “what” a coffee mug really is and what it is used for to understand your human demonstrations. also finetuning in a motor control and action head shouldn’t be hard here if data not in pretraining 3. a bit of real world on-policy RL with your model deployed in the wild (or some in sim/in lab) is what you need to seal the deal.
スレッドを読み込み中
X から元のツイートを取得し、読みやすいビューを準備しています。
通常は数秒で完了しますので、お待ちください。