but this feels like it has to be right in the end! 1. unsupervised learning on everything to understand the world. 1st person, 3rd person, car cameras, 2d animation, cctv, instructional videos, text, images, any and all robotics data, etc. 2. that should transfer downstream to a model you finetune with teleoperation data. your robotics model uses its deep latent understanding of “what” a coffee mug really is and what it is used for to understand your human demonstrations. also finetuning in a motor control and action head shouldn’t be hard here if data not in pretraining 3. a bit of real world on-policy RL with your model deployed in the wild (or some in sim/in lab) is what you need to seal the deal.
正在載入線程內容
正在從 X 取得原始推文,整理成清爽的閱讀畫面。
通常只需幾秒鐘,請稍候。