We used VLMs to turn robot videos into subtasks at 19x lower cost than humans

We have spent the past few weeks carefully annotating videos and experimenting with VLMs for subtask annotation. This type of annotation is incredibly important for long-horizon tasks, since robots need a more granular learning signal than high-level instructions like “clean your room.” We ran 50+ experiments, created a new diverse benchmark for this type of annotation, and built a pipeline that is 19x cheaper than humans. It works well as a first pass for labeling, speeding up human annotation and making it substantially cheaper. Blogpost about it is here: https://macrodata.co/blog/annotating-robot-video-subtasks submitted by /u/Other_Housing8453 [link] [Kommentare]

We used VLMs to turn robot videos into subtasks at 19x lower cost than humans(reddit.com)

Comments