The transformer in CoTracker is designed to iteratively update an estimate of several trajectories and can be applied to very long videos in a sliding-window manner. The authors have developed an unrolled training loop for this purpose. The CoTracker outperforms other point tracking methods in terms of efficiency and accuracy, demonstrating its potential in video motion prediction.
Key takeaways:
- The paper introduces CoTracker, an architecture that tracks multiple points throughout an entire video, improving upon methods that track points individually.
- CoTracker is based on a transformer network that models the correlation of different points in time using specialised attention layers.
- The transformer is designed to iteratively update an estimate of several trajectories and can be applied to very long videos in a sliding-window manner.
- CoTracker performs favorably against other point tracking methods in terms of efficiency and accuracy.