Tracking-by-attention performs online multi-object tracking by representing unique objects with embeddings or "queries" that are propagated across frames. These object queries are passed between time steps and used to detect objects in the scene. Queries that output high confidence predictions in multiple frames are considered to be associated with the same object.
This approach has shown promising results in the 2D and vision-based tracking domains. However, the performance of TBA methods in the 3D LiDAR tracking domain has yet to match that of tracking-by-detection (TBD) methods. JDT3D explores this performance gap, proposing a novel LiDAR-based tracking method that addresses the limitations of existing LiDAR-based TBA methods.
JDT3D introduces a novel data augmentation method called Track Sampling Augmentation (TSA) to enrich the supervision signals for the model. TSA injects consistent objects over multiple LiDAR frames to maintain temporal consistency while providing additional supervision signals for the model. This method enables the model to learn from a more diverse set of object configurations and improves the robustness of the model to occlusions and object interactions.
Unlike previous methods that use the ground truth matching to determine which queries are passed to the next frame during training, JDT3D employs a confidence threshold for query propagation during both training and inference. This approach ensures a consistent query passing criterion between training and inference. By using a confidence threshold, JDT3D ensures that only high confidence queries are propagated across frames, naturally creating instances of false positive and false negative track queries and improving the overall tracking performance.
@article{cheongJDT3DAddressingGaps2024,
author = {Cheong, Brian and Zhou, Jiachen and Waslander, Steven},
title = {JDT3D: Addressing the Gaps in LiDAR-Based Tracking-by-Attention},
journal = {ECCV},
year = {2024},
}