DCFNet:用于视觉跟踪的判别相关滤波器网络
DCFNet: Discriminant Correlation Filters Network for Visual Tracking
-
摘要:研究背景 基于卷积神经网络CNN(convolutional neural network)的视觉目标实时跟踪器通常不做在线网络更新,以保持快速的跟踪速度,这不可避免地影响了对目标外观变化的适应性。基于相关滤波器的跟踪器可以实时在线更新模型参数,因此将相关滤波器集成到基于CNN的跟踪器中,实现在线自适应模型更新的实时目标跟踪是一件富有挑战性的研究,进一步使用细粒度图像特征和高级通用语义嵌入来提高跟踪性能也很重要。目的 本研究的目的是通过构建组合CNN的在线自适应相关滤波器,并辅以包含图像几何和结构信息的高级通用语义嵌入,通过在线更新模型学习网络跟踪器的细粒度图像特征。方法 本文提出了一种端到端的轻量级网络架构,即判别相关滤波器网络DCFNet(discriminant correlation filter network)。一个可微分的DCF层被结合到CNN网络架构中,以便同时学习卷积特征和相关滤波器,并使相关滤波器可以有效地在线更新。本文将联合尺度-位置空间引入到DCFNet中,形成了一个尺度DCFNet,可以同时实现目标尺度和位置的预测。本文还将尺度DCFNet与卷积-反卷积网络相结合,同时学习图像的高级嵌入空间表示和低级细粒度表示,细粒度相关性分析的适应性和语义嵌入的泛化能力相互补充作用于视觉跟踪。在整个算法流程中,反向传播都是在傅立叶频域中推导出来,从而保持了DCF的效率。结果 在OTB(object tracking benchmark)和VOT(visual object tracking challenge)基准数据集的详细评估表明,本文所提出的跟踪器在保持跟踪精度的前提下,具有很快的跟踪速度;每秒跟踪超过65图像帧。结论 本文提出了DCFNet跟踪器,以在端到端的可学习框架内统一了特征表示学习和基于相关滤波器的外观建模。DCFNet建立了轻量级的特征学习网络,并在傅立叶频域上对相关滤波器层做快速建模。DCFNet已扩展到基于联合尺度-位置空间的尺度DCFNet,使特征学习能够获得目标尺度和位置的准确预测。本文还将尺度DCFNet扩展成卷积-反卷积DCFNet,在语义嵌入学习中引入了与域无关的图像重建约束,以生成保持图像结构信息的高层特征表示,并学习细粒度上下文感知相关过滤器以准确定位目标。在几个基准库上的的评估表明,DCFNet、尺度DCFNet和卷积-反卷积DCFNet在跟踪准确性和速度之间取得了很好的平衡。Abstract: CNN (convolutional neural network) based real time trackers usually do not carry out online network update in order to maintain rapid tracking speed. This inevitably influences the adaptability to changes in object appearance. Correlation filter based trackers can update the model parameters online in real time. In this paper, we present an end-to-end lightweight network architecture, namely Discriminant Correlation Filter Network (DCFNet). A differentiable DCF (discriminant correlation filter) layer is incorporated into a Siamese network architecture in order to learn the convolutional features and the correlation filter simultaneously. The correlation filter can be efficiently updated online. In previous work, we introduced a joint scale-position space to the DCFNet, forming a scale DCFNet which carries out the predictions of object scale and position simultaneously. We combine the scale DCFNet with the convolutional-deconvolutional network, learning both the high-level embedding space representations and the low-level fine-grained representations for images. The adaptability of the fine-grained correlation analysis and the generalization capability of the semantic embedding are complementary for visual tracking. The back-propagation is derived in the Fourier frequency domain throughout the entire work, preserving the efficiency of the DCF. Extensive evaluations on the OTB (Object Tracking Benchmark) and VOT (Visual Object Tracking Challenge) datasets demonstrate that the proposed trackers have fast speeds, while maintaining tracking accuracy.