STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification

AAAI2019 paper

Abstract

In this work, we propose a novel Spatial-Temporal Attention (STA) approach to tackle the large-scale person re-identification task in videos. Different from the most existing methods, which simply compute representations of video clips using frame-level aggregation (e.g. average pooling), the proposed STA adopts a more effective way for producing robust clip-level feature representation. Concretely, our STA fully exploits those discriminative parts of one target person in both spatial and temporal dimensions, which results in a 2-D attention score matrix via inter-frame regularization to measure the importances of spatial parts across different frames. Thus, a more robust clip-level feature representation can be generated according to a weighted sum operation guided by the mined 2-D attention score matrix. In this way, the challenging cases for video-based person re-identification such as pose variation and partial occlusion can be well tackled by the STA. We conduct extensive experiments on two large-scale benchmarks, i.e. MARS and DukeMTMC-VideoReID. In particular, the mAP reaches 87.7% on MARS, which significantly outperforms the state-of-the-arts with a large margin of more than 11.6%.

问题

Video Person ReID，给定probe视频（RGB），排序gallery中的视频。

方法

提出了一个新的网络STA来解决video person re-id问题，文章列举的创新点有：

1）对每个spatial region给予权重，这能够同时做到discriminative part mining和frame selection；
相比于AAAI2018的Region-based Quality Estimation Network改进了part-based attention，确实在一定程度更加符合现在reid中使用part level特征。这里是不是可以考虑对part特征的选择进行研究，例如使用deformable或者local的attention？不过这个在特征提取过程中已经做了一部分操作，效果不一定有提升。

2）提出一个inter-frame regularization，用来约束不同帧之间需要不类似。

3）新的特征融合方法。

方法框架