STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification

AAAI2019 paper

Abstract

In this work, we propose a novel Spatial-Temporal Attention (STA) approach to tackle the large-scale person re-identification task in videos. Different from the most existing methods, which simply compute representations of video clips using frame-level aggregation (e.g. average pooling), the proposed STA adopts a more effective way for producing robust clip-level feature representation. Concretely, our STA fully exploits those discriminative parts of one target person in both spatial and temporal dimensions, which results in a 2-D attention score matrix via inter-frame regularization to measure the importances of spatial parts across different frames. Thus, a more robust clip-level feature representation can be generated according to a weighted sum operation guided by the mined 2-D attention score matrix. In this way, the challenging cases for video-based person re-identification such as pose variation and partial occlusion can be well tackled by the STA. We conduct extensive experiments on two large-scale benchmarks, i.e. MARS and DukeMTMC-VideoReID. In particular, the mAP reaches 87.7% on MARS, which significantly outperforms the state-of-the-arts with a large margin of more than 11.6%.

问题

Video Person ReID,给定probe视频(RGB),排序gallery中的视频。

方法

提出了一个新的网络STA来解决video person re-id问题,文章列举的创新点有:

1)对每个spatial region给予权重,这能够同时做到discriminative part mining和frame selection;
相比于AAAI2018的Region-based Quality Estimation Network改进了part-based attention,确实在一定程度更加符合现在reid中使用part level特征。这里是不是可以考虑对part特征的选择进行研究,例如使用deformable或者local的attention?不过这个在特征提取过程中已经做了一部分操作,效果不一定有提升。

2)提出一个inter-frame regularization,用来约束不同帧之间需要不类似。

3)新的特征融合方法。

方法框架

整体看上去并不复杂,方法总览如下:

方法介绍

细节上:
1、特征提取:
ResNet50的最后一层的stride需要调整成1

在这里插入图片描述
这里公式有点问题,文字意思是先对每个point求特征向量的norm的平方,而后在spatial上面做L2 normalization。之后做的L1 normalization和公式也对不上,公式上只是算了每个spatial block的l1 norm。

在这里插入图片描述
在得到了每个spatial block的attention score之后,对相同spatial region的score做l1 normalization。

在这里插入图片描述

2、regularization的公式,应用在上,注意正则化项只是随机取了两帧计算的,公式在后面。

在这里插入图片描述

3、特征融合

在这里插入图片描述
使用highest score得到的第一个feature map,使用attention score加权得到第二个feature map,然后使用global average pooling和fc得到最后的特征。

4、算法表

在这里插入图片描述
算法表更清晰,但是有一些小错误。

结果

在这里插入图片描述
从ablation study看出提出的component: STA,Fusion,Reg都能提高结果。


STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification
http://yoursite.com/2019/09/10/STA-Spatial-Temporal-Attention-for-Large-Scale-Video-based-Person-Re-Identification/
Author
John Doe
Posted on
September 10, 2019
Licensed under