Teacher-Student Architecture Based CNN for Action Recognition

Yulan Zhao†; Hyo Jong Lee††

doi:10.3745/KTCCS.2022.11.3.99

ISSN: 2287-5891

Volume 11, No 3 (2022), pp. 99 - 104

10.3745/KTCCS.2022.11.3.99

Yulan Zhao† and Hyo Jong Lee††

Teacher-Student Architecture Based CNN for Action Recognition

Abstract: Convolutional neural network (CNN) generally uses two-stream architecture RGB and optical flow stream for its action recognition function. RGB frames stream display appearance and optical flow stream interprets its action. However, the standard method of using optical flow is costly in its computational time and latency associated with increased action recognition. The purpose of the study was to evaluate a novel way to create a two sub-networks in neural networks. The optical flow sub-network was assigned as a teacher and the RGB frames as a student. In the training stage, the optical flow sub-network extracts features through the teacher sub-network and transmits the information to student sub-network for baseline training. In the test stage, only student sub-network was operational with decreased in latency without computing optical flow. Experimental results shows that our network fed only by RGB stream gets a competitive accuracy of 54.5% on HMDB51, which is 1.5 times better than that on R3D-18.

Keywords: Two-Stream Network , Teacher-Student Architecture , CNN , Optical Flow , Action Recognition

Yulan Zhao† , 이효종††

동작 인식을 위한 교사-학생 구조 기반 CNN

요 약: 대부분 첨단 동작 인식 컨볼루션 네트워크는 RGB 스트림과 광학 흐름 스트림, 양 스트림 아키텍처를 기반으로 하고 있다. RGB 프레임 스트림은 모양 특성을 나타내고 광학 흐름 스트림은 동작 특성을 해석한다. 그러나 광학 흐름은 계산 비용이 매우 높기 때문에 동작 인식 시간에 지연을 초래한다. 이에 양 스트림 네트워크와 교사-학생 아키텍처에서 영감을 받아 행동 인식을 위한 새로운 네트워크 디자인을 개발하였다. 제안 신경망은 두 개의 하위 네트워크로 구성되어있다. 즉, 교사 역할을 하는 광학 흐름 하위 네트워크와 학생 역할을 하는 RGB 프레임 하위 네트워크를 연결하였다. 훈련 단계에서 광학 흐름의 특징을 추출하고 교사 서브 네트워크를 훈련시킨 다음 그 특징을 학생 서브 네트워크를 훈련시키기 위한 기준선으로 지정하여 학생 서브 네트워크에 전송한다. 테스트 단계에서는 광학 흐름을 계산하지 않고 대기 시간이 줄어들도록 학생 네트워크만 사용한다. 제안 네트워크는 실험을 통하여 정확도 면에서 일반 이중 스트림 아키텍처에 비해 높은 정확도를 보여주는 것을 확인하였다.

키워드: 양 스트림, 교사-학생 아키텍처, CNN, 광학 흐름, 동작 인식

1. Introduction

Video action recognition is an important function in the realm of computer vision with its applications including automated surveillance, self-driving vehi-cles and drone navigation. Convolutional neural net-works (CNNs) have become standard of image classi-fication[1,2]. Two-stream architecture CNN [3-5] has been extremely popular for action recognition ex-ploiting RGB frames and optical flow as input stream then combining its feature to produce a final result. There are many models based on two-stream network such as optical flow guided feature(OFF) [13], hidden two-stream convolutional networks (H-TSCN) [14], and Actionflownet[15].

The optical flow computation for the action infor-mation in video is realized by calculating the dis-placement of the objects between each pair of ad-jacent frames. Each action in video frames lasts from 5 to 30 frames or longer in most video dataset. As the frame increase in size the computational power of optical flow to process frame iterations increases. Subsequently, conventional optical flow process be-comes a time consuming process with increase in la-tency and diminished function in real-time applica-tion.

The purpose of the study is to design and evaluate a novel CNN for action recognition based on two-stream network of teacher-student architecture [16]. There are two sub-networks in our neural networks, the optical flow sub-network as a teacher and the RGB frames sub-network as a student. In training stage, we trained the feature of the optical flow to teacher sub-network, and then transmitted the feature to student sub-network as a baseline to train the student sub-network. In the test stage, we only used the student sub-network to reduce latency eliminating computing optical flow.

2. Related Work

There are some significant progress with CNNs in computer vision tasks like object classification[2,17] and object detection [18]. When a large video datasets are published such as UCF101, HMDB51 [11,12], the CNNs are applied in action recognition [3,19,20].

2.1 Two-Stream Network

Traditional two-stream network [3] proposed by Si-monyan et al. is a 2D CNN model with RGB frames and optical flow sub-networks. It uses video clips as the input stream, and decompose the clip into RGB frames. Then a stream of RGB frames functions as a spatial component while stream of optical flow calcu-lates adjacent RGB frame as its temporal component. The spatial part carries appearance information about the objects, while the temporal part carries the move-ment information. Each stream is implemented by a sub-network and the result is combined by late fusion. In a research of two-stream network, Feichtenhofer et al. improved fusion of the two streams [4]. They ini-tially focused on 2D CNNs, but transitioned to 3D CNNs for improved spatiotemporal features. Simi-larly, Diba et al. used C3D to learn motion from opti-cal flow for end-to-end applications [5].

2.2 Teacher-Student Architecture

There are various designs of teacher-student net-works [6-9]. The teacher-student architecture consists of two parallel CNNs, a large and a small models [6]. Traditionally, the large model has many nodes and parameters than the small one and often results in better outcome. The small model can process small dataset with fast result. In real-world applications, the large model spends costly resources and time, while the small network needs less resources, it only fits for little data for fast result. The distillation can transfer the knowledge learned by large model to small one, and the small model can use the distillation to access the problem and gain fast and accurate results. In process of transferring the distillation, the large mod-el acts as a teacher and the small model as a student. In recent research, Kong et al. proposed the single learning student network to tackle the challenges of learning student networks with few data [7], and Bashivan et al. designed teacher guided architecture to gain more computational efficiency [8].

3. Proposed Method

Fig. 1.

Architecture of Action Recognition

Fig. 2.

M1: The Module of Optical Flow Extraction

Our model is based on two-stream network and teacher-student architecture [16] as shown in Fig. 1. We used video clips as input stream and extracted the RGB frames to feed the teacher-student network. There are two sub-networks in our architecture, teacher sub-network for optical flow branch and the student sub-network for RGB frames branch. We divided the pro-cessing of network into two stages. In a training stage, we calculated the optical flow stream in teacher sub-network to possess important motion information. Then information is used to train the network for action re-cognition and then freeze on its finalized weights. We then used the knowledge of optical flow to train the student sub-network. In the test stage, we only used stu-dent sub-network with RGB frames stream to avoid optical flow computation to save resources and time.

3.1 Teacher Sub-network

There are two modules in the teacher sub-network, the optical flow extraction module is named as M1 and action recognition module as M2. The module M1 cal-culates optical flow from the sequential RGB frames. The module M2 extracts the features of optical flow.

In the module M1, we don't employ complex net-works [21,22] to compute optical flow, its extraction is based on the brightness consistency assumption be-tween the sequential images I1 and I2, with small change in moment of the object. The approximating optical flow is formulated as shown in Equation (1).

(1)

[TeX:] $$I_{2}(x, y)=I_{1}(x+\Delta x, y+\Delta y)$$

[TeX:] $$I_{1}(x, y)$$ denotes the object at the location [TeX:] $$(x, y)$$ of the image at time [TeX:] $$t$$. [TeX:] $$I_{2}(x, y)$$ is the object at the loca-tion in the image after time [TeX:] $$\Delta t$$, and [TeX:] $$(\Delta x, \Delta y)$$ is the object spatial pixel displacements in [TeX:] $$x$$ and [TeX:] $$y$$ axes, respectively. It can be approximated with a Taylor series as shown Equation (2).

(2)

[TeX:] $$I_{2}=I_{1}+\frac{\delta I}{\delta x} \Delta x+\frac{\delta I}{\delta y} \Delta y$$

Zach et al. proposed [TeX:] $$\text { TV-L }{ }^{1}$$ method [10,11] to cal-culate the optical flow approximately. The total var-iational method estimates the optical flow by an iter-ative optimization method. The tensor [TeX:] $$u \in R^{2 \times W \times H}$$ is the [TeX:] $$x$$ and [TeX:] $$y$$ directional optical flow for each location of the object in the image. The method first computes the gradient in both [TeX:] $$x$$ and [TeX:] $$y$$ directions: [TeX:] $$\nabla I_{2}$$. The ini-tial optical flow is set to [TeX:] $$u=0$$. The [TeX:] $$\rho$$ denoting the image residual between [TeX:] $$I_{1}$$ and [TeX:] $$I_{2}$$, the iterative opti-mization is performed with updating [TeX:] $$u, v, p$$ as shown in Equation (3)-(6).

(3)

[TeX:] $$u=v+\theta \cdot d v g(p)$$

(4)

[TeX:] $$v=u+ \begin{cases}\lambda \theta \nabla I_{2} & \rho<\lambda \theta\left|\nabla I_{2}\right|^{2} \\ -\lambda \theta \nabla I_{2} & \rho>\lambda \theta\left|\nabla I_{2}\right|^{2} \\ -\rho \frac{\nabla I_{2}}{\left|I_{2}\right|^{2}} & |\rho| \leq \lambda \theta\left|\nabla I_{2}\right|^{2}\end{cases}$$

(5)

[TeX:] $$p=\frac{p+\frac{\tau}{\theta} \nabla u}{1+\frac{\tau}{\theta}|\nabla u|}$$

(6)

[TeX:] $$(d v g(p))=\left\{\begin{array}{ll} p_{i, j}^{1}-p_{i-1, j}^{1} & \text { if } 1<i<N \\ p_{i, j}^{1} & \text { if } i=1 \\ -p_{i-1, j}^{1} & \text { if } i=N \end{array}+ \begin{cases}p_{i, j}^{2}-p_{i, j-1}^{2} & \text { if } 1<jM \\ p_{i, j}^{2} & \text { if } j=1 \\ -p_{i, j-1}^{2} & \text { if } j=M\end{cases}\right.$$

where hyper-parameter [TeX:] $$\theta$$ is the weight of the [TeX:] $$\text { TV-L }{ }^{1}$$ regularization term, [TeX:] $$\lambda$$ is weight of the data term, [TeX:] $$\tau$$ is the time-step and [TeX:] $$\tau \leq \frac{1}{8}$$. The dual vector field [TeX:] $$p$$ is used to minimize the energy.

We computed the optical flow using formula mentioned above via residual networks as shown in Fig. 2.

The module M2 extracts the features of optical flow denoted as [TeX:] $$F_{\text {of }}$$ in Fig. 3. Subsequently, we trained the teacher sub-network to classify actions using optical flow stream with a cross entropy loss function be-tween the predicted class labels [TeX:] $$P_{\text {of }}$$ and the true class labels T. Finally, the [TeX:] $$F_{\text {of }}$$ is transmitted to the student sub-network to train its network by back pro-pagation.

Fig. 3.

M2: The Module of Optical Flow Feature Extraction

Fig. 4.

The Module of RGB Frames Action Recognition

3.2 Student Sub-network

The module M3 in student sub-network extracts the feature of RGB frames as [TeX:] $$F_{\text {rgb }}$$, and receives the fea-ture of optical flow from the teacher sub-network as [TeX:] $$F_{\text {of }}$$ shown in Fig. 4.

We used the loss function of Mean Squared Error (MSE) on both [TeX:] $$F_{\text {rgb }}$$ and [TeX:] $$F_{\text {of }}$$ to back propagate its net-work. Thereof, the RGB stream feature can simulate the features of optical flow stream to train the early part of student sub-network. We then used a loss func-tion of cross-entropy between the predictive class de-noted as [TeX:] $$P_{\text {rgb }}$$ and the true class T with MSE to train the student sub-network entirely as shown in Equation (7).

(7)

[TeX:] $$L_{R G B}=\text { CrossEntropy }\left(P_{r g b}, T\right)+\lambda\left\|F_{r g b}-F_{o f}\right\|^{2}$$

where [TeX:] $$\lambda$$ is the scalar weight as the influence of motive feature.

4. Experiment

We focused on the popular dataset for action re-cognition: HMDB51 [11]. It consists of 51 action classes with more than 6800 videos and 3 splits for training and test. There are 3570 clips in training set and 1530 clips in test set. In our experiment, we extracted 25 RGB frames from each video clip randomly as an in-put stream and a sample of 224 [TeX:] $$\times$$ 224 cropped image.

In the action recognition modules, we used the SGD optimization method with a weight decay of 0.0005, momentum of 0.9, and an initial learning rate of 0.1. The accuracy of our experiment on HMDB51 was 54.5%. Table 1 includes R3D-18 with single RGB, [TeX:] $$\text { TV-L }{ }^{1}$$ and two-stream network modules and its accuracy.

The optical flow frames are shown in Fig. 5. The three columns on the left are RGB frames extracted from video clips, and the two images on the right are two optical flow frames [TeX:] $$u$$ extracted from the adja-cent two RGB frames.

In the first row, the action class is ‘pour’. There is a significant movement in frames with obvious change in optical flow trajectory in frame (d) and (e). In the second row, the action class is ‘cartwheel’, with the objects including a girl and a cartwheel. We can see the change of movement information from (d) and (e). In the third row, the class is ‘shake_hands’. The main motion is of two people shaking hand. There is no obvious movement change of object between frames (b) and (c) from the optical flow image (e). In the forth row, the main motion is hitting a golf ball by the man. We can see obvious movement changes from (d) and (e). In the fifth row, the class is ‘hand_ stand’. The movements and posture of the active man have changed significantly. In the sixth row, the class is ‘shoot_ball’. The action changes of the boy's shooting and the coach's posture can be seen from (d) and (e).

Table 1.

Experimental Results

CNNs	Accuracy (%)
R3D-18 RGB	34.5
R3B-18 [TeX:] $$\text { TV-L }{ }^{1}$$	36.5
Two-stream	46.6
Ours	54.5

Fig. 5.

Examples of Optical Flow: For Every Row, The Image (d) is The Optical Flow Between Frame (a) and (b), While Image (e) is The Optical Flow Between (b) and (c).

5. Conclusion

We introduced an architecture for action recog-nition based on teacher-student neural network. The student model only took video clips as an input stream but was able to extract both the appearance and the motion information. The model worked by training a sub-network to minimize the loss between its features and the features of optical flow stream and combining cross entropy loss for action recogni-tion. Our architecture resulted in higher accuracy in HMDB51 benchmark compared to other popular met-hods. The proposed method is sensitive and affected by lighting changes and camera motion between frames. In future studies, we will improve these defects and expand to other video dataset such as UCF101 [12], and improve our network in architecture to get better performance.

Biography

Yulan Zhao

https://orcid.org/0000-0002-9469-5119

e-mail : zhaoyulan27@naver.com

She received a M.S. degree in computer science from Northeast Electric Power Univ. in 2009. She is currently pursuing the Ph.D. degree in the Department of Computer Science and Engineering with Jeonbuk National Univ. Her research interests are computer vision, image processing, artificial intelligence and action recognition.

Biography

Hyo Jong Lee

https://orcid.org/0000-0003-2581-5268

e-mail : hlee@jbnu.ac.kr

He received a Ph.D. degree in computer science from the University of Utah in 1991. He has been a professor at Jeonbuk National Univ. since 1991. His research interests include computer graphics, image processing, and parallel processing, and artificial intelligence.

References

1 C. Szegedy, W. Liu, Y. Jia, P. Sermane, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, "Going deeper with convolutions," in Proceedings of the Computer Vision and Pattern Recognition, 2015.pp. 1-9. doi:[[[10.1109/cvpr.2015.7298594]]]
2 J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, W. Xu, "CNN-RNN: A unified framework for multi-label image classification," in Proceedings of the Computer Vision and Pattern Recognition, 2016.pp. 2285-2294. doi:[[[10.1109/cvpr.2016.251]]]
3 K. Simonyan, A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Proceedings of the Neural Information Processing, 2014.pp. 568-576. custom:[[[https://papers.nips.cc/paper/2014/hash/00ec53c4682d36f5c4359f4ae7bd7ba1-Abstract.html]]]
4 C. Feichtenhofer, A. Pinz, A. Zisserman, "Convolutional Two-Stream Network Fusion for Video Action Recognition," in Proceedings of the Computer Vision and Pattern Recognition, 2016.pp. 1933-1941. doi:[[[10.1109/CVPR.2016.213]]]
5 A. Diba, A. Pazandeh, L. V. Gool, "Efficient two-stream motion and appearance 3D CNNs for video classification," arXiv:1608.08851, 2016.doi:[[[10.48550/arXiv.1608.08851]]]
6 G. Hinton, O. Vinyal, J. Dean, "Distilling the knowledge in a neural network," in Neural Information Processing Deep Learning Workshop, 2014.doi:[[[10.48550/arXiv.1503.02531]]]
7 S. Kong, T. Guo, S. You, C. Xu, "Learning student networks with few data," in Proceedings of the AAAI Conference on Artificial Intelligence, 2020.vol. 34, no. 4, pp. 4469-4476. doi:[[[10.1609/aaai.v34i04.5874]]]
8 J. P. Bashivan, M. Tensen, J. J. DiCarlo, "Teacher guided architecture search," in Proceedings of the IEEE /CVF Inter-national Conference on Computer Vision (ICCV), 2019.pp. 5320-5329. doi:[[[10.1109/iccv.2019.00542]]]
9 D. Shah, V. Trivedi, V. Sheth, A. Shah, and U. Chauhan, Information Processing in Agriculture,, https://doi.org/10.1016/j.inpa.2021.06.001doi:[[[10.1016/j.inpa.2021.06.001]]]
10 C. Zach, T. Pock, H. Bischof, "A duality based approach for realtime TV-L1 optical flow," in DAGM 2007: Pattern Recognition, vol. 4713, pp. 214-223, 2007.doi:[[[10.1007/978-3-540-74936-3_22]]]
11 H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre, "HMDB: A large video database for human motion recognition," in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2011.pp. 2556-2563. doi:[[[10.1109/ICCV.2011.6126543]]]
12 K. Soomro, A. R. Zamir, M. Shah, "UCF101: A dataset of 101 human actions classes from videos in the wild," arXiv preprint arXiv:1212.0402, 2012.doi:[[[10.48550/arXiv.1212.0402]]]
13 S. Sun, Z. Kuang, L. Sheng, W. Ouyang, W. Zhang, "Optical flow guided feature: A fast and robust motion rep-resentation for video action recognition," in Proceedings of the Computer Vision and Pattern Recognition, 2018.pp. 1-9. doi:[[[10.48550/arXiv.1711.11152]]]
14 Y. Zhu, Z. Lan, S. Newsam, A. G. Hauptmann, "Hidden two-stream convolutional networks for action recognition," arXiv preprint arXiv:1704.00389, 2017.doi:[[[10.1007/978-3-030-20893-6_23]]]
15 J. Y.-H. Ng, J. Choi, J. Neumann, L. S. Davis, "Action-flownet: Learning motion representation for action recognition," in IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.pp. 1616-1624. doi:[[[10.1109/WACV.2018.00179]]]
16 Y. Zhao, H. Lee, "FTSnet: A simple convolutional neural networks for action recognition," in Proceedings of the Annual Conference of KIPS(ACK) 2021, 2021.pp. 878-879. doi:[[[10.3745/PKIPS.y2021m11a.878]]]
17 K. He, X. Zhang, S. Ren, J. Sun, "Deep residual learning for image recognition," in Proceedings of the Computer Vision and Pattern Recognition, 2016.pp. 770-778. doi:[[[10.1109/cvpr.2016.90]]]
18 S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, A. Kassim, "Robust facial landmark detection via recurrent attentive- refinement networks," in Proceedings of the European Conference on Computer Vision (ECCV), 2016.pp. 57-72. doi:[[[10.1007/978-3-319-46448-0_4]]]
19 Z. Wang, Q. She, A. Smolic, "ACTION-Net: Multipath excitation for action recognition," in Proceedings of the Computer Vision and Pattern Recognition, 2021.pp. 13214-13223. doi:[[[10.1109/cvpr46437.2021.01301]]]
20 L. Wang, Z. Tong, B. Ji, G. Wu, "TDN: Temporal difference networks for efficient action recognition," in Proceedings of the Computer Vision and Pattern Recognition, pp. 1895-1904, 2021.custom:[[[https://openaccess.thecvf.com/content/CVPR2021/papers/Wang_TDN_Temporal_Difference_Networks_for_Efficient_Action_Recognition_CVPR_2021_paper.pdf]]]
21 T. Hui, X. Tang, C. C. Loy, "A lightweight optical flow CNN-Revisiting data fidelity and regularization," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 8, pp. 2555-2569, 2021.doi:[[[10.48550/arXiv.1903.07414]]]
22 K. Luo, C. Wang, S. Liu, H. Fan, J. Wang, J. Sun, "UPFlow: Upsampling pyramid for unsupervised optical flow learning," in Proceedings of the Computer Vision and Pattern Recognition, 2021.pp. 1045-1054. doi:[[[10.1109/cvpr46437.2021.00110]]]

Received: December 23 2021

Accepted: January 24 2022

Published (Electronic): March 31 2022

Corresponding Author: Hyo Jong Lee†† , hlee@jbnu.ac.kr

Yulan Zhao†, 전북대학교 컴퓨터공학부 박사과정, zhaoyulan27@naver.com

Hyo Jong Lee††, 전북대학교 컴퓨터공학부 교수, hlee@jbnu.ac.kr

Index

Figures

Tables

Yulan Zhao† and Hyo Jong Lee††

Teacher-Student Architecture Based CNN for Action Recognition

Yulan Zhao† , 이효종††

동작 인식을 위한 교사-학생 구조 기반 CNN

1. Introduction