IndexFiguresTables |
Yulan Zhao† and Hyo Jong Lee††Teacher-Student Architecture Based CNN for Action RecognitionAbstract: Convolutional neural network (CNN) generally uses two-stream architecture RGB and optical flow stream for its action recognition function. RGB frames stream display appearance and optical flow stream interprets its action. However, the standard method of using optical flow is costly in its computational time and latency associated with increased action recognition. The purpose of the study was to evaluate a novel way to create a two sub-networks in neural networks. The optical flow sub-network was assigned as a teacher and the RGB frames as a student. In the training stage, the optical flow sub-network extracts features through the teacher sub-network and transmits the information to student sub-network for baseline training. In the test stage, only student sub-network was operational with decreased in latency without computing optical flow. Experimental results shows that our network fed only by RGB stream gets a competitive accuracy of 54.5% on HMDB51, which is 1.5 times better than that on R3D-18. Keywords: Two-Stream Network , Teacher-Student Architecture , CNN , Optical Flow , Action Recognition Yulan Zhao† , 이효종††동작 인식을 위한 교사-학생 구조 기반 CNN요 약: 대부분 첨단 동작 인식 컨볼루션 네트워크는 RGB 스트림과 광학 흐름 스트림, 양 스트림 아키텍처를 기반으로 하고 있다. RGB 프레임 스트림은 모양 특성을 나타내고 광학 흐름 스트림은 동작 특성을 해석한다. 그러나 광학 흐름은 계산 비용이 매우 높기 때문에 동작 인식 시간에 지연을 초래한다. 이에 양 스트림 네트워크와 교사-학생 아키텍처에서 영감을 받아 행동 인식을 위한 새로운 네트워크 디자인을 개발하였다. 제안 신경망은 두 개의 하위 네트워크로 구성되어있다. 즉, 교사 역할을 하는 광학 흐름 하위 네트워크와 학생 역할을 하는 RGB 프레임 하위 네트워크를 연결하였다. 훈련 단계에서 광학 흐름의 특징을 추출하고 교사 서브 네트워크를 훈련시킨 다음 그 특징을 학생 서브 네트워크를 훈련시키기 위한 기준선으로 지정하여 학생 서브 네트워크에 전송한다. 테스트 단계에서는 광학 흐름을 계산하지 않고 대기 시간이 줄어들도록 학생 네트워크만 사용한다. 제안 네트워크는 실험을 통하여 정확도 면에서 일반 이중 스트림 아키텍처에 비해 높은 정확도를 보여주는 것을 확인하였다. 키워드: 양 스트림, 교사-학생 아키텍처, CNN, 광학 흐름, 동작 인식 1. IntroductionVideo action recognition is an important function in the realm of computer vision with its applications including automated surveillance, self-driving vehi-cles and drone navigation. Convolutional neural net-works (CNNs) have become standard of image classi-fication[1,2]. Two-stream architecture CNN [3-5] has been extremely popular for action recognition ex-ploiting RGB frames and optical flow as input stream then combining its feature to produce a final result. There are many models based on two-stream network such as optical flow guided feature(OFF) [13], hidden two-stream convolutional networks (H-TSCN) [14], and Actionflownet[15]. The optical flow computation for the action infor-mation in video is realized by calculating the dis-placement of the objects between each pair of ad-jacent frames. Each action in video frames lasts from 5 to 30 frames or longer in most video dataset. As the frame increase in size the computational power of optical flow to process frame iterations increases. Subsequently, conventional optical flow process be-comes a time consuming process with increase in la-tency and diminished function in real-time applica-tion. The purpose of the study is to design and evaluate a novel CNN for action recognition based on two-stream network of teacher-student architecture [16]. There are two sub-networks in our neural networks, the optical flow sub-network as a teacher and the RGB frames sub-network as a student. In training stage, we trained the feature of the optical flow to teacher sub-network, and then transmitted the feature to student sub-network as a baseline to train the student sub-network. In the test stage, we only used the student sub-network to reduce latency eliminating computing optical flow. 2. Related WorkThere are some significant progress with CNNs in computer vision tasks like object classification[2,17] and object detection [18]. When a large video datasets are published such as UCF101, HMDB51 [11,12], the CNNs are applied in action recognition [3,19,20]. 2.1 Two-Stream NetworkTraditional two-stream network [3] proposed by Si-monyan et al. is a 2D CNN model with RGB frames and optical flow sub-networks. It uses video clips as the input stream, and decompose the clip into RGB frames. Then a stream of RGB frames functions as a spatial component while stream of optical flow calcu-lates adjacent RGB frame as its temporal component. The spatial part carries appearance information about the objects, while the temporal part carries the move-ment information. Each stream is implemented by a sub-network and the result is combined by late fusion. In a research of two-stream network, Feichtenhofer et al. improved fusion of the two streams [4]. They ini-tially focused on 2D CNNs, but transitioned to 3D CNNs for improved spatiotemporal features. Simi-larly, Diba et al. used C3D to learn motion from opti-cal flow for end-to-end applications [5]. 2.2 Teacher-Student ArchitectureThere are various designs of teacher-student net-works [6-9]. The teacher-student architecture consists of two parallel CNNs, a large and a small models [6]. Traditionally, the large model has many nodes and parameters than the small one and often results in better outcome. The small model can process small dataset with fast result. In real-world applications, the large model spends costly resources and time, while the small network needs less resources, it only fits for little data for fast result. The distillation can transfer the knowledge learned by large model to small one, and the small model can use the distillation to access the problem and gain fast and accurate results. In process of transferring the distillation, the large mod-el acts as a teacher and the small model as a student. In recent research, Kong et al. proposed the single learning student network to tackle the challenges of learning student networks with few data [7], and Bashivan et al. designed teacher guided architecture to gain more computational efficiency [8]. 3. Proposed MethodOur model is based on two-stream network and teacher-student architecture [16] as shown in Fig. 1. We used video clips as input stream and extracted the RGB frames to feed the teacher-student network. There are two sub-networks in our architecture, teacher sub-network for optical flow branch and the student sub-network for RGB frames branch. We divided the pro-cessing of network into two stages. In a training stage, we calculated the optical flow stream in teacher sub-network to possess important motion information. Then information is used to train the network for action re-cognition and then freeze on its finalized weights. We then used the knowledge of optical flow to train the student sub-network. In the test stage, we only used stu-dent sub-network with RGB frames stream to avoid optical flow computation to save resources and time. 3.1 Teacher Sub-networkThere are two modules in the teacher sub-network, the optical flow extraction module is named as M1 and action recognition module as M2. The module M1 cal-culates optical flow from the sequential RGB frames. The module M2 extracts the features of optical flow. In the module M1, we don't employ complex net-works [21,22] to compute optical flow, its extraction is based on the brightness consistency assumption be-tween the sequential images I1 and I2, with small change in moment of the object. The approximating optical flow is formulated as shown in Equation (1).
[TeX:] $$I_{1}(x, y)$$ denotes the object at the location [TeX:] $$(x, y)$$ of the image at time [TeX:] $$t$$. [TeX:] $$I_{2}(x, y)$$ is the object at the loca-tion in the image after time [TeX:] $$\Delta t$$, and [TeX:] $$(\Delta x, \Delta y)$$ is the object spatial pixel displacements in [TeX:] $$x$$ and [TeX:] $$y$$ axes, respectively. It can be approximated with a Taylor series as shown Equation (2).
Zach et al. proposed [TeX:] $$\text { TV-L }{ }^{1}$$ method [10,11] to cal-culate the optical flow approximately. The total var-iational method estimates the optical flow by an iter-ative optimization method. The tensor [TeX:] $$u \in R^{2 \times W \times H}$$ is the [TeX:] $$x$$ and [TeX:] $$y$$ directional optical flow for each location of the object in the image. The method first computes the gradient in both [TeX:] $$x$$ and [TeX:] $$y$$ directions: [TeX:] $$\nabla I_{2}$$. The ini-tial optical flow is set to [TeX:] $$u=0$$. The [TeX:] $$\rho$$ denoting the image residual between [TeX:] $$I_{1}$$ and [TeX:] $$I_{2}$$, the iterative opti-mization is performed with updating [TeX:] $$u, v, p$$ as shown in Equation (3)-(6).
(4)[TeX:] $$v=u+ \begin{cases}\lambda \theta \nabla I_{2} & \rho<\lambda \theta\left|\nabla I_{2}\right|^{2} \\ -\lambda \theta \nabla I_{2} & \rho>\lambda \theta\left|\nabla I_{2}\right|^{2} \\ -\rho \frac{\nabla I_{2}}{\left|I_{2}\right|^{2}} & |\rho| \leq \lambda \theta\left|\nabla I_{2}\right|^{2}\end{cases}$$
(6)[TeX:] $$(d v g(p))=\left\{\begin{array}{ll} p_{i, j}^{1}-p_{i-1, j}^{1} & \text { if } 1<i<N \\ p_{i, j}^{1} & \text { if } i=1 \\ -p_{i-1, j}^{1} & \text { if } i=N \end{array}+ \begin{cases}p_{i, j}^{2}-p_{i, j-1}^{2} & \text { if } 1<jM \\ p_{i, j}^{2} & \text { if } j=1 \\ -p_{i, j-1}^{2} & \text { if } j=M\end{cases}\right.$$where hyper-parameter [TeX:] $$\theta$$ is the weight of the [TeX:] $$\text { TV-L }{ }^{1}$$ regularization term, [TeX:] $$\lambda$$ is weight of the data term, [TeX:] $$\tau$$ is the time-step and [TeX:] $$\tau \leq \frac{1}{8}$$. The dual vector field [TeX:] $$p$$ is used to minimize the energy. We computed the optical flow using formula mentioned above via residual networks as shown in Fig. 2. The module M2 extracts the features of optical flow denoted as [TeX:] $$F_{\text {of }}$$ in Fig. 3. Subsequently, we trained the teacher sub-network to classify actions using optical flow stream with a cross entropy loss function be-tween the predicted class labels [TeX:] $$P_{\text {of }}$$ and the true class labels T. Finally, the [TeX:] $$F_{\text {of }}$$ is transmitted to the student sub-network to train its network by back pro-pagation. 3.2 Student Sub-networkThe module M3 in student sub-network extracts the feature of RGB frames as [TeX:] $$F_{\text {rgb }}$$, and receives the fea-ture of optical flow from the teacher sub-network as [TeX:] $$F_{\text {of }}$$ shown in Fig. 4. We used the loss function of Mean Squared Error (MSE) on both [TeX:] $$F_{\text {rgb }}$$ and [TeX:] $$F_{\text {of }}$$ to back propagate its net-work. Thereof, the RGB stream feature can simulate the features of optical flow stream to train the early part of student sub-network. We then used a loss func-tion of cross-entropy between the predictive class de-noted as [TeX:] $$P_{\text {rgb }}$$ and the true class T with MSE to train the student sub-network entirely as shown in Equation (7).
(7)[TeX:] $$L_{R G B}=\text { CrossEntropy }\left(P_{r g b}, T\right)+\lambda\left\|F_{r g b}-F_{o f}\right\|^{2}$$where [TeX:] $$\lambda$$ is the scalar weight as the influence of motive feature. 4. ExperimentWe focused on the popular dataset for action re-cognition: HMDB51 [11]. It consists of 51 action classes with more than 6800 videos and 3 splits for training and test. There are 3570 clips in training set and 1530 clips in test set. In our experiment, we extracted 25 RGB frames from each video clip randomly as an in-put stream and a sample of 224 [TeX:] $$\times$$ 224 cropped image. In the action recognition modules, we used the SGD optimization method with a weight decay of 0.0005, momentum of 0.9, and an initial learning rate of 0.1. The accuracy of our experiment on HMDB51 was 54.5%. Table 1 includes R3D-18 with single RGB, [TeX:] $$\text { TV-L }{ }^{1}$$ and two-stream network modules and its accuracy. The optical flow frames are shown in Fig. 5. The three columns on the left are RGB frames extracted from video clips, and the two images on the right are two optical flow frames [TeX:] $$u$$ extracted from the adja-cent two RGB frames. In the first row, the action class is ‘pour’. There is a significant movement in frames with obvious change in optical flow trajectory in frame (d) and (e). In the second row, the action class is ‘cartwheel’, with the objects including a girl and a cartwheel. We can see the change of movement information from (d) and (e). In the third row, the class is ‘shake_hands’. The main motion is of two people shaking hand. There is no obvious movement change of object between frames (b) and (c) from the optical flow image (e). In the forth row, the main motion is hitting a golf ball by the man. We can see obvious movement changes from (d) and (e). In the fifth row, the class is ‘hand_ stand’. The movements and posture of the active man have changed significantly. In the sixth row, the class is ‘shoot_ball’. The action changes of the boy's shooting and the coach's posture can be seen from (d) and (e). 5. ConclusionWe introduced an architecture for action recog-nition based on teacher-student neural network. The student model only took video clips as an input stream but was able to extract both the appearance and the motion information. The model worked by training a sub-network to minimize the loss between its features and the features of optical flow stream and combining cross entropy loss for action recogni-tion. Our architecture resulted in higher accuracy in HMDB51 benchmark compared to other popular met-hods. The proposed method is sensitive and affected by lighting changes and camera motion between frames. In future studies, we will improve these defects and expand to other video dataset such as UCF101 [12], and improve our network in architecture to get better performance. BiographyYulan Zhaohttps://orcid.org/0000-0002-9469-5119 e-mail : zhaoyulan27@naver.com She received a M.S. degree in computer science from Northeast Electric Power Univ. in 2009. She is currently pursuing the Ph.D. degree in the Department of Computer Science and Engineering with Jeonbuk National Univ. Her research interests are computer vision, image processing, artificial intelligence and action recognition. BiographyHyo Jong Leehttps://orcid.org/0000-0003-2581-5268 e-mail : hlee@jbnu.ac.kr He received a Ph.D. degree in computer science from the University of Utah in 1991. He has been a professor at Jeonbuk National Univ. since 1991. His research interests include computer graphics, image processing, and parallel processing, and artificial intelligence. References
|
StatisticsHighlightsA3C를 활용한 블록체인 기반 금융 자산 포트폴리오 관리J. Kim, J. Heo, H. Lim, D. Kwon, Y. Han대용량 악성코드의 특징 추출 가속화를 위한 분산 처리시스템 설계 및 구현H. Lee, S. Euh, D. Hwang머신러닝 기법을 활용한 공장 에너지 사용량 데이터 분석J. H. Sung and Y. S. ChoGPU 성능 향상을 위한 MSHR 활용률 기반 동적 워프 스케줄러G. B. Kim, J. M. Kim, C. H. Kim인체 채널에서 전자기파 전송 지연 특성을 고려한 다중 매체 제어 프로토콜 설계S. Kim, J. Park, J. Ko빅데이터 및 고성능컴퓨팅 프레임워크를 활용한 유전체 데이터 전처리 과정의 병렬화E. Byun, J. Kwak, J. MunVANET 망에서 다중 홉 클라우드 형성 및 리소스 할당H. Choi, Y. Nam, E. LeeMongoDB 기반의 분산 침입탐지시스템 성능 평가H. Han, H. Kim, Y. Kim한국어 관객 평가기반 영화 평점 예측 CNN 구조H. Kim, H. Oh,Residual Multi-Dilated Recurrent Convolutional U-Net을 이용한 전자동 심장 분할 모델 분석S. H. Lim and M. S. LeeCite this articleIEEE StyleY. Zhao and H. J. Lee, "Teacher-Student Architecture Based CNN for Action Recognition," KIPS Transactions on Computer and Communication Systems, vol. 11, no. 3, pp. 99-104, 2022. DOI: https://doi.org/10.3745/KTCCS.2022.11.3.99.
ACM Style Yulan Zhao and Hyo Jong Lee. 2022. Teacher-Student Architecture Based CNN for Action Recognition. KIPS Transactions on Computer and Communication Systems, 11, 3, (2022), 99-104. DOI: https://doi.org/10.3745/KTCCS.2022.11.3.99.
|