Human action recognition in first person videos using verb-object Pairs Fiil-nesne Çiftlerini kullanarak birinci- Şahis Videolarinda Insan Hareketlerini Tanima


Gokce Z., Pehlivan Tort S.

27th Signal Processing and Communications Applications Conference, SIU 2019, Sivas, Turkey, 24 - 26 April 2019 identifier identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1109/siu.2019.8806562
  • City: Sivas
  • Country: Turkey
  • Keywords: action recognition, first-person video, deep learning
  • TED University Affiliated: Yes

Abstract

© 2019 IEEE.Human action recognition problem is important for distinguishing the rich variety of human activities in first-person videos. While there has been an improvement in egocentric action recognition, the space of action categories is large and it looks impractical to label training data for all categories. In this work, we decompose action models into verb and noun model pairs and propose a method to combine them with a simple fusion strategy. Particularly, we use 3 Dimensional Convolutional Neural Network model, C3D, for verb stream to model video-based features, and we use object detection model, YOLO, for noun stream to model objects interacting with human. We present experiments on the recently introduced large-scale EGTEA Gaze+ dataset with 106 action classes, and show that our model is comparable to the state-of-the-art action recognition models.