Human action recognition in first person videos using verb-object Pairs Fiil-nesne Çiftlerini kullanarak birinci- Şahis Videolarinda Insan Hareketlerini Tanima

Gokce Z., Pehlivan Tort S.

27th Signal Processing and Communications Applications Conference, SIU 2019, Sivas, Türkiye, 24 - 26 Nisan 2019, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/siu.2019.8806562
Basıldığı Şehir: Sivas
Basıldığı Ülke: Türkiye
Anahtar Kelimeler: action recognition, first-person video, deep learning
TED Üniversitesi Adresli: Evet

Özet

© 2019 IEEE.Human action recognition problem is important for distinguishing the rich variety of human activities in first-person videos. While there has been an improvement in egocentric action recognition, the space of action categories is large and it looks impractical to label training data for all categories. In this work, we decompose action models into verb and noun model pairs and propose a method to combine them with a simple fusion strategy. Particularly, we use 3 Dimensional Convolutional Neural Network model, C3D, for verb stream to model video-based features, and we use object detection model, YOLO, for noun stream to model objects interacting with human. We present experiments on the recently introduced large-scale EGTEA Gaze+ dataset with 106 action classes, and show that our model is comparable to the state-of-the-art action recognition models.