2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	IVMSP-23.6
Paper Title	AGENT-ENVIRONMENT NETWORK FOR TEMPORAL ACTION PROPOSAL GENERATION
Authors	Viet-Khoa Vo-Ho, Hoang-Ngan Le, Kashu Kamazaki, University of Arkansas, United States; Akihiro Sugimoto, National Institute of Informatics, United States; Minh-Triet Tran, University of Science, Vietnam National University - Ho Chi Minh City, United States
Session	IVMSP-23: Applications 1
Location	Gather.Town
Session Time:	Thursday, 10 June, 15:30 - 16:15
Presentation Time:	Thursday, 10 June, 15:30 - 16:15
Presentation	Poster
Topic	Image, Video, and Multidimensional Signal Processing: [IVARS] Image & Video Analysis, Synthesis, and Retrieval
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Temporal action proposal generation is an essential and challenging task that aims at localizing temporal intervals containing human actions in untrimmed videos. Most of existing approaches are unable to follow the human cognitive process of understanding the video context due to lack of attention mechanism to express the concept of an action or an agent who performs the action or the interaction between the agent and the environment. Based on the action definition that a human, known as an agent, interacts with the environment and performs an action that affects the environment, we propose a contextual Agent-Environment Network. Our proposed contextual AEN involves (i) agent pathway, operating at a local level to tell about which humans/agents are acting and (ii) environment pathway operating at a global level to tell about how the agents interact with the environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks, i.e C3D and SlowFast, show that our method robustly exhibits outperformance against state-of-the-art methods regardless of the employed backbone network.