AI4media project

Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.

Escaping local minima in deep reinforcement learning for video summarization

State-of-the-art deep neural unsupervised video summarization methods mostly fall under the adversarial reconstruction framework. This employs a Generative Adversarial Network (GAN) structure and Long Short-Term Memory (LSTM) auto-encoders during its training stage. The typical result is a selector LSTM that sequentially receives video frame representations and outputs corresponding scalar importance factors, which are then used to select key-frames. This basic approach has been augmented with an additional Deep Reinforcement Learning (DRL) agent, trained using the Discriminator’s output as a reward, which learns to optimize the selector’s outputs. However, local minima are a well-known problem in DRL. Thus, this paper presents a novel regularizer for escaping local loss minima, in order to improve unsupervised key-frame extraction. It is an additive loss term employed during a second training phase, that rewards the difference of the neural agent’s parameters from those of a previously found good solution. Thus, it encourages the training process to explore more aggressively the parameter space in order to discover a better local loss minimum. Evaluation performed on two public datasets shows considerable increases over the baseline and against the state-of-the-art.

Institution: ISTI-CNR;

nSimplex Zen: a novel dimensionality reduction for euclidean and Hilbert spaces

Tuning Neural ODE Networks to Increase Adversarial Robustness in Image Forensics

MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection

On the Generalization of Deep Learning Models in Video Deepfake Detection

Rhythmic and Psycholinguistic Features for Authorship Tasks in the Spanish Parliament: Evaluation and Analysis

Binary quantification and dataset shift: an experimental investigation

Regularization-Based Methods for Ordinal Quantification

Induced Permutations for Approximate Metric Search

VISIONE 5.0: Enhanced User Interface and AI Models for VBS2024

Evaluating Performance and Trends in Interactive Video Retrieval: Insights From the 12th VBS Competition

Interactive multimodal video search: an extended post-evaluation for the VBS 2022 competition

The Emotions of the Crowd: Learning Image Sentiment from Tweets via Cross-modal Distillation

Scalable bio-inspired training of Deep Neural Networks with FastHebb

VISIONE for newbies: an easier-to-use video retrieval system

Will VISIONE Remain Competitive in Lifelog Image Search?

Detecting Images Generated by Diffusers

A deep learning-based pipeline for whitefly pest abundance estimation on chromotropic sticky traps

Report on the 3rd International Workshop on Learning to Quantify (LQ 2023)

Vec2Doc: transforming dense vectors into sparse representations for efficient information retrieval

Development of a Realistic Crowd Simulation Environment for Fine-grained Validation of People Tracking Methods

CrowdSim2: an Open Synthetic Benchmark for Object Detectors

A Spatio-Temporal Attentive Network for Video-Based Crowd Counting

Learning to Detect Fallen People in Virtual Worlds

Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

Escaping local minima in deep reinforcement learning for video summarization