Alan F. Smeaton Alba Seco de Herrera; Bogdan Ionescu; Claire-Hélène Demarty; Faiyaz Doctor Graham Healy; Lorin Sweeney; Mihai Gabriel Constantin Rukiye Savran Kiziltepe;
Dublin City University; InterDigital Paris University of Essex; University Politehnica of Bucharest
Using a collection of publicly available links to short form video clips of an average of 6 seconds duration each, 1,275 users manually annotated each video multiple times to indicate both longterm and short-term memorability of the videos. The annotations were gathered as part of an online memory game and measured a participant’s ability to recall having seen the video previously when shown a collection of videos. The recognition tasks were performed on videos seen within the previous few minutes for short-term memorability and within the previous 24 to 72 hours for long-term memorability. Data includes the reaction times for each recognition of each video. Associated with each video are text descriptions (captions) as well as a collection of image-level features applied to 3 frames extracted from each video (start, middle and end). Video-level features are also provided. The dataset was used in the Video Memorability task as part of the MediaEval benchmark in 2020.
Open Access
Journal article
Data in Brief
Daniel Gatica-Perez; Mario Parra
Idiap Research Institute
In this study, we evaluated the feasibility of using zero-shot classification models for activity recognition in a Digital Sommelier. Our experiment involved preprocessing video data by extracting frames and categorization user activities related to a wine-tasting scenario. Image classification models demonstrated high accuracy, nearing 90%, in distinguishing between “engaged” and ” disengaged” stated. however, video classification models presented a lower performance in classifying user activities such as “observing wine”, “smelling wine” and “snipping wine”, with an average accuracy of around 50% due to the interdependent nature of activities. Despite these challenges, our findings highlight the potential of zero-shot classification models in enhancing virtual assistants’ ability to recognize and respond to user activities.
Open Access
Publication
N/A
Mathias-Felipe de Lima-Santos Wilson Ceron
Universidade Federal de São Paulo; University of Amsterdam;
The information landscape has undergone significant transformations with the widespread adoption of the internet and online social networks. This has led to both positive and negative consequences. On the positive side, information can now spread quickly and reach a vast audience. Social media platforms have played a crucial role in fostering a culture of participation by motivating people to actively create and share content. However, there were also drawbacks. Social media platforms employ algorithms that restrict the diversity of content users are exposed to, leading to the reinforcement of pre-existing beliefs, commonly referred to as “echo chambers”
Open Access
Book section
Mapping Lies in the Global Media Spherre - Routlege
Alex Gomez-Villa Bartłomiej Twardowski Joost van de Weijer Marco Buzzelli Simone Zini
Autonomous University of Barcelona University of Florence; University of Milano- Bicocca
Several recent works on self-supervised learning are trained by mapping different augmentations of the same image to the same feature representation. The data augmentations used are of crucial importance to the quality of learned feature representations. In this paper, we analyze how the color jitter traditionally used in data augmentation negatively impacts the quality of the color features in learned feature representations. To address this problem, we propose a more realistic, physics-based color data augmentation – which we call Planckian Jitter – that creates realistic variations in chromaticity and produces a model robust to illumination changes that can be commonly observed in real life, while maintaining the ability to discriminate image content based on color information. Experiments confirm that such a representation is complementary to the representations learned with the currently-used color jitter augmentation and that a simple concatenation leads to significant performance gains on a wide range of downstream datasets.
In addition, we present a color sensitivity analysis that documents the impact of different training methods on model neurons and shows that the performance of the learned features is robust with respect to illuminant variations. Official code available at: https://github.com/TheZino/PlanckianJitter
Open Access
Conference paper
N/A
Hanna Lukashevich Jakob Abeßer Joachim Bös Sascha Grollmisch Sebastian Stober
Fraunhofer IDMT; Otto-von-Guericke University Magdeburg
Music classification algorithms use signal processing and machine learning approaches to extract and enrich metadata for audio recordings in music archives. Common tasks include music genre classification, where each song is assigned a single label (such as Rock, Pop, or Jazz), and musical instrument classification. Since music metadata can be ambiguous, classification algorithms cannot always achieve fully accurate predictions. Therefore, our focus extends beyond the correctly estimated class labels to include realistic confidence values for each potential genre or instrument label. In practice, many state-of-the-art classification algorithms based on deep neural networks exhibit overconfident predictions, complicating the interpretation of the final output values. In this work, we examine whether the issue of overconfident predictions and, consequently, non-representative confidence values is also relevant to music genre classification and musical instrument classification.
Moreover, we describe techniques to mitigate this behavior and assess the impact of deep ensembles and temperature scaling in generating more realistic confidence outputs, which can be directly employed in real-world music tagging applications.
Open Access
Conference paper Publication
Audio Mostly Conference
Alejandro Moreo; Berta Chulvi Paolo Rosso Silvia Corbara
ISTI-CNR; Scuola Normale Superiore Universitat Politècnica de Valènvia
Among the many tasks of the authorship field, Authorship Identification aims at uncovering the author of a document, while Author Profiling focuses on the analysis of personal characteristics of the author(s), such as gender, age, etc. Methods devised for such tasks typically focus on the style of the writing, and are expected not to make inferences grounded on the topics that certain authors tend to write about. In this paper, we present a series of experiments evaluating the use of topic- agnostic feature sets for Authorship Identification and Author Profiling tasks in Spanish political language. In particular, we propose to employ features based on rhythmic and psycholinguistic patterns, obtained via different approaches of text masking that we use to actively mask the underlying topic. We feed these feature sets to a SVM learner, and show that they lead to results that are comparable to those obtained by a BETO transformer, when the latter is trained on the original text, i.e., potentially learning from topical information. Moreover, we further investigate the results for the different authors, showing that variations in performance are partially explainable.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Andrea Ciamarra Federico Becattini; Lorenzo Seidenari;
University of Florence;
Forecasting motion and spatial positions of objects is of fundamental importance, especially in safety-critical settings such as autonomous driving. In this work, we address the issue by forecasting two different modalities that carry complementary information, namely optical flow and depth. To this end we propose FLODCAST a flow and depth forecasting model that leverages a multitask recurrent architecture, trained to jointly forecast both modalities at once. We stress the importance of training using flows and depth maps together, demonstrating that both tasks improve when the model is informed of the other modality. We train the proposed model to also perform predictions for several timesteps in the future. This provides better supervision and leads to more precise predictions, retaining the capability of the model to yield outputs autoregressively for any future time horizon. We test our model on the challenging Cityscapes dataset, obtaining state of the art results for both flow and depth forecasting. Thanks to the high quality of the generated flows, we also report benefits on the downstream task of segmentation forecasting, injecting our predictions in a flow-based mask-warping framework.
Open Access
Journal article
Pattern Recognition Letters
Christos Koutlis; Giorgios Kordopatis-Zilos; Ioannis Kompatsiaris; Ioannis Sarridis Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
In this paper, we introduce InDistill, a method that serves as a warmup stage for enhancing Knowledge Distillation (KD) effectiveness. InDistill focuses on transferring critical information flow paths from a heavyweight teacher to a lightweight student. This is achieved through a curriculum learning-based training scheme that considers the distillation difficulty of each layer and the critical learning periods when the information flow paths are established. This procedure can lead to a student model that is better prepared to learn from the teacher. To ensure the applicability of InDistill across a wide range of teacher-student pairs, we also incorporate a pruning operation when there is a discrepancy in the width of the teacher and student layers. This pruning operation reduces the width of the teacher’s intermediate layers to match those of the student, allowing direct distillation without the need for an encoding stage. The proposed method is extensively evaluated using various pairs of teacher-student architectures on CIFAR-10, CIFAR-100, and ImageNet datasets showcasing that preserving the information flow paths consistently increases the performance of the baseline KD approaches on both classification and retrieval settings.
Open Access
Conference paper
N/A
Christos Papaioannidis Ioanna Valsamara; Ioannis Pitas;
Aristotle University of Thessaloniki;
Recently, multi-agent systems that facilitate knowledge sharing among Deep Neural Network (DNN) agents, have gained increasing attention. This paper explores the dynamics of multi-agent systems that support Teacher-Student DNN interactions, where knowledge is distilled from Teachers to Students. Within such systems, selecting the most compatible Teacher for a given task is far from trivial and can lead to low-quality decisions. Hence, the need arises for accurate domain knowledge evaluation. In that context, we propose including an OOD detection module in each DNN agent to enable effective agent expertise evaluation and precise identification of suitable Teachers. This setup allows Student agents to distill knowledge from the most knowledgeable Teachers within a specific domain, ensuring optimal system performance. To effectively utilize OOD detection in this context, we address key challenges such as determining the minimum data cardinality required to ensure optimal performance and reliable inferences of the OOD detectors.
Open Access
Paper Publication Research article
N/A
Christos Papaioannidis Ioanna Valsamara; Ioannis Pitas;
Aristotle University of Thessaloniki;
In today’s data-driven world, the exponential growth of data across various sectors presents unique opportunities and challenges. In this paper, we propose a novel method tailored to enhance the efficiency of Deep Neural Networks (DNNs) in managing these vast data amounts. The primary challenge addressed is the ability of DNNs to provide inferences on the minimal amount of data without sacrificing their quality, a significant concern given the vast scales involved in big data analytics. Our approach emphasizes DNN inference efficiency and reliability, enabling DNNs to deliver accurate inferences while substantially reducing computational complexity. This study explores the increasingly attractive deployment of DNNs for complex tasks, focusing on determining the minimal amount of data necessary to ensure optimal network performance and reliable inference outputs, improving the applicability of DNNs across various big data environments.
Open Access
Paper Publication Research article
N/A
Dimitrios Papaioannou; Ioannis Pitas; Vasileios Mygdalis
Aristotle University of Thessaloniki; University of Antwerp
In the realm of machine learning systems, achieving consensus among networking nodes is a fundamental yet challenging task. This paper presents Proof of Quality Inference (PoQI), a novel consensus protocol designed to integrate deep learning inference under the basic format of the Practical Byzantine Fault Tolerant (P-BFT) algorithm. PoQI is applied to Deep Neural Networks (DNNs) to infer the quality and authenticity of produced estimations by evaluating the trustworthiness of the DNN node’s decisions. In this manner, PoQI enables DNN inference nodes to reach a consensus on a common DNN inference history in a fully decentralized fashion, rather than relying on a centralized inference decision-making process. Through PBFT adoption, our method ensures byzantine fault tolerance, permitting DNN nodes to reach an agreement on inference validity swiftly and efficiently. We demonstrate the efficacy of PoQI through theoretical analysis and empirical evaluations, highlighting its potential to forge trust among unreliable DNN nodes.
Open Access
Paper Preprint Publication
N/A
Antidio Viguria Francisco Pérez-Grau Ioannis Pitas; Marco Montes-Grova Vasileios Mygdalis
Aristotle University of Thessaloniki;
Novel view synthesis is the task of generating new images that render an object or scene from a different viewpoint than the one given. It aims to create new views of a specific subject starting from a number of pictures taken from known points of view. The novel view synthesis problem can be approached in two different ways: as a problem of interpolation of images between two known images or extrapolation of images from one or a subset of images. During this work, the problem of the extrapolation will be addressed, taking advantage of the fact that it is possible to pre-calculate the trajectories that we want the camera that takes the images to execute, from a series of known shot-types. Based on that and on the Autoregressive Transformers, it is presented an end-to-end tool for novel-view synthesis from previously unvisited points of view for aerial cinematography robots.
Open Access
Paper Preprint Publication
N/A
Anestis Christidis Christos Papaioannidis Ioannis Mademlis; Ioannis Pitas;
Aristotle University of Thessaloniki;
Human gesture recognition is a very important tool in human-computer or human-robot interaction. In many cases, such algorithms may need to be executed on systems with limited computational capabilities, due to size or weight constraints, introducing restrictions that can impede gesture recognition performance. This paper proposes a gesture recognition method that is based on a very simple and lightweight Deep Neural Network (DNN) architecture, suitable for embedded execution. In order to achieve increased accuracy without a large computational/memory overhead, the proposed method utilizes as input both full 2D human body skeletons and image patches extracted from regions of interest (e.g., around human arms) in each video frame. These two input types are processed in parallel by separate modules and the corresponding features are fused before being exploited for gesture recognition. Reliance on 2D skeleton sequences allows the utilization of a lightweight DNN architecture, while the image patches convey rich semantic information that enhances gesture recognition performance. This approach is unlike existing similar methods, which only exploit skeleton sequences. Experimental evaluation indeed shows increased recognition accuracy, indicating that the proposed method offers a reliable solution for human gesture recognition on embedded systems.
Open Access
Paper Preprint Publication
N/A
Claudio Gennaro; Fabrizio Falchi; Gabriele Lagani; Giuseppe Amato;
ISTI-CNR;
Recent work on sample efficient training of Deep Neural Networks (DNNs) proposed a semi-supervised methodology based on biologically inspired Hebbian learning, combined with traditional backprop-based training. Promising results were achieved on various computer vision benchmarks, in scenarios of scarce labeled data availability. However, current Hebbian learning solutions can hardly address large-scale scenarios due to their demanding computational cost. In order to tackle this limitation, in this contribution, we investigate a novel solution, named FastHebb (FH), based on the reformulation of Hebbian learning rules in terms of matrix multiplications, which can be executed more efficiently on GPU. Starting from Soft-Winner-Takes-All (SWTA) and Hebbian Principal Component Analysis (HPCA) learning rules, we formulate their improved FH versions: SWTA-FH and HPCA-FH. We experimentally show that the proposed approach accelerates training speed up to 70 times, allowing us to gracefully scale Hebbian learning experiments on large datasets and network architectures such as ImageNet and VGG.
Open Access
Journal article
Carlos Santiago; Claudio Gennaro; Fabrizio Falchi; Giuseppe Amato; Luca Ciampi;
Institute of Information Science and Technologies Instituto Superior Técnico;
This work addresses the challenge of video violence detection in data-scarce scenarios, focusing on bridging the domain gap that often hinders the performance of deep learning models when applied to unseen domains. We present a novel unsupervised domain adaptation (UDA) scheme designed to effectively mitigate this gap by combining supervised learning in the train (source) domain with unlabeled test (target) data. We employ single-image classification and multiple instance learning (MIL) to select frames with the highest classification scores, and, upon this, we exploit UDA techniques to adapt the model to unlabeled target domains. We perform an extensive experimental evaluation, using general-context data as the source domain and target domain datasets collected in specific environments, such as violent/non-violent actions in hockey matches and public transport. The results demonstrate that our UDA pipeline substantially enhances model performances, improving their generalization capabilities in novel scenarios without requiring additional labeled data.
Open Access
Journal article
SN Computer Science
Alberto Del Bimbo; Federico Becattini; Francesco Marchetti; Lorenzo Seidenari; Lucile Sassatelli; Quentin Guimard
Institut Universitaire de France; Université Côte d'Azur; University of Florence;
Prediction of head movements in immersive media is key to designing efficient streaming systems able to focus the bandwidth budget on visible areas of the content. However, most of the numerous proposals made to predict user head motion in 360° images and videos do not explicitly consider a prominent characteristic of the head motion data: its intrinsic uncertainty. In this article, we present an approach to generate multiple plausible futures of head motion in 360° videos, given a common past trajectory. To our knowledge, this is the first work that considers the problem of multiple head motion prediction for 360° video streaming. We introduce our discrete variational multiple sequence (DVMS) learning framework, which builds on deep latent variable models. We design a training procedure to obtain a flexible, lightweight stochastic prediction model compatible with sequence-to-sequence neural architectures. Experimental results on 4 different datasets show that our method DVMS outperforms competitors adapted from the self-driving domain by up to 41% on prediction horizons up to 5 sec., at lower computational and memory costs. To understand how the learned features account for the motion uncertainty, we analyze the structure of the learned latent space and connect it with the physical properties of the trajectories. We also introduce a method to estimate the likelihood of each generated trajectory, enabling the integration of DVMS in a streaming system. We hence deploy an extensive evaluation of the interest of our DVMS proposal for a streaming system. To do so, we first introduce a new Python-based 360° streaming simulator that we make available to the community. On real-world user, video, and networking data, we show that predicting multiple trajectories yields higher fairness between the traces, the gains for 20 to 30% of the users reaching up to 10% in visual quality for the best number K of trajectories to generate.
Open Access
Journal article
ACM Transactions on Multimedia Computing, Communications, and Applications
Claudio Gennaro; Giuseppe Amato; Lucia Vadicamo;
ISTI-CNR;
Permutation-based Indexing (PBI) approaches have been proven to be particularly effective for conducting large-scale approximate metric searching. These methods rely on the idea of transforming the original metric objects into permutation representations, which can be efficiently indexed using data structures such as inverted files.
The standard conceptualization of permutation associated with a metric object involves only the use of object distances and their relative orders from a set of anchors called pivots. In this paper, we generalized this definition in order to enlarge the class of permutation representations that can be used by PBI approaches. In particular, we introduced the concept of permutation induced by a space transformation and a sorting function, and we investigated which properties these transformations should possess to produce permutations that are effective for metric search. Furthermore, as a practical outcome, we defined a new type of permutation representation that is calculated using distances from pairs of pivots. This proposed technique allowed us to produce longer permutations than traditional ones for the same number of object pivot distance calculations. The advantage lies in the fact that when longer permutations are employed, the use of inverted files built on permutation prefixes leads to greater efficiency in the search phase.
Open Access
Journal article
N/A
Georgios Tzimiropoulos; Ioannis Maniadis Metaxas; Ioannis Patras;
Queen Mary University of London;
Self-supervised learning has recently emerged as the preeminent pretraining paradigm across and between modalities, with remarkable results. In the image domain specifically, group (or cluster) discrimination has been one of the most successful methods. However, such frameworks need to guard against heavily imbalanced cluster assignments to prevent collapse to trivial solutions. Existing works typically solve this by reweighing cluster assignments to promote balance, or with offline operations (e.g. regular re-clustering) that prevent collapse. However, the former typically requires large batch sizes, which leads to increased resource requirements, and the latter introduces scalability issues with regard to large datasets. In this work, we propose ExCB, a framework that tackles this problem with a novel cluster balancing method. ExCB estimates the relative size of the clusters across batches and balances them by adjusting cluster assignments, proportionately to their relative size and in an online manner. Thereby, it overcomes previous methods’ dependence on large batch sizes and is fully online, and therefore scalable to any dataset. We conduct extensive experiments to evaluate our approach and demonstrate that ExCB: a) achieves state-of-the-art results with significantly reduced resource requirements compared to previous works, b) is fully online, and therefore scalable to large datasets, and c) is stable and effective even with very small batch sizes.
Open Access
Conference paper
European Conference on Computer Vision
Chen Feng; Georgios Tzimiropoulos; Ioannis Patras;
Queen Mary University of London;
Despite the large progress in supervised learning with neural networks, there are significant challenges in obtaining high-quality, large-scale and accurately labelled datasets. In such contexts, how to learn in the presence of noisy labels has received more and more attention. Addressing this relatively intricate problem to attain competitive results predominantly involves designing mechanisms that select samples that are expected to have reliable annotations. However, these methods typically involve multiple off-the-shelf techniques, resulting in intricate structures. Furthermore, they frequently make implicit or explicit assumptions about the noise modes/ratios within the dataset. Such assumptions can compromise model robustness and limit its performance under varying noise conditions. Unlike these methods, in this work, we propose an efficient and effective framework with minimal hyperparameters that achieves SOTA results in various benchmarks. Specifically, we design an efficient and concise training framework consisting of a subset expansion module responsible for exploring non-selected samples and a model training module to further reduce the impact of noise, called NoiseBox . Moreover, diverging from common sample selection methods based on the “small loss” mechanism, we introduce a novel sample selection method based on the neighbouring relationships and label consistency in the feature space. Without bells and whistles, such as model co-training, self-supervised pre-training and semi-supervised learning, and with robustness concerning the settings of its few hyper-parameters, our method significantly surpasses previous methods on both CIFAR10/CIFAR100 with synthetic noise and real-world noisy datasets such as Red Mini-ImageNet, WebVision, Clothing1M and ANIMAL-10N.
Open Access
Journal article
N/A
Andrea Esuli; Claudio Gennaro; Davide Alessandro Coccomini; Fabrizio Falchi; Giuseppe Amato;
ISTI-CNR;
Open Access
Publication
N/A
Jean De Meyere Noémie Krack
KU Leuven
Recently, the British police launched its first investigation into a case of virtual “rape” in the metaverse. This paper delves into the complex considerations that user safety and content moderation could pose through the prism of the recently adopted Digital Services Act (DSA). We first explore the current state of platform operating metaverses. Metaverses are similar to current online platforms yet are differentiated by the use of XR technologies. Despite the low number of users on such platforms, specific issues related to the metaverse, such as the rise of disinformation or virtual sex crimes, have already been reported. This paper considers the following research questions: What legal challenges do specific metaverse platforms present in terms of user safety, and how does the DSA address these challenges? Attention will be brought to the impact of relevant obligations for user safety in metaverses. We continue our analysis by addressing the lack of risk assessment obligations for platform operating metaverses, as they currently do not meet the threshold to be bound by these obligations under the DSA. We conclude with recommendations for policymakers on how to tackle the challenges posed by increased risks in the metaverse.
Open Access
Conference paper
International Congress towards a responsible development of Metaverse
Christos Koutlis; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
The recently developed and publicly available synthetic image generation methods and services make it possible to create extremely realistic imagery on demand, raising great risks for the integrity and safety of online information. State-of-the-art Synthetic Image Detection (SID) research has led to strong evidence on the advantages of feature extraction from foundation models. However, such extracted features mostly encapsulate high-level visual semantics instead of fine-grained details, which are more important for the SID task. On the contrary, shallow layers encode low-level visual information. In this work, we leverage the image representations extracted by intermediate Transformer blocks of CLIP’s image-encoder via a lightweight network that maps them to a learnable forgery-aware vector space capable of generalizing exceptionally well. We also employ a trainable module to incorporate the importance of each Transformer block to the final prediction. Our method is compared against the state-of-the-art by evaluating it on 20 test datasets and exhibits an average +10.6% absolute performance improvement. Notably, the best performing models require just a single epoch for training (~8 minutes). Code available at https://github.com/mever-team/rine.
Open Access
Conference paper
European Conference on Computer Vision
Ioannis Kompatsiaris; John Violos Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
This paper discusses four facets of the Knowledge Distillation (KD) process for Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) architectures, particularly when executed on edge devices with constrained processing capabilities. First, we conduct a comparative analysis of the KD process between CNNs and ViT architectures, aiming to elucidate the feasibility and efficacy of employing different architectural configurations for the teacher and student, while assessing their performance and efficiency. Second, we explore the impact of varying the size of the student model on accuracy and inference speed, while maintaining a constant KD duration. Third, we examine the effects of employing higher resolution images on the accuracy, memory footprint and computational workload. Last, we examine the performance improvements obtained by fine-tuning the student model after KD to specific downstream tasks. Through empirical evaluations and analyses, this research provides AI practitioners with insights into optimal strategies for maximizing the effectiveness of the KD process on edge devices.
Open Access
Conference paper
Signal Processing and Communication
Hannes Fassold;
Joanneum Research;
The detection of shot boundaries (hardcuts and short dissolves), sampling structure (progressive / interlaced / pulldown) and dynamic keyframes in a video are fundamental video analysis tasks which have to be done before any further high-level analysis tasks. We present a novel algorithm which does all these analysis tasks in an unified way, by utilizing a combination of inter-frame and intra-frame measures derived from the motion field and normalized cross correlation. The algorithm runs four times faster than real-time due to sparse and selective calculation of these measures.
Open Access
Publication
Conference on Imaging, Signal Processing and Communication
Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;
ISTI-CNR;
VISIONE is a versatile video retrieval system supporting diverse search functionalities, including free-text, similarity, and temporal searches. Its recent success in securing first place in the 2024 Video Browser Showdown (VBS) highlights its effectiveness.
Originally designed for analyzing, indexing, and searching diverse video content, VISIONE can also be adapted to images from lifelog cameras thanks to its reliance on frame-based representations and retrieval mechanisms.
In this paper, we present an overview of VISIONE’s core characteristics and the adjustments made to accommodate lifelog images. These adjustments primarily focus on enhancing result visualization within the GUI, such as grouping images by date or hour to align with lifelog dataset imagery. It’s important to note that while the GUI has been updated, the core search engine and visual content analysis components remain unchanged from the version presented at VBS 2024. Specifically, metadata such as local time, GPS coordinates, and concepts associated with images are not indexed or utilized in the system. Instead, the system relies solely on the visual content of the images, with date and time information extracted from their filenames, which are utilized exclusively within the GUI for visualization purposes.
Our objective is to evaluate the system’s performance within the Lifelog Search Challenge, emphasizing reliance on visual content analysis without additional metadata.
Open Access
Conference paper
Bergman Clement Frédéric Precioso Julie Tores Léa Andolfi Lucile Sassatelli; Magali Guaresi Sarah Lecossais Thierry Devars Victor Ecrement Virginie Julliard Wu Hui-Yin
Inria; Institut Universitaire de France; Sorbonne Université Université Côte d'Azur; Université Sorbonne Paris Nord
In film gender studies the concept of “male gaze” refers to the way the characters are portrayed on-screen as objects of desire rather than subjects. In this article we introduce a novel video-interpretation task to detect character objectification in films. The purpose is to reveal and quantify the usage of complex temporal patterns operated in cinema to produce the cognitive perception of objectification. We introduce the ObyGaze12 dataset made of 1914 movie clips densely annotated by experts for objectification concepts identified in film studies and psychology. We evaluate recent vision models show the feasibility of the task and where the challenges remain with concept bottleneck models. Our new dataset and code are made available to the community.
Open Access
Conference paper Publication
IEEE Conference on Computer Vision and Pattern Recognition
Tobias Blanke;
University of Amsterdam;
Archives have long been a key concern of academic debates about truth, memory, recording and power and are important sites for social sciences and humanities research. This has been the case for traditional archives, but these debates have accelerated with the digital transformation of archives. The proliferation of digital tools and the fast-growing increase in digital materials have created very large digitised and born-digital archives. This article investigates how new digital archives continue existing archival practices while at the same time discontinuing them. We present novel methodologies and tools for changing memory and power relations in digital archives through new ways of reassembling marginalised, non-canonical entities in digital archives. Reassembling digital archives can take advantage of the materiality and the algorithmic processuality of digital collections and reshape them to inscribe lost voices and previously ignored differences. Digital archives are not fixed and are changed with new research and political questions and are only identified through new questions. The article presents six distinct techniques and strategies to reassemble digital archives and renders these according to three different types of new digital archives. We consider both the extension of archives towards evidence that is otherwise thrown away as well as the provision of new intensive, non-discriminatory viewpoints on existing collections.
Open Access
Journal article
N/A
Adrian Popescu; Bogdan Ionescu; Cristian Stanciu; Giorgios Kordopatis-Zilos; Luca Cuccovillo; Roberto Cardelli; Symeon Papadopoulos
CEA; CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; Fraunhofer IDMT; Mercatorum University; University Politehnica of Bucharest
Front matter of the proceedings of the 3nd ACM International Workshop on Multimedia AI against Disinformation, held in Phuket (Thailand) on June 10th, 2024.
The full proceedings are available online at https://dl.acm.org/doi/proceedings/10.1145/3643491.
Open Access
Book section
ACM Association for Computing Machinery
Adrian Popescu; Bogdan Ionescu; Cristian Stanciu; Giorgios Kordopatis-Zilos; Luca Cuccovillo; Roberto Cardelli; Symeon Papadopoulos
CEA; CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; Fraunhofer IDMT; Mercatorum University; University Politehnica of Bucharest
Synthetic media generation and manipulation have seen rapid advancements in recent years, making it increasingly easy to create multimedia content that is indistinguishable to the human observer. Moreover, generated content can be used maliciously by individuals and organizations in order to spread disinformation, posing a significant threat to society and democracy. Hence, there is an urgent need for AI tools geared towards facilitating a timely and effective media verification process. The MAD’24 workshop seeks to bring together people with diverse backgrounds who are dedicated to combating disinformation in multimedia through the means of AI, by fostering an environment for exploring innovative ideas and sharing experiences. The research areas of interest encompass the identification of manipulated or generated content, along with the investigation of the dissemination of disinformation and its societal repercussions. Recognizing the significance of multimedia, the workshop emphasizes the joint analysis of various modalities within content, as verification can be improved by aggregating multiple forms of content.
Open Access
Conference paper
ACM on Multimedia Retrieval
Evlampios Apostolidis; Konstantinos Tsigos Spyridon Baxevanakis; Symeon Papadopoulos Vasileios Mezaris
CERTH - Center for Research and Technology Hellas
In this paper we propose a new framework for evaluating the performance of explanation methods on the decisions of a deepfake detector. This framework assesses the ability of an explanation method to spot the regions of a fake image with the biggest influence on the decision of the deepfake detector, by examining the extent to which these regions can be modified through a set of adversarial attacks, in order to flip the detector’s prediction or reduce its initial prediction; we anticipate a larger drop in deepfake detection accuracy and prediction, for methods that spot these regions more accurately. Based on this framework, we conduct a comparative study using a state-of-the-art model for deepfake detection that has been trained on the FaceForensics++ dataset, and five explanation methods from the literature. The findings of our quantitative and qualitative evaluations document the advanced performance of the LIME explanation method against the other compared ones, and indicate this method as the most appropriate for explaining the decisions of the utilized deepfake detector.
Open Access
Conference paper
ACM on Multimedia Retrieval
Evlampios Apostolidis; Ioannis Kontostathis; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas
In this paper we introduce a new dataset for 360-degree video summarization: the transformation of 360-degree video content to concise 2D-video summaries that can be consumed via traditional devices, such as TV sets and smartphones. The dataset includes ground-truth human-generated summaries, that can be used for training and objectively evaluating 360-degree video summarization methods. Using this dataset, we train and assess two state-of-the-art summarization methods that were originally proposed for 2D-video summarization, to serve as a baseline for future comparisons with summarization methods that are specifically tailored to 360-degree video. Finally, we present an interactive tool that was developed to facilitate the data annotation process and can assist other annotation activities that rely on video fragment selection.
Open Access
Conference paper
Davide Alessandro Coccomini; Fabrizio Falchi; Giorgios Kordopatis-Zilos; Giuseppe Amato; Roberto Caldelli Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; ISTI-CNR;
In this paper, we present MINTIME, a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals
and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes, i.e., averaging or max operations, or only focus on the largest face in the video, our proposed method learns to accurately detect spatio-temporal inconsistencies across multiple identities in a video through a Spatio-Temporal Transformer combined with a Convolutional Neural Network backbone. This is achieved through an Identityaware Attention mechanism that applies a masking operation on the face sequence to process each identity independently, which enables effective video-level aggregation. Furthermore, our system incorporates two novel embedding schemes: (i) the Temporal Coherent Positional Embedding, which encodes the temporal information of the face sequences of each identity, and (ii) the Size Embedding, which captures the relative sizes of the faces to the video frames. MINTIME achieves state-of-the-art performance on the ForgeryNet dataset, with a remarkable improvement of up to 14% AUC in videos containing multiple people. Moreover, it demonstrates very robust generalization capabilities in crossforgery and cross-dataset settings. The code is publicly available at: https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection
Open Access
Journal article
N/A
Cathal Gurrin; Fabio Carrara; Florian Spiess; Jakub Lokoč; Klaus Schoeffmann; Ladislav Peška; Loris Sauter; Luca Rossetto; Lucia Vadicamo; Minh-Triet Tran Nico Hezel Nicola Messina; Rahel Arnold; Sebastian Lubos Stefanos Vrochidis; Thao-Nhu Nguyen; Werner Bailer; Xingham Li Zhixin Ma;
CERTH - Center for Research and Technology Hellas Charles University; Dublin City University; HTW Berlin; Institute of Information Science and Technologies ISTI-CNR; Joanneum Research; University of Basel; University of Klagenfurt University of Zurich; Vietnam National University Wuhan University
This paper conducts a thorough examination of the 12th Video Browser Showdown (VBS) competition, a well-established international benchmarking campaign for interactive video search systems.
The annual VBS competition has witnessed a steep rise in the popularity of multimodal embedding-based approaches in interactive video retrieval. Most of the thirteen systems participating in VBS 2023 utilized a CLIP-based cross-modal search model, allowing the specification of free-form text queries to search visual content. This shared emphasis on joint embedding models contributed to balanced performance across various teams. However, the distinguishing factors of the top-performing teams included the adept combination of multiple models and search modes, along with the capabilities of interactive interfaces to facilitate and refine the search process.
Our work provides an overview of the state-of-the-art approaches employed by the participating systems and conducts a thorough analysis of their search logs, which record user interactions and results of their queries for each task. Our comprehensive examination of the VBS competition offers assessments of the effectiveness of the retrieval models, browsing efficiency, and user query patterns. Additionally, it provides valuable insights into the evolving landscape of interactive video retrieval and its future challenges.
Open Access
Journal article
IEEE Access
Emmanouil Krasanakis; Ioannis Kompatsiaris; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
Being able to express broad families of equivariant or invariant attributed graph functions is a popular measuring stick of whether graph neural networks should be employed in practical applications. However, it is equally important to find deep local minima of losses (i.e., produce outputs with much smaller loss values compared to other minima), even when architectures cannot express global minima. In this work we introduce the architectural property of attracting optimization trajectories to local minima as a means of achieving smaller loss values. We take first steps in satisfying this property for losses defined over attributed undirected unweighted graphs with an architecture called universal local attractor (ULA). This refines each dimension of end-to-end-trained node feature embeddings based on graph structure to track the optimization trajectories of losses satisfying some mild conditions. The refined dimensions are then linearly pooled to create predictions. We experiment on 11 tasks, from node classification to clique detection, on which ULA is comparable with or outperforms popular alternatives of similar or greater theoretical expressive power.
Open Access
Publication
N/A
Albin Soutif-Cormerais Andrew Bagdanov Joost van de Weijer Simone Magistri Tommaso Trinci
Computer Vision Center University of Florence;
Exemplar-Free Class Incremental Learning (EFCIL) aims to learn from a sequence of tasks without having access to previous task data. In this paper, we consider the challenging Cold Start scenario in which insufficient data is available in the first task to learn a high-quality backbone. This is especially challenging for EFCIL since it requires high plasticity, which results in feature drift which is difficult to compensate for in the exemplar-free setting. To address this problem, we propose a simple and effective approach that consolidates feature representations by regularizing drift in directions highly relevant to previous tasks and employs prototypes to reduce task-recency bias. Our method, called Elastic Feature Consolidation (EFC), exploits a tractable second-order approximation of feature drift based on an Empirical Feature Matrix (EFM). The EFM induces a pseudo-metric in feature space which we use to regularize feature drift in important directions and to update Gaussian prototypes used in a novel asymmetric cross entropy loss which effectively balances prototype rehearsal with data from new tasks. Experimental results on CIFAR-100, Tiny-ImageNet, ImageNet-Subset and ImageNet-1K demonstrate that Elastic Feature Consolidation is better able to learn new tasks by maintaining model plasticity and significantly outperform the state-of-the-art.
Open Access
Conference paper
N/A
Antonios Liapis; Georgios N. Yannakakis; Marvin Zammit;
University of Malta
The recent advances in language-based generative models have paved the way for the orchestration of multiple generators of different artefact types (text, image, audio, etc.) into one system. Presently, many open-source pre-trained models combine text with other modalities, thus enabling shared vector embeddings to be compared across different generators. Within this context we propose a novel approach to handle multimodal creative tasks using Quality Diversity evolution. Our contribution is a variation of the MAP-Elites algorithm, MAP-Elites with Transverse Assessment (MEliTA), which is tailored for multimodal creative tasks and leverages deep learned models that assess coherence across modalities. MEliTA decouples the artefacts’ modalities and promotes cross-pollination between elites. As a test bed for this algorithm, we generate text descriptions and cover images for a hypothetical video game and assign each artefact a unique modality-specific behavioural characteristic. Results indicate that MEliTA can improve text-to-image mappings within the solution space, compared to a baseline MAP-Elites algorithm that strictly treats each image-text pair as one solution. Our approach represents a significant step forward in multimodal bottom-up orchestration and lays the groundwork for more complex systems coordinating multimodal creative agents in the future.
Open Access
Conference paper
N/A
Hannes Fassold;
Joanneum Research;
Deploying Large Language Models (LLMs) on mobile devices makes all the capabilities of natural language processing available on the device. An important use case of LLMs is question answering, which can provide accurate and contextually relevant answers to a wide array of user queries. We describe how we managed to port state of the art LLMs to mobile devices, enabling them to operate natively on the device. We employ the llama.cpp framework, a flexible and self-contained C++ framework for LLM inference. We selected a 6-bit quantized version of the Orca-Mini-3B model with 3 billion parameters and present the correct prompt format for this model. Experimental results show that LLM inference runs in interactive speed on a Galaxy S21 smartphone and that the model delivers high-quality answers to user queries related to questions from different subjects like politics, geography or history.
Open Access
Conference paper
N/A
Connor Richard Lucia Vadicamo;
ISTI-CNR; University of St. Andrews
Dimensionality reduction techniques map values from a high dimensional space to one with a lower dimension. The result is a space which requires less physical memory and has a faster distance calculation. These techniques are widely used where required properties of the reduced-dimension space give an acceptable accuracy with respect to the original space. Many such transforms have been described. They have been classified in two main groups: linear and topological. Linear methods such as Principal Component Analysis (PCA) and Random Projection (RP) define matrix-based transforms into a lower dimension of Euclidean space. Topological methods such as Multidimensional Scaling (MDS) attempt to preserve higher-level aspects such as the nearest-neighbour relation, and some may be applied to non-Euclidean spaces. Here, we introduce nSimplex Zen, a novel topological method of reducing dimensionality. Like MDS, it relies only upon pairwise distances measured in the original space. The use of distances, rather than coordinates, allows the technique to be applied to both Euclidean and other Hilbert spaces, including those governed by Cosine, Jensen–Shannon and Quadratic Form distances. We show that in almost all cases, due to geometric properties of high-dimensional spaces, our new technique gives better properties than others, especially with reduction to very low dimensions.
Open Access
Journal article
ACM Transactions on Knowledge Discovery from Data
Bogdan Ionescu; Mihai Gabriel Constantin
University Politehnica of Bucharest
Video memorability is one of the vital aspects of subjective multimedia perception and, as such, is closely and thoroughly studied in the computer vision literature. This paper presents the methods proposed by AIMultimediaLab for the generalization subtask of the 2023 edition of the Predicting Video Memorability task. We explore several methods for augmenting the training process for a video Vision Transformer network, aiming to increase the number of hard-to-predict samples in the training set in order to increase the robustness of the targeted AI model. Starting from our previous works, we analyze several visual features that define “hard-to-predict” samples, and based on these features, we augment the training data of our models to target those specific videos that pose problems for memorability prediction.
Open Access
Conference paper
MediaEval
Claudio Vairo; Fabio Carrara; Jakub Lokoč; Kai Uwe Barthel; Klaus Schoeffmann; Konstantin Schall; Ladislav Peška; Lucia Vadicamo; Werner Bailer;
HTW Berlin; ISTI-CNR; Joanneum Research; University of Klagenfurt
CLIP-based text-to-image retrieval has proven to be very effective at the interactive video retrieval competition Video Browser Showdown 2022, where all three top-scoring teams had implemented a variant of a CLIP model in their system. Since the performance of these three systems was quite close, this post-evaluation was designed to get better insights on the differences of the systems and compare the CLIP-based text-query retrieval engines by introducing slight modifications to the original competition settings. An extended analysis of the overall results and the retrieval performance of all systems’ functionalities shows that a strong text retrieval model certainly helps, but has to be coupled with extensive browsing capabilities and other query-modalities to consistently solve known-item-search tasks in a large scale video database.
Open Access
Journal article
International Journal of Multimedia Information Retrieval
Alejandro Moreo; Fabrizio Sebastiani; Pablo González
ISTI-CNR; University of Oviedo
Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data, and is of special interest when the labelled data on which the predictor has been trained and the unlabelled data are not IID, i.e., suffer from dataset shift. To date, quantification methods have mostly been tested only on a special case of dataset shift, i.e., prior probability shift; the relationship between quantification and other types of dataset shift remains, by and large, unexplored. In this work we carry out an experimental analysis of how current quantification algorithms behave under different types of dataset shift, in order to identify limitations of current approaches and hopefully pave the way for the development of more broadly applicable methods. We do this by proposing a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift, and by testing existing quantification methods on the datasets thus generated. One finding that results from this investigation is that many existing quantification methods that had been found robust to prior probability shift are not necessarily robust to other types of dataset shift. A second finding is that no existing quantification method seems to be robust enough to dealing with all the types of dataset shift we simulate in our experiments. The code needed to reproduce all our experiments is publicly available at https:// github. com/pglez82/quant_ datasetshift
Open Access
Journal article
Data Mining and Knowledge Discovery
Christos Tzelepis; Georgios Tzimiropoulos; Ioannis Patras; Stella Bounareli; Vasileios Argyriou;
Kingston University London; Queen Mary University of London; University of Nottingham
In this paper, we present our framework for neural face/head reenactment whose goal is to transfer the 3D head orientation and expression of a target face to a source face. Previous methods focus on learning embedding networks for identity and head pose/expression disentanglement which proves to be a rather hard task, degrading the quality of the generated images. We take a different approach, bypassing the training of such networks, by using (fine-tuned) pre-trained GANs which have been shown capable of producing high-quality facial images. Because GANs are characterized by weak controllability, the core of our approach is a method to discover which directions in latent GAN space are responsible for controlling head pose and expression variations. We present a simple pipeline to learn such directions with the aid of a 3D shape model which, by construction, inherently captures disentangled directions for head pose, identity, and expression. Moreover, we show that by embedding real images in the GAN latent space, our method can be successfully used for the reenactment of real-world faces. Our method features several favorable properties including using a single source image (one-shot) and enabling cross-person reenactment. Extensive qualitative and quantitative results show that our approach typically produces reenacted faces of notably higher quality than those produced by state-of-the-art methods for the standard benchmarks of VoxCeleb1 & 2.
Open Access
Journal article
Ioannis Patras; Zheng Gao;
Queen Mary University of London;
Self-supervised pre-training has been proved to be effective in learning transferable representations that benefit various visual tasks. This paper asks this question: can self-supervised pre-training learn general facial representations for various facial analysis tasks? Recent efforts toward this goal are limited to treating each face image as a whole, i.e., learning consistent facial representations at the image-level, which overlooks the consistency of local facial representations (i.e., facial regions like eyes, nose, etc). In this work, we make a first attempt to propose a novel self-supervised facial representation learning framework to learn consistent global and local facial representations, Facial Region Awareness (FRA). Specifically, we explicitly enforce the consistency of facial regions by matching the local facial representations across views, which are extracted with learned heatmaps highlighting the facial regions. Inspired by the mask prediction in supervised semantic segmentation, we obtain the heatmaps via cosine similarity between the per-pixel projection of feature maps and facial mask embeddings computed from learnable positional embeddings, which leverage the attention mechanism to globally look up the facial image for facial regions. To learn such heatmaps, we formulate the learning of facial mask embeddings as a deep clustering problem by assigning the pixel features from the feature maps to them. The transfer learning results on facial classification and regression tasks show that our FRA outperforms previous pre-trained models and more importantly, using ResNet as the unified backbone for various tasks, our FRA achieves comparable or even better performance compared with SOTA methods in facial analysis tasks.
Open Access
Conference paper
IEEE Conference on Computer Vision and Pattern Recognition
Aleksandra Kuczerawy Lidia Dutklewicz; Noémie Krack Peggy Valcke
KU Leuven
This chapter discusses how AI technologies permeate the media sector. It sketches opportunities and benefits of the use of AI in media content gathering and production, in media content distribution, in fact-checking and content moderation. The chapter then zooms in on ethical and legal risks raised by AI-driven media applications: lack of data availability, poor data quality and bias in training datasets, lack of transparency, risks for the right to freedom of expression, threats to media freedom and pluralism online, and threats to media independence. Finally, the
chapter introduces the relevant elements of the EU legal framework which aim to mitigate these risks, such as the Digital Services Act, the European Media Freedom Act proposal and the AI Act proposal.
Open Access
Book section
Cambridge handbook on the law, ethics and policy of Artificial Intelligence
Konstantinos Gkrispanis; Nikolaos Gkalelis; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas
Face detectors are becoming a crucial component of many applications, including surveillance, that often have to run on edge devices with limited processing power and memory. Therefore, there’s a pressing demand for compact face detection models that can function efficiently across resource-constrained devices. Over recent years, network pruning techniques have attracted a lot of attention from researchers. These methods haven’t been well examined in the context of face detectors, despite their expanding popularity. In this paper, we implement filter pruning on two already small and compact face detectors, named EXTD (Extremely Tiny Face Detector) and EResFD (Efficient ResNet Face Detector). The main pruning algorithm that we utilize is Filter Pruning via Geometric Median (FPGM), combined with the Soft Filter Pruning (SFP) iterative procedure. We also apply L1 Norm pruning, as a baseline to compare with the proposed approach. The experimental evaluation on the WIDER FACE dataset indicates that the proposed approach has the potential to further reduce the model size of already lightweight face detectors, with limited accuracy loss, or even with small accuracy gain for low pruning rates.
Open Access
Conference paper
Winter Conference on Applications of Computer Vision
Evlampios Apostolidis; Konstantinos Apostolidis; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas
This paper presents a web-based tool that facilitates the production of tailored summaries for online sharing on social media. Through an interactive user interface, it supports a “one-click” video summarization process. Based on the integrated AI models for video summarization and aspect ratio transformation, it facilitates the generation of multiple summaries of a full-length video according to the needs of target platforms with regard to the video’s length and aspect ratio.
Open Access
Conference paper
Conference on Multimedia Modeling
Evlampios Apostolidis; Ioannis Kontostathis; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas
In this work, we present an integrated system for spatiotemporal summarization of 360-degrees videos. The video summary production mainly involves the detection of salient events and their synopsis into a concise summary. The analysis relies on state-of-the-art methods for saliency detection in 360-degrees video (ATSal and SST-Sal) and video summarization (CA-SUM). It also contains a mechanism that classifies a 360-degrees video based on the use of static or moving camera during recording and decides which saliency detection method will be used, as well as a 2D video production component that is responsible to create a conventional 2D video containing the salient events in the 360-degrees video. Quantitative evaluations using two datasets for 360-degrees video saliency detection (VR-EyeTracking, Sports-360) show the accuracy and positive impact of the developed decision mechanism, and justify our choice to use two different methods for detecting the salient events. A qualitative analysis using content from these datasets, gives further insights about the functionality of the decision mechanism, shows the pros and cons of each used saliency detection method and demonstrates the advanced performance of the trained summarization method against a more conventional approach.
Open Access
Conference paper
Conference on Multimedia Modeling
Bogdan Ionescu; Hannes Fassold; Mihai Dogariu Werner Bailer;
Joanneum Research; University Politehnica of Bucharest
Open Access
Conference paper
Multimedia Modeling Conference
Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;
ISTI-CNR;
In this paper, we introduce the fifth release of VISIONE, an advanced video retrieval system offering diverse search functionalities. The user can search for a target video using textual prompts, drawing objects and colors appearing in the target scenes in a canvas, or images as query examples to search for video keyframes with similar content.
Compared to the previous version of our system, which was runner-up at VBS 2023, the forthcoming release, set to participate in VBS 2024, showcases a refined user interface that enhances its usability and updated AI models for more effective video content analysis.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Federico Becattini; Francesco Marchetti; Lorenzo Seidenari;
Università degli Studi di Firenze; University of Florence;
Effective modeling of human interactions is of utmost importance when forecasting behaviors such as future trajectories. Each individual, with its motion, influences surrounding agents since everyone obeys to social non-written rules such as collision avoidance or group following. In this paper we model such interactions, which constantly evolve through time, by looking at the problem from an algorithmic point of view, i.e., as a data manipulation task. We present a neural network based on an end-to-end trainable working memory, which acts as an external storage where information about each agent can be continuously written, updated and recalled. We show that our method is capable of learning explainable cause-effect relationships between motions of different agents, obtaining state-of-the-art results on multiple trajectory forecasting datasets.
Open Access
Journal article
IEEE Transaction on Pattern Analysis and Machine Intelligence
Andrea Esuli; Davide Alessandro Coccomini; Fabrizio Falchi; Nicola Messina;
Institute of Information Science and Technologies
With the increasing importance of multimedia and multilingual data in online encyclopedias, novel methods are needed to fill domain gaps and automatically connect different modalities for increased accessibility. For example, Wikipedia is composed of millions of pages written in multiple languages. Images, when present, often lack textual context, thus remaining conceptually floating and harder to find and manage.
In this work, we tackle the novel task of associating images from Wikipedia pages with the correct caption among a large pool of available ones written in multiple languages, as required by the image-caption matching Kaggle challenge organized by the Wikimedia Foundation. A system able to perform this task would improve the accessibility and completeness of the underlying multi-modal knowledge graph in online encyclopedias. We propose a cascade of two models powered by the recent Transformer networks able to efficiently and effectively infer a relevance score between the query image data and the captions. We verify through extensive experiments that the proposed cascaded approach effectively handles a large pool of images and captions while maintaining bounded the overall computational complexity at inference time.
With respect to other approaches in the challenge leaderboard, we can achieve remarkable improvements over the previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrained resources.
Open Access
Journal article
Multimedia Tools and Applications
Adrian Popescu; Bertrand Delezoide Céline Hudelot; David Picard; Eva Feillet; Grégoire Petit; Michael Soumm
CEA; Université Gustave Eiffel; Université Paris-Saclay;
Class-Incremental Learning (CIL) aims to build classification models from data streams. At each step of the CIL process, new classes must be integrated into the model. Due to catastrophic forgetting, CIL is particularly challenging when examples from past classes cannot be stored, the case on which we focus here. To date, most approaches are based exclusively on the target dataset of the CIL process. However, the use of models pre-trained in a self-supervised way on large amounts of data has recently gained momentum.
The initial model of the CIL process may only use the first batch of the target dataset, or also use pre-trained weights obtained on an auxiliary dataset. The choice between these two initial learning strategies can significantly influence the performance of the incremental learning model, but has not yet been studied in depth. Performance is also influenced by the choice of the CIL algorithm, the neural architecture, the nature of the target task, the distribution of classes in the stream and the number of examples available for learning.
We conduct a comprehensive experimental study to assess the roles of these factors. We present a statistical analysis framework that quantifies the relative contribution of each
factor to incremental performance. Our main finding is that the initial training strategy is the dominant factor influencing the average incremental accuracy, but that the choice of
CIL algorithm is more important in preventing forgetting.
Based on this analysis, we propose practical recommendations for choosing the right initial training strategy for a given incremental learning use case. These recommendations are intended to facilitate the practical deployment of incremental learning.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Davide Pucci Federico Becattini;
Università degli Studi di Firenze; University of Florence;
Action understanding is a fundamental computer vision branch for several applications, ranging from surveillance to robotics. Most works deal with localizing and recognizing the action in both time and space, without providing a characterization of its evolution. Recent works have addressed the prediction of action progress, which is an estimate of how far the action has advanced as it is performed. In this paper, we propose to predict action progress using a different modality compared to previous methods: body joints. Human body joints carry very precise information about human poses, which we believe are a much more lightweight and effective way of characterizing actions and therefore their execution. Estimating action progress can in fact be determined based on the understanding of how key poses follow each other during the development of an activity. We show how an action progress prediction model can exploit body joints and integrate it with modules providing keypoint and action information in order to be run directly from raw pixels. The proposed method is experimentally validated on the Penn Action Dataset.
Open Access
Journal article
MDPI
Alberto Messina; Angelo Bruccoleri; Fulvio Negro; Maurizio Montagnuolo; Roberto Iacoviello;
RAI;
Knowledge about the presence of people in a video is a valuable source of information in many applications, such as video annotation, retrieval and summarisation. The contribution of this paper goes in the direction of demonstrating how AI-based face processing technologies can be profitably used to perform video annotation of television content. To validate our vision, we developed the Face Management Framework (FMF), which implements an end-to-end pipeline for face analysis and content annotation based on few-shot or zero-shot face embedding extraction models. The results of the test campaign of the system show that the key performance indicators that we defined were exceeded by a wide margin, demonstrating how media workflows could greatly benefit from the tool and the efficiency improvements it brings.
Open Access
Conference paper
International Conference on Big Data
Alberto Del Bimbo; Andrea Ciamarra Federico Becattini; Lorenzo Seidenari; Roberto Cardelli;
Mercatorum University; University of Florence;
The ever-increasing use of synthetically generated content in different sectors of our everyday life, one for all media information, poses a strong need for deepfake detection tools in order to avoid the proliferation of altered messages. The process to identify manipulated content, in particular images and videos, is basically performed by looking for the presence of some inconsistencies and/or anomalies specifically due to the fake generation process. Different techniques exist in the scientific literature that exploit diverse ad-hoc features in order to highlight possible modifications. In this paper, we propose to investigate how deepfake creation can impact on the characteristics that the whole scene had at the time of the acquisition. In particular, when an image (video) is captured the overall geometry of the scene (e.g. surfaces) and the acquisition process (e.g. illumination) determine a univocal environment that is directly represented by the image pixel values; all these intrinsic relations are possibly changed by the deepfake generation process. By resorting to the analysis of the characteristics of the surfaces depicted in the image it is possible to obtain a descriptor usable to train a CNN for deepfake detection: we refer to such an approach as SurFake. Experimental results carried out on the FF + + dataset for different kinds of deep-fake forgeries and diverse deep learning models confirm that such a feature can be adopted to discriminate between pristine and altered images; furthermore, experiments witness that it can also be combined with visual data to provide a certain improvement in terms of detection accuracy.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Federico Becattini; Lorenzo Seidenari; Luca Cultrera; Pietro Pala
Università degli Studi di Firenze; University of Florence;
Conditional Imitation learning is a common and effective approach to train autonomous driving agents. However, two issues limit the full potential of this approach: (i) the inertia problem, a special case of causal confusion where the agent mistakenly correlates low speed with no acceleration, and (ii) low correlation between offline and online performance due to the accumulation of small errors that brings the agent in a previously unseen state. Both issues are critical for state-aware models, yet informing the driving agent of its internal state as well as the state of the environment is of crucial importance. In this article we propose a multi-task learning agent based on a multi-stage vision transformer with state token propagation. We feed the state of the vehicle along with the representation of the environment as a special token of the transformer and propagate it throughout the network. This allows us to tackle the aforementioned issues from different angles: guiding the driving policy with learned stop/go information, performing data augmentation directly on the state of the vehicle and visually explaining the model’s decisions. We report a drastic decrease in inertia and a high correlation between offline and online metrics.
Open Access
Journal article
IEEE Transactions on Intelligent Vehicles
Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;
ISTI-CNR;
This paper presents a revised version of the VISIONE video retrieval system, which offers a wide range of search functionalities, including free text search, spatial color and object search, visual and semantic similarity search, and temporal search. The system is designed to ensure scalability using advanced indexing techniques and effectiveness using cutting-edge Artificial Intelligence technology for visual content analysis. VISIONE was the runner-up in the 2023 Video Browser Showdown competition, demonstrating its comprehensive video retrieval capabilities. In this paper, we detail the improvements made to the search and browsing interface to enhance its usability for non-expert users.
A demonstration video of our system with the restyled interface, showcasing its capabilities on over 2,300 hours of diverse video content, is available online at {\url{https://youtu.be/srD3TCUkMSg}}.
Open Access
Conference paper
Conference on Content-based Multimedia Indexing
Andrei Cosmin Jitaru Bogdan Ionescu; Mihai Dogariu Mihai Gabriel Constantin
University Politehnica of Bucharest
Memorability is a critical aspect of human cognition that has been studied extensively in various fields, including psychology, education, and computer vision. The ability to remember information and experiences over time is essential for learning, decision-making, and creating lasting impressions. While the number of computer vision works that attempt to predict the memorability score of videos has recently seen a significant boost, thanks to several benchmarking tasks and datasets, some questions related to the performance of automated systems on certain types of videos are still largely unexplored. Given this, we are interested in discerning what makes a video sample easy or hard to classify or predict from a memorability standpoint. In this paper, we use a large set of runs, created and submitted by the participants to the MediaEval Predicting Video Memorability task, and, using their results and a set of visual, object, and annotator-based features and analyses, we attempt to find and define common traits that make the memorability scores of videos hard or easy to predict.
Open Access
Conference paper
Conference on Content-based Multimedia Indexing
Ambrish Rawat; Anisa Halimi; Nathalie Baracaldo; Swanand Kadhe;
IBM Research;
Training large language models (LLMs) is a costly endeavour in terms of time and computational resources. The large amount of training data used during the unsupervised pre-training phase makes it difficult to verify all data and, unfortunately, undesirable data may be ingested during training. Re-training from scratch is impractical and has led to the creation of the unlearning discipline where models are modified to “unlearn” undesirable information without retraining. However, any modification can alter the behaviour of LLMs, especially on key dimensions such as fairness. This is the first work that examines this interplay between unlearning and fairness for LLMs. In particular, we focus on a popular unlearning framework known as SISA [Bourtoule et al., 2021], which creates an ensemble of models trained on disjoint shards. We evaluate the performance-fairness trade-off for SISA, and empirically demsontrate that SISA can indeed reduce fairness in LLMs. To remedy this, we propose post-processing bias mitigation techniques for ensemble models produced by SISA. We adapt the post-processing fairness improvement technique from [Hardt et al., 2016] to design three methods that can handle model ensembles, and prove that one of the methods is an optimal fair predictor for ensemble of models. Through experimental results, we demonstrate the efficacy of our post-processing framework called FairSISA.
Open Access
Conference paper
Socially Responsible Language Modelling Research
Elena Cabrio Mariana Chaves Pierpaolo Goffredo Serena Villata
CNRS Inria; Université Côte d'Azur;
Fallacies are arguments that employ faulty reasoning. Given their persuasive and seemingly valid nature, fallacious arguments are often used in political debates. Employing these misleading arguments in politics can have detrimental consequences for society, since they can lead to inaccurate conclusions and invalid inferences from the public opinion and the policymakers. Automatically detecting and classifying fallacious arguments represents therefore a crucial challenge to limit the spread of misleading or manipulative claims and promote a more informed and healthier political discourse. Our contribution to address this challenging task is twofold. First, we extend the ElecDeb60To16 dataset of U.S. presidential debates annotated with fallacious arguments, by incorporating the most recent Trump-Biden presidential debate. We include updated token level annotations, incorporating argumentative components (i.e., claims and premises), the relations between these components (i.e., support and attack), and six categories of fallacious arguments (i.e., Ad Hominem, Appeal to Authority, Appeal to Emotion, False Cause, Slippery Slope, and Slogans). Second, we perform the twofold task of fallacious argument detection and classification by defining neural network architectures based on Transformers models, combining text, argumentative features, and engineered features. Our results show the advantages of complementing transformer-generated text representations with non-textual features.
Open Access
Conference paper
Association for Computational Linguistics Empirical Methods in Natural Language Processing
Luca Cuccovillo; Milica Gerhardt; Patrick Aichroth;
Fraunhofer IDMT;
In this study we propose a novel approach to audio phylogeny, i.e. the detection of relationships and transformations within a set of near-duplicate audio items, by leveraging a deep neural network for efficiency and extensibility. Unlike existing methods, our approach detects transformations between nodes in one step, and the transformation set can be expanded by retraining the neural network without excessive computational costs. We evaluated our method against the state of the art using a self-created and publicly released dataset, observing a superior performance in reconstructing phylogenetic trees and heightened transformation detection accuracy. Moreover, the ability to detect a wide range of transformations and to extend the transformation set make the approach suitable for various applications.
Open Access
Conference paper
IEEE International Workshop of Information Forensics and Security
Christina Katsini; George E. Raptis; Vasilis Theodorou;
Human Opsis
The News and Media landscape has undergone significant transformations in recent years, driven by the rise of new technologies and the widespread use of social media. This evolution introduces unique challenges for professionals working within this environment (e.g., journalists, content creators, and news authors), with a major one being the efficient sourcing of images that complement article content. In response to this challenge, we developed VIREO, a tool that recommends images based on textual content. In this paper, we make a step towards the practical effectiveness of VIREO’s core models in recommending images for real-world articles, with a specific focus on image recommendation efficiency. Our results indicate that VIREO offers a promising solution for professionals seeking to meet the evolving demands of the News and Media landscape while maintaining content quality and engagement.
Open Access
Conference paper
International Conference on Computer and Applications
Angelo Canale; Fabrizio Falchi; Giovanni Benelli; Giuseppe Amato; Luca Ciampi; Luca Incrocci; Stefano Chessa; Valeria Zeni;
ISTI-CNR; University of Pisa
Integrated Pest Management (IPM) is an essential approach used in smart agriculture to manage pest populations and sustainably optimize crop production. One of the cornerstones underlying IPM solutions is pest monitoring, a practice often performed by farm owners by using chromotropic sticky traps placed on insect hot spots to gauge pest population densities. In this paper, we propose a modular model-agnostic deep learning-based counting pipeline for estimating the number of insects present in pictures of chromotropic sticky traps, thus reducing the need for manual trap inspections and minimizing human effort. Additionally, our solution generates a set of raw positions of the counted insects and confidence scores expressing their reliability, allowing practitioners to filter out unreliable predictions. We train and assess our technique by exploiting PST – Pest Sticky Traps, a new collection of dot-annotated images we created on purpose and we publicly release, suitable for counting whiteflies. Experimental evaluation shows that our proposed counting strategy can be a valuable Artificial Intelligence-based tool to help farm owners to control pest outbreaks and prevent crop damages effectively. Specifically, our solution achieves an average counting error of approximately 9% compared to human capabilities requiring a matter of seconds, a large improvement respecting the time-intensive process of manual human inspections, which often take hours or even days.
Open Access
Journal article
Ecological Informatics
Hervé Le Borgne; Michel Crucianu; Nicolas Audebert; Perla Doubinsky
CEA; Conservatoire National des Arts et Métiers;
With the availability of powerful text-to-image diffusion models, recent works have explored the use of synthetic data to improve image classification performances. These works show that it can effectively augment or even replace real data. In this work, we investigate how synthetic data can benefit few-shot class-agnostic counting. This requires to generate images that correspond to a given input number of objects. However, text-to-image models struggle to grasp the notion of count. We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map in order to augment a training dataset for few-shot counting. Due to the small dataset size, the fine-tuned model tends to generate images close to the training images. We propose to enhance the diversity of synthesized images by exchanging captions between images thus creating unseen configurations of object types and spatial layout. Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models on FSC147 and CARPK.
Open Access
Conference paper
Winter Conference on Applications of Computer Vision
Patrick Aichroth; Thomas Köllmer Zühal Kurt
Atilim University Fraunhofer IDMT;
The paper outlines an explainable knowledge graph-based recommendation system that aims to provide personalized news recommendations and tries to explain why an item is recommended to a particular user. The system leverages a knowledge graph (KG) that models the relationships between items and users’ preferences, as well as external knowledge sources such as item features and user profiles. The main objectives of this study are to train a recommendation model that can predict whether a user will click on a news article or not, and then obtain the explainable recommendations for the same purpose. This is achieved with three steps: Firstly, KG of the MIND dataset are generated based on the history and, the clicked information of the users, the category and subcategory of the news. Then, the path reasoning approaches are utilized to reach explainable paths of recommended news/items. Thirdly, the proposed KG-based model is evaluated using MIND News data sets. Experiments have been conducted using the MIND-demo and MINDsmall datasets, which are the open-source English news datasets for public research scope. Experimental results indicate that the proposed approach performs better in terms of recommendation explainability, making it a promising basis for developing transparent and interpretable recommendation systems.
Open Access
Conference paper Publication
Conference on Knowledge Discovery
Alberto Messina; Stefano Scotta;
RAI;
In this work, we present an example of how a relatively small Large Language Model (LLM) fine-tuned to perform a simple and well defined task (assigning titles to news articles) could perform similarly or even better than huge LLMs which are created to respond to any question. This approach of specializing smaller LLMsonsimplertasksisalsointeresting because it goes in the direction of makingthis technology more sustainable and available to a higher number of entities that usually could not use these expensive models, both for economic and data policy reasons. We also present a couple of examples of how can be evaluated the performances of LLMs when the task is specified as in the example that we present in this work.
Open Access
Conference paper
International Conference of the Italian Association for Artificial Intelligence
Albert Gatt Andrea Pedrotti; Anette Frank Aykut Erdem Emre Can Acikgoz Erkut Erdem Iacer Calixto Ilker Kesen Leticia Parcalabescu Michele Cafagna Mustafa Dogan
Hacettepe University Heidelberg University Institute of Information Science and Technologies Koç University University of Amsterdam; University of Malta University of Pisa Utrecht University;
With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. ViLMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs’ grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, helping to highlight areas that still need to be explored.
Open Access
Conference paper
Alberto Del Bimbo; Hondamunige Prasanna Silva Lorenzo Seidenari;
University of Florence;
This paper presents a novel reconstruction method that leverages Diffusion Models to protect machine learning classifiers against adversarial attacks, all without requiring any modifications to the classifiers themselves. The susceptibility of machine learning models to minor input perturbations renders them vulnerable to adversarial attacks. While diffusion-based methods are typically disregarded for adversarial defense due to their slow reverse process, this paper demonstrates that our proposed method offers robustness against adversarial threats while preserving clean accuracy, speed, and plug-and-play compatibility. Code at: https://github.com/HondamunigePrasannaSilva/DiffDefence.
Open Access
Conference paper
N/A
Marius Gavrilescu;
Technical University of Iasi
The identification of important structures from volume data is a challenging problem in information visualization due to the complexity and amount of detail found in volume data sets. In particular, medical imaging devices generate scans which contain a significant amount of important anatomical structures, some of which are hidden, occluded or otherwise difficult to highlight. Conventional density and gradient-based classification methods fail to uncover such structures, thereby creating the necessity for more elaborate visualization methods and the involvement of multiple visual criteria in order to generate quality representations of the volume data. We propose a volume visualization approach which extends the conventional rendering pipeline by incorporating visibility-based quality criteria into the color and opacity mapping process. Our method consists in using two stacked transfer functions which handle visual mappings: one based on the density domain of the data set, and the other on a custom metric which quantifies the visibility of volumetric structures. We show that this arrangement allows the generation of improved representations of meaningful hidden structures from medical CT data, while constituting a reliable means of identifying volumetric details not representable using traditional approaches.
Open Access
Conference paper
E-Health and Bioengineering Conference 2023
Evlampios Apostolidis; Ioannis Patras; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas Queen Mary University of London;
In this paper we present our study on the use of attention for explaining video summarization. We build on a recent work that formulates the task, called XAI-SUM, and we extend it by: a) taking into account two additional network architectures and b) introducing two novel explanation signals that relate to the entropy and diversity of attention weights. In total, we examine the effectiveness of seven types of explanation, using three state-of-the-art attention-based network architectures (CA-SUM, VASNet, SUM-GDA) and two datasets (SumMe, TVSum) for video summarization. The conducted evaluations show that the inherent attention weights are more suitable for explaining network architectures which integrate mechanisms for estimating attentive diversity (SUM-GDA) and uniqueness (CA-SUM). The explanation of simpler architectures (VASNet) can benefit from taking into account estimates about the strength of the input vectors, while another option is to consider the entropy of attention weights.
Open Access
Conference paper
ACM Multimedia
Alberto Del Bimbo; Lorenzo Berlincioni; Marco Bertini; Stefano Berretti
Università degli Studi di Firenze; University of Florence;
Time-varying sequences of 3D point clouds, or 4D point clouds, are now being acquired at an increasing pace in several applications (personal avatar representation, LiDAR in autonomous or assisted driving). In many cases, such volume of data is transmitted, thus requiring that proper compression tools are applied to either reduce the resolution or the bandwidth. In this paper, we propose a new solution for upscaling and restoration of time-varying 3D video point clouds after they have been heavily compressed. Our model consists of a specifically designed Graph Convolutional Network that combines Dynamic Edge Convolution and Graph Attention Networks for feature aggregation in a Generative Adversarial setting. We present a different way to sample dense point clouds with the intent to make these modules work in synergy to provide each node with enough features about its neighbourhood in order to later on generate new vertices. Compared to other solutions in the literature that address the same task, our proposed model is capable of obtaining comparable results in terms of quality of the reconstruction, while using a substantially lower number of parameters (\simeq 300KB), making our solution deployable in edge computing devices.
Open Access
Conference paper
N/A
Artem Yaroshchuk; Christoforos Papastergiopoulos Dimitrios Tzovaras; Konstantinos Votis; Luca Cuccovillo; Patrick Aichroth;
CERTH - Center for Research and Technology Hellas Fraunhofer IDMT;
This paper introduces a multilingual, multispeaker dataset composed of synthetic and natural speech, designed to foster research and benchmarking in synthetic speech detection. The dataset encompasses 18,993 audio utterances synthesized from text, alongside with their corresponding natural equivalents, representing approximately 17 hours of synthetic audio data. The dataset features synthetic speech generated by 156 voices spanning three languages, namely, English, German, and Spanish, with a balanced gender representation. It targets state-of-the-art synthesis methods, and has been released with a license allowing seamless extension and redistribution by the research community.
Open Access
Conference paper
IEEE International Workshop of Information Forensics and Security
Claudio Gennaro; Fabio Carrara; Giuseppe Amato; Lucia Vadicamo;
ISTI-CNR;
The rapid development of deep learning and artificial intelligence has transformed our approach to solving scientific problems across various domains, including computer vision, natural language processing, and automatic content generation. Information retrieval (IR) has also experienced significant advancements, with natural language understanding and multimodal content analysis enabling accurate information retrieval. However, the widespread adoption of neural networks has also influenced the focus of IR problem-solving, which nowadays predominantly relies on evaluating the similarity of dense vectors derived from the latent spaces of deep neural networks. Nevertheless, the challenges of conducting similarity searches on large-scale databases with billions of vectors persist. Traditional IR approaches use inverted indices and vector space models, which work well with sparse vectors. In this paper, we propose Vec2Doc, a novel method that converts dense vectors into sparse integer vectors, allowing for the use of inverted indices. Preliminary experimental evaluation shows a promising solution for large-scale vector-based IR problems.
Open Access
Conference paper
International Conference on Similarity Search and Applications
Evlampios Apostolidis; Georgios Balaouras; Ioannis Patras; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas Queen Mary University of London;
This chapter focuses on explainable video summarization, a technology that could significantly advance the content production workflow of Media organizations. It starts by presenting the current state of the art in the fields of deep-learning-based video summarization and explainable video analysis and understanding. Following, it focuses on video summarization methods that rely on the use of attention mechanisms and reports on previous works that investigated the use of attention for explaining the outcomes of deep neural networks. Subsequently, it briefly describes a state-of-the-art attention-based architecture for unsupervised video summarization and discusses a recent work that examines the use of various attention-based signals for explaining the outcomes of video summarization. Finally, it provides recommendations about future research directions.
Open Access
Book section
Encyclopedia of Information Science and Technology
Daniel Gatica-Perez; Sina Sajadmanesh
Idiap Research Institute
Graph Neural Networks (GNNs) have become a popular tool for learning on graphs, but their widespread use raised privacy concerns as graph data can contain personal or sensitive information. Differentially private GNN models have been recently proposed to preserve privacy while still allowing for effective learning over graph-structured datasets. however, achieving an ideal balance between accuracy and privacy in GNNs remains challenging due to the intrinsic structural connectivity of graphs. in this paper, we propose a new differentially private GNN called ProGAP that uses a progressive training scheme to improve such accuracy-privacy trade-offs. Combined with the aggregation perturbation technique to ensure differential privacy, ProGAP splits a GNN into a sequence of overlapping submodels that are trained progressively, expanding from the first submodel to the complete model. Specifically, each submodel is trained over the privately aggregated node embeddings learned and cached by the previous submodels, leading to an increased expressive power compared to previous approaches while limiting the incurred privacy costs. We formally prove that ProGAP ensures edge-level and node-level privacy guarantees for both training and inference stages, and evaluate its performance on benchmark graph datasets. Experimental results demonstrate that ProGAP can achieve up to 5-10% higher accuracy than existing state-of-the-art differentially private GNNs. Our code is available at https://github.com/sisaman/ProGAP.
Open Access
Publication
N/A
Alejandro Moreo; Fabrizio Sebastiani; Martin Senz; Mirko Bunse;
ISTI-CNR; University of Dortmund;
Quantification, i.e., the task of training predictors of the class prevalence values in sets of unlabeled data items, has received increased attention in recent years. However, most quantification research has concentrated on developing algorithms for binary and multiclass problems in which the classes are not ordered. Here, we study the ordinal case, i.e., the case in which a total order is defined on the set of n > 2 classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms proposed by authors from very different research fields, such as data mining and astrophysics, who were unaware of each others’ developments. Third, we propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments. The key to this gain in performance is that our regularization prevents ordinally implausible estimates, assuming that ordinal distributions tend to be smooth in practice. We informally verify this assumption for several real-world applications.
Open Access
Publication
Data Mining and Knowledge Discovery
Fabrizio Falchi; Jan Sedmidubsky; Nicola Messina; Tomás Rebok;
ISTI-CNR; Masaryk University;
Open Access
Conference paper
N/A
Florin Leon; Marius Gavrilescu; Sabina-Adriana Floria;
Technical University of Iasi
Representing relevant information from volume data sets is a problem often faced in visualization. Generating meaningful images from highly-complex volume data sets is a challenging, tedious task requiring specialized knowledge of the distribution and properties of the data. Traditionally, this task has been carried out manually via specialized user interfaces. We propose a volume visualization pipeline which facilitates the automatic generation of high-quality images from volume data sets. Our method involves a direct volume renderer which generates images from volume data based on visual mappings provided by a transfer function. Central to our approach is a quality-focused descriptor which exploits the properties of the distribution of gradient orientations of an alpha-bounded surface within the volume. This feature is useful for determining transfer functions that result in the rendering of corresponding images depicting various details from the volume. We show that by using this feature as an optimization objective, the generation of high quality images can be automated. Using simple genetic algorithms, we can automatically generate sets of images illustrating coherent, easily-distinguishable and high-quality surfaces of relevant structures from volume data.
Open Access
Conference paper
International Conference on System Theory
Anastasios Gkagkas; Davide Alessandro Coccomini; Gylfi Þór Guðmundsson; Jakub Lokoč; Jiaxin Wu; Nick Pantelidis; Nicola Messina; Rahel Arnold; Silvan Heller; Vera Benz; Werner Bailer;
CERTH - Center for Research and Technology Hellas Charles University; City University Hong Kong; ISTI-CNR; Joanneum Research; Reykjavik University; University of Basel;
Different task interpretations are a highly undesired element in interactive video retrieval evaluations. When a participating team focuses partially on a wrong goal, the evaluation results might become partially misleading. In this paper, we propose a process for refining known-item and open-set type queries, and preparing the assessors that judge the correctness of submissions to open-set queries. Our findings from recent years reveal that a proper methodology can lead to objective query quality improvements and subjective participant satisfaction with query clarity.
Open Access
Conference paper
Conference on Multimedia Retrieval
David Renaudie; Georgios N. Yannakakis; Konstantinos Makantasis; Kosmas Pinitas; Matthew Barthet Mike Thomsen;
Massive Entertainment Ubisoft University of Malta
This paper introduces a large scale multimodal corpus collected for the purpose of analysing and predicting player engagement in commercial-standard games. The corpus is solicited from 25 players of the action role-playing game Tom Clancy’s The Division 2, who annotated their level of engagement using a time-continuous annotation tool. The cleaned and processed corpus presented in this paper consists of nearly 20 hours of annotated gameplay videos accompanied by logged gamepad actions. We report preliminary results on predicting long-term player engagement based on in-game footage and game controller actions using Convolutional Neural Network architectures. Results obtained suggest we can predict the player engagement with up to accuracy on average ( at best) when we fuse information from the game footage and the player’s controller input. Our findings validate the hypothesis that long-term (i.e. 1 hour of play) engagement can be predicted efficiently solely from pixels and gamepad actions.
Open Access
Paper
Conference on Multimodal Interaction
Evlampios Apostolidis; Georgios Balaouras; Ioannis Patras; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas Queen Mary University of London;
This paper presents a new reinforcement-based method for video thumbnail selection (called RL-DiVTS), that relies on estimates of the aesthetic quality, representativeness and visual diversity of a small set of selected frames, made with the help of tailored reward functions. The proposed method integrates a novel diversity-aware Frame Picking mechanism that performs a sequential frame selection and applies a reweighting process to demote frames that are visually-similar to the already selected ones. Experiments on two benchmark datasets (OVP and YouTube), using the top-3 matching evaluation protocol, show the competitiveness of RL-DiVTS against other SoA video thumbnail selection and summarization approaches from the literature.
Open Access
Paper
IEEE International Conference on Image Processing
Alberto Del Bimbo; Lorenzo Seidenari; Luca Cultrera;
University of Florence;
Out-of-Distribution (OOD) detection is a crucial challenge in computer vision, especially when deploying machine learning models in the real world. In this paper, we propose a novel OOD detection method leveraging Visual Attention Heatmaps from a Vision Transformer (ViT) classifier. Our approach involves training a Convolutional Autoencoder to reconstruct attention heatmaps produced by a ViT classifier, enabling accurate image reconstruction and effective OOD detection. Moreover, our method does not require additional labels during training, ensuring efficiency and ease of implementation. We validate our approach on a standard OOD benchmark using CIFAR10 and CIFAR100. To test OOD in a real-world setting we also collected a novel dataset: WildCapture. Our new dataset comprises more than 60k wild animal shots, from 15 different wildlife species, taken via phototraps in varying lighting conditions. The dataset is fully annotated with animal bounding boxes and species.
Open Access
Conference paper
N/A
Cristian-Nicolae Butincu; Florin Leon; Lavinia-Eugenia Ferariu; Marius Gavrilescu;
Technical University of Iasi
This report describes our research and documentation efforts in searching and analyzing the related literature for existing applications of evolutionary algorithms for quality-oriented optimization. We present our findings in terms of multiple relevant results from the related state of the art. We mainly divide the results into two broad categories: classic single- and multi-objective optimization, and quality-diversity (QD) methods. While we mostly focus on evolutionary optimization applied in visualization and image-processing, we also present some results from other fields which we considered relevant. This report was originally submitted as documentation for the deliverables of the VolEvol project.
Open Access
Report
N/A
Fabio Carrara; Fabrizio Falchi; Maurizio Tesconi;
ISTI-CNR; University of Pisa
Trends and opinion mining in social media increasingly focus on novel interactions involving visual media, like images and short videos, in addition to text.
In this work, we tackle the problem of visual sentiment analysis of social media images — specifically, the prediction of image sentiment polarity. While previous work relied on manually labeled training sets, we propose an automated approach for building sentiment polarity classifiers based on a cross-modal distillation paradigm; starting from scraped multimodal (text + images) data, we train a student model on the visual modality based on the outputs of a textual teacher model that analyses the sentiment of the corresponding textual modality.
We applied our method to randomly collected images crawled from Twitter over three months and produced, after automatic cleaning, a weakly-labeled dataset of ∼1.5 million images. Despite exploiting noisy labeled samples, our training pipeline produces classifiers showing strong generalization capabilities and outperforming the current state of the art on five manually labeled benchmarks for image sentiment polarity prediction.
Open Access
Publication
ECAI - European Conference on Artificial Intelligence
Ali Najm; Antonios Liapis; Despina Michael-Grigoriou; Emmanouil Xylakis; Georgios N. Yannakakis;
Cyprus University of Technology; University of Malta
Open Access
Conference paper
N/A
Ioanna Valsamara; Ioannis Pitas;
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Andreas Sochopoulos; Evangelos Charalampakis; Ioannis Mademlis; Ioannis Pitas; Sotirios Papadopoulos
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Dimitrios Papaioannou; Ioannis Pitas; Vasileios Mygdalis
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Anestis Kaimakamadis; Ioannis Pitas;
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Ioanna Valsamara; Ioannis Mademlis; Ioannis Pitas;
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Emmanouil Krasanakis; Ioanna Valsamara; Ioannis Mademlis; Ioannis Pitas;
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Hervé Le Borgne; Michel Crucianu; Nicolas Audebert; Perla Doubinsky
CNAM; Université Paris-Saclay;
The latent space of GANs contains rich semantics reflecting the training data. Different methods propose to learn edits in latent space corresponding to semantic attributes, thus allowing to modify generated images. Most supervised methods rely on the guidance of classifiers to produce such edits. However, classifiers can lead to out-of-distribution regions and be fooled by adversarial samples. We propose an alternative formulation based on the Wasserstein loss that avoids such problems, while maintaining performance on-par with classifier-based approaches. We demonstrate the effectiveness of our method on two datasets (digits and faces) using StyleGAN2.
Open Access
Conference paper
N/A
Alejandro Moreo; Fabrizio Sebastiani; Mirko Bunse; Pablo González
Consiglio Nazionale delle Ricerche; University of Oviedo
Open Access
Book
N/A
Florin Leon; Marius Gavrilescu;
Technical University of Iasi
The study of hurricanes through information visualization and visual analysis is useful for tracking and understanding the behavior and impact of such hazardous natural phenomena. Images obtained from data commonly acquired through meteorological radar provide scientists with a visual representation of the storm’s characteristics, such as its location, size, and intensity. Such information is useful for forecasting, decision making in disaster management and environmental and human health risk assessment. Visual representations of such phenomena can help emergency responders and policymakers make informed decisions about evacuations, disaster response, and resource allocation. In this context, we propose an automated means of generating representations from complex 3D datasets obtained from meteorological radar scans of regions affected by hurricanes, illustrating the geometry and spatial features of such phenomena.
Open Access
Conference paper
International Conference on Environmental Engineering and Management
Ioannis Mademlis; Ioannis Pitas; Michail Kaseris
Aristotle University of Thessaloniki;
Open Access
Preprint
N/A
Axel Roebel; Lenny Renault; Rémi Mignot;
Sorbonne Université
Open Access
Journal article
Journal of the Audio Engineering Society
Hannes Fassold;
Joanneum Research;
Manifold learning is an emerging research domain of machine learning. In this work, we give an introduction into manifold learning and how it is employed for important application fields in multimedia.
Open Access
Conference paper
Conference on Video and Signal Processing
Claudio Gennaro; Fabrizio Falchi; Gaetano Emanuele Valenti; Giuseppe Amato; Luca Ciampi; Nicola Messina;
ISTI-CNR; University of Pisa
Open Access
Conference paper
Conference on Image Analysis and Processing
Antonino Furnari; Claudio Gennaro; Fabrizio Falchi; Giovanni Maria Farinella; Nicola Messina;
ISTI-CNR; University of Catania;
Open Access
Journal article
Conference on Image Analysis and Processing
Axel Roebel; Lenny Renault; Rémi Mignot;
Sorbonne Université
Open Access
Conference paper
International Conference on Digital Audio Effects
Bruno Lepri; Linchao Bao; Marco de Nadai; Nicu Sebe; Yahui Liu Yajing Chen;
FBK; Tencent AI Lab University of Trento;
Closed Access
Journal article
IEEE Transactions on Multimedia
Marius Gavrilescu;
Technical University of Iasi
Objective quality assessment in volume visualization is a crucial process aimed at quantifying the quality of rendered volumetric images or animations using measurable metrics and algorithms. This approach is essential to ensure that the visualizations accurately represent the underlying data and meet specific quality standards. The assessment of quality in computer graphics, visualization and image processing is a complex task, particularly due to the number of scenarios, use cases and problems encountered in the aforementioned fields, and also due to the subjective nature of quality. To this extent, we search for methods, algorithms and metrics that can be used by an optimizer to search for rendering parameters such that the resulting images adhere to our formulations on what constitutes quality. At the same time, similar metrics can be exploited such that the space of possible parameters can be more thoroughly explored, resulting in populations of images exhibiting diverse content. This document presents our findings in terms of approaches that constitute good candidates for quality and diversity criteria, to be used as objectives and/or for defining feature spaces when automatically generating images from volume data. This report was originally submitted as documentation for the deliverables of the VolEvol project.
Open Access
Report
N/A
Antonios Liapis; Chintan Triverdi; Emmanouil Xylakis; Georgios N. Yannakakis; Konstantinos Makantasis; Kosmas Pinitas; Matthew Barthet
University of Malta
Open Access
Conference paper
Conference on Affective Computing and Intelligent Interaction Workshops and Demos
Christina Katsini; George E. Raptis; Vasilis Theodorou;
Human Opsis
In a fast-changing media ecosystem, professionals and enterprises in the News and Media industry face new challenges that they should address to maximize their productivity and improve their services. The rise of alternative news sources, such as social media, the leading news source, especially for young people, has led to emerging requirements in the News and Media industry. A core requirement is publishing articles as fast as possible on various platforms, combining visual and textual content. Accompanying news with images raises the readers’ interest, improves engagement, and recall. Therefore, the News and Media industry professionals must adapt their publication strategies to meet this requirement and the media consumers’ expectations. However, the selection of the appropriate images is a time-consuming and manual task. Towards this direction, we propose VIREO, which addresses this challenge by providing professionals (e.g., journalists) with an integrated digital solution that automatically recommends a collection of images that could accompany an article. VIREO implements text and image analysis and matching processes leveraging AI techniques in real time to achieve this. VIREO aims to benefit both professionals (e.g., journalists) by suggesting appealing images that accompany the textual content of their articles and create breath-taking stories and the media consumers (e.g., readers) by delivering an enhanced reading experience, engagement, and recall.
Open Access
Conference paper
Human-Computer Interaction
Ambrish Rawat; Gabriele Picco; Giulio Zizzo; Myles Foley; Taesung Lee; Yufang Hou;
IBM Research; Imperial College London;
The wide applicability and adaptability of large language models (LLM) has enabled their rapid adoption. While the pre-trained models can perform many tasks, such models are often fine-tuned to improve their performance. However, this leads to issues over violation of model licenses, model theft, and copyright infringement. Moreover, recent advances show that generative technology is capable of producing harmful content which exacerbates the problems of accountability within model supply chains. Thus, we need a method to investigate how a model was trained or a piece of text was generated and what their source pre-trained model was. In this paper we take a first step to addressing this open problem by tracing back the origin of a given fine-tuned LLM to its corresponding pre-trained base model. We consider different knowledge levels and attribution strategies, and find that we are able to trace back to the original base model with an AUC of 0.804.
Open Access
Conference paper
N/A
Ioannis Patras; Zengqun Zhao;
Queen Mary University of London;
Open Access
Conference paper
N/A
Aaron Duane; Cathal Gurrin; Florian Spiess; Jakub Lokoč; Klaus Schoeffmann; Konstantin Schall; Ladislav Peška; Loris Sauter; Luca Rossetto; Lucia Vadicamo; Nicola Messina; Omar Shahbaz Khan; Stefanos Vrochidis; Stelios Andreadis; Thao-Nhu Nguyen; Werner Bailer; Zhixin Ma;
CERTH - Center for Research and Technology Hellas Charles University; Dublin City University; HTW Berlin; ISTI-CNR; IT University of Copenhagen; Joanneum Research; Klagenfurt University; Singapore Management University; University of Basel; University of Copenhagen; University of Zurich;
This paper presents the findings of the eleventh Video Browser Showdown competition, where sixteen teams competed in known-item and ad-hoc search tasks. Many of the teams utilized state-of-the-art video retrieval approaches that demonstrated high effectiveness in challenging search scenarios. In the paper, a broad survey of all utilized approaches is presented in connection
with an analysis of the performance of participating teams. Specifically, both high-level performance indicators are presented with overall statistics as well as an in-depth analysis of the performance of selected tools implementing result set logging. The analysis reveals evidence that the CLIP model represents a versatile tool for cross-modal video retrieval when combined with interactive search capabilities. Furthermore, the analysis investigates the effect of different users and text query properties on the performance in search tasks. Last but not least, lessons learned from search task preparation are presented, and a new direction for adhoc search based tasks at Video Browser Showdown is introduced.
Open Access
Journal article
N/A
Ioannis Pitas; Vasileios Mygdalis
Aristotle University of Thessaloniki;
This work examines the problem of increasing the robustness of deep neural network-based image classification systems to adversarial attacks, without changing the neural architecture or employ adversarial examples in the learning process. We attribute their famous lack of robustness to the geometric properties of the deep neural network embedding space, derived from standard optimization options, which allow minor changes in the intermediate activation values to trigger dramatic changes to the decision values in the final layer. To counteract this effect, we explore optimization criteria that supervise the distribution of the intermediate embedding spaces, in a class-specific basis, by introducing and leveraging one-class classification objectives. The proposed learning procedure compares favorably to recently proposed training schemes for adversarial robustness in black-box adversarial attack settings.
Open Access
Conference paper
N/A
Alexandros Zamichos; Ioannis Pitas; Vasileios Mygdalis
Aristotle University of Thessaloniki;
Adversarial attacks in image classification are optimization problems that estimate the minimum perturbation required for a single input image, so the neural network misclassifies it. Universal adversarial perturbations are adversarial attacks that target a whole dataset, estimated by e.g., accumulating the perturbations for each image using standard adversarial attacks. This work treats the universal adversarial perturbation as a problem of transformation estimation. As such, we propose to learn an iterative transformation that maps “clean” images to a “perturbed” domain, by exploiting adversarial attacks. Our experiments show that the proposed formulation leads to easy generation of the adversarial perturbation, while it introduces less noise in the perturbed images, when compared to the state-of-the-art. Finally, this formulation allows us to explore additional properties, notably reversibility of the transformation and attainability of the transformation by using dataset samples.
Open Access
Conference paper
N/A
Ioannis Pitas; Stefania Altini; Vasileios Mygdalis
Aristotle University of Thessaloniki;
Different adversarial attack methods have been proposed in the literature, mainly focusing on attack efficiency and visual quality, e.g., similarity with the non-adversarial examples. These properties enable the use of adversarial attacks for privacy protection against automated classification systems, while maintaining utility for human users. In this paradigm, when privacy restrictions are lifted, access to the original data should be restored, for all stakeholders. This paper addresses exactly this problem. Existing adversarial attack methods cannot reconstruct the original data from the adversarial ones, leading to significant storage overhead for all privacy applications. To solve this issue, we propose AdvRevGAN, a novel Neural Network architecture that generates reversible adversarial examples. We evaluate our approach in classification problems, where we examine the case where adversarial attacks are constructed by a neural network, while the original images are reconstructed using the reverse transformation from the adversarial examples. We show that adversarial attacks using this approach maintain and even increase their efficiency, while the classification accuracy of the model in the reconstructed data can almost totally be restored.
Open Access
Conference paper
N/A
Daniel Aláez; Ioannis Pitas; Jesús Villadangos; Vasileios Mygdalis
Aristotle University of Thessaloniki; University of Navarre;
In recent years, the field of automated aerial cinematography has seen a significant increase in demand for real-time 3D target geopositioning for motion and shot planning. To this end, many of the existing cinematography plans require the use of complex sensors that need to be equipped on the subject or rely on external motion systems. This work addresses this problem by combining monocular visual target detection and tracking with a simple ground intersection model. Under the assumption that the targets to be filmed typically stand on the ground, 3D target localization is achieved by estimating the direction and the norm of the look-at vector. The proposed algorithm employs an error estimation model that accounts for the error in detecting the bounding box, the height estimation errors, and the uncertainties of the pitch and yaw angles. This algorithm has been fully implemented in a heavy-lifting aerial cinematography hexacopter, and its performance has been evaluated through experimental flights. Results show that typical errors are within 5 meters of absolute distance and 3 degrees of angular error for distances to the target of around 100 meters.
Open Access
Conference paper
N/A
Christos Tzelepis; Georgios Tzimiropoulos; Ioannis Patras; Stella Bounareli; Vasileios Argyriou;
Kingston University London; Queen Mary University of London;
Open Access
Conference paper
N/A
Adrian Popescu; Armelle Brun; Evan Dufraisse; Jérôme Deshayes-Chossart; Julien Tourille;
Université de Lorraine; Université Paris-Saclay;
Target-dependent sentiment classification (TSC) enables a fine-grained automatic analysis of sentiments expressed in texts.
Sentiment expression varies depending on the domain, and it is necessary to create domain-specific datasets.
While socially important, TSC in the news domain remains relatively understudied.
We introduce MAD-TSC, the first multilingual aligned dataset designed for TSC in news. MAD-TSC differs substantially from existing resources.
First, it includes aligned examples in eight languages to facilitate a comparison of performance for individual languages, and a direct comparison of human and machine translation.
Second, the dataset is sampled from a diversified parallel news corpus, and is diversified in terms of news sources and geographic spread of entities.
Finally, MAD-TSC is more challenging than existing datasets because its samples are more complex.
We exemplify the use of MAD-TSC with comprehensive monolingual and multilingual experiments.
The latter shows that machine translations can successfully replace manual ones, and that performance for all included languages can match that of English by automatically translating test examples.
Open Access
Conference paper
Conference on Computational Linguistics N/A
Daniele Ugo Leonzio; Luca Cuccovillo; Marco Marcon; Paolo Bolettieri; Patrick Aichroth; Stefano Tubaro;
Fraunhofer IDMT; Politecnico di Milano;
In recent years, the multimedia forensic community has put a great effort in developing solutions to assess the integrity and authenticity of multimedia objects, focusing especially on manipulations applied by means of advanced deep learning techniques. However, in addition to complex forgeries as the deepfakes, very simple yet effective manipulation techniques not involving any use of state-of-the-art editing tools still exist and prove dangerous. This is the case of audio splicing for speech signals, i.e., to concatenate and combine multiple speech segments obtained from different recordings of a person in order to cast a new fake speech. Indeed, by simply adding a few words to an existing speech we can completely alter its meaning. In this work, we address the overlooked problem of detection and localization of audio splicing from different models of acquisition devices. Our goal is to determine whether an audio track under analysis is pristine, or it has been manipulated by splicing one or multiple segments obtained from different device models. Moreover, if a recording is detected as spliced, we identify where the modification has been introduced in the temporal dimension. The proposed method is based on a Convolutional Neural Network (CNN) that extracts model-specific features from the audio recording. After extracting the features, we determine whether there has been a manipulation through a clustering algorithm. Finally, we identify the point where the modification has been introduced through a distance-measuring technique. The proposed method allows to detect and localize multiple splicing points within a recording.
Open Access
Journal article
Multimedia FORensics in the WILD
Alberto Del Bimbo; Federico Becattini; Lorenzo Seidenari; Luca Cultrera;
University of Florence;
Autonomous driving is advancing at a fast pace, with driving algorithms becoming more and more accurate and reliable. Despite this, it is of utter importance to develop models that can offer a certain degree of explainability in order to be trusted, understood and accepted by researchers and, especially, society. In this work we present a conditional imitation learning agent based on a visual attention mechanism in order to provide visually explainable decisions by design. We propose different variations of the method, relying on end-to-end trainable regions proposal functions, generating regions of interest to be weighed by an attention module. We show that visual attention can improve driving capabilities and provide at the same time explainable decisions.
Open Access
Journal article
N/A
Nicu Sebe; Wei Wang; Yue Song
Beijing Jiaotong University; University of Trento;
Closed Access
Journal article
IEEE Transaction on Pattern Analysis and Machine Intelligence
Fang Li; Jing Wang; Jun Zhang; Wengjing Li; Zhongcheng Wu; Zhun Zhong
Chinese Academy of Science; University of Trento;
Closed Access
Journal article
Transactions on Intelligent Transportation Systems;
Andy Keller; Max Welling; Nicu Sebe; Yue Song
University of Amsterdam; University of Trento;
Open Access
Conference paper
International Conference on Machine Learning
Emmanouil Krasanakis; Ioannis Kompatsiaris; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
Open Access
Journal article
N/A
Bin Ren; Hao Tang; Nicu Sebe; Wei Wang; Xia Li; Yiming Wang;
Beijing Jiaotong University; ETH Zurich; FBK; University of Trento;
For semantic-guided cross-view image translation, it is crucial to learn where to sample pixels from the source view image and where to reallocate them guided by the target view semantic map, especially when there is little overlap or drastic view difference between the source and target images. Hence, one not only needs to encode the longrange dependencies among pixels in both the source view image and target view semantic map but also needs to translate these learned dependencies. To this end, we propose a novel generative adversarial network, PI-Trans, which mainly consists of a novel Parallel-ConvMLP module and an Implicit Transformation module at multiple semantic levels. Extensive experimental results show that PI-Trans achieves the best qualitative and quantitative performance by a large margin compared to the state-of-the-art methods on two challenging datasets. The source code is available at https://github.com/Amazingren/PI-Trans.
Open Access
Conference paper
N/A
Chen Feng; Ioannis Patras;
Queen Mary University of London;
Deep learning has achieved great success in recent years with the aid of advanced neural network structures and large-scale human-annotated datasets. However, it is often costly and difficult to accurately and efficiently annotate large-scale datasets, especially for some specialized domains where fine-grained labels are required. In this setting, coarse labels are much easier to acquire as they do not require expert knowledge. In this work, we propose a contrastive learning method, called masked contrastive learning (MaskCon) to address the under-explored problem setting, where we learn with a coarse-labelled dataset in order to address a finer labelling problem. More specifically, within the contrastive learning framework, for each sample our method generates soft-labels with the aid of coarse labels against other samples and another augmented view of the sample in question. By contrast to self-supervised contrastive learning where only the sample’s augmentations are considered hard positives, and in supervised contrastive learning where only samples with the same coarse labels are considered hard positives, we propose soft labels based on sample distances, that are masked by the coarse labels. This allows us to utilize both inter-sample relations and coarse labels. We demonstrate that our method can obtain as special cases many existing state-of-the-art works and that it provides tighter bounds on the generalization error. Experimentally, our method achieves significant improvement over the current state-of-the-art in various datasets, including CIFAR10, CIFAR100, ImageNet-1K, Standford Online Products and Stanford Cars196 datasets. Code and annotations are available at https://github.com/ MrChenFeng/MaskCon_CVPR2023.
Open Access
Conference paper
N/A
Christos Tzelepis; Giorgios Kordopatis-Zilos; Giorgios Tolias; Ioannis Kompatsiaris; Ioannis Patras; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; Queen Mary University of London;
We introduce S2VS, a video similarity learning approach with self-supervision. Self-Supervised Learning (SSL) is typically used to train deep models on a proxy task so as to have strong transferability on target tasks after fine-tuning. Here, in contrast to prior work, SSL is used to perform video similarity learning and address multiple retrieval and detection tasks at once with no use of labeled data. This is achieved by learning via instance-discrimination with task-tailored augmentations and the widely used InfoNCE loss together with an additional loss operating jointly on self-similarity and hard-negative similarity. We benchmark our method on tasks where video relevance is defined with varying granularity, ranging from video copies to videos depicting the same incident or event. We learn a single universal model that achieves state-of-the-art performance on all tasks, surpassing previously proposed methods that use labeled data. The code and pretrained models are publicly available at: https://github.com/gkordo/s2vs.
Open Access
Conference paper
N/A
Chen Feng; Ioannis Patras;
Queen Mary University of London;
Open Access
Conference paper
N/A
Alberto Del Bimbo; Andrea Leonardo Chiara Albisani Federico Becattini; Lisa Cresti Lorenzo Berlincioni; Luca Cultrera; Sara Picchioni
Università degli Studi di Firenze; University of Florence;
Recently, event cameras have shown large applicability in several computer vision fields especially concerning tasks that require high temporal resolution. In this work, we investigate the usage of such kind of data for emotion recognition by presenting NEFER, a dataset for Neuromorphic Event-based Facial Expression Recognition. NEFER is composed of paired RGB and event videos representing human faces labeled with the respective emotions and also annotated with face bounding boxes and facial landmarks. We detail the data acquisition process as well as providing a baseline method for RGB and event data. The collected data captures subtle micro-expressions, which are hard to spot with RGB data, yet emerge in the event domain. We report a double recognition accuracy for the event-based approach, proving the effectiveness of a neuromorphic approach for analyzing fast and hardly detectable expressions and the emotions they conceal.
Open Access
Conference paper
Computer Vision Foundation
Alessandrio Betti Frédéric Precioso Gabriele Ciravegna Kevin Mottin Marco Gori
Politecnico di Torino Université Côte d'Azur; University of Siena
The deployment of Deep Learning (DL) models is still precluded in those contexts where the amount of supervised data is limited.
To answer this issue, active learning strategies aim at minimizing the amount of labelled data required to train a DL model. Most active strategies are based on uncertain sample selection, and even often restricted to samples lying close to the decision boundary. These techniques are theoretically sound, but an understanding of the selected samples based on
their content is not straightforward, further driving non-experts to consider DL as a black-box. For the first time, here we propose to take into consideration common domain-knowledge and enable non-expert users to train a model with fewer samples. In our Knowledge-driven Active Learning (KAL) framework, rule-based knowledge is converted into logic constraints and their violation is checked as a natural guide for sample selection. We show that even simple relationships among data and output classes offer a way to spot predictions for which the model need supervision. We empirically show that KAL (i) outperforms many active learning strategies, particularly in those contexts where domain knowledge is rich, (ii) it discovers data distribution lying far from the initial training data, (iii) it ensures domain experts that the provided knowledge is acquired by the model, (iv) it is suitable for regression and object recognition tasks unlike uncertainty-based strategies, and (v) its computational demand is low.
Open Access
Conference paper
N/A
Fabrizio Falchi; Jan Sedmidubsky; Nicola Messina; Tomás Rebok;
ISTI-CNR; Masaryk University;
Open Access
Conference paper
N/A
Giorgios Kordopatis-Zilos; Ioannis Kompatsiaris; Pantelis Dogoulis; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
New advancements for the detection of synthetic images are critical for fighting disinformation, as the capabilities of generative AI models continuously evolve and can lead to hyper-realistic synthetic imagery at unprecedented scale and speed. In this paper, we focus on the challenge of generalizing across different concept classes, e.g., when training a detector on human faces and testing on synthetic animal images — highlighting the ineffectiveness of existing approaches that randomly sample generated images to train their models. By contrast, we propose an approach based on the premise that the robustness of the detector can be enhanced by training it on realistic synthetic images that are selected based on their quality scores according to a probabilistic quality estimation model. We demonstrate the effectiveness of the proposed approach by conducting experiments with generated images from two seminal architectures, StyleGAN2 and Latent Diffusion, and using three different concepts for each, so as to measure the cross-concept generalization ability. Our results show that our quality-based sampling method leads to higher detection performance for nearly all concepts, improving the overall effectiveness of the synthetic image detectors.
Open Access
Conference paper
N/A
Adrian Popescu; Bogdan Ionescu; Giorgios Kordopatis-Zilos; Luca Cuccovillo; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; Fraunhofer IDMT; Université Paris-Saclay; University Politehnica of Bucharest
With recent advancements in synthetic media manipulation and generation, verifying multimedia content posted online has become increasingly difficult. Additionally, the malicious exploitation of AI technologies by actors to disseminate disinformation on social media, and more generally the Web, at an alarming pace poses significant threats to society and democracy. Therefore, the development of AI-powered tools that facilitate media verification is urgently needed. The MAD ’23 workshop aims to bring together individuals working on the wider topic of detecting disinformation in multimedia to exchange their experiences and discuss innovative ideas, attracting people with varying backgrounds and expertise. The research areas of interest include identifying manipulated and synthetic content in multimedia, as well as examining the dissemination of disinformation and its impact on society. The multimedia aspect is very important since content most often contains a mix of modalities and their joint analysis can boost the performance of verification methods.
Open Access
Conference paper
Conference on Multimedia Retrieval
Adrian Popescu; Bogdan Ionescu; Giorgios Kordopatis-Zilos; Luca Cuccovillo; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas Czech Technical University in Prague; Fraunhofer IDMT; Université Paris-Saclay; University Politehnica of Bucharest
Front matter of the proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation, held in Thessaloniki (Greece) on June 12th, 2023. The full proceedings are available online at https://doi.org/10.1145/3591106.
Open Access
Book section
International Workshop on Multimedia AI against Disinformation
Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;
ISTI-CNR;
VISIONE is a large-scale video retrieval system that integrates multiple search functionalities, including free text search, spatial color and object search, visual and semantic similarity search, and temporal search. The system leverages cutting-edge AI technology for visual analysis and advanced indexing techniques to ensure scalability. As demonstrated by its runner-up position in the 2023 Video Browser Showdown competition, VISIONE effectively integrates these capabilities to provide a comprehensive video retrieval solution. A system demo is available online, showcasing its capabilities on over 2300 hours of diverse video content (V3C1+V3C2 dataset) and 12 hours of highly redundant content (Marine dataset). The demo can be accessed at https://visione.isti.cnr.it.
Open Access
Conference paper
Conference on Multimedia Retrieval
Lucile Sassatelli; Quentin Guimard
Université Côte d'Azur;
Adaptive bitrate (ABR) algorithms are used in streaming media to adjust video or audio quality based on the viewer’s network conditions to provide a smooth playback experience. With the rise of virtual reality (VR) headsets, 360° video streaming is growing rapidly and requires efficient ABR strategies to also adapt the video quality to the user’s head position. However, research in this field is often difficult to compare due to a lack of reproducible simulations. To address this problem, we provide SMART360, a 360° streaming simulation environment to compare motion prediction and adaptive bitrates strategies.
We provide sample inputs and baseline algorithms along with the simulator, as well as examples of results and visualizations that can be obtained with SMART360. The code and data are made publicly available.
Open Access
Conference paper
ACM Multimedia Systems Conference
Juanjuan Weng; Nicu Sebe; Shaozi Li; Zhiming Luo; Zhun Zhong
University of Trento; Xiamen University
Open Access
Transactions on Information Forensics and Security
Hao Tang;
ETH Zurich; Tencent AI Lab University of Oregon;
Open Access
Conference paper
Computer Vision and Pattern Recognition
Nan Pu; Nicu Sebe; Zhun Zhong
University of Trento;
Open Access
Conference paper
Computer Vision and Pattern Recognition
Bin Ren; Nicu Sebe; Rita Cucchiara; Wei Bi; Wei Wang; Yahui Liu Yue Song
Beijing Jiaotong University; Tencent AI Lab University of Modena and Reggio Emilia; University of Trento;
Open Access
Conference paper
Computer Vision and Pattern Recognition
Boyu Wang; Charles Ling; Nicu Sebe; Wei Wang; Weijie Wang; Xi Chen; Zhun Zhong
Huawei Noah's Ark Lab; University of Trento; Western University;
Open Access
Conference paper
Computer Vision and Pattern Recognition
Antonios Liapis; Edoardo Tibuzzi; Georgios N. Yannakakis; Jeg Dudley; Joel Hilmersson; Konstantinos Sfikas;
AKT II; University of Malta
Computer-aided optimization algorithms in structural engineering have historically focused on the structural performance of generated forms, often resulting in the selection of a single ‘optimal’ solution. However, diversity of generated solutions is desirable when those solutions are shown to a human user to choose from. Quality-Diversity (QD) search is an emerging field of Evolutionary Computation which can automate the exploration of the solution space in engineering problems. QD algorithms, such as MAP-Elites, operate by maintaining and expanding an archive of diverse solutions, optimising for quality in local niches of a multidimensional design space. The generated archive of solutions can help engineers gain a better overview of the solution space, illuminating which designs are possible and their trade-offs. In this paper we apply Quality Diversity search to the problem of designing shell structures. Since the design of shell structures comes with physical constraints, we leverage a constrained optimization variant of the MAP-Elites algorithm, FI-MAP-Elites. We implement our proposed methodology within the Rhino/Grasshopper environment and use the Karamba Finite Element Analysis solver for all structural engineering calculations. We test our method on case studies of parametric models of shell structures that feature varying complexity. Our experiments investigate the algorithm’s ability to illuminate the solution space and generate feasible and high-quality solutions.
Open Access
Conference paper
N/A
Christos Tzelepis; Ioannis Panagakis Ioannis Patras; James Oldfield; Mihalis Nicolaou;
Cyprus Institute; National and Kapodistrian University of Athens Queen Mary University of London;
Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP’s joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest. Such a subspace yields disentangled representations of the different visual properties of an image or text in closed form while respecting the underlying geometry of the manifold on which the representations lie. What’s more, we show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances (e.g. artists’ painting styles), which enables the selective removal of entire visual themes from CLIP-based text-to-image synthesis. We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists’ styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification.
Open Access
Conference paper
Conference on Neural Information Processing Systems
Johan Oomen; Philo van Kemenade; Rasa Bocyte
Netherlands Institute for Sound & Vision
Segments of audiovisual content are constantly being reinterpreted as they are reused and repeated in new contexts. Framing analysis can reveal patterns and biases in the way content is being recontextualised in the media to shape public discourse. In the AI4Media project, the Netherlands Institute for Sound & Vision has been investigating how AI-based tools could support humanities scholars in performing framing analysis across large-scale audiovisual collections. This short paper describes a demo of the Partial Audio Matching (PAM) functionality designed for this purpose. It describes how PAM has been integrated into the CLARIAH Media Suite – a virtual research space for humanities scholars that enables the exploration and analysis of audiovisual collections.
Open Access
Report
N/A
Adrian Popescu; Hugo Schindler; Jérôme Deshayes-Chossart; Van-Khoa Nguyen
Université Paris-Saclay; University of Geneva;
Online social networks use AI techniques to automatically infer profiles from users’ shared data.
However, these inferences and their effects remain, to a large extent, opaque to the users themselves.
We propose a method which raises user awareness about the potential use of their profiles in impactful situations, such as searching for a job or an accommodation.
These situations illustrate usage contexts that users might not have anticipated when deciding to share their data.
User photographic profiles are described by automatic object detections in profile photos, and associated object ratings in situations.
Human ratings of the profiles per situation are also available for training.
These data are represented as graph structures which are fed into graph neural networks in order to learn how to automatically rate them.
An adaptation of the learning procedure per situation is proposed since the same profile is likely to be interpreted differently, depending on the context.
Automatic profile ratings are compared to one another in order to inform individual users of their standing with respect to others.
Our method is evaluated on a public dataset, and consistently outperforms competitive baselines.
An ablation study gives insights about the role of its main components.
Open Access
Conference paper
Conference on Multimedia Retrieval
Claudio Gennaro; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Nicola Messina;
ISTI-CNR;
Open Access
Conference paper
Conference on Image Analysis and Processing
Hristiana Krasteva; Irina Temnikova; Ivo Dzhumerov; Ruslana Margova; Tsvetelina Stefanova; Veneta Kireva;
GATE Institute;
Automatically detecting disinformation is an important Natural Language Processing (NLP) task whose results can assist journalists and the general public. The European Commission defines “disinformation” as “false or misleading content that is spread with an intention to deceive”. Deception and thus disinformation can be identified by the presence of (psycho)linguistic markers, but some lower-resourced languages (e.g. Bulgarian) lack sufficient linguistic and psycholinguistic research on this topic, lists of such markers and suitable datasets. This article introduces the first ever resources for studying and detecting deception and disinformation in Bulgarian (some of which can be adapted to other languages). The resources can benefit linguists, psycholinguists and NLP researchers, are accessible on Zenodo (subject to legal conditions) and include: 1) an extended hierarchical classification of linguistic markers signalling deception; 2) lists of Bulgarian expressions for recognizing some of the linguistic markers; 3) four large Bulgarian social media datasets on topics related to deception, not fact-checked, but automatically annotated with the markers; 4) Python scripts to automatically collect, clean, anonymize, and annotate new Bulgarian texts. The datasets can be used to build machine learning methods or study potential deception. The article describes the methods of collecting and processing the datasets and linguistic markers, and presents some statistics.
Open Access
Conference paper
Language & Technology Conference
Hao Tang; Nicu Sebe; Philip Torr;
ETH Zurich; University of Oxford; University of Trento;
Closed Access
Journal article
IEEE Transaction on Pattern Analysis and Machine Intelligence
Dan Xu; Guolei Sun; Hao Tang; Luc van Gool; Nicu Sebe; Radu Timofte; Xiaojuan Qi;
ETH Zurich; HKUST; University of Hong Kong; University of Trento; University of Wurzburg;
We propose a novel edge guided generative adversarial network with contrastive learning (ECGAN) for the challenging semantic image synthesis task. Although considerable improvement has been achieved, the quality of synthesized images is far from satisfactory due to three largely unresolved challenges. 1) The semantic labels do not provide detailed structural information, making it difficult to synthesize local details and structures. 2) The widely adopted CNN operations such as convolution, down-sampling, and normalization usually cause spatial resolution loss and thus cannot fully preserve the original semantic information, leading to semantically inconsistent results (e.g., missing small objects). 3) Existing semantic image synthesis methods focus on modeling “local” semantic information from a single input semantic layout. However, they ignore “global” semantic information of multiple input semantic layouts, i.e., semantic cross-relations between pixels across different input layouts. To tackle 1), we propose to use edge as an intermediate representation which is further adopted to guide image generation via a proposed attention guided edge transfer module. Edge information is produced by a convolutional generator and introduces detailed structure information. To tackle 2), we design an effective module to selectively highlight class-dependent feature maps according to the original semantic layout to preserve the semantic information. To tackle 3), inspired by current methods in contrastive learning, we propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content than those from different classes. Doing so can capture more semantic relations by explicitly exploring the structures of labeled pixels from multiple input semantic layouts. Experiments on three challenging datasets show that our ECGAN achieves significantly better results than state-of-the-art methods.
Open Access
Conference paper
International Conference on Learning Representations
Dan Xu; Guolei Sun; Hao Tang; Luc van Gool; Nicu Sebe; Radu Timofte; Xiaojuan Qi;
ETH Zurich; HKUST; University of Hong Kong; University of Trento; University of Wurzburg;
We propose a novel edge guided generative adversarial network with contrastive learning (ECGAN) for the challenging semantic image synthesis task. Although considerable improvement has been achieved, the quality of synthesized images is far from satisfactory due to three largely unresolved challenges. 1) The semantic labels do not provide detailed structural information, making it difficult to synthesize local details and structures. 2) The widely adopted CNN operations such as convolution, down-sampling, and normalization usually cause spatial resolution loss and thus cannot fully preserve the original semantic information, leading to semantically inconsistent results (e.g., missing small objects). 3) Existing semantic image synthesis methods focus on modeling “local” semantic information from a single input semantic layout. However, they ignore “global” semantic information of multiple input semantic layouts, i.e., semantic cross-relations between pixels across different input layouts. To tackle 1), we propose to use edge as an intermediate representation which is further adopted to guide image generation via a proposed attention guided edge transfer module. Edge information is produced by a convolutional generator and introduces detailed structure information. To tackle 2), we design an effective module to selectively highlight class-dependent feature maps according to the original semantic layout to preserve the semantic information. To tackle 3), inspired by current methods in contrastive learning, we propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content than those from different classes. Doing so can capture more semantic relations by explicitly exploring the structures of labeled pixels from multiple input semantic layouts. Experiments on three challenging datasets show that our ECGAN achieves significantly better results than state-of-the-art methods.
Open Access
Conference paper
International Conference on Learning Representations
Bin Ren; Hao Tang; Nicu Sebe; Wei Wang; Xia Li; Yiming Wang;
Beijing Jiaotong University; ETH Zurich; FBK; University of Trento;
For semantic-guided cross-view image translation, it is crucial to learn where to sample pixels from the source view image and where to reallocate them guided by the target view semantic map, especially when there is little overlap or drastic view difference between the source and target images. Hence, one not only needs to encode the longrange dependencies among pixels in both the source view image and target view semantic map but also needs to translate these learned dependencies. To this end, we propose a novel generative adversarial network, PI-Trans, which mainly consists of a novel Parallel-ConvMLP module and an Implicit Transformation module at multiple semantic levels. Extensive experimental results show that PI-Trans achieves the best qualitative and quantitative performance by a large margin compared to the state-of-the-art methods on two challenging datasets. The source code is available at https://github.com/Amazingren/PI-Trans.
Closed Access
Conference paper
Speech and Signal Processing
Carlos Santiago; Claudio Gennaro; Giuseppe Amato; João Paulo Costeira; Luca Ciampi;
Instituto Superior Técnico; ISTI-CNR;
Video violence detection is a subset of human action recognition aiming to detect violent behaviors in trimmed video clips. Current Computer Vision solutions based on Deep Learning approaches provide astonishing results. However, their success relies on large collections of labeled datasets for supervised learning to guarantee that they generalize well to diverse testing scenarios. Although plentiful annotated data may be available for some pre-specified domains, manual annotation is unfeasible for every ad-hoc target domain or task. As a result, in many real-world applications, there is a domain shift between the distributions of the train (source) and test (target) domains, causing a significant drop in performance at inference time. To tackle this problem, we propose an Unsupervised Domain Adaptation scheme for video violence detection based on single image classification that mitigates the domain gap between the two domains. We conduct experiments considering as the source labeled domain some datasets containing violent/non-violent clips in general contexts and, as the target domain, a collection of videos specific for detecting violent actions in public transport, showing that our proposed solution can improve the performance of the considered models.
Open Access
Conference paper
Conference on Image Processing and Vision Engineering
Antonios Liapis; David Melhart; Georgios N. Yannakakis; Paris Mavromoustakos-Blom; Pieter Spronck; Sander Bakkes;
Tilburg University; University of Malta Utrecht University;
Games are designed to elicit strong emotions during game play, especially when players are competing against each other. Artificial Intelligence applied to predict a player’s emotions has mainly been tested on single-player experiences in low-stakes settings and short-term interactions. How do players experience and manifest affect in high-stakes competitions, and which modalities can capture this? This paper reports a first experiment in this line of research, using a competition of the video game Hearthstone where both competing players’ game play and facial expressions were recorded over the course of the entire match which could span up to 41 minutes. Using two experts’ annotations of tension using a continuous video affect annotation tool, we attempt to predict tension from the webcam footage of the players alone. Treating both the input and the tension output in a relative fashion, our best models reach 66.3% average accuracy (up to 79.2% at the best fold) in the challenging leave-one-participant out cross-validation task. This initial experiment shows a way forward for affect annotation in games “in the wild” in high-stakes, real-world competitive settings.
Open Access
Conference paper
Conference on the Foundations of Digital Games
Antonios Liapis; Georgios N. Yannakakis; Konstantinos Makantasis; Kosmas Pinitas;
University of Malta
How can we reliably transfer affect models trained in controlled laboratory conditions ( in-vitro ) to uncontrolled real-world settings ( in-vivo )? The information gap between in-vitro and in-vivo applications defines a core challenge of affective computing. This gap is caused by limitations related to affect sensing including intrusiveness, hardware malfunctions and availability of sensors. As a response to these limitations, we introduce the concept of privileged information for operating affect models in real-world scenarios (in the wild). Privileged information enables affect models to be trained across multiple modalities available in a lab, and ignore, without significant performance drops, those modalities that are not available when they operate in the wild. Our approach is tested in two multimodal affect databases one of which is designed for testing models of affect in the wild. By training our affect models using all modalities and then using solely raw footage frames for testing the models, we reach the performance of models that fuse all available modalities for both training and testing. The results are robust across both classification and regression affect modeling tasks which are dominant paradigms in affective computing. Our findings make a decisive step towards realizing affect interaction in the wild.
Open Access
Journal article
IEEE Transactions on Affective Computing
Alberto Del Bimbo; Leonardo Galteri; Lorenzo Agnolucci Marco Bertini;
University of Florence;
In the latest years, videoconferencing has taken a fundamental role in interpersonal relations, both for personal and business purposes. Lossy video compression algorithms are the enabling technology for videoconferencing, as they reduce the bandwidth required for real-time video streaming. However, lossy video compression decreases the perceived visual quality. Thus, many techniques for reducing compression artifacts and improving video visual quality have been proposed in recent years. In this work, we propose a novel GAN-based method for compression artifacts reduction in videoconferencing. Given that, in this context, the speaker is typically in front of the camera and remains the same for the entire duration of the transmission, we can maintain a set of reference keyframes of the person from the higher-quality I-frames that are transmitted within the video stream and exploit them to guide the visual quality improvement; a novel aspect of this approach is the update policy that maintains and updates a compact and effective set of reference keyframes. First, we extract multi-scale features from the compressed and reference frames. Then, our architecture combines these features in a progressive manner according to facial landmarks. This allows the restoration of the high-frequency details lost after the video compression. Experiments show that the proposed approach improves visual quality and generates photo-realistic results even with high compression rates.
Open Access
Journal article
IEEE Transactions on Multimedia
Georgios Tzimiropoulos; Ioannis Maniadis Metaxas; Ioannis Patras;
Queen Mary University of London;
Clustering has been a major research topic in the field of machine learning, one to which Deep Learning has recently been applied with significant success. However, an aspect of clustering that is not addressed by existing deep clustering methods, is that of efficiently producing multiple, diverse partitionings for a given dataset. This is particularly important, as a diverse set of base clusterings are necessary for consensus clustering, which has been found to produce better and more robust results than relying on a single clustering. To address this gap, we propose DivClust, a diversity controlling loss that can be incorporated into existing deep clustering frameworks to produce multiple clusterings with the desired degree of diversity. We conduct experiments with multiple datasets and deep clustering frameworks and show that: a) our method effectively controls diversity across frameworks and datasets with very small additional computational cost, b) the sets of clusterings learned by DivClust include solutions that significantly outperform single-clustering baselines, and c) using an off-the-shelf consensus clustering algorithm, DivClust produces consensus clustering solutions that consistently outperform single-clustering baselines, effectively improving the performance of the base deep clustering framework.
Open Access
Conference paper
N/A
Dan Xu; Hao Tang; Hong Liu; Nicu Sebe; Philip Torr;
ETH Zurich; Hong Kong University of Science and Technology; Peking University; University of Oxford; University of Trento;
State-of-the-art methods in the image-to-image translation are capable of learning a mapping from a source domain to a target domain with unpaired image data. Though the existing methods have achieved promising results, they still produce visual artifacts, being able to translate low-level information but not high-level semantics of input images. One possible reason is that generators do not have the ability to perceive the most discriminative parts between the source and target domains, thus making the generated images low quality. In this article, we propose a new Attention-Guided Generative Adversarial Networks (AttentionGAN) for the unpaired image-to-image translation task. AttentionGAN can identify the most discriminative foreground objects and minimize the change of the background. The attention-guided generators in AttentionGAN are able to produce attention masks, and then fuse the generation output with the attention masks to obtain high-quality target images. Accordingly, we also design a novel attention-guided discriminator which only considers attended regions. Extensive experiments are conducted on several generative tasks with eight public datasets, demonstrating that the proposed method is effective to generate sharper and more realistic images compared with existing competitive models. The code is available at https://github.com/Ha0Tang/AttentionGAN.
Closed Access
Journal article
IEEE Transactions on Neural Networks and Learning Systems
Hong Liu; Nicu Sebe; Sin'ichi Satoh; Zhun Zhong
National Institute of Informatics of Tokyo; University of Trento;
Overfitting in adversarial training has attracted the interest of researchers in the community of artificial intelligence and machine learning in recent years. To address this issue, in this paper we begin by evaluating the defense performances of several calibration methods on various robust models. Our analysis and experiments reveal two intriguing properties: 1) a well-calibrated robust model is decreasing the confidence of robust model; 2) there is a trade-off between the confidences of natural and adversarial images. These new properties offer a straightforward insight into designing a simple but effective regularization, called Self-Residual-Calibration (SRC). The proposed SRC calculates the absolute residual between adversarial and natural logit features corresponding to the ground-truth labels. Furthermore, we utilize the pinball loss to minimize the quantile residual between them, resulting in more robust regularization. Extensive experiments indicate that our SRC can effectively mitigate the overfitting problem while improving the robustness of state-of-the-art models. Importantly, SRC is complementary to various regularization methods. When combined with them, we are capable of achieving the top-rank performance on the AutoAttack benchmark leaderboard.
Closed Access
Journal article
Artificial Intelligence
Antonios Liapis; Georgios N. Yannakakis; Konstantinos Sfikas;
University of Malta
This paper introduces a user-driven evolutionary algorithm based on Quality Diversity (QD) search. During a design session, the user iteratively selects among presented alternatives and their selec- tions affect the upcoming results. We implement a variation of the MAP-Elites algorithm where the presented alternatives are sampled from a small region (window) of the behavioral space. After a user selection, the window is centered on the selected individual’s be- havior characterization, evolution selects parents from within this window to produce offspring, and new alternatives are sampled. Essentially we define an adaptive system of local QD search, where the user’s selections guide the search towards specific regions of the behavioral space. The system is tested on the generation of architectural layouts, a constrained optimization task, leveraging QD search through a two-archive approach.
Open Access
Conference paper
Genetic and Evolutionary Computation Conference
Antonios Liapis; Georgios N. Yannakakis; Konstantinos Sfikas;
University of Malta
This paper introduces a user-driven evolutionary algorithm based on Quality Diversity (QD) search. During a design session, the user iteratively selects among presented alternatives and their selec- tions affect the upcoming results. We implement a variation of the MAP-Elites algorithm where the presented alternatives are sampled from a small region (window) of the behavioral space. After a user selection, the window is centered on the selected individual’s be- havior characterization, evolution selects parents from within this window to produce offspring, and new alternatives are sampled. Essentially we define an adaptive system of local QD search, where the user’s selections guide the search towards specific regions of the behavioral space. The system is tested on the generation of architectural layouts, a constrained optimization task, leveraging QD search through a two-archive approach.
Open Access
Conference paper
Genetic and Evolutionary Computation Conference
Alejandro Moreo; Fabrizio Sebastiani; Mirko Bunse; Pablo González
ISTI-CNR; University of Applied Sciences and Art Dortmund; University of Oviedo
Open Access
Journal article
SIGKDD Explorations
Hristiana Nikolaeva; Irina Temnikova; Ivo Dzhumerov; Silvia Gargova;
GATE Institute; Plovdic University;
Automatic Language Identification (LI) is a widely addressed task, but not all users (for example linguists) have the means or interest to develop their own tool or to train the existing ones with their own data. There are several off-the-shelf LI tools, but for some languages, it is unclear which tool is the best for specific types of text. This article presents a comparison of the performance of several off-the-shelf language identification tools on Bulgarian social media data. The LI tools are tested on a multilingual Twitter dataset (composed of 2966 tweets) and an existing Bulgarian Twitter dataset on the topic of fake content detection of 3350 tweets. The article presents the manual annotation procedure of the first dataset, a dis- cussion of the decisions of the two annotators, and the results from testing the 7 off-the-shelf LI tools on both datasets. Our findings show that the tool, which is the easiest for users with no programming skills, achieves the highest F1-Score on Bulgarian social media data, while other tools have very useful functionalities for Bulgarian social media texts.
Open Access
Conference paper
Conference on Computational Linguistics
Fabio Carrara; Giuseppe Amato; Jan Sedmidubsky;
ISTI-CNR; Masaryk University;
Recent progress in pose-estimation methods enables the extraction of sufficiently-precise 3D human skeleton data from ordinary videos, which offers great opportunities for a wide range of applications. However, such spatio-temporal data are typically extracted in the form of a continuous skeleton sequence without any information about semantic segmentation or annotation. To make the extracted data reusable for further processing, there is a need to access them based on their content. In this paper, we introduce a universal retrieval approach that compares any two skeleton sequences based on temporal order and similarities of their underlying segments. The similarity of segments is determined by their content-preserving low-dimensional code representation that is learned using the Variational AutoEncoder principle in an unsupervised way. The quality of the proposed representation is validated in retrieval and classification scenarios; our proposal outperforms the state-of-the-art approaches in effectiveness and reaches speed-ups up to 64x on common skeleton sequence datasets.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Federico Pernici; Matteo Bruni; Niccolò Biondi
University of Florence;
Compatible features enable the direct comparison of old and new learned features allowing to use them interchangeably over time. In visual search systems, this eliminates the need to extract new features from the gallery-set when the representation model is upgraded with novel data. This has a big value in real applications as re-indexing the gallery-set can be computationally expensive when the gallery-set is large, or even infeasible due to privacy or other concerns of the application. In this paper, we propose CoReS, a new training procedure to learn representations that are compatible with those previously learned, grounding on the stationarity of the features as provided by fixed classifiers based on polytopes. With this solution, classes are maximally separated in the representation space and maintain their spatial configuration stationary as new classes are added, so that there is no need to learn any mappings between representations nor to impose pairwise training with the previously learned model.
We demonstrate that our training procedure largely outperforms the current state of the art and is particularly effective in the case of multiple upgrades of the training-set, which is the typical case in real applications.
Open Access
Journal article
IEEE Transaction on Pattern Analysis and Machine Intelligence
Claudio Gennaro; Davide Alessandro Coccomini; Fabrizio Falchi; Roberto Caldelli
ISTI-CNR; Mercatorum University;
The increasing use of deep learning techniques to manipulate images and videos, commonly referred to as “deepfakes,” is making more and more challenging to differentiate between real and 2 fake content. While various deepfake detection systems have been developed, they often struggle to detect deepfakes in real-world situations. In particular, these methods are often unable to effectively distinguish images or videos when these are modified using novel techniques which have not been used in the training set. In this study, we carry out an analysis of different deep learning architectures in an attempt to understand which is more capable of better generalizing the concept of deepfake. According to our results, it appears that Convolutional Neural Networks (CNNs) seem to be more capable of storing specific anomalies and thus excel in cases of datasets with a limited number of elements and manipulation methodologies. The Vision Transformer, conversely, is more effective when trained with more varied datasets, achieving more outstanding generalization capabilities than the other methods analysed. Finally, the Swin Transformer appears to be a good alternative for using an attention-based method in a more limited data regime and performs very well in cross-dataset scenarios. All the analyzed architectures seem to have a different way to look at deepfakes but since in a real-world environment, the generalization capability is essential, based on the carried out experiments the attention-based architectures seem to provide superior performances.
Open Access
Journal article
N/A
Hao Tang; Ling Shao; Nicu Sebe; Philip Torr;
ETH Zurich; Terminus AI Lab; University of Oxford; University of Trento;
We present a novel bipartite graph reasoning Generative Adversarial Network (BiGraphGAN) for two challenging tasks: person pose and facial image synthesis. The proposed graph generator consists of two novel blocks that aim to model the pose-to-pose and pose-to-image relations, respectively. Specifically, the proposed bipartite graph reasoning (BGR) block aim to reason the long-range cross relations between the source and target pose in a bipartite graph, which mitigates some of the challenges caused by pose deformation. Moreover, we propose a new interaction-and-aggregation (IA) block to effectively update and enhance the feature representation capability of both a person’s shape and appearance in an interactive way. To further capture the change in pose of each part more precisely, we propose a novel part-aware bipartite graph reasoning (PBGR) block to decompose the task of reasoning the global structure transformation with a bipartite graph into learning different local transformations for different semantic body/face parts. Experiments on two challenging generation tasks with three public datasets demonstrate the effectiveness of the proposed methods in terms of objective quantitative scores and subjective visual realness. The source code and trained models are available at https://github.com/Ha0Tang/BiGraphGAN.
Closed Access
Journal article
International Journal of Computer Vision
Bin Ren; Hao Tang; Lei Ding; Nicu Sebe; Paolo Rota; Songsong Wu;
ETH Zurich; Guangdong University of Petrochemical Technology; University of Trento;
Video processing and analysis have become an urgent task since a huge amount of videos (e.g., Youtube, Hulu) are uploaded online every day. The extraction of representative key frames from videos is very important in video processing and analysis since it greatly reduces computing resources and time. Although great progress has been made recently, large-scale video classification remains an open problem, as the existing methods have not well balanced the performance and eiciency simultaneously. To tackle this problem, this work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC). The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically. The other is that it can preserve the temporal information of the video. Thus it improves the eiciency of video classiication. Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classiication. Moreover, a weight fusion strategy of diferent input networks is presented to boost the performance. By optimizing both video classiication and key frame extraction simultaneously, we achieve better classiication performance and higher eiciency.We evaluate our method on two popular datasets (i.e., HMDB51 and UCF101) and the experimental results consistently demonstrate that our strategy achieves competitive performance and eficiency compared with the state-of-the-art approaches.
Closed Access
Journal article
ACM Transactions on Multimedia Computing, Communications, and Applications
Christos Tzelepis; Ioannis Patras; Nicu Sebe; Simone Barattin;
Queen Mary University of London; University of Trento;
This work addresses the problem of anonymizing the identity of faces in a dataset of images, such that the privacy of those depicted is not violated, while at the same time the dataset is useful for downstream task such as for training machine learning models. To the best of our knowledge, we are the first to explicitly address this issue and deal with two major drawbacks of the existing state-of-the-art approaches, namely that they (i) require the costly training of additional, purpose-trained neural networks, and/or (ii) fail to retain the facial attributes of the original images in the anonymized counterparts, the preservation of which is of paramount importance for their use in downstream tasks. We accordingly present a task-agnostic anonymization procedure that directly optimises the images’ latent representation in the latent space of a pre-trained GAN. By optimizing the latent codes directly, we ensure both that the identity is of a desired distance away from the original (with an identity obfuscation loss), whilst preserving the facial attributes (using a novel feature-matching loss in FaRL’s deep feature space). We demonstrate through a series of both qualitative and quantitative experiments that our method is capable of anonymizing the identity of the images whilst–crucially–better-preserving the facial attributes.
Open Access
Conference paper
N/A
Adam Cygan; Agnieszka Szczesna; Bartosz Bizón; Dominik Golba; Elzbieta Macioszeck; Luca Ciampi; Michal Cogiel; Michal Staniszewski; Nicola Messina; Pawel Foszner;
Blees; ISTI-CNR; QSystem.pro; Silesian University of Technology
Data scarcity has become one of the main obstacles to developing supervised models based on Artificial Intelligence in Computer Vision. Indeed, Deep Learning-based models systematically struggle when applied in new scenarios never seen during training and may not be adequately tested in non-ordinary yet crucial real-world situations. This paper presents and publicly releases CrowdSim2, a new synthetic collection of images suitable for people and vehicle detection gathered from a simulator based on the Unity graphical engine. It consists of thousands of images gathered from various synthetic scenarios resembling the real world, where we varied some factors of interest, such as the weather conditions and the number of objects in the scenes. The labels are automatically collected and consist of bounding boxes that precisely localize objects belonging to the two object classes, leaving out humans from the annotation pipeline. We exploited this new benchmark as a testing ground for some state-of-the-art detectors, showing that our simulated scenarios can be a valuable tool for measuring their performances in a controlled environment.
Open Access
Conference paper
N/A
Adam Cygan; Agnieszka Szczesna; Bartosz Bizón; Dominik Golba; Elzbieta Macioszeck; Luca Ciampi; Michal Cogiel; Michal Staniszewski; Nicola Messina; Pawel Foszner;
Blees; ISTI-CNR; QSystem.pro; Silesian University of Technology
Generally, crowd datasets can be collected or generated from real or synthetic sources. Real data is generated by using infrastructure-based sensors (such as static cameras or other sensors). The use of simulation tools can significantly reduce the time required to generate scenario-specific crowd datasets, facilitate data-driven research, and next build functional machine learning models. The main goal of this work was to develop an extension of crowd simulation (named CrowdSim2) and prove its usability in the application of people-tracking algorithms. The simulator is developed using the very popular Unity 3D engine with particular emphasis on the aspects of realism in the environment, weather conditions, traffic, and the movement and models of individual agents. Finally, three methods of tracking were used to validate generated dataset: IOU-Tracker, Deep-Sort, and Deep-TAMA.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Federico Becattini; Francesco Marchetti; Lorenzo Seidenari;
Università degli Studi di Firenze; University of Florence;
In this paper we address the problem of trajectory prediction, focusing on memory-based models. Such methods are trained to collect a set of useful samples that can be retrieved and used at test time to condition predictions. We propose Explainable Sparse Attention (ESA), a module that can be seamlessly plugged-in into several existing memory-based state of the art predictors. ESA generates a sparse attention in memory, thus selecting a small subset of memory entries that are relevant for the observed trajectory. This enables an explanation of the model’s predictions with reference to previously observed training samples. Furthermore, we demonstrate significant improvements on three trajectory prediction datasets.
Open Access
Conference paper
N/A
Adrien Depeursinge; Davide Calvaresi; Henning Müller; John O. Prior; José Pereira Amorim; Katerina Yordanova; Lidia Dutklewicz; Lode Lauwaert; Mara Graziani; Mor Vered; Pedro Henriques Abreu; Rahul Nair; Tobias Blanke; Valeria Pulignano; Vincent Andrearczy; Wessel Reijers;
European University Institute; Faculty of Social Science of Leuven; IPO - Porto Research Centre; Lausanne University Hospital; University of Amsterdam; University of Applied Sciences of Western Switzerland; University of Coimbra; University of Geneva;
Since its emergence in the 1960s, Artificial Intelligence (AI) has grown to conquer many technology products and their fields of application. Machine learning, as a major part of the current AI solutions, can learn from the data and through experience to reach high performance on various tasks. This growing success of AI algorithms has led to a need for interpretability to understand opaque models such as deep neural networks. Vari- ous requirements have been raised from different domains, together with numerous tools to debug, justify outcomes, and establish the safety, fairness and reliability of the mod- els. This variety of tasks has led to inconsistencies in the terminology with, for instance, terms such as interpretable, explainable and transparent being often used interchange- ably in methodology papers. These words, however, convey different meanings and are “weighted” differently across domains, for example in the technical and social sciences. In this paper, we propose an overarching terminology of interpretability of AI systems that can be referred to by the technical developers as much as by the social sciences community to pursue clarity and efficiency in the definition of regulations for ethical and reliable AI development. We show how our taxonomy and definition of interpretable AI differ from the ones in previous research and how they apply with high versatility to several domains and use cases, proposing a—highly needed—standard for the communication among inter- disciplinary areas of AI.
Open Access
Journal article
N/A
Adrian Popescu; David Picard; Grégoire Petit; Hugo Schindler;
Université Gustave Eiffel; Université Paris-Saclay;
Exemplar-free class-incremental learning is very challenging due to the negative effect of catastrophic forgetting. A balance between stability and plasticity of the incremental process is needed in order to obtain good accuracy for past as well as new classes. Existing exemplar-free class-incremental methods focus either on successive fine tuning of the model, thus favoring plasticity, or on using a feature extractor fixed after the initial incremental state, thus favoring stability. We introduce a method which combines a fixed feature extractor and a pseudo-features generator to improve the stability-plasticity balance. The generator uses a simple yet effective geometric translation of new class features to create representations of past classes, made of pseudo-features. The translation of features only requires the storage of the centroid representations of past classes to produce their pseudo-features. Actual features of new classes and pseudo-features of past classes are fed into a linear classifier which is trained incrementally to discriminate between all classes. The incremental process is much faster with the proposed method compared to mainstream ones which update the entire deep model. Experiments are performed with three challenging datasets, and different incremental settings. A comparison with ten existing methods shows that our method outperforms the others in most cases. FeTrIL code is available at https://github.com/GregoirePetit/FeTrIL.
Open Access
Conference paper
N/A
Christos Tzelepis; Ioannis Pitas; James Oldfield; Mihalis Nicolaou; Yannis Panagakis
Cyprus Institute; Queen Mary University of London; University of Athens
Recent advances in the understanding of Generative Adversarial Networks (GANs) have led to remarkable progress in visual editing and synthesis tasks, capitalizing on the rich semantics that are embedded in the latent spaces of pre-trained GANs. However, existing methods are often tailored to specific GAN architectures and are limited to either discovering global semantic directions that do not facilitate localized control, or require some form of supervision through manually provided regions or segmentation masks. In this light, we present an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion. These factors are obtained by applying a semi-nonnegative tensor factorization on the feature maps, which in turn enables context-aware local image editing with pixel-level control. In addition, we show that the discovered appearance factors correspond to saliency maps that localize concepts of interest, without using any labels. Experiments on a wide range of GAN architectures and datasets show that, in comparison to the state of the art, our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control. Our code is available at https://github.com/james-oldfield/PandA.
Open Access
Conference paper
International Conference on Learning Representations
Gim Hee Lee; Nicu Sebe; Yuyang Zhao; Zhiming Luo; Zhun Zhong
National University of Singapore; University of Trento; Xiamen University
In this work, we introduce a new concept, named source-free open compound domain adaptation (SF-OCDA), and study it in semantic segmentation. SF-OCDA is more challenging than the traditional domain adaptation but it is more practical. It jointly considers (1) the issues of data privacy and data storage and (2) the scenario of multiple target domains and unseen open domains. In SF-OCDA, only the source pre-trained model and the target data are available to learn the target model. The model is evaluated on the samples from the target and unseen open domains. To solve this problem, we present an effective framework by separating the training process into two stages: (1) pre-training a generalized source model and (2) adapting a target model with self-supervised learning. In our framework, we propose the Cross-Patch Style Swap (CPSS) to diversify samples with various patch styles in the feature-level, which can benefit the training of both stages. First, CPSS can significantly improve the generalization ability of the source model, providing more accurate pseudo-labels for the latter stage. Second, CPSS can reduce the influence of noisy pseudo-labels and also avoid the model overfitting to the target domain during selfsupervised learning, consistently boosting the performance on the target and open domains. Experiments demonstrate that our method produces state-of-the-art results on the C-Driving dataset. Furthermore, our model also achieves the leading performance on CityScapes for domain generalization.
Open Access
Journal article
IEEE Transactions on Circuits and Systems for Video Technology
Hao Tang; Mengyi Zhao; Nicu Sebe; Wei Wang; Yue Song
ETH Zurich; University of Trento;
Modern saliency detection models are based on the encoder-decoder framework and they use different strategies to fuse the multi-level features between the encoder and decoder to boost representation power. Motivated by recent work in implicit modelling, we propose to introduce an implicit function to simulate the equilibrium state of the feature pyramid at infinite depths. We question the existence of the ideal equilibrium and thus propose a quasi-equilibrium model by taking the first-order derivative into the black-box root solver using Taylor expansion. It models more realistic convergence states and significantly improves the network performance. We also propose a differentiable edge extractor that directly extracts edges from the saliency masks. By optimizing the extracted edges, the generated saliency masks are naturally optimized on contour constraints and the non-deterministic predictions are removed. We evaluate the proposed methodology on five public datasets and extensive experiments show that our method achieves new state-of-the-art performances on six metrics across datasets.
Closed Access
Journal article
IEEE Transactions on Image Processing
Hao Tang; Nicu Sebe; Wei Wang; Yue Song
ETH Zurich; University of Trento;
Salient object detection has been long studied to identify the most visually attractive objects in images/videos. Recently, a growing number of approaches have been proposed all of which rely on the contour/edge information to improve detection performance. The edge labels are either put into the loss directly or used as extra supervision. The edge and body can also be learned separately and then fused afterward. Both methods either lead to high prediction errors near the edge or cannot be trained in an end-to-end manner. Another problem is that existing methods may fail to detect objects of various sizes due to the lack of efficient and effective feature fusion mechanisms. In this work, we propose to decompose the saliency detection task into two cascaded sub-tasks, i.e., detail modelling and body illing. Specifically, the detail modelling focuses on capturing the object edges by supervision of explicitly decomposed detail label that consists of the pixels that are nested on the edge and near the edge. Then the body illing learns the body part which will be illed into the detail map to generate more accurate saliency map. To effectively fuse the features and handle objects at different scales, we have also proposed two novel multi-scale detail attention and body attention blocks for precise detail and body modelling. Experimental results show that our method achieves state-of-the-art performances on six public datasets.
Open Access
Journal article
ACM Multimedia Systems Conference
Cristiano Saltori; Elisa Ricci; Fabio Poiesi; Guofeng Mei; Jian Zhang; Nicu Sebe; Qiang Wu
University of Technology Sydney; University of Trento; Vision Lab Fondazione
Unsupervised learning on 3D point clouds has undergone a rapid evolution, especially thanks to data augmentation-based contrastive methods. However, data augmentation is not ideal as it requires a careful selection of the type of augmentations to perform, which in turn can affect the geometric and semantic information learned by the network during selftraining. To overcome this issue, we propose an augmentation-free unsupervised approach for point clouds to learn transferable point-level features via soft clustering, named SoftClu. SoftClu assumes that the points belonging to a cluster should be close to each other in both geometric and feature spaces. This differs from typical contrastive learning, which builds similar representations for a whole point cloud and its augmented versions. We exploit the affiliation of points to their clusters as a proxy to enable self-training through a pseudo-label prediction task. Under the constraint that these pseudo-labels induce the equipartition of the point cloud, we cast SoftClu as an optimal transport problem. We formulate an unsupervised loss to minimize the standard cross-entropy between pseudolabels and predicted labels. Experiments on downstream applications, such as 3D object classification, part segmentation, and semantic segmentation, show the effectiveness of our framework in outperforming state-of-the-art techniques.
Open Access
Conference paper
British Machine Vision Conference
Nicu Sebe; Wei Wang; Yue Song
Beijing Jiaotong University; University of Trento;
The task of out-of-distribution (OOD) detection is crucial for deploying machine learning models in real-world settings. In this paper, we observe that the singular value distributions of the in-distribution (ID) and OOD features are quite different: the OOD feature matrix tends to have a larger dominant singular value than the ID feature, and the class predictions of OOD samples are largely determined by it. This observation motivates us to propose RankFeat, a simple yet effective post hoc approach for OOD detection by removing the rank-1 matrix composed of the largest singular value and the associated singular vectors from the high-level feature. RankFeat achieves state-of-the-art performance and reduces the average false positive rate (FPR95) by 17.90% compared with the previous best method. Extensive ablation studies and comprehensive theoretical analyses are presented to support the empirical results.
Open Access
Conference paper
Conference on Neural Information Processing Systems
Gim Hee Lee; Nicu Sebe; Yuyang Zhao; Zhun Zhong
National University of Singapore; University of Trento;
In this paper, we consider the problem of domain generalization in semantic segmentation, which aims to learn a robust model using only labeled synthetic (source) data. The model is expected to perform well on unseen real (target) domains. Our study finds that the image style variation can largely influence the model’s performance and the style features can be well represented by the channel-wise mean and standard deviation of images. Inspired by this, we propose a novel adversarial style augmentation (AdvStyle) approach, which can dynamically generate hard stylized images during training and thus can effectively prevent the model from overfitting on the source domain. Specifically, AdvStyle regards the style feature as a learnable parameter and updates it by adversarial training. The learned adversarial style feature is used to construct an adversarial image for robust model training. AdvStyle is easy to implement and can be readily applied to different models. Experiments on two synthetic-to-real semantic segmentation benchmarks demonstrate that AdvStyle can significantly improve the model performance on unseen real domains and show that we can achieve the state of the art. Moreover, AdvStyle can be employed to domain generalized image classification and produces a clear improvement on the considered datasets.
Open Access
Conference paper
Conference on Neural Information Processing Systems
Gim Hee Lee; Na Zhao; Nicu Sebe; Yuyang Zhao; Zhun Zhong
National University of Singapore; University of Trento;
In this paper, we study the task of synthetic-to-real domain generalized semantic segmentation, which aims to learn a model that is robust to unseen real-world scenes using only synthetic data. The large domain shift between synthetic and real-world data, including the limited source environmental variations and the large distribution gap between synthetic and real-world data, significantly hinders the model performance on unseen real-world scenes. In this work, we propose the Style-HAllucinated Dual consistEncy learning (SHADE) framework to handle such domain shift. Specifically, SHADE is constructed based on two consistency constraints, Style Consistency (SC) and Retrospection Consistency (RC). SC enriches the source situations and encourages the model to learn consistent representation across style-diversified samples. RC leverages real-world knowledge to prevent the model from overfitting to synthetic data and thus largely keeps the representation consistent between the synthetic and real-world models. Furthermore, we present a novel style hallucination module (SHM) to generate style-diversified samples that are essential to consistency learning. SHM selects basis styles from the source distribution, enabling the model to dynamically generate diverse and realistic samples during training. Experiments show that our SHADE yields significant improvement and outperforms state-of-the-art methods by 5.05% and 8.35% on the average mIoU of three real-world datasets on single- and multi-source settings, respectively.
Open Access
Conference paper
European Conference on Computer Vision
Andrea Pilzer; Arno Solin; Elisa Ricci; Juho Kannala; Martin Trapp; Nicu Sebe; Subhankar Roy;
Aalto University; Fondazione Bruno Kessler; NVIDIA; University of Trento;
Source-free domain adaptation (SFDA) aims to adapt a classifier to an unlabelled target data set by only using a pre-trained source model. However, the absence of the source data and the domain shift makes the predictions on the target data unreliable. We propose quantifying the uncertainty in the source model predictions and utilizing it to guide the target adaptation. For this, we construct a probabilistic source model by incorporating priors on the network parameters inducing a distribution over the model predictions. Uncertainties are estimated by employing a Laplace approximation and incorporated to identify target data points that do not lie in the source manifold and to down-weight them when maximizing the mutual information on the target data. Unlike recent works, our probabilistic treatment is computationally lightweight, decouples source training and target adaptation, and requires no specialized source training or changes of the model architecture. We show the advantages of uncertainty-guided SFDA over traditional SFDA in the closed-set and open-set settings and provide empirical evidence that our approach is more robust to strong domain shifts even without tuning.
Open Access
Conference paper
N/A
Nicu Sebe; Wei Wang; Yue Song
University of Trento;
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. One crucial bottleneck limiting its usage is the expensive computation cost, particularly for a mini-batch of matrices in the deep neural networks. In this paper, we propose a QR-based ED method dedicated to the application scenarios of computer vision. Our proposed method performs the ED entirely by batched matrix/vector multiplication, which processes all the matrices simultaneously and thus fully utilizes the power of GPUs. Our technique is based on the explicit QR iterations by Givens rotation with double Wilkinson shifts. With several acceleration techniques, the time complexity of QR iterations is reduced from O(n5) to O(n3). The numerical test shows that for small and medium batched matrices (e.g., dim<32) our method can be much faster than the Pytorch SVD function. Experimental results on visual recognition and image generation demonstrate that our methods also achieve competitive performances
Open Access
Conference paper
European Conference on Computer Vision
Nicu Sebe; Wei Wang; Yue Song
University of Trento;
Inserting an SVD meta-layer into neural networks is prone to make the covariance ill-conditioned, which could harm the model in the training stability and generalization abilities. In this paper, we systematically study how to improve the covariance conditioning by enforcing orthogonality to the Pre-SVD layer. Existing orthogonal treatments on the weights are first investigated. However, these techniques can improve the conditioning but would hurt the performance. To avoid such a side effect, we propose the Nearest Orthogonal Gradient (NOG) and Optimal Learning Rate (OLR). The effectiveness of our methods is validated in two applications: decorrelated Batch Normalization (BN) and Global Covariance Pooling (GCP). Extensive experiments on visual recognition demonstrate that our methods can simultaneously improve the covariance conditioning and generalization. Moreover, the combinations with orthogonal weight can further boost the performances.
Open Access
Conference paper
European Conference on Computer Vision
Elisa Ricci; Mingxuan Liu; Nicu Sebe; Subhankar Roy; Zhun Zhong
Fondazione Bruno Kessler; University of Trento;
We study the new task of class-incremental Novel Class Discovery (class-iNCD), which refers to the problem of discovering novel categories in an unlabelled data set by leveraging a pre-trained model that has been trained on a labelled data set containing disjoint yet related categories. Apart from discovering novel classes, we also aim at preserving the ability of the model to recognize previously seen base categories. Inspired by rehearsal-based incremental learning methods, in this paper we propose a novel approach for class-iNCD which prevents forgetting of past information about the base classes by jointly exploiting base class feature prototypes and feature-level knowledge distillation. We also propose a self-training clustering strategy that simultaneously clusters novel categories and trains a joint classifier for both the base and novel classes. This makes our method able to operate in a class-incremental setting. Our experiments, conducted on three common benchmarks, demonstrate that our method significantly outperforms state-of-the-art approaches. Code is available at https://github.com/OatmealLiu/class-iNCD.
Open Access
Conference paper
European Conference on Computer Vision
Andrea Esuli; Fabrizio Falchi; Giuseppe Amato; Nicola Messina;
ISTI-CNR;
Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolution of deep-learning-based image and text processing systems, multi-modal matching remains a challenging problem. In this work, we consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. State-of-the-art results in image-text matching are achieved by inter-playing image and text features from the two different processing pipelines, usually using mutual attention mechanisms. However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems. In this regard, we introduce the Transformer Encoder Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive architectures, the Transformer Encoder (TE). This architecture is able to separately reason on the two different modalities and to enforce a final common abstract concept space by sharing the weights of the deeper transformer layers. Thanks to this design, the implemented network is able to produce compact and very rich visual and textual features available for the successive indexing step. Experiments are conducted on the MS-COCO dataset, and we evaluate the results using a discounted cumulative gain metric with relevance computed exploiting caption similarities, in order to assess possibly non-exact but relevant search results. We demonstrate that on this metric we are able to achieve state-of-the-art results in the image retrieval task. Our code is freely available at https://github.com/mesnico/TERN
Open Access
Conference paper
N/A
Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;
ISTI-CNR;
In this paper, we present the fourth release of VISIONE, a tool for fast and effective video search on a large-scale dataset. It includes several search functionalities like text search, object and color-based search, semantic and visual similarity search, and temporal search. VISIONE uses ad-hoc textual encoding for indexing and searching video content, and it exploits a full-text search engine as search backend. In this new version of the system, we introduced some changes both to the current search techniques and to the user interface.
Open Access
Conference paper
N/A
Hannes Fassold; Werner Bailer;
Joanneum Research;
In order to support common annotation tasks in visual media production and archiving, we propose two datasets which cover the annotation of the bustle of a scene (i.e., populated to unpopulated), the cinematographic type of a shot as well as the time of day and season of a shot. The dataset for bustle and shot type, called People@Places, adds annotations to the Places365 dataset, and the ToDY (time of day/year) dataset adds annotations to the SkyFinder dataset. For both datasets, we provide a toolchain to create automatic annotations, which have been manually verified and corrected for parts of the two datasets. We provide baseline results for these tasks using the EfficientNet-B3 model, pretrained on the Places365 dataset.
Open Access
Conference paper
MultiMedia Modeling
Alberto Del Bimbo; Claudio Ferrari; Daoudi Mohammed Filippo Principi Naima Otberdout; Stefano Berretti
University of Florence; University of Parma
Human facial expressions change dynamically, so their recognition / analysis should be conducted by accounting for the temporal evolution of face deformations either in 2D or 3D. While abundant 2D video data do exist, this is not the case in 3D, where few 3D dynamic (4D) datasets were released for public use. The negative consequence of this scarcity of data is amplified by current deep learning based-methods for facial expression analysis that require large quantities of variegate samples to be effectively trained. With the aim of smoothing such limitations, in this paper we propose a large dataset, named Florence 4D, composed of dynamic sequences of 3D face models, where a combination of synthetic and real identities exhibit an unprecedented variety of 4D facial expressions, with variations that include the classical neutral-apex transition, but generalize to expression-to-expression. All these characteristics are not exposed by any of the existing 4D datasets and they cannot even be obtained by combining more than one dataset. We strongly believe that making such a data corpora publicly available to the community will allow designing and experimenting new applications that were not possible to investigate till now. To show at some extent the difficulty of our data in terms of different identities and varying expressions, we also report a baseline experimentation on the proposed dataset that can be used as baseline.
Open Access
Conference paper
N/A
Adrian Popescu; Céline Hudelot; Eva Feillet; Grégoire Petit; Marina Reyboz
Université Grenoble Alpes; Université Gustave Eiffel; Université Paris-Saclay;
Open Access
Conference paper
Winter Conference on Applications of Computer Vision
Hao Tang; Ling Shao; Nicu Sebe; Philip Torr;
ETH Zurich; University of Oxford; University of Trento;
In this paper, we address the task of semantic-guided image generation. One challenge common to most existing image-level generation methods is the difficulty in generating small objects and detailed local textures. To address this, in this work we consider generating images using local context. As such, we design a local class-specific generative network using semantic maps as guidance, which separately constructs and learns subgenerators for different classes, enabling it to capture finer details. To learn more discriminative class-specific feature representations for the local generation, we also propose a novel classification module. To combine the advantages of both global image-level and local class-specific generation, a joint generation network is designed with an attention fusion module and a dual-discriminator structure embedded. Lastly, we propose a novel semantic-aware upsampling method, which has a larger receptive field and can take far-away pixels that are semantically related for feature upsampling, enabling it to better preserve semantic consistency for instances with the same semantic labels. Extensive experiments on two image generation tasks show the superior performance of the proposed method. State-of-the-art results are established by large margins on both tasks and on nine challenging public benchmarks. The source code and trained models are available at https://github.com/Ha0Tang/LGGAN.
Closed Access
Journal article
IEEE Transaction on Pattern Analysis and Machine Intelligence
Artem Yaroshchuk; Luca Cuccovillo; Malte Baum; Patrick Aichroth;
Fraunhofer IDMT;
In this paper we present a novel approach for environment classification for speech recordings, which does not require the selection of decaying reverberation tails. It is based on a multi-band RT60 analysis of blind channel estimates and achieves an accuracy of up to 93.8% on test recordings derived from the ACE corpus.
Open Access
Conference paper
Transactions on Information Forensics and Security
Adrian Weller Francesco Giannini Frédéric Precioso Gabriele Ciravegna Giuseppe Marra Mateja Jamnik Mateo Espinosa Zarlenga Michelangelo Diligenti Pietro Barbiero Pietro Lio Stefano Melacci Zohreh Shams
KU Leuven Université Côte d'Azur; University of Siena
Deploying AI-powered systems requires trustworthy models supporting effective human interactions, going beyond raw prediction accuracy. Concept bottleneck models promote trustworthiness by conditioning classification tasks on an intermediate level of human-like concepts. This enables human interventions which can correct mispredicted concepts to improve the model’s performance. However, existing concept bottleneck models are unable to find optimal compromises between high task accuracy, robust concept-based explanations, and effective interventions on concepts—particularly in real-world conditions where complete and accurate concept supervisions are scarce. To address this, we propose Concept Embedding Models, a novel family of concept bottleneck models which goes beyond the current accuracy-vs-interpretability trade-off by learning interpretable highdimensional concept representations. Our experiments demonstrate that Concept Embedding Models (1) attain better or competitive task accuracy w.r.t. standard neural models without concepts, (2) provide concept representations capturing meaningful semantics including and beyond their ground truth labels, (3) support test-time concept interventions whose effect in test accuracy surpasses that in standard concept bottleneck models, and (4) scale to real-world conditions where complete concept supervisions are scarce.
Open Access
Conference paper
Conference on Neural Information Processing Systems
Evlampios Apostolidis; Georgios Balaouras; Ioannis Patras; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas Queen Mary University of London;
In this paper we propose a method for explaining video summarization. We start by formulating the problem as the creation of an explanation mask which indicates the parts of the video that influenced the most the estimates of a video summarization network, about the frames’ importance. Then, we explain how the typical analysis pipeline of attention-based networks for video summarization can be used to define explanation signals, and we examine various attention-based signals that have been studied as explanations in the NLP domain. We evaluate the performance of these signals by investigating the video summarization network’s input-output relationship according to different replacement functions, and utilizing measures that quantify the capability of explanations to spot the most and least influential parts of a video. We run experiments using an attention-based network (CA-SUM) and two datasets (SumMe and TVSum) for video summarization. Our evaluations indicate the advanced performance of explanations formed using the inherent attention weights, and demonstrate the ability of our method to explain the video summarization results using clues about the focus of the attention mechanism.
Open Access
Conference paper
IEEE International Symposium on Multimedia
Fabio Valerio Massoli; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo;
ISTI-CNR;
In recent years, Quantum Computing witnessed massive improvements in terms of available resources and algorithms development. The ability to harness quantum phenomena to solve computational problems is a long-standing dream that has drawn the scientific community’s interest since the late 80s. In such a context, we propose our contribution. First, we introduce basic concepts related to quantum computations, and then we explain the core functionalities of technologies that implement the Gate Model and Adiabatic Quantum Computing paradigms. Finally, we gather, compare and analyze the current state-of-the-art concerning Quantum Perceptrons and Quantum Neural Networks implementations.
Open Access
Journal article
ACM Computing Surveys
Chen Feng; Ioannis Patras;
Queen Mary University of London;
Self-supervised learning has recently achieved great success in representation learning without human annotations. The dominant method – that is contrastive learning, is generally based on instance discrimination tasks, i.e., individual samples are treated as independent categories. However, presuming all the samples are different contradicts the natural grouping of similar samples in common visual datasets, e.g., multiple views of the same dog. To bridge the gap, this paper proposes an adaptive method that introduces soft inter-sample relations, namely Adaptive Soft Contrastive Learning (ASCL). More specifically, ASCL transforms the original instance discrimination task into a multi-instance soft discrimination task, and adaptively introduces inter-sample relations. As an effective and concise plug-in module for existing self-supervised learning frameworks, ASCL achieves the best performance on several benchmarks in terms of both performance and efficiency. Code is available at https://github.com/MrChenFeng/ASCL_ICPR2022.
Open Access
Conference paper
N/A
Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Roberto Cardelli;
ISTI-CNR; Mercatorum University; National Inter-University Consortium for Telecommunications;
The adoption of deep learning-based solutions practically pervades all the diverse areas of our everyday life, showing improved performances with respect to other classical systems. Since many applications deal with sensible data and procedures, a strong demand to know the actual reliability of such technologies is always present. This work analyzes the robustness characteristics of a specific kind of deep neural network, the neural ordinary differential equations (N-ODE) network. They seem very interesting for their effectiveness and a peculiar property based on a test-time tunable parameter that permits obtaining a trade-off between accuracy and efficiency. In addition, adjusting such a tolerance parameter grants robustness against adversarial attacks. Notably, it is worth highlighting how decoupling the values of such a tolerance between training and test time can strongly reduce the attack success rate. On this basis, we show how such tolerance can be adopted, during the prediction phase, to improve the robustness of N-ODE to adversarial attacks. In particular, we demonstrate how we can exploit this property to construct an effective detection strategy and increase the chances of identifying adversarial examples in a non-zero knowledge attack scenario. Our experimental evaluation involved two standard image classification benchmarks. This showed that the proposed detection technique provides high rejection of adversarial examples while maintaining most of the pristine samples.
Open Access
Journal article
N/A
Chen Feng; Georgios Tzimiropoulos; Ioannis Patras;
Queen Mary University of London;
Despite the large progress in supervised learning with neural networks, there are significant challenges in obtaining high-quality, large-scale and accurately labelled datasets. In such a context, how to learn in the presence of noisy labels has received more and more attention. As a relatively complex problem, in order to achieve good results, current approaches often integrate components from several fields, such as supervised learning, semi-supervised learning, transfer learning and resulting in complicated methods. Furthermore, they often make multiple assumptions about the type of noise of the data. This affects the model robustness and limits its performance under different noise conditions. In this paper, we consider a novel problem setting, Learning with Unknown Label Noise}(LULN), that is, learning when both the degree and the type of noise are unknown. Under this setting, unlike previous methods that often introduce multiple assumptions and lead to complex solutions, we propose a simple, efficient and robust framework named Sample Selection and Relabelling(SSR), that with a minimal number of hyperparameters achieves SOTA results in various conditions. At the heart of our method is a sample selection and relabelling mechanism based on a non-parametric KNN classifier~(NPK) $g_q$ and a parametric model classifier~(PMC) $g_p$, respectively, to select the clean samples and gradually relabel the noisy samples. Without bells and whistles, such as model co-training, self-supervised pre-training and semi-supervised learning, and with robustness concerning the settings of its few hyper-parameters, our method significantly surpasses previous methods on both CIFAR10/CIFAR100 with synthetic noise and real-world noisy datasets such as WebVision, Clothing1M and ANIMAL-10N. Code is available at https://github.com/MrChenFeng/SSR_BMVC2022.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Daniele Mugnai; Federico Pernici; Matteo Bruni; Niccolò Biondi
University of Florence;
In this article, we propose a method to partially mimic natural intelligence for the problem of lifelong learning representations that are compatible. We take the perspective of a learning agent that is interested in recognizing object instances in an open dynamic universe in a way in which any update to its internal feature representation does not render the features in the gallery unusable for visual search. We refer to this learning problem as Compatible Lifelong Learning Representations (CL2 R), as it considers compatible representation learning within the lifelong learning paradigm. We identify stationarity as the property that the feature representation is required to hold to achieve compatibility and propose a novel training procedure that encourages local and global stationarity on the learned representation. Due to stationarity, the statistical properties of the learned features do not change over time, making them interoperable with previously learned features. Extensive experiments on standard benchmark datasets show that our CL2 R training procedure outperforms alternative baselines and state-of-the-art methods. We also provide novel metrics to specifically evaluate compatible representation learning under catastrophic forgetting in various sequential learning tasks. Code is available at https://github.com/NiccoBiondi/CompatibleLifelongRepresentation.
Open Access
Journal article
ACM Multimedia Systems Conference
Emmanouil Krasanakis; Ioannis Kompatsiaris; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
A recurring graph analysis task is to rank nodes based on their relevance to overlapping communities of shared metadata attributes (e.g. the interests of social network users). To achieve this, approaches often start with a few example community members and employ graph filters that rank nodes based on their structural proximity to the examples. Choosing between well-known filters typically involves experiments on existing graphs, but their efficacy is known to depend on the structural relations between community members. Therefore, we argue that employed filters should be determined not during algorithm design but at runtime, upon receiving specific graphs and example nodes to process. To do this, we split example nodes into training and validation sets and either perform supervised selection between well-known filters, or account for granular graph dynamics by tuning parameters of the generalized graph filter form with a novel optimization algorithm. Experiments on 27 community node ranking tasks across three real-world networks of various sizes reveal that runtime algorithm selection selects near-best AUC and NDCG among a list of 8 popular alternatives, and that parameter tuning yields similar or improved results in all cases.
Open Access
Paper
N/A
Emmanouil Krasanakis; Ioannis Kompatsiaris; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
A recurring graph analysis task is to rank nodes based on their relevance to overlapping communities of shared metadata attributes (e.g. the interests of social network users). To achieve this, approaches often start with a few example community members and employ graph filters that rank nodes based on their structural proximity to the examples. Choosing between well-known filters typically involves experiments on existing graphs, but their efficacy is known to depend on the structural relations between community members. Therefore, we argue that employed filters should be determined not during algorithm design but at runtime, upon receiving specific graphs and example nodes to process. To do this, we split example nodes into training and validation sets and either perform supervised selection between well-known filters, or account for granular graph dynamics by tuning parameters of the generalized graph filter form with a novel optimization algorithm. Experiments on 27 community node ranking tasks across three real-world networks of various sizes reveal that runtime algorithm selection selects near-best AUC and NDCG among a list of 8 popular alternatives, and that parameter tuning yields similar or improved results in all cases.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Daniele Mugnai; Federico Pernici; Matteo Bruni; Niccolò Biondi
University of Florence;
In this article, we propose a method to partially mimic natural intelligence for the problem of lifelong learning representations that are compatible. We take the perspective of a learning agent that is interested in recognizing object instances in an open dynamic universe in a way in which any update to its internal feature representation does not render the features in the gallery unusable for visual search. We refer to this learning problem as Compatible Lifelong Learning Representations (CL2 R), as it considers compatible representation learning within the lifelong learning paradigm. We identify stationarity as the property that the feature representation is required to hold to achieve compatibility and propose a novel training procedure that encourages local and global stationarity on the learned representation. Due to stationarity, the statistical properties of the learned features do not change over time, making them interoperable with previously learned features. Extensive experiments on standard benchmark datasets show that our CL2 R training procedure outperforms alternative baselines and state-of-the-art methods. We also provide novel metrics to specifically evaluate compatible representation learning under catastrophic forgetting in various sequential learning tasks. Code is available at https://github.com/NiccoBiondi/CompatibleLifelongRepresentation.
Open Access
Paper
N/A
Italy National Research Council; Silesian University of Technology University of Pisa
Open Access
Journal article
Sensors
Adrian Tormos; Dario Garcia-Gasulla; Sergio Alvarez-Napagao; Victor Gimenez-Abalos;
Barcelona Supercomputing Center;
In deep learning, transfer learning (TL) has become the de facto approach when dealing with image related tasks. Visual features learnt for one task have been shown to be reusable for other tasks, improving performance significantly. By reusing deep representations, TL enables the use of deep models in domains with limited data availability, limited computational resources and/or limited access to human experts. Domains which include the vast majority of real-life applications. This paper conducts an experimental evaluation of TL, exploring its trade-offs with respect to performance, environmental footprint, human hours and computational requirements. Results highlight the cases were a cheap feature extraction approach is preferable, and the situations where an expensive fine-tuning effort may be worth the added cost. Finally, a set of guidelines on the use of TL are proposed.
Open Access
Conference paper
N/A
Andreas Symeonidis; Emmanouil Krasanakis; Ioannis Kompatsiaris; Symeon Papadopoulos
Aristotle University of Thessaloniki; CERTH - Center for Research and Technology Hellas
We introduce pygrank, an open source Python package to define, run and evaluate node ranking algorithms. We provide object-oriented and extensively unit-tested algorithmic components, such as graph filters, post-processors, measures, benchmarks, and online tuning. Computations can be delegated to numpy, tensorflow, or pytorch backends and fit in back-propagation pipelines. Classes can be combined to define interoperable complex algorithms. Within the context of this paper, we compare the package with related alternatives, describe its architecture, demonstrate its flexibility and ease of use with code examples, and discuss its impact.
Open Access
Journal article
SoftwareX
Fabio Carrara; Fabrizio Falchi; Roberto Caldelli
ISTI-CNR;
Although deep-learning-based solutions are pervading different application sectors, many doubts have arisen about their reliability and, above all, their security against threats that can mislead their decision mechanisms. In this work, we considered a particular kind of deep neural network, the Neural Ordinary Differential Equations (N-ODE) networks, which have shown intrinsic robustness against adversarial samples by properly tuning their tolerance parameter at test time. Their behaviour has never been investigated in image forensics tasks such as distinguishing between an original and an altered image. Following this direction, we demonstrate how tuning the tolerance parameter during the prediction phase can control and increase N-ODE’s robustness versus adversarial attacks. We performed experiments on basic image transformations used to generate tampered data, providing encouraging results in terms of adversarial rejection and preservation of the correct classification of pristine images.
Open Access
Publication
IEEE International Conference on Image Processing
Aldric Ducreux; Auriane Gros; Camille Bauce; Florent Robert; Hui-Yin Wu; Lucile Sassatelli; Marco Wincler; Quentin Guimard
Institut Universitaire de France; Université Côte d'Azur;
While immersive media have been shown to generate more intense emotions, saliency information has been shown to be a key component for the assessment of their quality, owing to the various portions of the sphere (viewports) a user can attend. In this article, we investigate the tri-partite connection between user attention, user emotion and visual content in immersive environments. To do so, we present a new dataset enabling the analysis of different types of saliency, both low-level and high-level, in connection with the user’s state in 360° videos. Head and gaze movements are recorded along with self-reports and continuous physiological measurements of emotions. We then study how the accuracy of saliency estimators in predicting user attention depends on user-reported and physiologically-sensed emotional perceptions. Our results show that high-level saliency better predicts user attention for higher levels of arousal. We discuss how this work serves as a first step to understand and predict user attention and intents in immersive interactive environments.
Open Access
Conference paper
N/A
Claudio Gallicchio; Claudio Gennaro; Davide Bacciu; Fabrizio Falchi; Gabriele Lagani; Giuseppe Amato;
ISTI-CNR; University of Pisa
Open Access
Conference paper
N/A
Fabrizio Falchi; Giuseppe Amato; Lorenzo Baraldi; Marcella Cornia; Matteo Stefanini; Nicola Messina; Rita Cucchiara;
ISTI-CNR; University of Modena and Reggio Emilia;
Open Access
Conference paper
Conference on Content-based Multimedia Indexing
Claudio Gennaro; Gabriele Lagani; Giuseppe Amato; Hannes Fassold;
ISTI-CNR; Joanneum Research; University of Pisa
Open Access
Conference paper
N/A
Ioannis Patras; Niki Maria Foteinopoulou
Queen Mary University of London;
Human affect and mental state estimation in an automated manner, face a number of difficulties, including learning from labels with poor or no temporal resolution, learning from few datasets with little data (often due to confidentiality constraints) and, (very) long, in-the-wild videos. For these reasons, deep learning methodologies tend to overfit, that is, arrive at latent representations with poor generalisation performance on the final regression task. To overcome this, in this work, we introduce two complementary contributions. First, we introduce a novel relational loss for multilabel regression and ordinal problems that regularises learning and leads to better generalisation. The proposed loss uses label vector inter-relational information to learn better latent representations by aligning batch label distances to the distances in the latent feature space. Second, we utilise a two-stage attention architecture that estimates a target for each clip by using features from the neighbouring clips as temporal context. We evaluate the proposed methodology on both continuous affect and schizophrenia severity estimation problems, as there are methodological and contextual parallels between the two. Experimental results demonstrate that the proposed methodology outperforms the baselines that are trained using the supervised regression loss, as well as pre-training the network architecture with an unsupervised contrastive loss. In the domain of schizophrenia, the proposed methodology outperforms previous state-of-the-art by a large margin, achieving a PCC of up to 78% performance close to that of human experts (85%) and much higher than previous works (uplift of up to 40%). In the case of affect recognition, we outperform previous vision-based methods in terms of CCC on both the OMG and the AMIGOS datasets. Specifically for AMIGOS, we outperform previous SoTA CCC for both arousal and valence by 9% and 13% respectively, and in the OMG dataset we outperform previous vision works by up to 5% for both arousal and valence.
Open Access
Conference paper
N/A
Claudio Gennaro; Fabio Carrara; Giuseppe Amato; Lucia Vadicamo;
ISTI-CNR;
Approximate search for high-dimensional vectors is commonly addressed using dedicated techniques often combined with hardware acceleration provided by GPUs, FPGAs, and other custom in-memory silicon. Despite their effectiveness, harmonizing those optimized solutions with other types of searches often poses technological difficulties. For example, to implement a combined text+image multimodal search, we are forced first to query the index of high-dimensional image descriptors and then filter the results based on the textual query or vice versa. This paper proposes a text surrogate technique to translate real-valued vectors into text and index them with a standard textual search engine such as Elasticsearch or Apache Lucene. This technique allows us to perform approximate kNN searches of high-dimensional vectors alongside classical full-text searches natively on a single textual search engine, enabling multimedia queries without sacrificing scalability. Our proposal exploits a combination of vector quantization and scalar quantization. We compared our approach to the existing literature in this field of research, demonstrating a significant improvement in performance through preliminary experimentation.
Open Access
Journal article
N/A
Christos Tzelepis; Georgios Tzimiropoulos; Ioannis Patras; Stella Bounareli; Vasileios Argyriou;
Kingston University London; Queen Mary University of London;
Open Access
Conference paper
N/A
Aliaksandr Siarohin; Enver Sangineto; Hao Tang; Jichao Zhang; Nicu Sebe; Wei Wang; Zhun Zhong
ETH Zurich; Snap Research; University of Modena and Reggio Emilia; University of Trento;
Generative Neural Radiance Field (GNeRF) models, which extract implicit 3D representations from 2D images, have recently been shown to produce realistic images representing rigid/semi-rigid objects, such as human faces or cars. However, they usually struggle to generate high-quality images representing non-rigid objects, such as the human body, which is of a great interest for many computer graphics applications. This paper proposes a 3D-aware Semantic-Guided Generative Model (3D-SGAN) for human image synthesis, which combines a GNeRF with a texture generator. The former learns an implicit 3D representation of the human body and outputs a set of 2D semantic segmentation masks. The latter transforms these semantic masks into a real image, adding a realistic texture to the human appearance. Without requiring additional 3D information, our model can learn 3D human representations with a photo-realistic, controllable generation. Our experiments on the DeepFashion dataset show that 3D-SGAN significantly outperforms the most recent baselines. The code is available at https://github.com/zhangqianhui/3DSGAN.
Open Access
Conference paper
N/A
Alejandro Moreo; Fabrizio Sebastiani; Juan José del Coz;
ISTI-CNR; University of Oviedo
The 2nd International Workshop on Learning to Quantify (LQ 2022 – https://lq-2022.github.io/) was held in Grenoble, FR, on September 23, 2022, as a satellite workshop of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2022). While the 1st edition of the workshop (LQ 2021 – https://cikmlq2021. github.io/, which was instead co-located with the 30th ACM International Conference on Information and Knowledge Management (CIKM 2021)) had to be an entirely online event, LQ 2022 was a hybrid event, with presentations given in-presence and both in-presence attendees and remote attendees.
Open Access
Book
N/A
Cristiano Saltori; Jurandy Almeida; Nicu Sebe; Paolo Rota;
Universidade Federal de São Paulo; University of Trento;
Deep learning revolution happened thanks to the availability of a massive amount of labeled data which contributed to the development of models with extraordinary inference capabilities. Despite the public availability of large-scale datasets, to address specific requirements it is often necessary to generate a new set of labeled data whose production is often costly and require specific know-how to be fulfilled. In this work, we propose the new problem of low-budget label query, which aims at maximizing the classification performance by selecting a convenient and small set of samples (i.e., low budget) to be manually labeled from an arbitrary big set of unlabeled data. While a first solution might be the use of pre-trained models with standard selection metrics, i.e. confidence and entropy, we argue that domain shift affects their reliability. We deem that Unsupervised Domain Adaptation (UDA) can be used to reduce domain shift, making selection metrics more reliable and less noisy. Therefore, we first improve an UDA method to better align source and target domains using consistency constraints, reaching comparable performance with the state of-the-art on several UDA tasks. After adaptation, we conduct an extensive experimental study with commonly used confidence metrics and sampling strategies to achieve low-budget label query on a large variety of publicly available datasets and under different setups.
Closed Access
Journal article
Computer Vision and Image Understanding
Alejandro Moreo; Fabrizio Sebastiani; Martin Senz; Mirko Bunse;
ISTI-CNR;
Quantification,i.e.,thetaskoftrainingpredictorsoftheclass prevalence values in sets of unlabelled data items, has received increased attention in recent years. However, most quantification research has con- centrated on developing algorithms for binary and multiclass problems in which the classes are not ordered. We here study the ordinal case, i.e., the case in which a total order is defined on the set of n > 2 classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms that are proposed by authors from very different research fields, who were unaware of each other’s developments. Third, we propose three OQ algorithms, based on the idea of preventing ordinally implausible estimates through regu- larization. Our experiments show that these algorithms outperform the existing ones if the ordinal plausibility assumption holds.
Open Access
Conference paper
N/A
Alejandro Moreo; Andrea Pedrotti; Fabrizio Sebastiani;
ISTI-CNR; University of Pisa
Funnelling (Fun) is a recently proposed method for cross-lingual text classification (CLTC) based on a two-tier learning ensemble for heterogeneous transfer learning (HTL). In this ensemble method, 1st-tier classifiers, each working on a different and language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a meta-classifier that uses this vector as its input. The meta-classifier can thus exploit class-class correlations, and this (among other things) gives Fun an edge over CLTC systems in which these correlations cannot be brought to bear. In this paper we describe Generalized Funnelling (gFun), a generalisation of Fun consisting of an HTL architecture in which 1st-tier components can be arbitrary view-generating functions, i.e., language-dependent functions that each produce a language-independent representation (“view”) of the (monolingual) document. We describe an instance of gFun in which the meta-classifier receives as input a vector of calibrated posterior probabilities (as in Fun) aggregated to other embedded representations that embody other types of correlations, such as word-class correlations (as encoded by Word-Class Embeddings), word-word correlations (as encoded by Multilingual Unsupervised or Supervised Embeddings), and word-context correlations (as encoded by multilingual BERT ). We show that this instance of gFun substantially improves over Fun and over state-of-the-art baselines, by reporting experimental results obtained on two large, standard datasets for multilingual multilabel text classification. Our code that implements gFun is publicly available.
Open Access
Journal article
ACM Transactions on Information Systems
Alejandro Moreo; Andrea Esuli; Fabrizio Sebastiani; Gianluca Sperduti
ISTI-CNR;
LeQua 2022 is a new lab for the evaluation of methods for “learning to quantify” in textual datasets, i.e., for training predictors of the relative frequencies of the classes of interest Y = {y1,…,yn} in sets of unlabelled textual documents. While these predictions could be easily achieved by first classifying all documents via a text classifier and then counting the numbers of documents assigned to the classes, a growing body of literature has shown this approach to be suboptimal, and has proposed better methods. The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting; this is the first time that an evaluation exercise solely dedicated to quantification is organized. For both the binary setting and the single-label multiclass setting, data were provided to participants both in ready-made vector form and in raw document form. In this overview article we describe the structure of the lab, we report the results obtained by the participants in the four proposed tasks and subtasks, and we comment on the lessons that can be learned from these results.
Open Access
Conference paper
Conference and Labs of the Evaluation Forum
Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Luca Ciampi; Marco Di Benedetto Nicola Messina;
ISTI-CNR;
Artificial Intelligence (AI) is increasingly employed to develop public services that make life easier for citizens. In this abstract, we present some research topics and applications carried out by the Artificial Intelligence for Media and Humanities (AIMH) laboratory of the ISTI-CNR of Pisa about the study and development of AI-based services for Smart Cities dedicated to the interaction with the physical world through the analysis of images gathered from city cameras. Like no other sensing mechanism, networks of city cameras can ‘observe’ the world and simultaneously provide visual data to AI systems to extract relevant information and make/suggest decisions helping to solve many real-world problems. Specifically, we discuss some solutions in the context of smart mobility, parking monitoring, infrastructure management, and surveillance systems.
Open Access
Conference paper
N/A
Antonios Liapis; Georgios N. Yannakakis; Matthew Barthet
University of Malta
This paper proposes a procedural content generator which evolves Minecraft buildings according to an open-ended and intrinsic definition of novelty. To realize this goal we evaluate individuals’ novelty in the latent space using a 3D autoencoder, and alternate between phases of exploration and transformation. During exploration the system evolves multiple populations of CPPNs through CPPN-NEAT and constrained novelty search in the latent space (defined by the current autoencoder). We apply a set of repair and constraint functions to ensure candidates adhere to basic structural rules and constraints during evolution. During transformation, we reshape the boundaries of the latent space to identify new interesting areas of the solution space by retraining the autoencoder with novel content. In this study we evaluate five different approaches for training the autoencoder during transformation and its impact on populations’ quality and diversity during evolution. Our results show that by retraining the autoencoder we can achieve better open-ended complexity compared to a static model, which is further improved when retraining using larger datasets of individuals with diverse complexities.
Open Access
Journal article
IEEE Transactions on Games
Axel Roebel; Lenny Renault; Rémi Mignot;
Sorbonne Université
Recent neural-based synthesis models have achieved impressive results for musical instrument sound generation. In particular, the Differentiable Digital Signal Processing (DDSP) framework enables the usage of spectral modeling analysis and synthesis techniques in fully differentiable architectures. Yet currently, it has only been used for modeling monophonic instruments. Leveraging the interpretability and modularity of this framework, the present work introduces a polyphonic differentiable model for piano sound synthesis, conditioned on Musical Instrument Digital Interface (MIDI) inputs. The model architecture is motivated by high-level acoustic modeling knowledge of the instrument which, in tandem with the sound structure priors inherent to the DDSP components, makes for a lightweight, interpretable and realistic sounding piano model. The proposed model has been evaluated in a listening test, demonstrating improved sound quality compared to a benchmark neural-based piano model, with significantly less parameters and even with reduced training data. The same listening test indicates that physical-modeling-based models still achieve better quality, but the differentiability of our lightened approach encourages its usage in other musical tasks dealing with polyphonic audio and symbolic data.
Open Access
Conference paper
International Conference on Digital Audio Effects
Eirini Ntoutsi; Ioannis Kompatsiaris; Simone Fabbrizzi; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas Freie Universität; Leibniz Universität:
Computer Vision (CV) has achieved remarkable results, outperforming humans in several tasks. Nonetheless, it may result in significant discrimination if not handled properly as CV systems highly depend on training datasets and can learn and amplify biases that such datasets may carry. Thus, the problem of understanding and discovering bias in visual datasets is of utmost importance; yet, it has not been studied in a systematic way to date. Hence, this work aims to: (i) describe the different kinds of bias that may manifest in visual datasets; (ii) review the literature on methods for bias discovery and quantification in visual datasets; (iii) discuss existing attempts to collect visual datasets in a bias-aware manner. A key conclusion of our study is that the problem of bias discovery and quantification in visual datasets is still open, and there is room for improvement in terms of both methods and the range of biases that can be addressed. Moreover, there is no such thing as a bias-free dataset, so scientists and practitioners must become aware of the biases in their datasets and make them explicit. To this end, we propose a checklist to spot different types of bias during visual dataset collection.
Open Access
Journal article
Alejandro Moreo; Andrea Esuli; Fabrizio Sebastiani; Gianluca Sperduti
ISTI-CNR;
LeQua 2022 is a new lab for the evaluation of methods for “learning to quantify” in textual datasets, i.e., for training predictors of the relative frequencies of the classes of interest 𝒴 = {𝑦1 , …, 𝑦𝑛 } in sets of unlabelled textual documents. While these predictions could be easily achieved by first classifying all documents via a text classifier and then counting the numbers of documents assigned to the classes, a growing body of literature has shown this approach to be suboptimal, and has proposed better methods. The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting; this is the first time that an evaluation exercise solely dedicated to quantification is organized. For both the binary setting and the single-label multiclass setting, data were provided to participants both in ready-made vector form and in raw document form. In this overview article we describe the structure of the lab, we report the results obtained by the participants in the four proposed tasks and subtasks, and we comment on the lessons that can be learned from these results.
Open Access
Conference paper
Conference and Labs of the Evaluation Forum
Benedetta Tafuri; Claudio Gennaro; Fabio Carrara; Fabrizio Falchi; Giancarlo Logroscino; Giuseppe Amato; Giuseppe Gigli; Marco Di Benedetto Roberto De Blasi; Salvatore Nigro;
ISTI-CNR; University of Bari; University of Salento;
Behavioral variant frontotemporal dementia (bvFTD) is a neurodegenerative syndrome whose clinical diagnosis remains a challenging task especially in the early stage of the disease. Currently, the presence of frontal and anterior temporal lobe atrophies on magnetic resonance imaging (MRI) is part of the diagnostic criteria for bvFTD. However, MRI data processing is usually dependent on the acquisition device and mostly require human-assisted crafting of feature extraction.
Following the impressive improvements of deep architectures, in this study we report on bvFTD identification using various classes of artificial neural networks, and present the results we achieved on classification accuracy and obliviousness on acquisition devices using extensive hyperparameter search.
In particular, we will demonstrate the stability and generalization of different deep networks based on the attention mechanism, where data intra-mixing confers models the ability to identify the disorder even on MRI data in inter-device settings, i.e., on data produced by different acquisition devices and without model fine tuning, as shown from the very encouraging performance evaluations that dramatically reach and overcome the 90% value on the AuROC and balanced accuracy metrics.
Open Access
Journal article
N/A
Antonio Giganti; Luca Cuccovillo; Paolo Bestagini; Patrick Aichroth; Stefano Tubaro;
Fraunhofer IDMT; Politecnico di Milano;
This work proposes a method for source device identification from speech recordings that applies neural-network-based denoising, to mitigate the impact of counter-forensics attacks using noise injection. The method is evaluated by comparing the impact of denoising on three state-of-the-art features for microphone classification, determining their discriminating power with and without denoising being applied. The proposed framework achieves a significant performance increase for noisy material, and more generally, validates the usefulness of applying denoising prior to device identification for noisy recordings.
Open Access
Conference paper
European Signal Processing Conference
Ahmed Khalifa; Antonios Liapis; Georgios N. Yannakakis; Matthew Barthet
University of Malta
This paper introduces a paradigm shift by viewing the task of affect modeling as a reinforcement learning (RL) process. According to the proposed paradigm, RL agents learn a policy (i.e. affective interaction) by attempting to maximize a set of rewards (i.e. behavioral and affective patterns) via their experience with their environment (i.e. context). Our hypothesis is that RL is an effective paradigm for interweaving affect elicitation and manifestation with behavioral and affective demonstrations. Importantly, our second hypothesis-building on Damasio’s so-matic marker hypothesis-is that emotion can be the facilitator of decision-making. We test our hypotheses in a racing game by training Go-Blend agents to model human demonstrations of arousal and behavior; Go-Blend is a modified version of the Go-Explore algorithm which has recently showcased supreme performance in hard exploration tasks. We first vary the arousal-based reward function and observe agents that can effectively display a palette of affect and behavioral patterns according to the specified reward. Then we use arousal-based state selection mechanisms in order to bias the strategies that Go-Blend explores. Our findings suggest that Go-Blend not only is an efficient affect modeling paradigm but, more importantly, affect-driven RL improves exploration and yields higher performing agents, validating Damasio’s hypothesis in the domain of games.
Open Access
Conference paper
International Conference on Affective Computing and Intelligent Interaction Workshops
Ahmed Khalifa; Antonios Liapis; Georgios N. Yannakakis; Matthew Barthet
University of Malta
Using artificial intelligence (AI) to automatically test a game remains a critical challenge for the development of richer and more complex game worlds and for the advancement of AI at large. One of the most promising methods for achieving that long-standing goal is the use of generative AI agents, namely procedural personas, that attempt to imitate particular playing behaviors which are represented as rules, rewards, or human demonstrations. All research efforts for building those generative agents, however, have focused solely on playing behavior which is arguably a narrow perspective of what a player actually does in a game. Motivated by this gap in the existing state of the art, in this paper we extend the notion of behavioral procedural personas to cater for player experience, thus examining generative agents that can both behave and experience their game as humans would. For that purpose, we employ the Go-Explore reinforcement learning paradigm for training human-like procedural personas, and we test our method on behavior and experience demonstrations of more than 100 players of a racing game. Our findings suggest that the generated agents exhibit distinctive play styles and experience responses of the human personas they were designed to imitate. Importantly, it also appears that experience, which is tied to playing behavior, can be a highly informative driver for better behavioral exploration.
Open Access
Conference paper
Conference on the Foundations of Digital Games
Antonios Liapis; Georgios N. Yannakakis; Konstantinos Makantasis; Kosmas Pinitas;
University of Malta
Affect modeling is viewed, traditionally, as the process of mapping measurable affect manifestations from multiple modalities of user input to affect labels. That mapping is usually inferred through end-to-end (manifestation-to-affect) machine learning processes. What if, instead, one trains general, subject-invariant representations that consider affect information and then uses such representations to model affect? In this paper we assume that affect labels form an integral part, and not just the training signal, of an affect representation and we explore how the recent paradigm of contrastive learning can be employed to discover general high-level affect-infused representations for the purpose of modeling affect. We introduce three different supervised contrastive learning approaches for training representations that consider affect information. In this initial study we test the proposed methods for arousal prediction in the RECOLA dataset based on user information from multiple modalities. Results demonstrate the representation capacity of contrastive learning and its efficiency in boosting the accuracy of affect models. Beyond their evidenced higher performance compared to end-to-end arousal classification, the resulting representations are general-purpose and subject-agnostic, as training is guided though general affect information available in any multimodal corpus.
Open Access
Conference paper
Conference on Multimodal Interaction
Dionysios Karamouzas; Ioannis Mademlis; Ioannis Pitas;
Aristotle University of Thessaloniki;
Sentiment analysis in texts, also known as opinion mining, is a significant Natural Language Processing (NLP) task, with many applications in automated social media monitoring, customer feedback processing, e-mail scanning, etc. Despite recent progress due to advances in Deep Neural Networks (DNNs), texts containing figurative language (e.g., sarcasm, irony, metaphors) still pose a challenge to existing methods due to the semantic ambiguities they entail. In this paper, a novel setup of neural knowledge transfer is proposed for DNN-based sentiment analysis of figurative texts. It is employed for distilling knowledge from a pretrained binary recognizer of figurative language into a multiclass sentiment classifier, while the latter is being trained under a multitask setting. Thus, hints about figurativeness implicitly help resolve semantic ambiguities. Evaluation on a relevant public dataset indicates that the proposed method leads to state-of-the-art accuracy.
Open Access
Conference paper
N/A
Hervé Le Borgne; Michel Crucianu; Nicolas Audebert; Perla Doubinsky
CEA; CNAM;
Various controls over the generated data can be extracted from the latent space of a pre-trained GAN, as it implicitly encodes the semantics of the training data. The discovered controls allow to vary semantic attributes in the generated images but usually lead to entangled edits that affect multiple attributes at the same time. Supervised approaches typically sample and annotate a collection of latent codes, then train classifiers in the latent space to identify the controls. Since the data generated by GANs reflects the biases of the original dataset, so do the resulting semantic controls. We propose to address disentanglement by balancing the semantics of the dataset before training the classifiers. We demonstrate the effectiveness of this approach by extracting disentangled linear directions for face manipulation on state-of-the-art GAN architectures (including StyleGAN2 and StyleGAN3) and two datasets, CelebAHQ and FFHQ. We show that this simple and general approach outperforms state-of-the-art classifier-based methods while avoiding the need for disentanglement-enforcing post-processing.
Open Access
Journal article
Pattern Recognition Letters
Elisa Ricci; Giancomo Zara: Nicu Sebe; Paolo Rota; Thiago Oliveira-Santos; Victor da Costa Vittorio Murino;
Universidade Federal do Espírito Santo; University of Trento; University of Verona
Over the last few years, Unsupervised Domain Adaptation (UDA) techniques have acquired remarkable importance and popularity in computer vision. However, when compared to the extensive literature available for images, the field of videos is still relatively unexplored. On the other hand, the performance of a model in action recognition is heavily affected by domain shift. In this paper, we propose a simple and novel UDA approach for video action recognition. Our approach leverages recent advances on spatio-temporal transformers to build a robust source model that better generalises to the target domain. Furthermore, our architecture learns domain invariant features thanks to the introduction of a novel alignment loss term derived from the Information Bottleneck principle. We report results on two video action recognition benchmarks for UDA, showing state-of-the-art performance on HMDB->UCF, as well as on Kinetics<->EC-Drone, which is more challenging. This demonstrates the effectiveness of our method in handling different levels of domain shift. The source code is available at https://github.com/vturrisi/UDAVT
Closed Access
Conference paper
International Conference on Pattern Recognition
Alberto Del Bimbo; Brais Bosquet Daniel Cores Lorenzo Seidenari; Manuel Mucientes Victor Brea
University of Florence; University of Santiago de Compostela
Object detection accuracy on small objects, i.e., objects under 32 × 32 pixels, lags behind that of large ones. To address this issue, innovative architectures have been designed and new datasets have been released. Still, the number of small objects in many datasets does not suffice for training. The advent of the generative adversarial networks (GANs) opens up a new data augmentation possibility for training architectures without the costly task of annotating huge datasets for small objects. In this paper, we propose a full pipeline for data augmentation for small object detection which combines a GAN-based object generator with techniques of object segmentation, image inpainting, and image blending to achieve high-quality synthetic data. The main component of our pipeline is DS-GAN, a novel GAN-based architecture that generates realistic small objects from larger ones. Experimental results show that our overall data augmentation method improves the performance of state-of-the-art models up to 11.9% AP@.5 on UAVDT and by 4.7% AP@.5 on iSAID, both for the small objects subset and for a scenario where the number of training instances is limited.
Open Access
Journal article
Pattern Recognition Letters
Ali Shahin Shamsabadi; Aurélien Bellet: Daniel Gatica-Perez; Sina Sajadmanesh
Alan Turing Institute; EPFL; Idiap Research Institute Inria;
In this paper, we study the problem of learning Graph Neural Networks (GNNs) with Differential Privacy (DP). We propose a novel differentially private GNN based on Aggregation Perturbation (GAP), which adds stochastic noise to the GNN’s aggregation function to statistically obfuscate the presence of a single edge (edge-level privacy) or a single node and all its adjacent edges (node-level privacy). Tailored to the specifics of private learning, GAP’s new architecture is composed of three separate modules: (i) the encoder module, where we learn private node embeddings without relying on the edge information; (ii) the aggregation module, where we compute noisy aggregated node embeddings based on the graph structure; and (iii) the classification module, where we train a neural network on the private aggregations for node classification without further querying the graph edges. GAP’s major advantage over previous approaches is that it can benefit from multi-hop neighborhood aggregations, and guarantees both edge-level and node-level DP not only for training, but also at inference with no additional costs beyond the training’s privacy budget. We analyze GAP’s formal privacy guarantees using Rényi DP and conduct empirical experiments over three real-world graph datasets. We demonstrate that GAP offers significantly better accuracy-privacy trade-offs than state-of-the-art DP-GNN approaches and naive MLP-based baselines.
Open Access
Conference paper
USENIX Security Symposium
Claudio Gennaro; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Luca Ciampi; Marco Di Benedetto
ISTI-CNR;
In many working and recreational activities, there are scenarios where both individual and collective safety have to be constantly checked and properly signaled, as occurring in dangerous workplaces or during pandemic events like the recent COVID-19 disease. From wearing personal protective equipment to filling physical spaces with an adequate number of people, it is clear that a possibly automatic solution would help to check compliance with the established rules. Based on an off-the-shelf compact and low-cost hardware, we present a deployed real use-case embedded system capable of perceiving people’s behaviour and aggregations and supervising the appliance of a set of rules relying on a configurable plug-in framework. Working on indoor and outdoor environments, we show that our implementation of counting people aggregations, measuring their reciprocal physical distances, and checking the proper usage of protective equipment is an effective yet open framework for monitoring human activities in critical conditions.
Open Access
Journal article
N/A
Ioannis Mademlis; Ioannis Pitas;
Aristotle University of Thessaloniki;
The high popularity of Twitter renders it an excellent tool for political research, while opinion mining through semantic analysis of individual tweets has proven valuable. However, exploiting relevant scientific advances for collective analysis of Twitter messages in order to quantify general public opinion has not been explored. This paper presents such a novel, automated public opinion monitoring mechanism , consisting of a semantic descriptor that relies on Natural Language Processing (NLP) algorithms. A four-dimensional descriptor is first extracted for each tweet independently, quantifying text polarity, offensiveness, bias and figurativeness. Subsequently, it is summarized across multiple tweets, according to a desired aggregation strategy and aggregation target. This can then be exploited in various ways, such as training machine learning models for forecasting day-by-day public opinion predictions. The proposed mechanism is applied to the 2016/2020 US Presidential Elections tweet datasets and the resulting succinct public opinion descriptions are explored as a case study.
Open Access
Journal article
Social Network Analysis and Mining
Ambrish Rawat; Anisa Halimi; Nathalie Baracaldo; Swanand Kadhe;
IBM Research;
With privacy legislation empowering users with the right to be forgotten, it has become essential to make a model forget about some of its training data. We explore the problem of removing any client’s contribution in federated learning (FL). During FL rounds, each client performs local training to learn a model that minimizes the empirical loss on their private data. We propose to perform unlearning at the client (to be erased) by reversing the learning process, i.e., training a model to \emph{maximize} the local empirical loss. In particular, we formulate the unlearning problem as a constrained maximization problem by restricting to an $\ell_2$-norm ball around a suitably chosen reference model to help retain some knowledge learnt from the other clients’ data. This allows the client to use projected gradient descent to perform unlearning. The method neither requires global access to the data used for training nor the history of the parameter updates to be stored by the aggregator (server) or any of the clients. Experiments on the MNIST dataset show that the proposed unlearning method is efficient and effective.
Open Access
Conference paper
N/A
Claudio Gennaro; Fabio Carrara; Fabrizio Falchi; Lorenzo Pasco
ISTI-CNR; University of Pisa
Falling is one of the most common causes of injury in all ages, especially in the elderly, where it is more frequent and severe. For this reason, a tool that can detect a fall in real time can be helpful in ensuring appropriate intervention and avoiding more serious damage. Some approaches available in the literature use sensors, wearable devices, or cameras with special features such as thermal or depth sensors. In this paper, we propose a Computer Vision deep-learning based approach for human fall detection based on largely available standard RGB cameras. A typical limitation of this kind of approaches is the lack of generalization to unseen environments. This is due to the error generated during human detection and, more generally, due to the unavailability of large-scale datasets that specialize in fall detection problems with different environments and fall types. In this work, we mitigate these limitations with a general-purpose object detector trained using a virtual world dataset in addition to real-world images. Through extensive experimental evaluation, we verified that by training our models on synthetic images as well, we were able to improve their ability to generalize. Code to reproduce results is available at https://github.com/lorepas/fallen-people-detection.
Open Access
Conference paper
Conference on Content-based Multimedia Indexing
Aurelia Viglione; Elena Putignano; Fabio Carrara; Giulia Sagona; Giuseppe Amato; Leonardo Lupori; Raffaele Mazziotti; Valentino Totaro;
BIO@SNS Lab; IRCCS Stella Maris Foundation; ISTI-CNR; Italy National Research Council;
Cyclin-dependent kinase-like 5 (Cdkl5) deficiency disorder (CDD) is a severe neurodevelopmental condition caused by mutations in the X-linked Cdkl5 gene. CDD is characterized by early-onset seizures in the first month of life, intellectual disability, motor and social impairment. No effective treatment is currently available and medical management is only symptomatic and supportive. Recently, mouse models of Cdkl5 disorder have demonstrated that mice lacking Cdkl5 exhibit autism-like phenotypes, hyperactivity and dysregulations of the arousal system, suggesting the possibility to use these features as translational biomarkers. In this study, we tested Cdkl5 male and female mutant mice in an appetitive operant conditioning chamber to assess cognitive and motor abilities, and performed pupillometry to assess the integrity of the arousal system. Then, we evaluated the performance of artificial intelligence models to classify the genotype of the animals from the behavioral and physiological phenotype. The behavioral results show that CDD mice display impulsivity, together with low levels of cognitive flexibility and perseverative behaviors. We assessed arousal levels by simultaneously recording pupil size and locomotor activity. Pupillometry reveals in CDD mice a smaller pupil size and an impaired response to unexpected stimuli associated with hyperlocomotion, demonstrating a global defect in arousal modulation. Finally, machine learning reveals that both behavioral and pupillometry parameters can be considered good predictors of CDD. Since early diagnosis is essential to evaluate treatment outcomes and pupillary measures can be performed easily, we proposed the monitoring of pupil size as a promising biomarker for CDD.
Open Access
Journal article
N/A
Christos Tzelepis; Giorgios Kordopatis-Zilos; Ioannis Kompatsiaris; Ioannis Patras; Symeon Papadopoulos
In this paper, we address the problem of high performance and computationally efficient content-based video retrieval in large-scale datasets. Current methods typically propose either: (i) fine-grained approaches employing spatio-temporal representations and similarity calculations, achieving high performance at a high computational cost or (ii) coarse-grained approaches representing/indexing videos as global vectors, where the spatio-temporal structure is lost, providing low performance but also having low computational cost. In this work, we propose a Knowledge Distillation framework, called Distill-and-Select (DnS), that starting from a well-performing fine-grained Teacher Network learns: a) Student Networks at different retrieval performance and computational efficiency trade-offs and b) a Selector Network that at test time rapidly directs samples to the appropriate student to maintain both high retrieval performance and high computational efficiency. We train several students with different architectures and arrive at different trade-offs of performance and efficiency, i.e., speed and storage requirements, including fine-grained students that store/index videos using binary representations. Importantly, the proposed scheme allows Knowledge Distillation in large, unlabelled datasets — this leads to good students. We evaluate DnS on five public datasets on three different video retrieval tasks and demonstrate a) that our students achieve state-of-the-art performance in several cases and b) that the DnS framework provides an excellent trade-off between retrieval performance, computational speed, and storage space. In specific configurations, the proposed method achieves similar mAP with the teacher but is 20 times faster and requires 240 times less storage space. The collected dataset and implementation are publicly available: https://github.com/mever-team/distill-and-select.
Open Access
Journal article
N/A
Ioannis Kompatsiaris; Nikolaos Giatsoglou; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
Decentralization is emerging as a key feature of the future Internet. However, effective algorithms for search are missing from state-of-the-art decentralized technologies, such as distributed hash tables and blockchain. This is surprising, since decentralized search has been studied extensively in earlier peer- to-peer (P2P) literature. In this work, we adopt a fresh outlook for decentralized search in P2P networks that is inspired by advancements in dense information retrieval and graph signal processing. In particular, we generate latent representations of P2P nodes based on their stored documents and diffuse them to the rest of the network with graph filters, such as person- alized PageRank. We then use the diffused representations to guide search queries towards relevant content. Our preliminary approach is successful in locating relevant documents in nearby nodes but the accuracy declines sharply with the number of stored documents, highlighting the need for more sophisticated techniques.
Open Access
Conference paper
Decentralized Internet, Networks, Protocols, and Systems
Antonios Liapis; Chintan Triverdi; Georgios N. Yannakakis; Konstantinos Makantasis;
University of Malta
Having access to accurate game state information is of utmost importance for any artificial intelligence task including game-playing, testing, player modeling, and procedural content generation. Self-Supervised Learning (SSL) techniques have shown to be capable of inferring accurate game state information from the high-dimensional pixel input of game footage into compressed latent representations. Contrastive Learning is a popular SSL paradigm where the visual understanding of the game’s images comes from contrasting dissimilar and similar game states defined by simple image augmentation methods. In this study, we introduce a new game scene augmentation technique—named GameCLR—that takes advantage of the game-engine to define and synthesize specific, highly-controlled renderings of different game states, thereby, boosting contrastive learning performance. We test our GameCLR technique on images of the CARLA driving simulator environment and compare it against the popular SimCLR baseline SSL method. Our results suggest that GameCLR can infer the game’s state information from game footage more accurately compared to the baseline. Our proposed approach allows us to conduct game artificial intelligence research by directly utilizing screen pixels as input.
Open Access
Conference paper
Conference on the Foundations of Digital Games
Alessio Molinari; Andrea Esuli; Fabrizio Sebastiani;
ISTI-CNR;
The Saerens-Latinne-Decaestecker (SLD) algorithm is a method whose goal is improving the quality of the posterior probabilities (or simply “posteriors”) returned by a probabilistic classifier in scenarios characterized by prior probability shift (PPS) between the training set and the unlabelled (“test”) set. This is an important task, (a) because posteriors are of the utmost importance in downstream tasks such as, e.g., multiclass classification and cost-sensitive classification, and (b) because PPS is ubiquitous in many applications. In this paper we explore whether using SLD can indeed improve the quality of posteriors returned by a classifier trained via active learning (AL), a class of machine learning (ML) techniques that indeed tend to generate substantial PPS. Specifically, we target AL via relevance sampling (ALvRS) and AL via uncertainty sampling (ALvUS), two AL techniques that are very well-known especially because, due to their low computational cost, are suitable to being applied in scenarios characterized by large datasets. We present experimental results obtained on the RCV1-v2 dataset, showing that SLD fails to deliver better-quality posteriors with both ALvRS and ALvUS, thus contradicting previous findings in the literature, and that this is due not to the amount of PPS that these techniques generate, but to how the examples they prioritize for annotation are distributed.
Open Access
Conference paper
Information Retrieval Communities in Europe Conference
Enver Sangineto; Hao Tang; Jichao Zhang; Jimgjing Chen; Nicu Sebe; Peng Wu; Yan Yan
University of Trento;
This paper proposes a gaze correction and animation method for high-resolution, unconstrained portrait images, which can be trained without the gaze angle and the head pose annotations. Common gaze-correction methods usually require annotating training data with precise gaze, and head pose information. Solving this problem using an unsupervised method remains an open problem, especially for high-resolution face images in the wild, which are not easy to annotate with gaze and head pose labels. To address this issue, we first create two new portrait datasets: CelebGaze (256 × 256) and highresolution CelebHQGaze (512 × 512). Second, we formulate the gaze correction task as an image inpainting problem, addressed using a Gaze Correction Module (GCM) and a Gaze Animation Module (GAM). Moreover, we propose an unsupervised training strategy, i.e., Synthesis-As-Training, to learn the correlation between the eye region features and the gaze angle. As a result, we can use the learned latent space for gaze animation with semantic interpolation in this space. Moreover, to alleviate both the memory and the computational costs in the training and the inference stage, we propose a Coarse-to-Fine Module (CFM) integrated with GCM and GAM. Extensive experiments validate the effectiveness of our method for both the gaze correction and the gaze animation tasks in both low and high-resolution face datasets in the wild and demonstrate the superiority of our method with respect to the state of the art.
Open Access
Journal article
IEEE Transactions on Image Processing
Alejandro Moreo; Andrea Esuli; Fabrizio Sebastiani;
ISTI-CNR; University of Padova
Algorithms and models are increasingly deployed to inform decisions about people, inevitably affecting their lives. As a consequence, those in charge of developing these models must carefully evaluate their impact on different groups of people and favour group fairness, that is, ensure that groups determined by sensitive demographic attributes, such as race or sex, are not treated unjustly. To achieve this goal, the availability (awareness) of these demographic attributes to those evaluating the impact of these models is fundamental. Unfortunately, collecting and storing these attributes is often in conflict with industry practices and legislation on data minimisation and privacy. For this reason, it can be hard to measure the group fairness of trained models, even from within the companies developing them. In this work, we tackle the problem of measuring group fairness under unawareness of sensitive attributes, by using techniques from quantification, a supervised learning task concerned with directly providing group-level prevalence estimates (rather than individual-level class labels). We show that quantification approaches are particularly suited to tackle the fairness-under-unawareness problem, as they are robust to inevitable distribution shifts while at the same time decoupling the (desirable) objective of measuring group fairness from the (undesirable) side effect of allowing the inference of sensitive attributes of individuals. More in detail, we show that fairness under unawareness can be cast as a quantification problem and solved with proven methods from the quantification literature. We show that these methods outperform previous approaches to measure demographic parity in five experimental protocols, corresponding to important challenges that complicate the estimation of classifier fairness under unawareness.
Open Access
Working paper
Journal of Artificial Intelligence Research
Evlampios Apostolidis; Georgios Balaouras; Ioannis Patras; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas QMUL
In this work, we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames’ dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames’ dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames’ uniqueness and diversity, shows their relative contributions to the overall summarization performance.
Open Access
Conference paper
N/A
Adrian Popescu; Babacar Sow; Julien Tourille;
Université Clermont Auvergne; Université Paris-Saclay;
Deep neural networks have the capacity to generate textual content which is increasingly difficult to distinguish from that produced by humans. Such content can be used in disinformation campaigns and its detrimental effects are amplified if it spreads on social networks. Here, we study the automatic detection of bot-generated Twitter messages. This task is difficult due to combination between the strong performance of recent deep language models and the limited length of tweets. In this study, we propose a challenging definition of the problem by making no assumption regarding the bot account, its network or the method used to generate the text. We devise two approaches for bot detection based on pretrained language models and create a new dataset of generated tweets to improve the performance of our classifier on recent text generation algorithms. The obtained results show that the generalization capabilities of the proposed classifier heavily depends on the dataset used to trained the model. Interestingly, the two automatic dataset augmentation proposed here show promising results. Their introduction leads to consistent performance gains compared to the use of the original dataset alone.
Open Access
Conference paper
N/A
Claudio Gennaro; Davide Alessandro Coccomini; Fabrizio Falchi; Giuseppe Amato; Roberto Cardelli;
CNIT; ISTI-CNR;
Open Access
Conference paper
Conference on Multimedia Retrieval
Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Luca Ciampi;
ISTI-CNR;
This paper presents a novel solution to automatically count vehicles in a parking lot using images captured by smart cameras. Unlike most of the literature on this task, which focuses on the analysis of single images, this paper proposes the use of multiple visual sources to monitor a wider parking area from different perspectives. The proposed multi-camera system is capable of automatically estimating the number of cars present in the entire parking lot directly on board the edge devices. It comprises an on-device deep learning-based detector that locates and counts the vehicles from the captured images and a decentralized geometric-based approach that can analyze the inter-camera shared areas and merge the data acquired by all the devices. We conducted the experimental evaluation on an extended version of the CNRPark-EXT dataset, a collection of images taken from the parking lot on the campus of the National Research Council (CNR) in Pisa, Italy. We show that our system is robust and takes advantage of the redundant information deriving from the different cameras, improving the overall performance without requiring any extra geometrical information of the monitored scene.
Open Access
Journal article
Expert Systems with Applications
Gim Hee Lee; Nicu Sebe; Yuyang Zhao; Zhun Zhong
National University of Singapore; University of Trento;
We introduce a new setting of Novel Class Discovery in Semantic Segmentation (NCDSS), which aims at segmenting unlabeled images containing new classes given prior knowledge from a labeled set of disjoint classes. In contrast to existing approaches that look at novel class discovery in image classification, we focus on the more challenging semantic segmentation. In NCDSS, we need to distinguish the objects and background, and to handle the existence of multiple classes within an image, which increases the difficulty in using the unlabeled data. To tackle this new setting, we leverage the labeled base data and a saliency model to coarsely cluster novel classes for model training in our basic framework. Additionally, we propose the Entropy-based Uncertainty Modeling and Self-training (EUMS) framework to overcome noisy pseudo-labels, further improving the model performance on the novel classes. Our EUMS utilizes an entropy ranking technique and a dynamic reassignment to distill clean labels, thereby making full use of the noisy data via self-supervised learning. We build the NCDSS benchmark on the PASCAL-5i dataset and COCO-20i dataset. Extensive experiments demonstrate the feasibility of the basic framework (achieving an average mIoU of 49.81% on PASCAL-5i) and the effectiveness of EUMS framework (outperforming the basic framework by 9.28% mIoU on PASCAL-5i).
Open Access
Conference paper
N/A
Andrea Esuli;
ISTI-CNR;
We present the Interactive Classification System (ICS), a web-based application that supports the activity of manual text classification. The application uses machine learning to continuously fit automatic classification models that are in turn used to actively support its users with classification suggestions. The key requirement we have established for the development of ICS is to give its users total freedom of action: they can at any time modify any classification schema and any label assignment, possibly reusing any relevant information from previous activities. We investigate how this requirement challenges the typical scenarios faced in machine learning research, which instead give no active role to humans or place them into very constrained roles, e.g., on-demand labeling in active learning processes, and always assume some degree of batch processing of data. We satisfy the “total freedom” requirement by designing an unobtrusive machine learning model, i.e., the machine learning component of ICS acts as an unobtrusive observer of the users, that never interrupts them, continuously adapts and updates its models in response to their actions, and it is always available to perform automatic classifications. Our efficient implementation of the unobtrusive machine learning model combines various machine learning methods and technologies, such as hash-based feature mapping, random indexing, online learning, active learning, and asynchronous processing.
Open Access
Journal article
IEEE Access
Lucile Sassatelli; Quentin Guimard
Université Côte d'Azur;
While 360° videos watched in a VR headset are gaining in popularity, it is necessary to lower the required bandwidth to stream these immersive videos and obtain a satisfying quality of experience. Doing so requires predicting the user’s head motion in advance, which has been tackled by a number of recent prediction methods considering the video content and the user’s past motion. However, human motion is a complex process that can depend on many more parameters, including the type of attentional phase the user is currently in, and their emotions, which can be difficult to capture. This is the first article to investigate the effects of user emotions on the predictability of head motion, in connection with video-centric parameters. We formulate and verify hypotheses, and construct a structural equation model of emotion, motion and predictability. We show that the prediction error is higher for higher valence ratings, and that this relationship is mediated by head speed. We also show that the prediction error is lower for higher arousal, but that spatial information moderates the effect of arousal on predictability. This work opens the path to better capture important factors in human motion, to help improve the training process of head motion predictors.
Open Access
Conference paper
International Workshop on Immersive Mixed and Virtual Environment Systems
Aldric Ducreux; Auriane Gros; Camille Bauce; Florent Robert; Hui-Yin Wu; Lucile Sassatelli; Marco Wincler; Quentin Guimard
Université Côte d'Azur;
From a user perspective, immersive content can elicit more intense emotions than flat-screen presentations. From a system perspective, efficient storage and distribution remain challenging, and must consider user attention. Understanding the connection between user attention, user emotions and immersive content is therefore key. In this article, we present a new dataset, PEM360 of user head movements and gaze recordings in 360° videos, along with self-reported emotional ratings of valence and arousal, and continuous physiological measurement of electrodermal activity and heart rate. The stimuli are selected to enable the spatiotemporal analysis of the connection between content, user motion and emotion. We describe and provide a set of software tools to process the various data modalities, and introduce a joint instantaneous visualization of user attention and emotion we name Emotional maps. We exemplify new types of analyses the PEM360 dataset can enable. The entire data and code are made available in a reproducible framework.
Open Access
Conference paper
N/A
Antonios Liapis; Chintan Triverdi; Georgios N. Yannakakis; Konstantinos Makantasis;
University of Malta
Self-supervised learning (SSL) techniques have been widely used to learn compact and informative representations from high-dimensional complex data. In many computer vision tasks, such as image classification, such methods achieve state-of-the-art results that surpass supervised learning approaches. In this paper, we investigate whether SSL methods can be leveraged for the task of learning accurate state representations of games, and if so, to what extent. For this purpose, we collect game footage frames and corresponding sequences of games’ internal state from three different 3D games: VizDoom, the CARLA racing simulator and the Google Research Football Environment. We train an image encoder with three widely used SSL algorithms using solely the raw frames, and then attempt to recover the internal state variables from the learned representations. Our results across all three games showcase significantly higher correlation between SSL representations and the game’s internal state compared to pre-trained baseline models such as ImageNet. Such findings suggest that SSL-based visual encoders can yield general — not tailored to a specific task — yet informative game representations solely from game pixel information. Such representations can, in turn, form the basis for boosting the performance of downstream learning tasks in games, including gameplaying, content generation and player modeling.
Open Access
Conference paper
IEEE Conference on Games
Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;
ISTI-CNR;
VISIONE is a content-based retrieval system that supports various search functionalities (text search, object/color-based search, semantic and visual similarity search, temporal search). It uses a full-text search engine as a search backend. In the latest version of our system, we modified the user interface, and we made some changes to the techniques used to analyze and search for videos.
Open Access
Conference paper
N/A
Antonio Giganti; Luca Cuccovillo; Paolo Bestagini; Patrick Aichroth; Stefano Tubaro;
Fraunhofer IDMT; Politecnico di Milano;
In this paper, we propose the use of denoising for microphone classification, to enable its usage for several key application domains that involve noisy conditions. We describe the proposed analysis pipeline and the baseline algorithm for microphone classification, and discuss various denoising approaches which can be applied to it within the time or spectral domain; finally, we determine the best-performing denoising procedure, and evaluate the performance of the overall, integrated approach with several SNR levels of additive input noise. As a result, the proposed method achieves an average accuracy increase of about 25% on denoised content over the reference baseline.
Open Access
Conference paper
International Workshop on Multimedia AI against Disinformation
Alejandro Moreo; Fabrizio Sebastiani; Juan José del Coz;
ISTI-CNR; University of Oviedo
The 1st International Workshop on Learning to Quantify (LQ 2021 – https://cikmlq2021.github.io/), organized as a satellite event of the 30th ACM International Confer- ence on Knowledge Management (CIKM 2021), took place on two separate days, November 1 and 5, 2021. As the main CIKM 2021 conference, the workshop was held entirely on- line, due to the COVID-19 pandemic. This report presents a summary of each keynote speech and contributed paper presented in this event, and discusses the issues that were raised during the workshop.
Open Access
Journal article
SIGKDD Exploration
Nicu Sebe; Paolo Rota; Petru Soviany; Radu Tudor Ionescu
University of Trento; University Politehnica of Bucharest
Training machine learning models in a meaningful order, from the easy samples to the hard ones, using curriculum learning can provide performance improvements over the standard training approach based on random data shuffling, without any additional computational costs. Curriculum learning strategies have been successfully employed in all areas of machine learning, in a wide range of tasks. However, the necessity of finding a way to rank the samples from easy to hard, as well as the right pacing function for introducing more difficult data can limit the usage of the curriculum approaches. In this survey, we show how these limits have been tackled in the literature, and we present different curriculum learning instantiations for various tasks in machine learning. We construct a multi-perspective taxonomy of curriculum learning approaches by hand, considering various classification criteria. We further build a hierarchical tree of curriculum learning methods using an agglomerative clustering algorithm, linking the discovered clusters with our taxonomy. At the end, we provide some interesting directions for future work.
Closed Access
Journal article
International Journal of Computer Vision
Carlos Santiago; Claudio Gennaro; Fabio Carrara; Giuseppe Amato; Leonardo Lupori; Luca Ciampi; Raffaele Mazziotti; Tommaso Pizzorusso Valentino Totaro;
Institute for Systems and Robotics; ISTI-CNR; University of Florence;
Exploiting well-labeled training sets has led deep learning models to astonishing results for counting biological structures in microscopy images. However, dealing with weak multi-rater annotations, i.e., when multiple human raters disagree due to non-trivial patterns, remains a relatively unexplored problem. More reliable labels can be obtained by aggregating and averaging the decisions given by several raters to the same data. Still, the scale of the counting task and the limited budget for labeling prohibit this. As a result, making the most with small quantities of multi-rater data is crucial. To this end, we propose a two-stage counting strategy in a weakly labeled data scenario. First, we detect and count the biological structures; then, in the second step, we refine the predictions, increasing the correlation between the scores assigned to the samples and the raters’ agreement on the annotations. We assess our methodology on a novel dataset comprising fluorescence microscopy images of mice brains containing extracellular matrix aggregates named perineuronal nets. We demonstrate that we significantly enhance counting performance, improving confidence calibration by taking advantage of the redundant information characterizing the small sets of available multi-rater data.
Open Access
Journal article
Medical Image Analysis
Ioannis Mademlis; Ioannis Pitas; Michail Kaseris
Aristotle University of Thessaloniki;
Most unsupervised Deep Neural Networks (DNNs) for video summarization rely on adversarial learning, autoencoding and training without utilizing any ground-truth summary. In several cases, the Convolutional Neural Network (CNN)-derived video frame representations are sequentially fed to a Long Short-Term Memory (LSTM) network, which selects key frames and, during training, attempts to reconstruct the original/full video from the summary, while confusing an adversarially optimized Discriminator. Additionally, regularizers aiming at maximizing the summary’s visual semantic diversity can be employed, such as the Determinantal Point Process (DPP) loss term. In this paper, a novel DPP-based regularizer is proposed that exploits a pretrained DNN-based image captioner in order to additionally enforce maximal keyframe diversity from the perspective of textual semantic content. Thus, the selected key-frames are encouraged to differ not only with regard to what objects they depict, but also with regard to their textual descriptions, which may additionally capture activities, scene context, etc. Empirical evaluation indicates that the proposed regularizer leads to state-of-the-art performance.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Leonardo Galteri; Lorenzo Seidenari; Marco Bertini; Pietro Bongini
University of Florence;
Image quality assessment is often performed with deep networks which are fine-tuned to regress a human provided quality score of a given image. Usually, this approaches may lack generalization capabilities and, while being highly precise on similar image distribution, it may yield lower correlation on unseen distortions. In particular they show poor performances whereas images corrupted by noise, blur or compressed have been restored by generative models. As a matter of fact, evaluation of these generative models is often performed providing anecdotal results to the reader. In the case of image enhancement and restoration, reference images are usually available. Nonetheless, using signal based metrics often leads to counterintuitive results: highly natural crisp images may obtain worse scores than blurry ones. On the other hand, blind reference image assessment may rank images reconstructed with GANs higher than the original undistorted images. To avoid time consuming human based image assessment, semantic computer vision tasks may be exploited instead. In this paper we advocate the use of language generation tasks to evaluate the quality of restored images. We refer to our assessment approach as LANguage-based Blind Image QUality Evaluation (LANBIQUE). We show experimentally that image captioning, used as a downstream task, may serve as a method to score image quality, independently of the distortion process that affects the data. Captioning scores are better aligned with human rankings with respect to classic signal based or No-Reference image quality metrics. We show insights on how the corruption, by artifacts, of local image structure may steer image captions in the wrong direction.
Open Access
Journal article
N/A
Alberto Del Bimbo; Federico Becattini; Francesco Marchetti; Lorenzo Seidenari; Lucile Sassatelli; Quentin Guimard
Università degli Studi di Firenze; Université Côte d'Azur;
Prediction of head movements in immersive media is key to design efficient streaming systems able to focus the bandwidth budget on visible areas of the content. Numerous proposals have therefore been made in the recent years to predict 360° images and videos. However, the performance of these models is limited by a main characteristic of the head motion data: its intrinsic uncertainty. In this article, we present an approach to generate multiple plausible futures of head motion in 360° videos, given a common past trajectory. Our method provides likelihood estimates of every predicted trajectory, enabling direct integration in streaming optimization. To the best of our knowledge, this is the first work that considers the problem of multiple head motion prediction for 360° video streaming. We first quantify this uncertainty from the data. We then introduce our discrete variational multiple sequence (DVMS) learning framework, which builds on deep latent variable models. We design a training procedure to obtain a flexible and lightweight stochastic prediction model compatible with sequence-to-sequence recurrent neural architectures. Experimental results on 3 different datasets show that our method DVMS outperforms competitors adapted from the self-driving domain by up to 37% on prediction horizons up to 5 sec., at lower computational and memory costs. Finally, we design a method to estimate the respective likelihoods of the multiple predicted trajectories, by exploiting the stationarity of the distribution of the prediction error over the latent space. Experimental results on 3 datasets show the quality of these estimates, and how they depend on the video category.
Open Access
Conference paper
ACM Multimedia Systems Conference
Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;
ISTI-CNR;
This repository contains a mapping between the classes of COCO, LVIS, and Open Images V4 datasets into a unique set of 1460 classes. COCO [Lin et al 2014] contains 80 classes, LVIS [gupta2019lvis] contains 1460 classes, Open Images V4 [Kuznetsova et al. 2020] contains 601 classes. We built a mapping of these classes using a semi-automatic procedure in order to have a unique final list of 1460 classes. We also generated a hierarchy for each class, using wordnet.
Open Access
Journal article
Journal of Imaging
Alberto Del Bimbo; Andrea Ciamarra Federico Becattini; Lorenzo Seidenari;
Università degli Studi di Firenze; University of Florence;
For an autonomous vehicle it is essential to observe the ongoing dynamics of a scene and consequently predict imminent future scenarios to ensure safety to itself and others. This can be done using different sensors and modalities. In this paper we investigate the usage of optical flow for predicting future semantic segmentations. To do so we propose a model that forecasts flow fields autoregressively. Such predictions are then used to guide the inference of a learned warping function that moves instance segmentations onto future frames. Results on the Cityscapes dataset demonstrate the effectiveness of optical-flow methods.
Open Access
Conference paper
N/A
Francesco Poldi; Ioannis Kompatsiaris; Lazaros Apostolidis; Olga Papadopoulou; Symeon Papadopoulos Themistoklis Makedas
CERTH - Center for Research and Technology Hellas EU DisinfoLab
The proliferation of online news, especially during the “infodemic” that emerged along with the COVID-19 pandemic, has rapidly increased the risk of and, more importantly, the volume of online misinformation. Online Social Networks (OSNs), such as Facebook, Twitter, and YouTube, serve as fertile ground for disseminating misinformation, making the need for tools for analyzing the social web and gaining insights into communities that drive misinformation online vital. We introduce the MeVer NetworkX analysis and visualization tool, which helps users delve into social media conversations, helps users gain insights about how information propagates, and provides intuition about communities formed via interactions. The contributions of our tool lie in easy navigation through a multitude of features that provide helpful insights about the account behaviors and information propagation, provide the support of Twitter, Facebook, and Telegram graphs, and provide the modularity to integrate more platforms. The tool also provides features that highlight suspicious accounts in a graph that a user should investigate further. We collected four Twitter datasets related to COVID-19 disinformation to present the tool’s functionalities and evaluate its effectiveness.
Open Access
Journal article
Future Internet
Johan Oomen; Philo van Kemenade; Rasa Bocyte
Netherlands Institute for Sound & Vision
With the rapid advance of Artificial Intelligence (AI), increased availability of digitised and born-digital sources from a wide range of collection owners, researchers can gain new perspectives on large-scale audiovisual collections and study patterns that reach across media and time. But what are the actual requirements that humanities scholars have for the use of such AI-based tooling? This question is what the Netherlands Institute for Sound & Vision brought into the European research project AI4Media. Specifically, NISV is investigating how AI tools could open new research possibilities for the users of the CLARIAH Media Suite virtual research environment which enables exploration and analysis of distributed audiovisual collections. In this short paper presentation, we will present the requirements gathered from humanities scholars on AI tooling and describe how they are being translated into functional AI tools in the Media Suite.
Open Access
Conference paper
DH Benelux
Hanna Lukashevich Jakob Abeßer Ribecky Sebastian
Fraunhofer IDMT;
In the context of music information retrieval, similarity-based approaches are useful for a variety of tasks that benefit from a query-by-example approach. Music however, naturally decomposes into a set of semantically meaningful factors of variation. Current representation learning strategies pursue the disentanglement of such factors from deep representations, and result in highly interpretable models. This allows to model the perception of music similarity, which is highly subjective and multi-dimensional. While the focus of prior work is on metadata driven similarity, we suggest to directly model the human notion of multi-dimensional music similarity. To achieve this, we propose a multi-input deep neural network architecture, which simultaneously processes mel-spectrogram, CENSchromagram and tempogram representations in order to extract informative features for different disentangled musical dimensions: genre, mood, instrument, era, tempo, and key. We evaluated the proposed music similarity approach using a triplet prediction task and found that the proposed multi-input architecture outperforms a state of the art method. Furthermore, we present a novel multi-dimensional analysis to evaluate the influence of each disentangled dimension on the perception of music similarity.
Open Access
Publication
N/A
Denis Teyssou; Giorgios Kordopatis-Zilos; Ioannis Kompatsiaris; Ipek B. Schlicht; Killian Levacher; Lazaros Apostolidis; Panagiotis Galopoulos; Spyridon Baxevanakis;
Agence France-Presse; CERTH - Center for Research and Technology Hellas Deutsche Welle; IBM Research;
Enabled by recent improvements in generation methodologies, DeepFakes have become mainstream due to their increasingly better visual quality, the increase in easy-to-use generation tools and the rapid dissemination through social media. This fact poses a severe threat to our societies with the potential to erode social cohesion and influence our democracies. To mitigate the threat, numerous DeepFake detection schemes have been introduced in the literature but very few provide a web service that can be used in the wild. In this paper, we introduce the MeVer DeepFake detection service, a web service detecting deep learning manipulations in images and video. We present the design and implementation of the proposed processing pipeline that involves a model ensemble scheme, and we endow the service with a model card for transparency. Experimental results show that our service performs robustly on the three benchmark datasets while being vulnerable to Adversarial Attacks. Finally, we outline our experience and lessons learned when deploying a research system into production in the hopes that it will be useful to other academic and industry teams.
Open Access
Conference paper
International Workshop on Multimedia AI against Disinformation
Jeremy Foss; Konstantinos Apostolidis; Lyndon Nixon; Vasileios Mezaris
Birmingham City University; CERTH - Center for Research and Technology Hellas MODUL Technology;
Open Access
Journal article
ACM Multimedia Systems Conference
Antonios Liapis; Georgios N. Yannakakis; Johannes Pfau; Rainer Malaka;
University of Bremen; University of Malta
Video game testing has become a major investment of time, labor and expense in the game industry. Particularly the balancing of in-game units, characters and classes can cause long-lasting issues that persist years after a game’s launch. While approaches incorporating artificial intelligence have already shown successes in reducing manual effort and enhancing game development processes, most of these draw on heuristic, generalized or optimal behavior routines, while actual low-level decisions from individual players and their resulting playing styles are rarely considered. In this paper, we apply Deep Player Behavior Modeling to turn atomic actions of 213 players from 6 months of single-player instances within the MMORPG Aion into generative models that capture and reproduce particular playing strategies. In a subsequent simulation, the resulting generative agents (replicants) were tested against common NPC opponent types of MMORPGs that iteratively increased in difficulty, respective to the primary factor that constitutes this enemy type (Melee, Ranged, Rogue, Buffer, Debuffer, Healer, Tank or Group). As a result, imbalances between classes as well as strengths and weaknesses regarding particular combat challenges could be identified and regulated automatically.
Open Access
Journal article
IEEE Transactions on Games
Antonios Liapis; Georgios N. Yannakakis; Konstantinos Makantasis; Kosmas Pinitas;
University of Malta
Stochastic gradient descent (SGD) is a premium optimization method for training neural networks, especially for learning objectively defined labels such as image objects and events. When a neural network is instead faced with subjectively defined labels–such as human demonstrations or annotations–SGD may struggle to explore the deceptive and noisy loss landscapes caused by the inherent bias and subjectivity of humans. While neural networks are often trained via preference learning algorithms in an effort to eliminate such data noise, the de facto training methods rely on gradient descent. Motivated by the lack of empirical studies on the impact of evolutionary search to the training of preference learners, we introduce the RankNEAT algorithm which learns to rank through neuroevolution of augmenting topologies. We test the hypothesis that RankNEAT outperforms traditional gradient-based preference learning within the affective computing domain, in particular predicting annotated player arousal from the game footage of three dissimilar games. RankNEAT yields superior performances compared to the gradient-based preference learner (RankNet) in the majority of experiments since its architecture optimization capacity acts as an efficient feature selection mechanism, thereby, eliminating overfitting. Results suggest that RankNEAT is a viable and highly efficient evolutionary alternative to preference learning.
Open Access
Conference paper
Genetic and Evolutionary Computation Conference
Georg Thallinger; Katharina Schell; Verena Krawarik; Victoria Ertelthalner; Werner Bailer;
Joanneum Research;
Tools based on artificial intelligence (AI) are increasingly used in the media industry, addressing a potentially wide range of application areas. Based on a survey involving media professionals and technology providers, we present a taxonomy of application areas of AI in the media industry, including an assessment of the maturity of AI technology for the respective application. As many of these applications require human oversight, either due to insufficient maturity of technology or the need for editorial control, we also propose a classification of automation levels for AI in the media domain, with examples for different stages of the media value chain. Both of these aspects are strongly linked to the role of human users and their interaction with AI technologies. The results suggest that human-AI collaboration in media applications is still an unsolved research question.
Open Access
Conference paper
N/A
Werner Bailer;
Joanneum Research;
Few-shot object detection is useful in order to extend object detection capabilities in media production and archiving applications with specific object classes of interest for a particular organization or production context. While recent approaches for few-shot object detection have advanced the state of the art, they still do not fully meet the requirements of practical workflows, e.g., in media production and archiving. In these applications, annotated samples for novel classes are drawn from different data sources, they differ in numbers and it may be necessary to add a new class quickly to cover the requirements of a specific production. In contrast, current frameworks for few-shot object detection typically assume a static dataset, which is split into the base and novel classes. We propose a toolchain to facilitate training for few-shot object detection, which takes care of data preparation when using heterogeneous training data and setup of training steps. The toolchain also creates annotation files to use combined data sets as new base models, which facilitates class-incremental training. We also integrated the toolchain with an annotation UI.
Open Access
Conference paper
N/A
Nicu Sebe; Wei Wang; Yue Song
University of Trento;
Computing the matrix square root or its inverse in a differentiable manner is important in a variety of computer vision tasks. Previous methods either adopt the Singular Value Decomposition (SVD) to explicitly factorize the matrix or use the Newton-Schulz iteration (NS iteration) to derive the approximate solution. However, both methods are not computationally efficient enough in either the forward pass or in the backward pass. In this paper, we propose two more efficient variants to compute the differentiable matrix square root. For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad´e Approximants (MPA). The backward gradient is computed by iteratively solving the continuous-time Lyapunov equation using the matrix sign function. Both methods yield considerable speed-up compared with the SVD or the Newton-Schulz iteration. Experimental results on the de-correlated batch normalization and second-order vision transformer demonstrate that our methods can also achieve competitive and even slightly better performances. The code is available at https://github.com/KingJamesSong/FastDifferentiableMatSqrt.
Open Access
Conference paper
International Conference on Learning Representations
Alberto Del Bimbo; Claudio Ferrari; Mohamed Daoudi; Naima Otberdout; Stefano Berretti
University of Florence; University of Lille; University of Parma
In this paper, we propose a solution to the task of generating dynamic 3D facial expressions from a neutral 3D face and an expression label. This involves solving two sub-problems: (i) modeling the temporal dynamics of expressions, and (ii) deforming the neutral mesh to obtain the expressive counterpart. We represent the temporal evolution of expressions using the motion of a sparse set of 3D landmarks that we learn to generate by training a manifold-valued GAN (Motion3DGAN). To better encode the expression-induced deformation and disentangle it from the identity information, the generated motion is represented as per-frame displacement from a neutral configuration. To generate the expressive meshes, we train a Sparse2Dense mesh Decoder (S2D-Dec) that maps the landmark displacements to a dense, per-vertex displacement. This allows us to learn how the motion of a sparse set of landmarks influences the deformation of the overall face surface, independently from the identity. Experimental results on the CoMA and D3DFACS datasets show that our solution brings significant improvements with respect to previous solutions in terms of both dynamic expression generation and mesh reconstruction, while retaining good generalization to unseen data. The code and the pretrained model will be made publicly available.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Federico Becattini; Lorenzo Berlincioni; Lorenzo Seidenari;
University of Florence;
Trajectory prediction is an important task, especially in autonomous driving. The ability to forecast the position of other moving agents can yield to an effective planning, ensuring safety for the autonomous vehicle as well for the observed entities. In this work we propose a data driven approach based on Markov Chains to generate synthetic trajectories, which are useful for training a multiple future trajectory predictor. The advantages are twofold: on the one hand synthetic samples can be used to augment existing datasets and train more effective predictors; on the other hand, it allows to generate samples with multiple ground truths, corresponding to diverse equally likely outcomes of the observed trajectory. We define a trajectory prediction model and a loss that explicitly address the multimodality of the problem and we show that combining synthetic and real data leads to prediction improvements, obtaining state of the art results.
Open Access
Conference paper
N/A
Claudio Gennaro; Fabio Carrara; Fabrizio Falchi; Lorenzo Pasco
ISTI-CNR; University of Pisa
A synthetic dataset for visual fallen people detection comprising images extracted from the highly photo-realistic video game Grand Theft Auto V developed by Rockstar North. Each image is labeled by the game engine providing bounding boxes and statuses (fallen or non-fallen) of people present in the scene. The dataset comprises 6,071 synthetic images depicting 7,456 fallen and 26,125 non-fallen pedestrian instances in various looks, camera positions, background scenes, lightning, and occlusion conditions.
Open Access
Paper
N/A
Björn Þór Jónsson; Cathal Gurrin; Jakub Lokoč; Jiaxin Wu; Kai Uwe Barthel; Klaus Schoeffmann; Ladislav Peška; Luca Rossetto; Lucia Vadicamo; Silvan Heller; Stefanos Vrochidis; Werner Bailer;
CERTH - Center for Research and Technology Hellas Charles University; Dublin City University; HTW Berlin; Joanneum Research; University of Basel; University of Copenhagen; University of Hong Kong;
In the last decade, user-centric video search competitions have facilitated the evolution of interactive video search systems. So far, these competitions focused on a small number of search task categories, with few attempts to change task category configurations. Based on our extensive experience with interactive video search contests, we have \mbox{analyzed} the spectrum of possible task categories and propose a list of individual axes that define a large space of possible task categories. Using this concept of category space, new user-centric video search competitions can be designed to benchmark video search systems from different perspectives. We further analyse the three task categories considered so far at the Video Browser Showdown and discuss possible (but sometimes challenging) shifts within the task category space.
Open Access
Conference paper
N/A
Alejandro Moreo; Andrea Esuli; Fabrizio Sebastiani;
ISTI-CNR;
LeQua 2022 is a new lab for the evaluation of methods for “learning to quantify” in textual datasets, i.e., for training predictors of the relative frequencies of the classes of interest in sets of unlabelled textual documents. While these predictions could be easily achieved by first classifying all documents via a text classifier and then counting the numbers of documents assigned to the classes, a growing body of literature has shown this approach to be suboptimal, and has proposed better methods. The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting. For each such setting we provide data either in ready-made vector form or in raw document form.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Claudio Ferrari; Federico Becattini; Leonardo Galteri;
University of Florence; University of Parma
Modern image classification approaches often rely on deep neural networks, which have shown pronounced weakness to adversarial examples: images corrupted with specifically designed yet imperceptible noise that causes the network to misclassify. In this paper, we propose a conceptually simple yet robust solution to tackle adversarial attacks on image classification. Our defense works by first applying a JPEG compression with a random quality factor; compression artifacts are subsequently removed by means of a generative model (AR-GAN). The process can be iterated ensuring the image is not degraded and hence the classification not compromised. We train different AR-GANs for different compression factors, so that we can change its parameters dynamically at each iteration depending on the current compression, making the gradient approximation difficult. We experiment our defense against three white-box and two black-box attacks, with a particular focus on the state-of-the-art BPDA attack. Our method does not require any adversarial training, and is independent of both the classifier and the attack. Experiments demonstrate that dynamically changing the AR-GAN parameters is of fundamental importance to obtain significant robustness.
Open Access
Conference paper
N/A
Alejandro Moreo; Fabrizio Sebastiani;
Italy National Research Council;
Sentiment quantification is the task of training, by means of supervised learning, estimators of the relative frequency (also called “prevalence”) of sentiment-related classes (such as Positive, Neutral, Negative) in a sample of unlabelled texts. This task is especially important when these texts are tweets, since the final goal of most sentiment classification efforts carried out on Twitter data is actually quantification (and not the classification of individual tweets). It is well-known that solving quantification by means of “classify and count” (i.e., by classifying all unlabelled items by means of a standard classifier and counting the items that have been assigned to a given class) is less than optimal in terms of accuracy, and that more accurate quantification methods exist. Gao and Sebastiani~\cite{Gao:2016uq carried out a systematic comparison of quantification methods on the task of tweet sentiment quantification. In hindsight, we observe that the experimentation carried out in that work was weak, and that the reliability of the conclusions that were drawn from the results is thus questionable. We here re-evaluate those quantification methods (plus a few more modern ones) on exactly the same datasets, this time following a now consolidated and robust experimental protocol (which also involves simulating the presence, in the test data, of class prevalence values very different from those of the training set). This experimental protocol (even without counting the newly added methods) involves a number of experiments 5,775 times larger than that of the original study. Due to the above-mentioned presence, in the test data, of samples characterised by class prevalence values very different from those of the training set, the results of our experiments are dramatically different from those obtained by Gao and Sebastiani, and provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.
Open Access
Journal article
N/A
Adrian Popescu; Bogdan Ionescu; Jérôme Deshayes-Chossart;
Université Paris-Saclay; University Politehnica of Bucharest
Images constitute a large part of the content shared on social networks. Their disclosure is often related to a particular context and users are often unaware of the fact that, depending on their privacy status, images can be accessible to third parties and be used for purposes which were initially unforeseen. For instance, it is common practice for employers to search information about their future employees online. Another example of usage is that of automatic credit scoring based on online data. Most existing approaches which propose feedback about shared data focus on inferring user characteristics and their practical utility is rather limited. We hypothesize that user feedback would be more efficient if conveyed through the real-life effects of data sharing. The objective of the task is to automatically score user photographic profiles in a series of situations with strong impact on her/his life. Four such situations were modeled this year and refer to searching for: (1) a bank loan, (2) an accommodation, (3) a job as waitress/waiter and (4) a job in IT. The inclusion of several situations is interesting in order to make it clear to the end users of the system that the same image will be interpreted differently depending on the context. The final objective of the task is to encourage the development of efficient user feedback, such as the YDSYO Android app.
Open Access
Paper
N/A
Guoying Zhao Hao Tang; Haoyou Chen Nicu Sebe; Zitong Yu
ETH Zurich; University of Oulu; University of Trento;
We present a customized 3D mesh Transformer model for the pose transfer task. As the 3D pose transfer essentially is a deformation procedure dependent on the given meshes, the intuition of this work is to perceive the geometric inconsistency between the given meshes with the powerful self-attention mechanism. Specifically, we propose a novel geometry-contrastive Transformer that has an efficient 3D structured perceiving ability to the global geometric inconsistencies across the given meshes. Moreover, locally, a simple yet efficient central geodesic contrastive loss is further proposed to improve the regional geometric-inconsistency learning. At last, we present a latent isometric regularization module together with a novel semi-synthesized dataset for the cross-dataset 3D pose transfer task towards unknown spaces. The massive experimental results prove the efficacy of our approach by showing state-of-the-art quantitative performances on SMPL-NPT, FAUST and our new proposed dataset SMG- 3D datasets, as well as promising qualitative results on MGcloth and SMAL datasets. It’s demonstrated that our method can achieve robust 3D pose transfer and be generalized to challenging meshes from unknown spaces on cross-dataset tasks. The code and dataset are made available. Code is available: https://github.com/mikecheninoulu/CGT.
Open Access
Conference paper
Conference on Artificial Intelligence
Emmanouil Krasanakis; Symeon Papadopoulos
CERTH - Center for Research and Technology Hellas
In this work, we aim to classify nodes of unstructured peer-to-peer networks with communication uncertainty, such as users of decentralized social networks. Graph Neural Networks (GNNs) are known to improve the accuracy of simpler classifiers in centralized settings by leveraging naturally occurring network links, but graph convolutional layers are challenging to implement in decentralized settings when node neighbors are not constantly available.We address this problem by employing decoupled GNNs, where base classifier predictions and errors are diffused through graphs after training. For these, we deploy pre-trained and gossip-trained base classifiers and implement peer-to-peer graph diffusion under communication uncertainty. In particular, we develop an asynchronous decentralized formulation of diffusion that converges at centralized predictions in distribution and linearly with respect to communication rates. We experiment on three real-world graphs with node features and labels and simulate peer-to-peer networks with uniformly random communication frequencies; given a portion of known labels, our decentralized graph diffusion achieves comparable accuracy to centralized GNNs with minimal communication overhead (less than 3% of what gossip training already adds).
Open Access
Journal article
IEEE Access
Elisa Ricci; Enrico Fini; Moin Nabi; Nicu Sebe; Victor da Costa
SAP AI Research University of Trento;
This paper presents solo-learn, a library of self-supervised methods for visual representation learning. Implemented in Python, using Pytorch and Pytorch lightning, the library ts both research and industry needs by featuring distributed training pipelines with mixed-precision, faster data loading via Nvidia DALI, online linear evaluation for better prototyping, and many additional training tricks. Our goal is to provide an easy-to- use library comprising a large amount of Self-supervised Learning (SSL) methods, that can be easily extended and ne-tuned by the community. solo-learn opens up avenues for exploiting large-budget SSL solutions on inexpensive smaller infrastructures and seeks to democratize SSL by making it accessible to all. The source code is available at https://github.com/vturrisi/solo-learn.
Open Access
Journal article
Journal of Machine Learning Research
Bruno Lepri; Enver Sangineto; Marco de Nadai; Nicu Sebe; Wei Bi; Yahui Liu
FBK; Tencent AI Lab University of Trento;
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger representation capacity. However, the lack of the typical convolutional inductive bias makes these models more data hungry than common CNNs. In fact, some local properties of the visual domain which are embedded in the CNN architectural design, in VTs should be learned from samples. In this paper, we empirically analyse different VTs, comparing their robustness in a small training set regime, and we show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. Moreover, we propose an auxiliary selfsupervised task which can extract additional information from images with only a negligible computational overhead. This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data is scarce. Our task is used jointly with the standard (supervised) training and it does not depend on specific architectural choices, thus it can be easily plugged in the existing VTs. Using an extensive evaluation with different VTs and datasets, we show that our method can improve (sometimes dramatically) the final accuracy of the VTs. Our code is available at: https://github.com/ yhlleo/VTs-Drloc.
Open Access
Conference paper
Conference on Neural Information Processing Systems
Antonios Liapis; Georgios N. Yannakakis; Konstantinos Makantasis;
University of Malta
What if emotion could be captured in a general and subject-agnostic fashion? Is it possible, for instance, to design general-purpose representations that detect affect solely from the pixels and audio of a human-computer interaction video? In this paper we address the above questions by evaluating the capacity of deep learned representations to predict affect by relying only on audiovisual information of videos. We assume that the pixels and audio of an interactive session embed the necessary information required to detect affect. We test our hypothesis in the domain of digital games and evaluate the degree to which deep classifiers and deep preference learning algorithms can learn to predict the arousal of players based only on the video footage of their gameplay. Our results from four dissimilar games suggest that general-purpose representations can be built across games as the arousal models obtain average accuracies as high as 85% using the challenging leave-one-video-out cross-validation scheme. The dissimilar audiovisual characteristics of the tested games showcase the strengths and limitations of the proposed method.
Open Access
Journal article
IEEE Transactions on Affective Computing
Claudio Gennaro; Fabio Carrara; Giuseppe Amato; Luca Ciampi;
ISTI-CNR;
Image-based automatic cell counting is an essential yet challenging task, crucial for the diagnosing of many diseases. Current solutions rely on Convolutional Neural Networks and provide astonishing results. However, their performance is often measured only considering counting errors, which can lead to masked mistaken estimations; a low counting error can be obtained with a high but equal number of false positives and false negatives. Consequently, it is hard to determine which solution truly performs best. In this work, we investigate three general counting approaches that have been successfully adopted in the literature for counting several different categories of objects. Through an experimental evaluation over three public collections of microscopy images containing marked cells, we assess not only their counting performance compared to several state-of-the-art methods but also their ability to correctly localize the counted cells. We show that commonly adopted counting metrics do not always agree with the localization performance of the tested models, and thus we suggest integrating the proposed evaluation protocol when developing novel cell counting solutions.
Open Access
Conference paper
N/A
Claudio Gennaro; Fabrizio Falchi; Gabriele Lagani; Giuseppe Amato;
ISTI-CNR; University of Pisa
We propose a semi-supervised learning strategy for deep Convolutional Neural Networks (CNNs) in which an unsupervised pre-training stage, performed using biologically inspired Hebbian learning algorithms, is followed by supervised end-to-end backprop fine-tuning. We explored two Hebbian learning rules for the unsupervised pre-training stage: soft-Winner-Takes-All (soft-WTA) and nonlinear Hebbian Principal Component Analysis (HPCA). Our approach was applied in sample efficiency scenarios, where the amount of available labeled training samples is very limited, and unsupervised pre-training is therefore beneficial. We performed experiments on CIFAR10, CIFAR100, and Tiny ImageNet datasets. Our results show that Hebbian outperforms Variational Auto-Encoder (VAE) pre-training in almost all the cases, with HPCA generally performing better than soft-WTA.
Open Access
Conference paper
N/A
Claudio Gennaro; Fabrizio Falchi; Francesco Merola; Marco Di Benedetto
ISTI-CNR;
Self-driving systems have recently received massive attention in both academic and industrial contexts, leading to major improvements in standard navigation scenarios typically identified as well-maintained urban routes. Critical events like road accidents or unexpected obstacles, however, require the execution of specific emergency actions that deviate from the ordinary driving behavior and are therefore harder to incorporate in the system. In this context, we propose a system that is specifically built to take control of the vehicle and perform an emergency maneuver in case of a dangerous scenario. The presented architecture is based on a deep reinforcement learning algorithm, trained in a simulated environment and using raw sensory data as input. We evaluate the system’s performance on several typical pre-accident scenario and show promising results, with the vehicle being able to consistently perform an avoidance maneuver to nullify or minimize the incoming damage.
Open Access
Conference paper
N/A
Claudio Gennaro; Fabrizio Falchi; Gabriele Lagani; Giuseppe Amato;
ISTI-CNR; University of Pisa
We explore competitive Hebbian learning strategies to train feature detectors in Convolutional Neural Networks (CNNs), without supervision. We consider variants of the Winner-Takes-All (WTA) strategy explored in previous works, i.e. k-WTA, e-soft-WTA and p-soft-WTA, performing experiments on different object recognition datasets. Results suggest that the Hebbian approaches are effective to train early feature extraction layers, or to re-train higher layers of a pre-trained network, with soft competition generally performing better than other Hebbian approaches explored in this work. Our findings encourage a path of cooperation between neuroscience and computer science towards a deeper investigation of biologically inspired learning principles.
Open Access
Conference paper
N/A
Claudio Gennaro; Fabrizio Falchi; Gabriele Lagani; Giuseppe Amato;
ISTI-CNR; University of Pisa
In this paper, we investigate Hebbian learning strategies applied to Convolutional Neural Network (CNN) training. We consider two unsupervised learning approaches, Hebbian Winner-Takes-All (HWTA), and Hebbian Principal Component Analysis (HPCA). The Hebbian learning rules are used to train the layers of a CNN in order to extract features that are then used for classification, without requiring backpropagation (backprop). Experimental comparisons are made with state-of-the-art unsupervised (but backprop-based) Variational Auto-Encoder (VAE) training. For completeness,we consider two supervised Hebbian learning variants (Supervised Hebbian Classifiers—SHC, and Contrastive Hebbian Learning—CHL), for training the final classification layer, which are compared to Stochastic Gradient Descent training. We also investigate hybrid learning methodologies, where some network layers are trained following the Hebbian approach, and others are trained by backprop. We tested our approaches on MNIST, CIFAR10, and CIFAR100 datasets. Our results suggest that Hebbian learning is generally suitable for training early feature extraction layers, or to retrain higher network layers in fewer training epochs than backprop. Moreover, our experiments show that Hebbian learning outperforms VAE training, with HPCA performing generally better than HWTA.
Open Access
Conference paper
N/A
Antoine Plumerault; Céline Hudelot; Hervé Le Borgne;
Université Paris-Saclay;
Among the wide variety of image generative models, two models stand out: Variational Auto Encoders (VAE) and Generative Adversarial Networks (GAN). GANs can produce realistic images, but they suffer from mode collapse and do not provide simple ways to get the latent representation of an image. On the other hand, VAEs do not have these problems, but they often generate images less realistic than GANs. In this article, we explain that this lack of realism is partially due to a common underestimation of the natural image manifold dimensionality. To solve this issue we introduce a new framework that combines VAE and GAN in a novel and complementary way to produce an auto-encoding model that keeps VAEs properties while generating images of GAN-quality. We evaluate our approach both qualitatively and quantitatively on five image datasets.
Open Access
Conference paper
N/A
Antonios Liapis; Georgios N. Yannakakis; Theodoros Galanos
University of Malta
This paper introduces a novel method for generating artistic images that express particular affective states. Leveraging state-of-the-art deep learning methods for visual generation (through generative adversarial networks), semantic models from OpenAI, and the annotated dataset of the visual art encyclopedia WikiArt, our AffectGAN model is able to generate images based on specific or broad semantic prompts and intended affective outcomes. A small dataset of 32 images generated by AffectGAN is annotated by 50 participants in terms of the particular emotion they elicit, as well as their quality and novelty. Results show that for most instances the intended emotion used as a prompt for image generation matches the participants’ responses. This small-scale study brings forth a new vision towards blending affective computing with computational creativity, enabling generative systems with intentionality in terms of the emotions they wish their output to elicit.
Open Access
Conference paper
International Conference on Affective Computing and Intelligent Interaction Workshops
Antonios Liapis; Georgios N. Yannakakis; Matthew Barthet
University of Malta
This paper proposes a paradigm shift for affective computing by viewing the affect modeling task as a reinforcement learning process. According to our proposed framework the context (environment) and the actions of an agent define the common representation that interweaves behavior and affect. To realise this framework we build on recent advances in reinforcement learning and use a modified version of the Go-Explore algorithm which has showcased supreme performance in hard exploration tasks. In this initial study, we test our framework in an arcade game by training Go-Explore agents to both play optimally and attempt to mimic human demonstrations of arousal. We vary the degree of importance between optimal play and arousal imitation and create agents that can effectively display a palette of affect and behavioral patterns. Our Go-Explore implementation not only introduces a new paradigm for affect modeling; it empowers believable AI-based game testing by providing agents that can blend and express a multitude of behavioral and affective patterns.
Open Access
Conference paper
International Conference on Affective Computing and Intelligent Interaction Workshops
Alberto Del Bimbo; Leonardo Galteri; Lorenzo Seidenari; Marco Bertini; Pietro Bongini
University of Florence;
Evaluation of generative models, in the visual domain, is often performed providing anecdotal results to the reader. In the case of image enhancement, reference images are usually available. Nonetheless, using signal based metrics often leads to counterintuitive results: highly natural crisp images may obtain worse scores than blurry ones. On the other hand, blind reference image assessment may rank images reconstructed with GANs higher than the original undistorted images. To avoid time consuming human based image assessment, semantic computer vision tasks may be exploited instead [9, 25, 33]. In this paper we advocate the use of language generation tasks to evaluate the quality of restored images. We show experimentally that image captioning, used as a downstream task, may serve as a method to score image quality. Captioning scores are better aligned with human rankings with respect to signal based metrics or no-reference image quality metrics. We show insights on how the corruption, by artifacts, of local image structure may steer image captions in the wrong direction.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Federico Vaccaro; Marco Bertini; Tiberio Uricchio
University of Florence;
In this paper, we address the problem of content-based image retrieval (CBIR) by learning images representations based on the activations of a Convolutional Neural Network. We propose an end-to-end trainable network architecture that exploits a novel multi-scale local pooling based on the trainable aggregation layer NetVLAD (Arandjelovic et al in Proceedings of the IEEE conference on computer vision and pattern recognition CVPR, NetVLAD, 2016) and bags of local features obtained by splitting the activations, allowing to reduce the dimensionality of the descriptor and to increase the performance of retrieval. Training is performed using an improved triplet mining procedure that selects samples based on their difficulty to obtain an effective image representation, reducing the risk of overfitting and loss of generalization. Extensive experiments show that our approach, that can be effectively used with different CNN architectures, obtains state-of-the-art results on standard and challenging CBIR datasets.
Closed Access
Journal article
N/A
Adrian Popescu; Bogdan Ionescu; Jérôme Deshayes-Chossart; Liviu-Daniel Stefan;
Université Paris-Saclay; University Politehnica of Bucharest
Face verification aims to distinguish between genuine and imposter pairs of faces, which include the same or dif-
ferent identities, respectively. The performance reported in recent years gives the impression that the task is practically solved. Here, we revisit the problem and argue that existing evaluation datasets were built using two oversimplifying design choices. First, the usual identity selection to form imposter pairs is not challenging enough because, in practice, verification is needed to detect challenging imposters.
Second, the underlying demographics of existing datasets are often insufficient to account for the wide diversity of
facial characteristics of people from across the world. To mitigate these limitations, we introduce the F aV CI2D
dataset. Imposter pairs are challenging because they include visually similar faces selected from a large pool of
demographically diversified identities. The dataset also includes metadata related to gender, country and age to facilitate fine-grained analysis of results. F aV CI2D is generated from freely distributable resources. Experiments with state-of-the-art deep models that provide nearly 100% performance on existing datasets show a significant performance drop for FaVCI2D, confirming our starting hypothesis. Equally important, we analyze legal and ethical challenges which appeared in recent years and hindered the development of face analysis research. We intro-
duce a series of design choices which address these challenges and make the dataset constitution and usage more
sustainable and fairer.
Open Access
Conference paper
Winter Conference on Applications of Computer Vision
Adrian Popescu; Darian Onchis; Eden Belouadah; Habib Slim
IMT Atlantic; Université Paris-Saclay; West University of Timisoara
Incremental learning enables artificial agents to learn from sequential data. While important progress was made by exploiting deep neural networks, incremental learning remains very challenging. This is particularly the case when no memory of past data is allowed and catastrophic forgetting has a strong negative effect. We tackle class-incremental learning without memory by adapting prediction bias correction, a method which makes predictions of past and new classes more comparable. It was proposed when a memory is allowed and cannot be directly used without memory, since samples of past classes are required. We introduce a two-step learning process which allows the transfer of bias correction parameters between reference and target datasets. Bias correction is first optimized offline on reference datasets which have an associated validation memory. The obtained correction parameters are then transferred to target datasets, for which no memory is available. The second contribution is to introduce a finer modeling of bias correction by learning its parameters per incremental state instead of the usual past vs. new class modeling. The proposed dataset knowledge transfer is applicable to any incremental method which works without memory. We test its effectiveness by applying it to four existing methods. Evaluation with four target datasets and different configurations shows consistent improvement, with practically no computational and memory overhead.
Open Access
Conference paper
Winter Conference on Applications of Computer Vision
Adrian Popescu; Jérôme Deshayes-Chossart; Van-Khoa Nguyen
Université Paris-Saclay;
Social networks give free access to their services in exchange for the right to exploit their users’ data. Data sharing is done in an initial context which is chosen by the users. However, data are used by social networks and third parties in different contexts which are often not transparent. In order to unveil such usages, we propose an approach which focuses on the effects of data sharing in impactful real-life situations. Focus is put on visual content because of its strong influence in shaping online user profiles. The approach relies on three components: (1) a set of visual objects with associated situation impact ratings obtained by crowdsourcing, (2) a corresponding set of object detectors for mining users’ photos and (3) a ground truth dataset made of 500 visual user profiles which are manually rated per situation. These components are combined in LERVUP, a method which learns to rate visual user profiles in each situation. LERVUP exploits a new image descriptor which aggregates object ratings and object detections at user level and an attention mechanism which boosts highly-rated objects to prevent them from being overwhelmed by low-rated ones. Performance is evaluated per situation by measuring the correlation between the automatic ranking of profile ratings and a manual ground truth. Results indicate that LERVUP is effective since a strong correlation of the two rankings is obtained. A practical implementation of the approach in a mobile app which raises user awareness about shared data usage is also discussed.
Open Access
Conference paper
Winter Conference on Applications of Computer Vision
Anna Queralt Artur Garcia-Saez; Francesc Lordan Javier Conejero Rosa M. Badia Sergio Sanchez-Ramirez Toni Cortes
Barcelona Supercomputing Center; Universitat Politecnica de Catalunya
With the advent of more powerful Quantum Computers, the need for larger Quantum Simulations has boosted. As the amount of resources grows exponentially with size of the target system Tensor Networks emerge as an optimal framework with which we represent Quantum States in tensor factorizations. As the extent of a tensor network increases, so does the size of intermediate tensors requiring HPC tools for their manipulation. Simulations of medium-sized circuits cannot fit on local memory, and solutions for distributed contraction of tensors are scarce. In this work we present RosneT, a library for distributed, out-ofcore block tensor algebra. We use the PyCOMPSs programming model to transform tensor operations into a collection of tasks handled by the COMPSs runtime, targeting executions in existing and upcoming Exascale supercomputers. We report results validating our approach showing good scalability in simulations of Quantum circuits of up to 53 qubits.
Open Access
Publication
IEEE/ACM Second International Workshop on Quantum Computing Software
Bogdan Ionescu; Mihai Gabriel Constantin
University Politehnica of Bucharest
This paper describes the approach taken by the AI Multimedia Lab team for the MediaEval 2021 Predicting Media Memorability task. Our approach is based on a Vision Transformer-based learning method, which is optimized by filtering the training sets for the two proposed datasets.We attempt to train the methods we propose with video segments that are more representative for the videos they are part of. We test several types of filtering architectures, and submit and test the architectures that best performed in our preliminary studies.
Open Access
Conference paper
N/A
Adrián Pérez-Salinas Artur Garcia-Saez; Carlos Bravo-Prieto Diego Gárcia-Martín José I. Latorre Sergi Ramos-Calderer Stavros Efthymiou
Barcelona Supercomputing Center; Center fo Quantum Technologies Instituto de Física Teórica Qilimanjaro Quantum Tech Quantum Research Centre Universidad de Barcelona University of Milano- Bicocca
We present Qibo, a new open-source software for fast evaluation of quantum circuits and adiabatic evolution which takes full advantage of hardware accelerators. The growing interest in quantum computing and the recent developments of quantum hardware devices motivates the development of new advanced computational tools focused on performance and usage simplicity. In this work we introduce a new quantum simulation framework that enables developers to delegate all complicated aspects of hardware or platform implementation to the library so they can focus on the problem and quantum algorithms at hand. This software is designed from scratch with simulation performance, code simplicity and user friendly interface as target goals. It takes advantage of hardware acceleration such as multi-threading CPU, single GPU and multi-GPU devices.
Open Access
Publication
QST
Alberto Baldrati; Alberto Del Bimbo; Marco Bertini; Tiberio Uricchio
University of Florence;
Building on the recent advances in multimodal zero-shot represen-
tation learning, in this paper we explore the use of features obtained
from the recent CLIP model to perform conditioned image retrieval.
Starting from a reference image and an additive textual description
of what the user wants with respect to the reference image, we
learn a Combiner network that is able to understand the image
content, integrate the textual description and provide combined
feature used to perform the conditioned image retrieval. Starting
from the bare CLIP features and a simple baseline, we show that
a carefully crafted Combiner network, based on such multimodal
features, is extremely effective and outperforms more complex state
of the art approaches on the popular FashionIQ dataset.
Open Access
Conference paper
N/A
Antonios Liapis; David Melhart; Georgios N. Yannakakis;
University of Malta
To which degree can abstract gameplay metrics capture the player experience in a general fashion within a game genre? In this comprehensive study we address this question across three different videogame genres: racing, shooter, and platformer games. Using high-level gameplay features that feed preference learning models we are able to predict arousal accurately across different games of the same genre in a large-scale dataset of over 1,000 arousal-annotated play sessions. Our genre models predict changes in arousal with up to 74% accuracy on average across all genres and 86% in the best cases. We also examine the feature importance during the modelling process and find that time-related features largely contribute to the performance of both game and genre models. The prominence of these game-agnostic features show the importance of the temporal dynamics of the play experience in modelling, but also highlight some of the challenges for the future of general affect modelling in games and beyond.
Open Access
Conference paper
IEEE Conference on Games
Antonios Liapis; Chintan Triverdi; Georgios N. Yannakakis;
University of Malta
Representing games through their pixels offers a promising approach for building general-purpose and versatile game models. While games are not merely images, neural network models trained on game pixels often capture differences of the visual style of the image rather than the content of the game. As a result, such models cannot generalize well even within similar games of the same genre. In this paper we build on recent advances in contrastive learning and showcase its benefits for representation learning in games. Learning to contrast images of games not only classifies games in a more efficient manner; it also yields models that separate games in a more meaningful fashion by ignoring the visual style and focusing, instead, on their content. Our results in a large dataset of sports video games containing 100k images across 175 games and 10 game genres suggest that contrastive learning is better suited for learning generalized game representations compared to conventional supervised learning. The findings of this study bring us closer to universal visual encoders for games that can be reused across previously unseen games without requiring retraining or fine-tuning.
Open Access
Conference paper
N/A
Alejandro Moreo; Andrea Esuli; Fabrizio Sebastiani;
ISTI-CNR;
The aim of LeQua 2022 (the 1st edition of the CLEF “Learning to Quantify” lab) is to allow the comparative evaluation of methods for “learning to quantify” in textual datasets, i.e., methods for training predictors of the relative frequencies of the classes of interest in sets of unlabelled textual documents. These predictors (called “quantifiers”) will be required to issue predictions for several such sets, some of them characterized by class frequencies radically different from the ones of the training set.
Open Access
Paper
N/A
Bogdad Andrei Boteanu; Bogdan Ionescu; Bomi Kim; Claudiu Lamba; Liviu-Daniel Stefan; Mihai Dogariu
Hana Institute of Technology; University Politehnica of Bucharest
Financial markets have always been a point of interest for automated systems. Due to their complex nature, financial algorithms and fintech frameworks require vast amounts of data to accurately respond to market fluctuations. This data availability is tied to the daily market evolution so it is impossible to accelerate its acquisition. In this paper, we discuss several solutions for augmenting financial datasets via synthesizing realistic time-series with the help of generative models. This problem is complex since financial time series present very specific properties, e.g., fat-tail distribution, cross-correlation between different stocks, specific autocorrelation, cluster volatility etc. In particular, we propose solutions for capturing cross-correlations between different stocks and for transitioning from fixed to variable length time-series without resorting to sequence modeling networks, and adapt various network architectures, e.g., fully connected and convolutional GANs, variational autoencoders, and generative moment matching networks. Finally, we tackle the problem of evaluating the quality of synthetic financial time-series. We introduce qualitative and quantitative metrics, along with a portfolio trend prediction framework which validates our generative models’ performance. We carry out experiments on real-world financial data extracted from the US stock market proving the benefits of these techniques.
Open Access
Journal article
N/A
Ioannis Patras; James Oldfield; Markos Georgopoulos; Mihalis Nicolaou; Yannis Panagakis
Cyprus Institute; Imperial College London; Queen Mary University of London; University of Athens
This paper addresses the problem of finding interpretable directions in the latent space of pre-trained Generative Adversarial Networks (GANs) to facilitate controllable image synthesis. Such interpretable directions correspond to transformations that can affect both the style and geometry of the synthetic images. However, existing approaches
that utilise linear techniques to find these transformations often fail to provide an intuitive way to separate these two sources of variation. To address this, we propose to a) perform a multilinear decomposition of the tensor of intermediate representations, and b) use a tensor-based regression to map directions found using this decomposition to the latent space. Our scheme allows for both linear edits corresponding to the individual modes of the tensor, and non-linear ones that model the multiplicative interactions between them. We show experimentally that we can utilise the former to better separate style- from geometry-based transformations, and the latter to generate an extended set of possible transformations in comparison to prior works. We demonstrate our approach’s efficacy both quantitatively and qualitatively compared to the current state-of-the-art.
Open Access
Conference paper
N/A
George Voulgaris; Ioannis Mademlis; Ioannis Pitas;
Aristotle University of Thessaloniki;
Synthetic terrain realism is critical in VR applications based on computer graphics (e.g., games, simulations). Although fast procedural algorithms for automated terrain generation do exist, they still require human effort. This paper proposes a novel approach to procedural terrain generation, relying on Generative Adversarial Networks (GANs). The neural model is trained using terrestrial Points-of-Interest (PoIs, described by their geodesic coordinates/altitude) and publicly available corresponding satellite images. After training is complete, the GAN can be employed for deriving realistic terrain images on-the- fly, by merely forwarding through it a rough 2D scatter plot of desired PoIs in image form (so-called “altitude image”). We demonstrate that such a GAN is able to translate this rough, quickly produced sketch into an actual photorealistic terrain image. Additionally, we describe a strategy for enhancing the visual diversity of trained model synthetic output images, by tweaking input altitude image orientation during GAN training. Finally, we perform an objective and a subjective evaluation of the proposed method. Results validate the latter’s ability to
rapidly create life-like terrain images from minimal input data.
Open Access
Conference paper
N/A
Daniel Gatica-Perez; Sina Sajadmanesh
Idiap Research Institute
Graph Neural Networks (GNNs) have demonstrated superior performance in learning node representations for various graph inference tasks. However, learning over graph data can raise privacy concerns when nodes represent people or human-related variables that involve sensitive or personal information. While numerous techniques have been proposed for privacy-preserving deep learning over non-relational data, there is less work addressing the privacy issues pertained to applying deep learning algorithms on graphs. In this paper, we study the problem of node data privacy, where graph nodes have potentially sensitive data that is kept private, but they could be beneficial for a central server for training a GNN over the graph. To address this problem, we develop a privacy-preserving, architecture-agnostic GNN learning algorithm with formal privacy guarantees based on Local Differential Privacy (LDP). Specifically, we propose an LDP encoder and an unbiased rectifier, by which the server can communicate with the graph nodes to privately collect their data and approximate the GNN’s first layer. To further reduce the effect of the injected noise, we propose to prepend a simple graph convolution layer, called KProp, which is based on the multi-hop aggregation of the nodes’ features acting as a denoising mechanism. Finally, we propose a robust training framework, in which we benefit from KProp’s denoising capability to increase the accuracy of inference in the presence of noisy labels. Extensive experiments conducted over real-world datasets demonstrate that our method can maintain a satisfying level of accuracy with low privacy loss.
Open Access
Conference paper
N/A
Dario Zanca; Lucile Sassatelli; Marco Goro; Miguel Rondon; Stefano Melacci
Friedrich-Alexander-Universität; Institut Universitaire de France; Université Côte d'Azur; University of Siena
Immersive environments such as Virtual Reality (VR) are now a main area of interactive digital entertainment. The challenge to design personalized interactive VR systems is specifically to guide and adapt to the user’s attention. Understanding the connection between the visual content and the human attentional process is therefore key. In this article, we investigate this connection by first proposing a new head motion predictor named HeMoG. HeMoG is a white-box model built on physics of rotational motion and gravitation. Second, we compare HeMoG with existing reference Deep Learning models. We show that HeMoG can achieve similar or better performance and provides insights on the inner workings of these black-box models. Third, we study HeMoG parameters in terms of video categories and prediction horizons to gain knowledge on the connection between visual saliency and the head motion process.
Open Access
Conference paper
International Conference on Artificial Intelligence and Virtual Reality
Alejandro Moreo; Andrea Esuli; Fabrizio Sebastiani;
ISTI-CNR;
QuaPy is an open source framework for Quantification (a.k.a. Supervised Prevalence Estimation) written in Python.
QuaPy roots on the concept of data sample, and provides implementations of most important concepts in quantification literature, such as the most important quantification baselines, many advanced quantification methods, quantification-oriented model selection, many evaluation measures and protocols used for evaluating quantification methods. QuaPy also integrates commonly used datasets and offers visualization tools for facilitating the analysis and interpretation of results.
Open Access
Paper
N/A
Alexandros Metsai; Eleni Adamantidou; Evlampios Apostolidis; Ioannis Patras;
CERTH - Center for Research and Technology Hellas QMUL
Video summarization technologies aim to create a concise and complete synopsis by selecting the most informative parts of the video content. Several approaches have been developed over the last couple of decades, and the current state of the art is represented by methods that rely on modern deep neural network architectures. This work focuses on the recent advances in the area and provides a comprehensive survey of the existing deep-learning-based methods for generic video summarization. After presenting the motivation behind the development of technologies for video summarization, we formulate the video summarization task and discuss the main characteristics of a typical deep-learning-based analysis pipeline. Then, we suggest a taxonomy of the existing algorithms and provide a systematic review of the relevant literature that shows the evolution of the deep-learning-based video summarization technologies and leads to suggestions for future developments. We then report on protocols for the objective evaluation of video summarization algorithms, and we compare the performance of several deep-learning-based approaches. Based on the outcomes of these comparisons, as well as some documented considerations about the amount of annotated data and the suitability of evaluation protocols, we indicate potential future research directions.
Open Access
Journal article
Proceedings of the IEEE
Alejandro Moreo; Fabrizio Sebastiani; Juan José del Coz; Pablo González
Consiglio Nazionale delle Ricerche; University of Oviedo
Learning to Quantify (LQ) is the task of training class prevalence estimators via supervised learning. The task of these estimators is to estimate, given an unlabelled set of data items D and a set of classes C = {c1, . . . , c|C|}, the prevalence (i.e., relative frequency) of each class ci in D. LQ is interesting in all applications of classification in which the final goal is not determining which class (or classes) individual unlabelled data items belong to, but estimating the distribution of the unlabelled data items across the classes of interest. Example disciplines whose interest in labelling data items is at the aggregate level (rather than at the individual level) are the social sciences, political science, market research, ecological modelling, and epidemiology. While LQ may in principle be solved by classifying each data item in D and counting how many such items have been labelled with ci, it has been shown that this “classify and count” (CC) method yields suboptimal quantification accuracy. As a result, quantification is now no longer considered a mere byproduct of classification and has evolved as a task of its own. The goal of this workshop is bringing together all researchers interested in methods, algorithms, and evaluation measures and methodologies for LQ, as well as practitioners interested in their practical application to managing large quantities of data.
Open Access
Conference paper
N/A
Elisa Ricci; Guanglei Yang Hao Tang; Mingli Ding Niculae Sebe
ETH Zurich; Harbin Institute of Technology; University of Trento;
While convolutional neural networks have shown a
tremendous impact on various computer vision tasks, they
generally demonstrate limitations in explicitly modeling
long-range dependencies due to the intrinsic locality of
the convolution operation. Initially designed for natural
language processing tasks, Transformers have emerged as
alternative architectures with innate global self-attention
mechanisms to capture long-range dependencies. In this
paper, we propose TransDepth, an architecture that benefits
from both convolutional neural networks and transformers.
To avoid the network losing its ability to capture locallevel
details due to the adoption of transformers, we propose
a novel decoder that employs attention mechanisms
based on gates. Notably, this is the first paper that applies
transformers to pixel-wise prediction problems involving
continuous labels (i.e., monocular depth prediction and
surface normal estimation). Extensive experiments demonstrate
that the proposed TransDepth achieves state-of-theart
performance on three challenging datasets. Our code is
available at: https://github.com/ygjwd12345/TransDepth.
Open Access
Conference paper
International Conference on Computer Vision
Guoying Zhao Hao Tang; Haoyou Chen Henglin Shi; Nicu Sebe; Wei Peng
ETH Zurich; University of Oulu; University of Trento;
With the strength of deep generative models, 3D pose transfer regains intensive research interests in recent years. Existing methods mainly rely on a variety of constraints to achieve the pose transfer over 3D meshes, e.g., the need for manually encoding for shape and pose disentanglement. In this paper, we present an unsupervised approach to conduct the pose transfer between any arbitrate given 3D meshes. Specifically, a novel Intrinsic-Extrinsic Preserved Generative Adversarial Network (IEP-GAN) is presented for both intrinsic (i.e., shape) and extrinsic (i.e., pose) information preservation. Extrinsically, we propose a cooccurrence discriminator to capture the structural/pose invariance from distinct Laplacians of the mesh. Meanwhile, intrinsically, a local intrinsic-preserved loss is introduced to preserve the geodesic priors while avoiding heavy computations. At last, we show the possibility of using IEP-GAN to manipulate 3D human meshes in various ways, including pose transfer, identity swapping and pose interpolation with latent code vector arithmetic. The extensive experiments on various 3D datasets of humans, animals and hands qualitatively and quantitatively demonstrate the generality of our approach. Our proposed model produces better results and is substantially more efficient compared to recent state-ofthe- art methods. Code is available: https://github.com/mikecheninoulu/Unsupervised_IEPGAN
Open Access
Conference paper
International Conference on Computer Vision
Nicu Sebe; Wei Wang; Yue Song
University of Trento;
Global Covariance Pooling (GCP) aims at exploiting
the second-order statistics of the convolutional feature. Its
effectiveness has been demonstrated in boosting the classification
performance of Convolutional Neural Networks
(CNNs). Singular Value Decomposition (SVD) is used in
GCP to compute the matrix square root. However, the
approximate matrix square root calculated using Newton-
Schulz iteration [14] outperforms the accurate one computed
via SVD [15]. We empirically analyze the reason
behind the performance gap from the perspectives of data
precision and gradient smoothness. Various remedies for
computing smooth SVD gradients are investigated. Based
on our observation and analyses, a hybrid training protocol
is proposed for SVD-based GCP meta-layers such that
competitive performances can be achieved against Newton-
Schulz iteration. Moreover, we propose a new GCP metalayer
that uses SVD in the forward pass, and Pad´e approximants
in the backward propagation to compute the gradients.
The proposed meta-layer has been integrated into
different CNN models and achieves state-of-the-art performances
on both large-scale and fine-grained datasets.
Open Access
Conference paper
International Conference on Computer Vision
Hannes Fassold;
Joanneum Research;
In this work, we propose to progressively increase the training difficulty during learning a neural network model via a novel strategy which we call mini-batch trimming. This strategy makes sure that the optimizer puts its focus in the later training stages on the more difficult samples, which we identify as the ones with the highest loss in the current mini-batch. The strategy is very easy to integrate into an existing training pipeline and does not necessitate a change of the network model. Experiments on several image classification problems show that mini-batch trimming is able to increase the generalization ability (measured via final test error) of the trained model.
Open Access
Conference paper
International Conference on Advances in Signal Processing and Artificial Intelligen
Hannes Fassold;
Joanneum Research;
We present a novel method for detecting speaking persons in video, by extracting facial landmarks with a neural network and analysing these landmarks statistically over time.
Open Access
Conference paper
N/A
Bin Ren; Hao Tang; Niculae Sebe
ETH Zurich; University of Trento;
It is hard to generate an image at target view well for previous cross-view image translation methods that directly adopt a simple encoder-decoder or U-Net structure, especially for drastically different views and severe deformation cases. To ease this problem, we propose a novel two-stage framework with a new Cascaded Cross MLPMixer (CrossMLP) sub-network in the first stage and one refined pixel-level loss in the second stage. In the first stage, the CrossMLP sub-network learns the latent transformation cues between image code and semantic map code via our novel CrossMLP blocks. Then the coarse results are generated progressively under the guidance of those cues. Moreover, in the second stage, we design a refined pixel-level loss that eases the noisy semantic label problem with more reasonable regularization in a more compact fashion for better optimization. Extensive experimental results on Dayton [40] and CVUSA [42] datasets show that our method can generate significantly better results than state-of-the-art methods. The source code and trained models are available at https://github.com/Amazingren/CrossMLP.
Open Access
Conference paper
British Machine Vision Conference
Guoying Zhao Hao Tang; Haoyou Chen Niculae Sebe
ETH Zurich; University of Oulu; University of Trento;
We present a novel task, i.e., animating a target 3D object through the motion of a raw driving sequence. In previous works, extra auxiliary correlations between source and target meshes or intermedia factors are inevitable to capture the motions in the driving sequences. Instead, we introduce AniFormer, a novel Transformer-based architecture, that generates animated 3D sequences by directly taking the raw driving sequences and arbitrary same-type target meshes as inputs. Specifically, we customize the Transformer architecture for 3D animation that generates mesh sequences by integrating styles from target meshes and motions from the driving meshes. Besides, instead of the conventional single regression head in the vanilla Transformer, AniFormer generates multiple frames as outputs to preserve the sequential consistency of the generated meshes. To achieve this, we carefully design a pair of regression constraints, i.e., motion and appearance constraints, that can provide strong regularization on the generated mesh sequences. Our AniFormer achieves high-fidelity, realistic, temporally coherent animated results and outperforms compared start-of-the-art methods on benchmarks of diverse categories. Code is available: https://github.com/mikecheninoulu/AniFormer.
Open Access
Conference paper
British Machine Vision Conference
Claudio Gennaro; Giuseppe Amato; Lucia Vadicamo;
ISTI-CNR;
In the domain of approximate metric search, the Permutation-based Indexing (PBI) approaches have been proved to be particularly suitable for dealing with large data collections. These methods employ a permutation-based representation of the data, which can be efficiently indexed using data structures such as inverted files. In the literature, the definition of the permutation of a metric object was derived by reordering the distances of the object to a set of pivots. In this paper, we aim at generalizing this definition in order to enlarge the class of permutations that can be used by PBI approaches. As a practical outcome, we defined a new type of permutation that is calculated using distances from pairs of pivots. The proposed technique permits us to produce longer permutations than traditional ones for the same number of object-pivot distance calculations. The advantage is that the use of inverted files built on permutation prefixes leads to greater efficiency in the search phase when longer permutations are used.
Open Access
Preprint
International Conference on Similarity Search and Applications
Alberto Del Bimbo; Francesco Bongini; Lorenzo Berlincioni; Marco Bertini;
University of Florence;
In this paper we propose a novel data augmentation approach for visual content domains that have scarce training datasets, composit- ing synthetic 3D objects within real scenes. We show the perfor- mance of the proposed system in the context of object detection in thermal videos, a domain where i) training datasets are very limited compared to visible spectrum datasets and ii) creating full realistic synthetic scenes is extremely cumbersome and expensive due to the difficulty in modeling the thermal properties of the materials of the scene. We compare different augmentation strategies, including state of the art approaches obtained through RL techniques, the injection of simulated data and the employment of a generative model, and study how to best combine our proposed augmentation with these other techniques.
Open Access
Conference paper
ACM Multimedia Systems Conference
Christos Tzelepis; Georgios Tzimiropoulos; Ioannis Patras;
Queen Mary University of London;
This work addresses the problem of discovering, in an unsupervised manner, interpretable paths in the latent space of pretrained GANs, so as to provide an intuitive and easy way of controlling the underlying generative factors. In doing so, it addresses some of the limitations of the state-of-the-art works, namely, a) that they discover directions that are independent of the latent code, i.e., paths that are linear, and b) that their evaluation relies either on visual inspection or on laborious human labeling. More specifically, we propose to learn non-linear warpings on the latent space, each one parametrized by a set of RBF-based latent space warping functions, and where each warping gives rise to a family of non-linear paths via the gradient of the function. Building on the work of Voynov and Babenko, that discovers linear paths, we optimize the trainable parameters of the set of RBFs, so as that images that are generated by codes along different paths, are easily distinguishable by a discriminator network. This leads to easily distinguishable image transformations, such as pose and facial expressions in facial images. We show that linear paths can be derived as a special case of our method, and show experimentally that non-linear paths in the latent space lead to steeper, more disentangled and interpretable changes in the image space than in state-of-the art methods, both qualitatively and quantitatively. We make the code and the pretrained models publicly available at: https://github.com/chi0tzp/WarpedGANSpace.
Open Access
Conference paper
International Conference on Computer Vision
Charalampos Symeonidis; Ioannis Pitas; Sotirios Papadopoulos
Aristotle University of Thessaloniki;
This paper addresses the important problem of leader detection in racing sports videos (e.g., cycling, boating and car racing events), as his/her proper framing is a pivotal issue in racing sports cinematography, where the events have a linear spatial deployment. Over the last few years, as autonomous drone vision and cinematography emerged, new challenges appeared in drone vision. While, until recently, most computer vision methods typically addressed still camera AV footage, drone sports cinematography typically employs moving cameras. In this paper, we solve the problem of leader detection in a group of similarly moving targets in sports videos, e.g. the leader of a sports cyclist group and his/her breakaway during a cycling event. This is very useful in drone sports cinematography, as it is important that the drone camera automatically centers on such a leader. We demonstrate that the novel method described in this paper can effectively solve the problem of leader detection in sports videos.
Open Access
Conference paper
IEEE International Workshop on Multimedia Signal Processing
Christos Tzelepis; Ioannis Patras; Niki Maria Foteinopoulou
Queen Mary University of London;
Continuous affect estimation is a problem where there is an inherent uncertainty and subjectivity in the labels that accompany data samples — typically, datasets use the average of multiple annotations or self-reporting to obtain ground truth labels. In this work, we propose a method for uncertainty-aware continuous affect estimation, that models explicitly the uncertainty of the ground truth label as a uni-variate Gaussian with mean equal to the ground truth label, and unknown variance. For each sample, the proposed neural network estimates not only the value of the target label (valence and arousal in our case), but also the variance. The network is trained with a loss that is defined as the KL-divergence between the estimation (valence/arousal) and the Gaussian around the ground truth. We show that, in two affect recognition problems with real data, the estimated variances are correlated with measures of uncertainty/error in the labels that are extracted by considering multiple annotations of the data.
Open Access
Conference paper
N/A
Anna Bobasheva; Fabien Gandon; Frédéric Precioso;
Université Côte d'Azur;
This work combines semantic reasoning and machine learning to create tools that allow curators of the visual art collections to identify and correct the annotations of the artwork as well as to improve the relevance of the content-based search results in these collections. The research is based on the Joconde database maintained by French Ministry of Culture that contains illustrated artwork records from main French public and private museums representing archeological objects, decorative arts, fine arts, historical and scientific documents, etc. The Joconde database includes semantic metadata that describes properties of the artworks and their content. The developed methods create a data pipeline that processes metadata, trains a Convolutional Neural Network image classification model, makes prediction for the entire collection and expands the metadata to be the base for the SPARQL search queries. We developed a set of such queries to identify noise and silence in the human annotations and to search image content with results ranked according to the relevance of the objects quantified by the prediction score provided by the deep learning model. We also developed methods to discover new contextual relationships between the concepts in the metadata by analyzing the contrast between the concepts similarities in the Joconde’s semantic model and other vocabularies and we tried to improve the model prediction scores based on the semantic relations. Our results show that cross-fertilization between symbolic AI and machine learning can indeed provide the tools to address the challenges of the museum curators work describing the artwork pieces and searching for the relevant images.
Open Access
Journal article
ACM Journal on Computing and Cultural Heritage
Ioannis Mademlis; Ioannis Pitas; Michail Kaseris
Aristotle University of Thessaloniki;
Automated unsupervised video summarization by key-frame extraction consists in identifying representative video frames, best abridging a complete input sequence, and temporally ordering them to form a video summary, without relying on manually constructed ground-truth key-frame sets. State-of-the-art unsupervised deep neural approaches consider the desired summary to be a subset of the original sequence, composed of video frames that are sufficient to visually reconstruct the entire input. They typically employ a pre-trained CNN for extracting a vector representation per RGB video frame and a baseline LSTM adversarial learning framework for identifying key-frames. In this paper, to better guide the network towards properly selecting video frames that can faithfully reconstruct the original video, we augment the baseline framework with an additional LSTM autoencoder, which learns in parallel a fixed-length representation of the entire original input sequence. This is exploited during training, where a novel loss term inspired by dictionary learning is added to the network optimization objectives, further biasing key-frame selection towards video frames which are collectively able to recreate the original video. Empirical evaluation on two common public relevant datasets indicates highly favourable results.
Open Access
Conference paper
IEEE International Conference on Image Processing
Alberto Del Bimbo; Federico Vaccaro; Marco Bertini; Tiberio Uricchio
University of Florence;
In this paper, we address the problem of real-time video quality enhancement, considering both frame super-resolution and com- pression artifact-removal. The first operation increases the sam- pling resolution of video frames, the second removes visual artifacts such as blurriness, noise, aliasing, or blockiness introduced by lossy compression techniques, such as JPEG encoding for single-images, or H.264/H.265 for video data.
We propose to use SR-UNet, a novel network architecture based on UNet, that has been specialized for fast visual quality improve- ment (i.e. capable of operating in less than 40ms, to be able to operate on videos at 25FPS). We show how this network can be used in a streaming context where the content is generated live, e.g. in video calls, and how it can be optimized when video to be streamed are prepared in advance. The network can be used as a final post processing, to optimize the visual appearance of a frame before showing it to the end-user in a video player. Thus, it can be applied without any change to existing video coding and transmission pipelines.
Open Access
Paper
ACM Multimedia Systems Conference
Alberto Del Bimbo; Federico Vaccaro; Marco Bertini; Tiberio Uricchio
Università degli Studi di Firenze;
In this paper, we address the problem of real-time video quality enhancement, considering both frame super-resolution and com- pression artifact-removal. The first operation increases the sam- pling resolution of video frames, the second removes visual artifacts such as blurriness, noise, aliasing, or blockiness introduced by lossy compression techniques, such as JPEG encoding for single-images, or H.264/H.265 for video data.
We propose to use SR-UNet, a novel network architecture based on UNet, that has been specialized for fast visual quality improve- ment (i.e. capable of operating in less than 40ms, to be able to operate on videos at 25FPS). We show how this network can be used in a streaming context where the content is generated live, e.g. in video calls, and how it can be optimized when video to be streamed are prepared in advance. The network can be used as a final post processing, to optimize the visual appearance of a frame before showing it to the end-user in a video player. Thus, it can be applied without any change to existing video coding and transmission pipelines.
Open Access
Conference paper
ACM Multimedia Systems Conference
Hao Tang; Nicu Sebe;
University of Trento;
In this paper, we address the task of layout-to-image translation, which aims to translate an input semantic layout to a realistic image. One open challenge widely observed in existing methods is the lack of effective semantic constraints during the image translation process, leading to models that cannot preserve the semantic information and ignore the semantic dependencies within the same object. To address this issue, we propose a novel Double Pooing GAN (DPGAN) for generating photo-realistic and semantically-consistent results from the input layout. We also propose a novel Double Pooling Module (DPM), which consists of the Square-shape Pooling Module (SPM) and the Rectangle-shape Pooling Module (RPM). Specifically, SPM aims to capture short range semantic dependencies of the input layout with different spatial scales, while RPM aims to capture long-range semantic dependencies from both horizontal and vertical directions. We then effectively fuse both outputs of SPM and RPM to further enlarge the receptive field of our generator. Extensive experiments on five popular datasets show that the proposed DPGAN achieves better results than state-of-the-art methods. Finally, both SPM and SPM are general and can be seamlessly integrated into any GAN-based architectures to strengthen the feature representation. The code is available at https://github.com/Ha0Tang/DPGAN.
Closed Access
Journal article
IEEE Transactions on Image Processing
Alessandro Benedetto; Aurelia Viglione; Giulia Ricci; Giulia Sagona; Giuseppe Amato; Leonardo Lupori; Luca Lo Verde; Raffaele Mazziotti; Tommaso Pizzorusso
IRCCS Stella Maris Foundation; ISTI-CNR; Italy National Research Council; University of Florence; University of Pisa
Pupil dynamics alterations have been found in patients affected by a variety of neuropsychiatric conditions, including autism. Studies in mouse models have used pupillometry for phenotypic assessment and as a proxy for arousal. Both in mice and humans, pupillometry is non-invasive and allows for longitudinal experiments supporting temporal specificity, however, its measure requires dedicated setups. Here, we introduce a Convolutional Neural Network that performs online pupillometry in both mice and humans in a web app format. This solution dramatically simplifies the usage of the tool for the non-specialist and non-technical operators. Because a modern web browser is the only software requirement, this choice is of great interest given its easy deployment and set-up time reduction. The tested model performances indicate that the tool is sensitive enough to detect both locomotor-induced and stimulus-evoked pupillary changes, and its output is comparable with state-of-the-art commercial devices.
Open Access
Journal article
eNeuro
Been Kim; Jingkuan Song; Niculae Sebe Qiang Liu; Xiang Wang; Xianglong Liu; Xiao Bai
Beihang University; Google USA; University of Texas; University of Trento;
Deep learning has recently achieved great success in many visual recognition tasks. However, the deep neural networks (DNNs) are often perceived as black-boxes, making their decision less understandable to humans and prohibiting their usage in safety-critical applications. This guest editorial introduces the thirty papers accepted for the Special Issue on Explainable Deep Learning for Efficient and Robust Pat- tern Recognition. They are grouped into three main categories: explainable deep learning methods, effi- cient deep learning via model compression and acceleration, as well as robustness and stability in deep learning. For each of the three topics, a survey of the representative works and latest developments is presented, followed by the brief introduction of the accepted papers belonging to this topic. The special issue should be of high relevance to the reader interested in explainable deep learning methods for ef- ficient and robust pattern recognition applications and it helps promoting the future research directions in this field.
Closed Access
Journal article
International Conference on Pattern Recognition
Alexey Ozerov; Ngoc Q. K. Duong
InterDigital
Deep neural networks (DNNs) have achieved great success in various machine learning tasks. However, most existing powerful DNN models are computationally expensive and memory demanding, hindering their deployment in devices with low memory and computational resources or in applications with strict latency requirements. Thus, several resource-adaptable or flexible approaches were recently proposed that train at the same time a big model and several resource-specific sub-models. Inplace knowledge distillation (IPKD) became a popular method to train those models and consists in distilling the knowledge from a larger model (teacher) to all other sub-models (students). In this work a novel generic training method called IPKD with teacher assistant (IPKD-TA) is introduced, where sub-models themselves become teacher assistants teaching smaller sub-models. We evaluated the proposed IPKD-TA training method using two state-of-the-art flexible models (MSDNet and Slimmable MobileNet-V1) with two popular image classification benchmarks (CIFAR-10 and CIFAR-100). Our results demonstrate that the IPKD-TA is on par with the existing state of the art while improving it in most cases.
Open Access
Conference paper
European Signal Processing Conference
Claudio Gennaro; Fabrizio Falchi; Gabriele Lagani; Giuseppe Amato;
ISTI-CNR;
We propose to address the issue of sample efficiency, in Deep Convolutional Neural Networks (DCNN), with a semi-supervised training strategy that combines Hebbian learning with gradient descent: all internal layers (both convolutional and fully connected) are pre-trained using an unsupervised approach based on Hebbian learning, and the last fully connected layer (the classification layer) is trained using Stochastic Gradient Descent (SGD). In fact, as Hebbian learning is an unsupervised learning method, its potential lies in the possibility of training the internal layers of a DCNN without labels. Only the final fully connected layer has to be trained with labeled examples. We performed experiments on various object recognition datasets, in different regimes of sample efficiency, comparing our semi-supervised (Hebbian for internal layers + SGD for the final fully connected layer) approach with end-to-end supervised backprop training, and with semi-supervised learning based on Variational Auto-Encoder (VAE). The results show that, in regimes where the number of available labeled samples is low, our semi-supervised approach outperforms the other approaches in almost all the cases.
Open Access
Journal article
Neural Networks
Claudio Gennaro; Fabrizio Falchi; Giuseppe Amato; Nicola Messina;
ISTI-CNR;
This paper describes the system used by the AIMH Team to approach the SemEval Task 6. We propose an approach that relies on an architecture based on the transformer model to process multimodal content (text and images) in memes. Our architecture, called DVTT (Double Visual Textual Transformer), approaches Subtasks 1 and 3 of Task 6 as multi-label classification problems, where the text and/or images of the meme are processed, and the probabilities of the presence of each possible persuasion technique are returned as a result. DVTT uses two complete networks of transformers that work on text and images that are mutually conditioned. One of the two modalities acts as the main one and the second one intervenes to enrich the first one, thus obtaining two distinct ways of operation. The two transformers outputs are merged by averaging the inferred probabilities for each possible label, and the overall network is trained end-to-end with a binary cross-entropy loss.
Open Access
Conference paper
International Workshop on Semantic Evaluation
Niculae Sebe Paolo Rota; Petru Soviany; Radu Tudor Ionescu
University of Trento;
Training (source) domain bias affects state-of-the-art object detectors, such as Faster R-CNN, when applied to new (target) domains. To alleviate this problem, researchers proposed various domain adaptation methods to improve object detection results in the cross-domain setting, e.g. by translating images with ground-truth labels from the source domain to the target domain using Cycle-GAN. On top of combining Cycle-GAN transformations and self-paced learning in a smart and efficient way, in this paper, we propose a novel self-paced algorithm that learns from easy to hard. Our method is simple and effective, without any overhead during inference. It uses only pseudo-labels for samples taken from the target domain, i.e. the domain adaptation is unsupervised. We conduct experiments on four cross-domain benchmarks, showing better results than the state of the art. We also perform an ablation study demonstrating the utility of each component in our framework. Additionally, we study the applicability of our framework to other object detectors. Furthermore, we compare our difficulty measure with other measures from the related literature, proving that it yields superior results and that it correlates well with the performance metric.
Closed Access
Journal article
N/A
Ioannis Pitas; Vasileios Mygdalis
Aristotle University of Thessaloniki;
This work addresses the problem of adversarial robustness in deep neural network classification from an optimal class boundary estimation perspective. It is argued that increased model robustness to adversarial attacks can be achieved when the feature learning process is monitored by geometrically-inspired optimization criteria. To this end, we propose to learn hyperspherical class prototypes in the neural feature embedding space, along with training the network parameters. Three concurrent optimization functions for the intermediate hidden layer training data activations are devised, requiring items of the same class to be enclosed by the corresponding class prototype boundaries, to have minimum distance from their class prototype vector (i.e., hypersphere center) and to have maximum distance from the remainder hypersphere centers. Our experiments show that training standard classification model architectures with the proposed objectives, significantly increases their robustness to white-box adversarial attacks, without adverse (if not beneficial) effects to their classification accuracy.
Open Access
Report
N/A
Anastasios Tefas; Ioannis Pitas; Vasileios Mygdalis
Aristotle University of Thessaloniki;
The network output activation values for a given input can be employed to produce a sorted ranking. Adversarial attacks typically generate the least amount of perturbation required to change the classifier label. In that sense, generated adversarial attack perturbation only affects the output in the 1st sorted ranking position. We argue that meaningful information about the adversarial examples i.e., their original labels, is still encoded in the network output ranking and could potentially be extracted, using rule-based reasoning. To this end, we introduce a novel adversarial attack methodology inspired by the K-anonymity principles, that generates adversarial examples that are not only misclassified by the neural network classifier, but are uniformly spread along K different positions in the output sorted ranking. In order to regulate the introduced perturbation that arises from the strength of the proposed optimization objectives, an additional visual similarity-based loss function is introduced as well, guiding the adversarial examples towards directions maintaining visual similarity according the same objective metric, such as the CW-SSIM. Experimental results denote that the proposed approach achieves the optimization goals inspired by K-anonymity, while introducing reduced perturbation as well.
Open Access
Report
N/A
Bogdan Ionescu; Liviu-Daniel Stefan; Mihai Gabriel Constantin
University Politehnica of Bucharest
In the context of the ever growing quantity of multimedia content from social, news and educational platforms, generating meaningful recommendations and ratings now requires a more advanced understanding of their impact on the user, such as their subjective perception. One of the important subjective concepts explored by researchers is visual interestingness. While several definitions of this concept are given in the current literature, in a broader sense, this property attempts to measure the ability of audio-visual data to capture and keep the viewer’s attention for longer periods of time. While many computer vision and machine learning methods have been tested for predicting media interestingness, overall, due to the heavily subjective nature of interestingness, the precision of the results is relatively low. In this chapter, we investigate several methods that address this problem from a different angle. We first review the literature on interestingness prediction and present an overview of the traditional fusion mechanisms, such as statistical fusion, weighted approaches, boosting, random forests or randomized trees. Further, we explore the possibility of employing a stronger, novel deep learning-based, system fusion for enhancing the performance. We investigate several types of deep networks for creating the fusion systems, including dense, attention, convolutional and cross-space-fusion networks, while also proposing some input decoration methods that help these networks achieve optimal performance.We present the results, as well as an analysis of the correlation between network structure and overall system performance. Experimental validation is carried out on a publicly available data set and on the systems benchmarked during the 2017 MediaEval Predicting Media Interestingness task.
Open Access
Section
N/A
Alexandr Ermolov; Aliaksandr Siarohin; Enver Sangineto; Niculae Sebe
University of Trento;
Most of the current self-supervised representation learning (SSL) methods are based on the contrastive
loss and the instance-discrimination task, where augmented versions of the same image instance (“positives”) are contrasted with instances extracted from other images (“negatives”). For the
learning to be effective, many negatives should be compared with a positive pair, which is computationally
demanding. In this paper, we propose a different direction and a new loss function for SSL, which is based on the whitening of the latentspace features. The whitening operation has a “scattering” effect on the batch samples, avoiding degenerate solutions where all the sample representations collapse to a single point. Our solution does not require asymmetric networks and it is conceptually simple. Moreover, since negatives are not needed, we can extract multiple positive pairs from the same image instance. The source code of the method and of all the experiments is available at: https://github.com/htdt/self-supervised.
Open Access
Conference paper
International Conference on Machine Learning
Antonios Liapis; Georgios N. Yannakakis; Konstantinos Sfikas;
University of Malta
A core challenge of evolutionary search is the need to balance between exploration of the search space and exploitation of highly fit regions. Quality-diversity search has explicitly walked this tightrope between a population’s diversity and its quality. This paper extends a popular quality-diversity search algorithm, MAP-Elites, by treating the selection of parents as a multi-armed bandit problem. Using variations of the upper-confidence bound to select parents from under-explored but potentially rewarding areas of the search space can accelerate the discovery of new regions as well as improve its archive’s total quality. The paper tests an indirect measure of quality for parent selection: the survival rate of a parent’s offspring. Results show that maintaining a balance between exploration and exploitation leads to the most diverse and high-quality set of solutions in three different testbeds.
Open Access
Conference paper
Genetic and Evolutionary Computation Conference
Claudio Gennaro; Claudio Vairo; Fabio Carrara; Fabrizio Falchi; Franca Debole; Giuseppe Amato; Lucia Vadicamo; Paolo Bolettieri;
ISTI-CNR;
This paper describes in detail VISIONE, a video search system that allows users to search for videos using textual keywords, the occurrence of objects and their spatial relationships, the occurrence of colors and their spatial relationships, and image similarity. These modalities can be combined together to express complex queries and meet users’ needs. The peculiarity of our approach is that we encode all information extracted from the keyframes, such as visual deep features, tags, color and object locations, using a convenient textual encoding that is indexed in a single text retrieval engine. This offers great flexibility when results corresponding to various parts of the query (visual, text and locations) need to be merged. In addition, we report an extensive analysis of the retrieval performance of the system, using the query logs generated during the Video Browser Showdown (VBS) 2019 competition. This allowed us to fine-tune the system by choosing the optimal parameters and strategies from those we tested.
Open Access
Journal article
Journal of Imaging
Claudio Gennaro; Claudio Vairo; Fabrizio Falchi; Giuseppe Amato; Lucia Vadicamo; Nicola Messina; Paolo Bolettieri;
ISTI-CNR;
This paper presents the second release of VISIONE, a tool for effective video search on large-scale collections. It allows users to search for videos using textual descriptions, keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial re- lationships, and image similarity. One of the main features of our system is that it employs specially designed textual encodings for indexing and searching video content using the mature and scalable Apache Lucene full-text search engine.
Open Access
Conference paper
MultiMedia Modeling
Adrián Pérez-Salinas Artur Garcia-Saez David López-Núñez José I. Latorre P. Forn-Díaz
Barcelona Institute of Science and Technology Barcelona Supercomputing Center; Center fo Quantum Technologies Qilimanjaro Quantum Tech Quantum Research Centre Universidad de Barcelona
A single-qubit circuit can approximate any bounded complex function stored in the degrees of freedom defining its quantum gates. The single-qubit approximant presented in this work is operated through a series of gates that take as their parameterization the independent variable of the target function and an additional set of adjustable parameters. The independent variable is re-uploaded in every gate while the parameters are optimized for each target function. The output state of this quantum circuit becomes more accurate as the number of re-uploadings of the independent variable increases, i. e., as more layers of gates parameterized with the independent variable are applied. In this work, we provide two different proofs of this claim related to both Fourier series and the Universal Approximation Theorem for Neural Networks, and we benchmark both methods against their classical counterparts. We further implement a single-qubit approximant in a real superconducting qubit device, demonstrating how the ability to describe a set of functions improves with the depth of the quantum circuit. This work shows the robustness of the re-uploading technique on Quantum Machine Learning.
Open Access
Publication
N/A
Georgios Zoumpourlis; Ioannis Patras;
Queen Mary University of London;
In this work we study the problem of emotion recognition under the prism of preference learning. Affective datasets are typically annotated by assigning a single absolute label, i.e. a numerical value that describes the intensity of an emotional attribute, to each sample. Then, the majority of existing works on affect recognition employ sample-wise classification/regression methods to predict affective states, using those annotations. We take a different approach and use a deep network architecture that performs joint training on the tasks of classification/regression of samples and ordinal ranking between pairs of samples. By treating input samples in a pairwise manner, we leverage the auxiliary task of inferring the ordinal relation between their corresponding affective states. Incorporating the ranking objective allows capturing the inherently ordinal structure of emotions and learning the inter-sample relations, resulting in better generalization. Our method is incorporated into existing affect recognition architectures and evaluated on datasets of electroencephalograms (EEG) and images. We show that the approach proposed in this work leads to consistent performance gains when incorporated in classification/regression networks.
Open Access
Conference paper
International Conference on ACII
Hao Tang; Nicu Sebe;
University of Trento;
We propose a novel and unified Cycle in Cycle Generative Adversarial Network (C2GAN) for generating human faces, hands, bodies, and natural scenes. Our proposed C2GAN is a cross-modal model exploring the joint exploitation of the input image data and guidance data in an interactive manner. C2GAN contains two different generators, i.e., an image-generation generator and a guidance-generation generator. Both generators are mutually connected and trained in an end-to-end fashion and explicitly form three cycled subnets, i.e., one image generation cycle and two guidance generation cycles. Each cycle aims at reconstructing the input domain and simultaneously produces a useful output involved in the generation of another cycle. In this way, the cycles constrain each other implicitly providing complementary information from both image and guidance modalities and bringing an extra supervision gradient across the cycles, facilitating a more robust optimization of the whole model. Extensive results on four guided image-to-image translation subtasks demonstrate that the proposed C2GAN is effective in generating more realistic images compared with state-of-the-art models.
Closed Access
Journal article
IEEE Transactions on Multimedia
Fengxiang Yang; Hong Liu; Nicu Sebe; Shaozi Li; Shin'ichi Satoh; Zheng Wang; Zhiming Luo; Zhun Zhong
National Institute of Informatics of Tokyo; University of Tokyo; University of Trento; Xiamen University
Recent advances in person re-identification (re-ID) have led to impressive retrieval accuracy. However, existing re-ID models are challenged by the adversarial examples crafted by adding quasi-imperceptible perturbations. Moreover, re-ID systems face the domain shift issue that training and testing domains are not consistent. In this study, we argue that learning powerful attackers with high universality that works well on unseen domains is an important step in promoting the robustness of re-ID systems. Therefore, we introduce a novel universal attack algorithm called “MetaAttack” for person re-ID. MetaAttack can mislead re-ID models on unseen domains by a universal adversarial perturbation. Specifically, to capture common patterns across different domains, we propose a meta-learning scheme to seek the universal perturbation via the gradient interaction between meta-train and meta-test formed by two datasets. We also take advantage of a virtual dataset (PersonX), instead of real ones, to conduct meta-test. This scheme not only enables us to learn with more comprehensive variation factors but also mitigates the negative effects caused by biased factors of real datasets. Experiments on three large-scale re-ID datasets demonstrate the effectiveness of our method in attacking re-ID models on unseen domains. Our final visualization results reveal some new properties of existing re-ID systems, which can guide us in designing a more robust re-ID model. Code and supplemental material are available at https://github.com/FlyingRoastDuck/MetaAttack AAAI21.
Open Access
Conference paper
Conference on Artificial Intelligence
Bogdan Ionescu; Cristian Stanciu;
University Politehnica of Bucharest
Generative models have evolved immensely in the last few years. GAN-based video and image generation has become very accessible due to open source software available to anyone, and that may pose a threat to society. Deepfakes can be used to intimidate, blackmail certain public figures or to mislead the public. At the same time, with the rising popularity of deepfakes, detection algorithms have also evolved significantly. The majority of those algorithms focus on images rather than to explore the temporal evolution in the video. In this paper, we explore whether the temporal information of the video can be used to increase the performance of state-of-the-art deepfake detection algorithms. We also investigate whether certain facial regions contain more information about the authenticity of the video by using the entire aligned face as input for our model and by only selecting certain facial regions. We use late fusion to combine those results for increased performance. To validate our solution, we experiment on 2 state-of-the-art datasets, namely FaceForensics++ and CelebDF. The results show that using the temporal dimension can greatly enhance the performance of a deep learning model.
Open Access
Conference paper
N/A
Bruno Lepri; Enver Sangineto; Haoxian Zhang; Linchao Bao; Marco de Nadai; Nicu Sebe; Wei Wang; Yahui Liu Yajing Chen;
Fondazione Bruno Kessler; Tencent AI Lab University of Trento;
Image-to-Image (I2I) multi-domain translation models are usually evaluated also using the quality of their semantic interpolation results. However, state-of-the-art models frequently show abrupt changes in the image appearance during interpolation, and usually perform poorly in interpolations across domains. In this paper, we propose a new training protocol based on three specific losses which help a translation network to learn a smooth and disentangled latent style space in which: 1) Both intra- and inter-domain interpolations correspond to gradual changes in the generated images and 2) The content of the source image is better preserved during the translation. Moreover, we propose a novel evaluation metric to properly measure the smoothness of latent style space of I2I translation models. The proposed method can be plugged in existing translation approaches, and our extensive experiments on different datasets show that it can significantly boost the quality of the generated images and the graduality of the interpolations.
Open Access
Conference paper
N/A
Elisa Ricci; Enrico Fini; Nicu Sebe; Subhankar Roy; Zhiming Luo; Zhun Zhong
Fondazione Bruno Kessler; University of Trento; Xiamen University
In this paper, we address Novel Class Discovery (NCD), the task of unveiling new classes in a set of unlabeled samples given a labeled dataset with known classes. We exploit the peculiarities of NCD to build a new framework, named Neighborhood Contrastive Learning (NCL), to learn discriminative representations that are important to clustering performance. Our contribution is twofold. First, we find that a feature extractor trained on the labeled set generates representations in which a generic query sample and
its neighbors are likely to share the same class. We exploit this observation to retrieve and aggregate pseudo-positive pairs with contrastive learning, thus encouraging the model to learn more discriminative representations. Second, we notice that most of the instances are easily discriminated by the network, contributing less to the contrastive loss. To overcome this issue, we propose to generate hard negatives by mixing labeled and unlabeled samples in the feature space. We experimentally demonstrate that these two ingredients significantly contribute to clustering performance and lead our model to outperform state-of-the-art methods by a large margin (e.g., clustering accuracy +13% on CIFAR-100 and +8% on ImageNet).
Open Access
Conference paper
N/A
Linchao Zhu; Nicu Sebe; Shaozi Li; Yi Yang; Zhiming Luo; Zhun Zhong
University of Technology Sydney; University of Trento; Xiamen University
In this paper, we tackle the problem of discovering new classes in unlabeled visual data given labeled data from disjoint classes. Existing methods typically first pre-train a model with labeled data, and then identify new classes in unlabeled data via unsupervised clustering. However, the labeled data that provide essential knowledge are often underexplored in the second step. The challenge is that the labeled and unlabeled examples are from non-overlapping classes, which makes it difficult to build a learning relationship between them. In this work, we introduce OpenMix to mix the unlabeled examples from an open set and the labeled examples from known classes, where their nonoverlapping labels and pseudo-labels are simultaneously mixed into a joint label distribution. OpenMix dynamically compounds examples in two ways. First, we produce mixed training images by incorporating labeled examples with unlabeled examples. With the benefit of unique prior knowledge in novel class discovery, the generated pseudo-labels will be more credible than the original unlabeled predictions. As a result, OpenMix helps preventing the model from overfitting on unlabeled samples that may be assigned with wrong pseudo-labels. Second, the first way encourages the unlabeled examples with high class-probabilities to have considerable accuracy. We introduce these examples as reliable anchors and further integrate them with unlabeled samples. This enables us to generate more combinations in unlabeled examples and exploit finer object relations among the new classes. Experiments on three classification datasets demonstrate the effectiveness of the proposed OpenMix, which is superior to state-of-the-art methods in novel class discovery.
Open Access
Conference paper
N/A
Fengxiang Yang; Nicu Sebe; Shaozi Li; Yaojing Lin; Yuanzheng Cai; Zhiming Luo; Zhun Zhong
Minjaing University; Minnan Normal University; University of Trento; Xiamen University
This paper considers the problem of unsupervised person re-identification (re-ID), which aims to learn discriminative models with unlabeled data. One popular method is to obtain pseudo-label by clustering and use them to optimize the model. Although this kind of approach has shown promising accuracy, it is hampered by 1) noisy labels produced by clustering and 2) feature variations caused by camera shift. The former will lead to incorrect optimization and thus hinders the model accuracy. The latter will result in assigning the intra-class samples of different cameras to different pseudo-label, making the model sensitive to camera variations. In this paper, we propose a unified framework to solve both problems. Concretely, we propose a Dynamic and Symmetric Cross Entropy loss (DSCE) to deal with noisy samples and a camera-aware meta-learning algorithm (MetaCam) to adapt camera shift. DSCE can alleviate the negative effects of noisy samples and accommodate the change of clusters after each clustering step. MetaCam simulates cross-camera constraint by splitting the training data into meta-train and meta-test based on camera IDs.
With the interacted gradient from meta-train and meta-test, the model is enforced to learn camera-invariant features. Extensive experiments on three re-ID benchmarks show the effectiveness and the complementary of the proposed DSCE and MetaCam. Our method outperforms the state-of-the-art methods on both fully unsupervised re-ID and unsupervised domain adaptive re-ID.
Open Access
Conference paper
N/A
Alexey Ozerov; Antonios Liapis; Artur Garcia-Saez; Birgit Gray; Danae Tsabouraki; Daniele Gravina; Filareti Tsalakanidou; Francois Schnitzler; Fulvio Negro; Georgi Kostadinov; Georgios N. Yannakakis; Ioannis Kompatsiaris; Jesse de Vos; Maritini Kalogerini; Maurizio Montagnuolo; Philo van Kemenade; Rémi Mignot; Symeon Papadopoulos Vasileios Mezaris
Athens Technology Center; Barcelona Supercomputing Center; CERTH - Center for Research and Technology Hellas Deutsche Welle; Imagga Technologies Lda.; InterDigital IRCAM; Modl.ai; Netherlands Institute for Sound & Vision RAI; University of Malta
Artificial Intelligence brings exciting innovations in all aspects of life and creates new opportunities across industry sectors. At the same time, it raises significant questions in terms of trust, ethics, and accountability. This paper offers an introduction to the AI4Media project, which aims to build on recent advances of AI in order to offer innovative tools to the media sector. AI4Media unifies the fragmented landscape of media-related AI technologies by investigating new learning paradigms and distributed AI, exploring issues of AI explainability, robustness and privacy, examining AI techniques for content analysis, and exploiting AI to address major societal challenges. In this paper, we focus on our vision of how such AI technologies can reshape the media sector, by discussing seven industrial use cases that range from combating disinformation in social media and supporting journalists for news story creation, to high quality video production, game design, and artistic co-creation. For each of these use cases, we highlight the present challenges and needs, and explain how they can be efficiently addressed by using innovative AI-driven solutions.
Open Access
Conference paper
N/A
Bogdan Ionescu; Cristian Stanciu; Dan-Ștefan Pârvu Denisa Ionaşcu; Mihai Gabriel Constantin
University Politehnica of Bucharest
The modern advances of social media platforms and content sharing websites led to the popularization of Internet memes, and today’s Internet landscape contains websites that are predominantly dedicated to meme sharing. While at their inception memes were mostly humorous, this concept evolved and nowadays memes cover a wide variety of subjects, including political and social commentaries. Considering the widespread use of memes and their power of conveying distilled messages, they became an important method for spreading hate speech against individuals or targeted groups. Given the multimodal nature of Internet memes, our proposed approach is also a multimodal one, consisting of two parallel processing branches, one textual and one visual, that are joined in a final classification step, providing prediction results for the samples. We test our approach on the publicly available Memotion 7k dataset and compare our results with the baseline approach developed for the dataset.
Open Access
Conference paper
N/A
Andreas Goulas; Damianos Galanopoulos; Nikolaos Gkalelis; Vasileios Mezaris
CERTH - Center for Research and Technology Hellas
In this paper a novel bottom-up video event recognition approach is proposed, ObjectGraphs, which utilizes a rich frame representation and the relations between objects within each frame. Following the application of an object detector (OD) on the frames, graphs are used to model the object relations and a graph convolutional network (GCN) is utilized to perform reasoning on the graphs. The resulting object-based frame-level features are then forwarded to a long short-term memory (LSTM) network for video event recognition. Moreover, the weighted in-degrees (WiDs) derived from the graph’s adjacency matrix at frame level are used for identifying the objects that were considered most (or least) salient for event recognition and contributed the most (or least) to the final event recognition decision, thus providing an explanation for the latter. The experimental results show that the proposed method achieves state-of-the-art performance on the publicly available FCVID and YLI-MED datasets. Source code for our ObjectGraphs method is made publicly available at: https://github.com/bmezaris/ObjectGraphs
Open Access
Conference paper
N/A
Aliaksandr Siarohin; Elisa Ricci; Sergey Tulyakov; Stéphane Lathuilière; Willi Menapace;
Fondazione Bruno Kessler; Institut Polytechnique de Paris; Snap Inc.; University of Trento;
This paper introduces the unsupervised learning problem of playable video generation (PVG). In PVG, we aim at allowing a user to control the generated video by selecting a discrete action at every time step as when playing a video game. The difficulty of the task lies both in learning semantically consistent actions and in generating realistic videos conditioned on the user input. We propose a novel framework for PVG that is trained in a self-supervised manner on a large dataset of unlabelled videos. We employ an encoder-decoder architecture where the predicted action labels act as bottleneck. The network is constrained to learn a rich action space using, as main driving loss, a reconstruction loss on the generated video. We demonstrate the effectiveness of the proposed approach on several datasets with wide environment variety. Further details, code and examples are available on our project page: willimenapace.github.io/playable-video-generation-website
Open Access
Conference paper
N/A
Dorothea Thomas-Aniola; Georg Thallinger; Gerhard Backfried; Werner Bailer;
HENSOLDT Analytics; Joanneum Research;
Fake news and misinformation is a widespread phenomenon these days, affecting social media, alternative and traditional media. In a climate of increasing polarization and perceived societal injustice, the topic of migration is one domain that is frequently the target of fake news, addressing both migrants and citizens in host countries. The problem is inherently a multi-lingual and multi-modal one in that it involves information in an array of languages, material in textual, visual and auditory form and often involves communication in a language which may be unfamiliar to recipients or which these recipients only may have basic knowledge of. We argue that semi-automatic approaches, empowering users to gain a clearer picture and base their decisions on sound information, are needed to counter the problem of misinformation. In order to deal with the scale of the problem, such approaches involve a variety of technologies from the field of Artificial Intelligence (AI). In this paper we identify a number of challenges related to implementing approaches for the detection of fake news in the context of migration. These include collecting multi-lingual and multi-modal datasets related to the migration domain, providing explanations of AI tools used in verification to both media professionals and consumers. Further efforts in truly collaborative AI will be needed.
Open Access
Conference paper
N/A
Antonio Martella; Fabrizio Falchi; Margherita Gambini; Maurizio Tesconi; Tiziano Fagni;
ISTI-CNR; University of Trento;
The recent advances in language modeling significantly improved the generative capabilities of deep neural models: in 2019 OpenAI released GPT-2, a pre-trained language model that can autonomously generate coherent, non-trivial and human-like text samples. Since then, ever more powerful text generative models have been developed. Adversaries can exploit these tremendous generative capabilities to enhance social bots that will have the ability to write plausible deepfake messages, hoping to contaminate public debate. To prevent this, it is crucial to develop deepfake social media messages detection systems. However, to the best of our knowledge no one has ever addressed the detection of machine-generated texts on social networks like Twitter or Facebook. With the aim of helping the research in this detection field, we collected the first dataset of real deepfake tweets, TweepFake. It is real in the sense that each deepfake tweet was actually posted on Twitter. We collected tweets from a total of 23 bots, imitating 17 human accounts. The bots are based on various generation techniques, i.e., Markov Chains, RNN, RNN+Markov, LSTM, GPT-2. We also randomly selected tweets from the humans imitated by the bots to have an overall balanced dataset of 25,572 tweets (half human and half bots generated). The dataset is publicly available on Kaggle. Lastly, we evaluated 13 deepfake text detection methods (based on various state-of-the-art approaches) to both demonstrate the challenges that Tweepfake poses and create a solid baseline of detection techniques. We hope that TweepFake can offer the opportunity to tackle the deepfake detection on social media messages as well.
Open Access
Journal article
PLOS ONE
Tobias Blanke; Tommaso Venturini;
Center for Internet and Society of Paris; University of Amsterdam;
This article shows how a machine can employ a network view to reason about complex social relations of news reliability. Such a network view promises a topic-agnostic perspective that can be a useful hint on reliability trends and their heterogeneous assumptions. In our analysis, we depart from the ever-growing numbers of papers trying to find machine learning algorithms to predict the reliability of news and focus instead on using machine reasoning to understand the structure of news networks by comparing it with our human judgements. Understanding and representing news networks is not easy, not only because they can be extremely vast but also because they are shaped by several overlapping network dynamics. We present a machine learning approach to analyse what constitutes reliable news from the view of a network. Our aim is to machine-read a network’s understanding of news reliability. To analyse real-life news sites, we used the Décodex dataset to train machine learning models from the structure of the underlying network. We then employ the models to draw conclusions how the Décodex evaluators came to assess the reliability of news.
Open Access
Journal article
Journal of Computational Social Science
Claudio Gennaro; Fabrizio Falchi; Federico Cremisi; Gabriele Lagani; Giuseppe Amato; Marco Cicchini Guido; Raffaele Mazziotti; Tommaso Pizzorusso
ISTI-CNR;
Previous work has shown that it is possible to train neuronal cultures on Multi-Electrode Arrays (MEAs), to recognize very simple patterns. However, this work was mainly focused to demonstrate that it is possible to induce plasticity in cultures, rather than performing a rigorous assessment of their pattern recognition performance. In this paper, we address this gap by developing a methodology that allows us to assess the performance of neuronal cultures on a learning task. Specifically, we propose a digital model of the real cultured neuronal networks; we identify biologically plausible simulation parameters that allow us to reliably reproduce the behavior of real cultures; we use the simulated culture to perform handwritten digit recognition and rigorously evaluate its performance; we also show that it is possible to find improved simulation parameters for the specific task, which can guide the creation of real cultures.
Open Access
Conference paper
Conference on Neural Engineering
Josif Grabocka; Martin Wistuba;
IBM Research; University of Freiburg;
Hyperparameter optimization (HPO) is a central pillar in the automation of machine learning solutions and is mainly performed via Bayesian optimization, where a parametric surrogate is learned to approximate the black box response function (e.g. validation error). Unfortunately, evaluating the response function is computationally intensive. As a remedy, earlier work emphasizes the need for transfer learning surrogates which learn to optimize hyperparameters for an algorithm from other tasks. In contrast to previous work, we propose to rethink HPO as a few-shot learning problem in which we train a shared deep surrogate model to quickly adapt (with few response evaluations) to the response function of a new task. We propose the use of a deep kernel network for a Gaussian process surrogate that is meta-learned in an end-to-end fashion in order to jointly approximate the response functions of a collection of training data sets. As a result, the novel few-shot optimization of our deep kernel surrogate leads to new state-of-the-art results at HPO compared to several recent methods on diverse metadata sets.
Open Access
Conference paper
N/A
Anne Lambert; Françoise Le Bolzer; Pascal Le Guyadee; Tsiry Mayet;
InterDigital
We introduce Skip-Window, a method to allow recurrent neural networks (RNNs) to trade off accuracy for computational cost during the analysis of a sequence. Similarly to existing approaches, Skip-Window extends existing RNN cells by adding a mechanism to encourage the model to process fewer inputs. Unlike existing approaches, Skip-Window is able to respect a strict computational budget, making this model more suitable for limited hardware like edge devices. We evaluate this approach on four datasets: a human activity recognition task, sequential MNIST, IMDB and adding task. Our results show that Skip-Window is often able to exceed the accuracy of existing approaches for a lower computational cost while strictly limiting said cost.
Open Access
Conference paper
N/A
Hao Tang; Hong Liu; Nicu Sebe; Wei Xiao;
Peking University; University of Science and Technology of Shenzhen; University of Trento;
We present a new deep dictionary learning and coding network (DDLCN) for image-recognition tasks with limited data. The proposed DDLCN has most of the standard deep learning layers (e.g., input/output, pooling, and fully connected), but the fundamental convolutional layers are replaced by our proposed compound dictionary learning and coding layers. The dictionary learning learns an overcomplete dictionary for input training data. At the deep coding layer, a locality constraint is added to guarantee that the activated dictionary bases are close to each other. Then, the activated dictionary atoms are assembled and passed to the compound dictionary learning and coding layers. In this way, the activated atoms in the first layer can be represented by the deeper atoms in the second dictionary. Intuitively, the second dictionary is designed to learn the fine-grained components shared among the input dictionary atoms; thus, a more informative and discriminative low-level representation of the dictionary atoms can be obtained. We empirically compare DDLCN with several leading dictionary learning methods and deep learning models. Experimental results on five popular data sets show that DDLCN achieves competitive results compared with state-of-the-art methods when the training data are limited. Code is available at https://github.com/Ha0Tang/DDLCN.
Closed Access
Journal article
IEEE Transactions on Neural Networks and Learning Systems
Frédéric Precioso; Lucile Sassatelli; Miguel Rondon; Ramon Aparicio-Pardo;
Université Côte d'Azur;
We consider predicting the user’s head motion in 360° videos, with 2 modalities only: the past user’s positions and the video content (not knowing other users’ traces). We make two main contributions. First, we re-examine existing deep-learning approaches for this problem and identify hidden flaws from a thorough root-cause analysis. Second, from the results of this analysis, we design a new proposal establishing state-of-the-art performance.
First, re-assessing the existing methods that use both modalities, we obtain the surprising result that they all perform worse than baselines using the user’s trajectory only. A root-cause analysis of the metrics, datasets and neural architectures shows in particular that (i) the content can inform the prediction for horizons longer than 2 to 3 sec. (existing methods consider shorter horizons), and that (ii) to compete with the baselines, it is necessary to have a recurrent unit dedicated to process the positions, but this is not sufficient.
Second, from a re-examination of the problem supported with the concept of Structural-RNN, we design a new deep neural architecture, named TRACK. TRACK achieves state-of-the-art performance on all considered datasets and prediction horizons, outperforming competitors by up to 20% on focus-type videos and horizons 2-5 seconds.
The entire framework (codes and datasets) is online and received an ACM reproducibility badge https://gitlab.com/miguelfromeror/head-motion-prediction
Open Access
Journal article
N/A
Adrian Clark; Adrian Popescu; Alba Seco de Herrera; Andrei Tauteanu; Antonio Campello; Asma Ben Abacha; Bogdan Ionescu; Christoph Griedrich; Dimitri Fichou; Dina Demner-Fushman; Hassan Moustahdif; Henning Müller; Janadhip Jacutprakart; Jérôme Deshayes-Chossart; Jon Chamberlain; Liviu-Daniel Stefan; Mihai Dogariu Mihai Gabriel Constantin Mourad Sarrouti; Obioma Pelka; Paul Brie; Raul Berari; Renaud Péteri; Sadid A. Hasan; Serge Kozlovski; Thomas Oliver; Vassili Kovalev; Vitali Lianchuk; Yashin Dicente Cid;
Abigail Schulz; Belarussian Academy of Sciences; CVS Health; La Rochelle University; National Library of Medicine; NOAA/US IOOS; teleportHQ; Unites Institute of Informatics Problems; Université Paris-Saclay; University of Applied Sciences and Art Dortmund; University of Applied Sciences of Western Switzerland; University of Essex; University of Warwick; University Politehnica of Bucharest Wellcome Trust;
This paper presents the ideas for the 2021 ImageCLEF lab that will be organized as part of the Conference and Labs of the Evaluation Forum—CLEF Labs 2021 in Bucharest, Romania. ImageCLEF is an ongoing evaluation initiative (active since 2003) that promotes the evaluation of technologies for annotation, indexing and retrieval of visual data with the aim of providing information access to large collections of images in various usage scenarios and domains. In 2021, the 19th edition of ImageCLEF will organize four main tasks: (i) a Medical task addressing visual question answering, a concept annotation and a tuberculosis classification task, (ii) a Coral task addressing the annotation and localisation of substrates in coral reef images, (iii) a DrawnUI task addressing the creation of websites from either a drawing or a screenshot by detecting the different elements present on the design and a new (iv) Aware task addressing the prediction of real-life consequences of online photo sharing. The strong participation in 2020, despite the COVID pandemic, with over 115 research groups registering and 40 submitting over 295 runs for the tasks shows an important interest in this benchmarking campaign. We expect the new tasks to attract at least as many researchers for 2021.
Open Access
Conference paper
N/A
Alejandro Moreo; Fabrizio Sebastiani;
ISTI-CNR;
Learning to quantify (a.k.a. quantification) is a task concerned with training unbiased estimators of class prevalence via supervised learning. This task originated with the observation that “Classify and Count” (CC), the trivial method of obtaining class prevalence estimates, is often a biased estimator, and thus delivers suboptimal quantification accuracy. Fol- lowing this observation, several methods for learning to quantify have been proposed and have been shown to outperform CC. In this work we contend that previous works have failed to use properly optimised versions of CC. We thus reassess the real merits of CC and its variants, and argue that, while still inferior to some cutting-edge methods, they deliver near-state-of-the- art accuracy once (a) hyperparameter optimisation is performed, and (b) this optimisation is performed by using a truly quantification-oriented evaluation protocol. Experiments on three publicly available binary sentiment classification datasets support these conclusions.
Open Access
Conference paper
N/A
Antonios Liapis; Konstantinos Sfikas;
University of Malta
Competitive board games have provided a rich and diverse testbed for artificial intelligence. This paper contends that collaborative board games pose a different challenge to artificial intelligence as it must balance short-term risk mitigation with long-term winning strategies. Collaborative board games task all players to coordinate their different powers or pool their resources to overcome an escalating challenge posed by the board and a stochastic ruleset. This paper focuses on the exemplary collaborative board game Pandemic and presents a rolling horizon evolutionary algorithm designed specifically for this game. The complex way in which the Pandemic game state changes in a stochastic but predictable way required a number of specially designed forward models, macro-action representations for decision-making, and repair functions for the genetic operations of the evolutionary algorithm. Variants of the algorithm which explore optimistic versus pessimistic game state evaluations, different mutation rates and event horizons are compared against a baseline hierarchical policy agent. Results show that an evolutionary approach via short-horizon rollouts can better account for the future dangers that the board may introduce, and guard against them. Results highlight the types of challenges that collaborative board games pose to artificial intelligence, especially for handling multi-player collaboration interactions.
Open Access
Journal article
IEEE Transactions on Games
Alejandro Moreo; Andrea Pedrotti; Fabrizio Sebastiani;
ISTI-CNR;
Funnelling (Fun) is a method for cross-lingual text classification (CLC) based on a two-tier ensemble for heterogeneous transfer learning. In Fun, 1st-tier classifiers, each working on a different, language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a meta- classifier that uses this vector as its input. The metaclassifier can thus exploit class-class correlations, and this (among other things) gives Fun an edge over CLC systems where these correlations cannot be leveraged.
We here describe Generalized Funnelling (gFun), a learning ensemble where the metaclassifier receives as input the above vector of calibrated posterior probabilities, concatenated with document embeddings (aligned across languages) that embody other types of correlations, such as word-class correlations (as encoded by Word-Class Embeddings) and word-word correlations (as encoded by Multilingual Unsupervised or Supervised Embeddings). We show that gFun improves on Fun by describing experiments on two large, standard multilingual datasets for multi-label text classification.
Open Access
Conference paper
N/A
Alberto Del Bimbo; Irene Amerini; Leonardo Galteri; Roberto Cardelli;
CNIT; University of Florence; University of Rome La Sapienza;
A new phenomenon named Deepfakes constitutes a serious threat in video manipulation. AI-based technologies have provided easy-to-use methods to create extremely realistic videos. On the side of multimedia forensics, being able to individuate this kind of fake contents becomes ever more crucial. In this work, a new forensic technique able to detect fake and original video sequences is proposed; it is based on the use of CNNs trained to distinguish possible motion dissimilarities in the temporal structure of a video sequence by exploiting optical flow fields. The results obtained highlight comparable performances with the state-of-the-art methods which, in general, only resort to single video frames. Furthermore, the proposed optical flow based detection scheme also provides a superior robustness in the more realistic cross-forgery operative scenario and can even be combined with frame-based approaches to improve their global effectiveness.
Closed Access
Journal article
N/A
Bogdan Ionescu; Claire-Hélène Demarty; Liviu-Daniel Stefan; Mats Sjöberg; Mihai Gabriel Constantin Ngoc Q. K. Duong
CSC - IT Center for Science; InterDigital University Politehnica of Bucharest
In this paper, we report on the creation of a publicly available, common evaluation framework for image and video visual interestingness prediction. We propose a robust data set, the Interestingness10k, with 9831 images and more than 4 h of video, interestigness scores determined based on more than 1M pair-wise annotations of 800 trusted annotators, some pre-computed multi-modal descriptors, and 192 system output results as baselines. The data were validated extensively during the 2016–2017 MediaEval benchmark campaigns. We provide an in-depth analysis of the crucial components of visual interestingness prediction algorithms by reviewing the capabilities and the evolution of the MediaEval benchmark systems, as well as of prominent systems from the literature. We discuss overall trends, influence of the employed features and techniques, generalization capabilities and the reliability of results. We also discuss the possibility of going beyond state-of-the-art performance via an automatic, ad-hoc system fusion, and propose a deep MLP-based architecture that outperforms the current state-of-the-art systems by a large margin. Finally, we provide the most important lessons learned and insights gained.
Open Access
Journal article
International Journal of Computer Vision
Alberto Del Bimbo; Claudio Baecchi; Federico Pernici; Matteo Bruni;
Università degli Studi di Firenze;
Neural networks are widely used as a model for classification in a large variety of tasks. Typically, a learnable transformation (i.e., the classifier) is placed at the end of such models returning a value for each class used for classification. This transformation plays an important role in determining how the generated features change during the learning process. In this work, we argue that this transformation not only can be fixed (i.e., set as nontrainable) with no loss of accuracy and with a reduction in memory usage, but it can also be used to learn stationary and maximally separated embeddings. We show that the stationarity of the embedding and its maximal separated representation can be theoretically justified by setting the weights of the fixed classifier to values taken from the coordinate vertices of the three regular polytopes available in Rd, namely, the d-Simplex, the d-Cube, and the d-Orthoplex. These regular polytopes have the maximal amount of symmetry that can be exploited to generate stationary features angularly centered around their corresponding fixed weights. Our approach improves and broadens the concept of a fixed classifier, recently proposed by Hoffer et al., to a larger class of fixed classifier models. Experimental results confirm the theoretical analysis, the generalization capability, the faster convergence, and the improved performance of the proposed method. Code will be publicly available.
Open Access
Journal article
IEEE Transactions on Neural Networks and Learning Systems
Henning Müller; Mara Graziani; Thomas Lompech; Vincent Andrearczy;
INP-ENSEEIHT; University of Applied Sciences of Western Switzerland; University of Geneva;
Visualization methods for Convolutional Neural Net-works (CNNs) are spreading within the medical com-munity to obtain explainable AI (XAI). The sole quali-tative assessment of the explanations is subject to a riskof confirmation bias. This paper proposes a methodol-ogy for the quantitative evaluation of common visual-ization approaches for histopathology images, i.e. ClassActivation Mapping and Local-Interpretable Model-Agnostic Explanations. In our evaluation, we proposeto assess four main points, namely the alignment withclinical factors, the agreement between XAI methods,the consistency and repeatability of the explanations. Todo so, we compare the intersection over union of multi-ple visualizations of the CNN attention with the seman-tic annotation of functionally different nuclei types. Theexperimental results do not show stronger attributions tothe multiple nuclei types than those of a randomly ini-tialized CNN. The visualizations hardly agree on salientareas and LIME outputs have particularly unstable re-peatability and consistency. The qualitative evaluationalone is thus not sufficient to establish the appropriate-ness and reliability of the visualization tools. The codeis available on GitHub atbit.ly/2K48HKz.
Open Access
Conference paper
N/A
Adrian Popescu; Céline Hudelot; Eden Belouadah; Umang Aggarwal
CEA; Université Paris-Saclay;
Deep learning approaches are successful in a wide range of AI problems and in particular for visual recognition tasks. However, there are still open problems among which is the capacity to handle streams of visual information and the management of class imbalance in datasets. Existing research approaches these two problems separately while they co-occur in real world applications. Here, we study the problem of learning incrementally from imbalanced datasets. We focus on algorithms which have a constant deep model complexity and use a bounded memory to store exemplars of old classes across incremental states. Since memory is bounded, old classes are learned with fewer images than new classes and an imbalance due to incremental learning is added to the initial dataset imbalance. A score prediction bias in favor of new classes appears and we evaluate a comprehensive set of score calibration methods to reduce it. Evaluation is carried with three datasets, using two dataset imbalance configurations and three bounded memory sizes. Results show that most calibration methods have beneficial effect and that they are most useful for lower bounded memory sizes, which are most interesting in practice. As a secondary contribution, we remove the usual distillation component from the loss function of incremental learning algorithms. We show that simpler vanilla fine tuning is a stronger backbone for imbalanced incremental learning algorithms.
Open Access
Journal article
N/A
Bogdan Ionescu; Liviu-Daniel Stefan; Mihai Gabriel Constantin
University Politehnica of Bucharest
While ensemble systems and late fusion mechanisms have proven their effectiveness by achieving state-of-the-art results in various computer vision tasks, current approaches are not exploiting the power of deep neural networks as their primary ensembling algorithm, but only as inducers, i.e., systems that are used as inputs for the primary ensembling algorithm. In this paper, we propose several deep neural network architectures as ensembling algorithms with various network configurations that use dense and attention layers, an input pre-processing algorithm, and a new type of deep neural network layer denoted the Cross-Space-Fusion layer, that further improves the overall results. Experimental validation is carried out on several data sets from various domains (emotional content classification, medical data captioning) and under various evaluation conditions (two-class regression, binary classification, and multi-label classification), proving the efficiency of DeepFusion.
Open Access
Conference paper
MultiMedia Modeling
Fabio Carrara; Fabrizio Falchi; Giuseppe Amato; Roberto Cardelli;
CNIT Florence ISTI-CNR;
Deep learned models are now largely adopted in different fields, and they generally provide superior performances with respect to classical signal-based approaches. Notwithstanding this, their actual reliability when working in an unprotected environment is far enough to be proven. In this work, we consider a novel deep neural network architecture, named Neural Ordinary Differential Equations (N-ODE), that is getting particular attention due to an attractive property–a test-time tunable trade-off between accuracy and efficiency. This paper analyzes the robustness of N-ODE image classifiers when faced against a strong adversarial attack and how its effectiveness changes when varying such a tunable trade-off. We show that adversarial robustness is increased when the networks operate in different tolerance regimes during test time and training time. On this basis, we propose a novel adversarial detection strategy for N-ODE nets based on the randomization of the adaptive ODE solver tolerance. Our evaluation performed on standard image classification benchmarks shows that our detection technique provides high rejection of adversarial examples while maintaining most of the original samples under white-box attacks and zero-knowledge adversaries.
Open Access
Conference paper
Multimedia FORensics in the WILD
Alejandro Moreo; Andrea Esuli; Fabrizio Sebastiani;
ISTI-CNR;
Pre-trained word embeddings encode general word semantics and lexical regularities of natural language, and have proven useful across many NLP tasks, including word sense disambiguation, machine translation, and sentiment analysis, to name a few. In supervised tasks such as multiclass text classification (the focus of this article) it seems appealing to enhance word representations with ad-hoc embeddings that encode task-specific information. We propose (supervised) word-class embeddings (WCEs), and show that, when concatenated to (unsupervised) pre-trained word embeddings, they substantially facilitate the training of deep-learning models in multiclass classification by topic. We show empirical evidence that WCEs yield a consistent improvement in multiclass classification accuracy, using six popular neural architectures and six widely used and publicly available datasets for multi- class text classification. One further advantage of this method is that it is conceptually simple and straightforward to implement. Our code that implements WCEs is publicly available at https://github.com/AlexMoreo/word-class-embeddings.
Open Access
Journal article
Data Mining and Knowledge Discovery
Andrea Esuli; Fabrizio Sebastiani;
ISTI-CNR;
We critically re-examine the Saerens-Latinne-Decaestecker (SLD) algorithm, a well-known method for estimating class prior probabilities (“priors”) and adjusting posterior probabilities (“posteriors”) in scenarios characterized by distribution shift, i.e., difference in the distribution of the priors between the training and the unlabelled documents. Given a machine learned classifier and a set of unlabelled documents for which the classifier has returned posterior probabilities and estimates of the prior probabilities, SLD updates them both in an iterative, mutually recursive way, with the goal of making both more accurate; this is of key importance in downstream tasks such as single-label multiclass classification and cost-sensitive text classification. Since its publication, SLD has become the standard algorithm for improving the quality of the posteriors in the presence of distribution shift, and SLD is still considered a top contender when we need to estimate the priors (a task that has become known as “quantification”). However, its real effectiveness in improving the quality of the posteriors has been questioned. We here present the results of systematic experiments conducted on a large, publicly available dataset, across multiple amounts of distribution shift and multiple learners. Our experiments show that SLD improves the quality of the posterior probabilities and of the estimates of the prior probabilities, but only when the number of classes in the classification scheme is very small and the classifier is calibrated. As the number of classes grows, or as we use non-calibrated classifiers, SLD converges more slowly (and often does not converge at all), performance degrades rapidly, and the impact of SLD on the quality of the prior estimates and of the posteriors becomes negative rather than positive.
Open Access
Journal article
ACM Transactions on Information Systems
Adrian Popescu; Eden Belouadah; Ioannis Kanellos
IMT Atlantic; Université Paris-Saclay;
The ability of artificial agents to increment their capabilities when confronted with new data is an open challenge in artificial intelligence. The main challenge faced in such cases is catastrophic forgetting, i.e., the tendency of neural networks to underfit past data when new ones are ingested. A first group of approaches tackles forgetting by increasing deep model capacity to accommodate new knowledge. A second type of approaches fix the deep model size and introduce a mechanism whose objective is to ensure a good compromise between stability and plasticity of the model. While the first type of algorithms were compared thoroughly, this is not the case for methods which exploit a fixed size model. Here, we focus on the latter, place them in a common conceptual and experimental framework and propose the following contributions: (1) define six desirable properties of incremental learning algorithms and analyze them according to these properties, (2) introduce a unified formalization of the class-incremental learning problem, (3) propose a common evaluation framework which is more thorough than existing ones in terms of number of datasets, size of datasets, size of bounded memory and number of incremental states, (4) investigate the usefulness of herding for past exemplars selection, (5) provide experimental evidence that it is possible to obtain competitive performance without the use of knowledge distillation to tackle catastrophic forgetting and (6) facilitate reproducibility by integrating all tested methods in a common open-source repository. The main experimental finding is that none of the existing algorithms achieves the best results in all evaluated settings. Important differences arise notably if a bounded memory of past classes is allowed or not.
Open Access
Journal article
Neural Networks
Ahmed Khalifa; Georgios N. Yannakakis; Jialin Liu; Julian Togelius; Sam Snodgrass; Sebastian Risi;
IT University of Copenhagen; Modl.ai; New York University; University of Malta University of Science and Technology of Shenzhen;
Procedural content generation in video games has a long history. Existing procedural content generation methods, such as search-based, solver-based, rule-based and grammar-based methods have been applied to various content types such as levels, maps, character models, and textures. A research field centered on content generation in games has existed for more than a decade. More recently, deep learning has powered a remarkable range of inventions in content production, which are applicable to games. While some cutting-edge deep learning methods are applied on their own, others are applied in combination with more traditional methods, or in an interactive setting. This article surveys the various deep learning methods that have been applied to generate game content directly or indirectly, discusses deep learning methods that could be used for content generation purposes but are rarely used today, and envisages some limitations and potential future directions of deep learning for procedural content generation.
Open Access
Journal article
N/A
Andrea Esuli; Claudio Gennaro; Fabrizio Falchi; Giuseppe Amato; Nicola Messina; Stéphane Marchand-Maillet
ISTI-CNR; University of Geneva;
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences (i.e., image regions and words, respectively) to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task.
Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way toward the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric. The code used for the experiments is publicly available on GitHub at https://github.com/mesnico/TERAN.
Open Access
Journal article
N/A
Alba Seco de Herrera; Ana Matran-Fernandez; Bogdan Ionescu; Camilo Fosco; Claire-Hélène Demarty; Graham Healy; lan Smeaton; Lorin Sweeney; Mihai Gabriel Constantin Rukiye Savran Kiziltepe; Sebastian Halder
Dublin City University; Massachusetts Institute of Technology Cambridge; University of Essex; University Politehnica of Bucharest
This paper describes the MediaEval 2021 Predicting Media Memorability task, which is in its 4th edition this year, as the prediction of short-term and long-term video memorability remains a challenging task. In 2021, two datasets of videos are used: first, a subset of the TRECVid 2019 Video-to-Text dataset; second, the Memento10K dataset in order to provide opportunities to explore cross-dataset generalisation. In addition, an Electroencephalography (EEG)-based prediction pilot subtask is introduced. In this paper, we outline the main aspects of the task and describe the datasets, evaluation metrics, and requirements for participants’ submissions
Open Access
Conference paper
MediaEval
Fengxiang Yang; Nicu Sebe; Yaojing Lin; Yuyang Zhao; Zhiming Luo; Zhun Zhong
Minnan Normal University; University of Trento; Xiamen University
Recent advances in person re-identification (ReID) obtain impressive accuracy in the supervised and unsupervised learning settings. However, most of the existing methods need to train a new model for a new domain by accessing data. Due to public privacy, the new domain data are not always accessible, leading to a limited applicability of these methods. In this paper, we study the problem of multisource domain generalization in ReID, which aims to learn a model that can perform well on unseen domains with only several labeled source domains. To address this problem, we propose the Memory-based Multi-Source Meta-Learning (M3L) framework to train a generalizable model for unseen domains. Specifically, a meta-learning strategy is introduced to simulate the train-test process of domain generalization for learning more generalizable models. To overcome the unstable meta-optimization caused by the parametric classifier, we propose a memory-based identification loss that is non-parametric and harmonizes with meta-learning. We also present a meta batch normalization layer (MetaBN) to diversify meta-test features, further establishing the advantage of meta-learning. Experiments demonstrate that our M3L can effectively enhance the generalization ability of the model for unseen domains and can outperform the state-of-the-art methods on four large-scale ReID datasets.
Open Access
Conference paper
N/A
Carlos Santiago; Claudio Gennaro; Giuseppe Amato; João Paulo Costeira; Luca Ciampi;
Instituto Superior Técnico; ISTI-CNR;
Convolutional Neural Networks have produced state-of-the-art results for a multitude of computer vision tasks under supervised learning. However, the crux of these methods is the need for a massive amount of labeled data to guarantee that they generalize well to diverse testing scenarios. In many real-world applications, there is indeed a large domain shift between the distributions of the train (source) and test (target) domains, leading to a significant drop in performance at inference time. Unsupervised Domain Adaptation (UDA) is a class of techniques that aims to mitigate this drawback without the need for labeled data in the target domain. This makes it particularly useful for the tasks in which acquiring new labeled data is very expensive, such as for semantic and instance segmentation. In this work, we propose an end-to-end CNN-based UDA algorithm for traffic density estimation and counting, based on adversarial learning in the output space. The density estimation is one of those tasks requiring per-pixel annotated labels and, therefore, needs a lot of human effort. We conduct experiments considering different types of domain shifts, and we make publicly available two new datasets for the vehicle counting task that were also used for our tests. One of them, the Grand Traffic Auto dataset, is a synthetic collection of images, obtained using the graphical engine of the Grand Theft Auto video game, automatically annotated with precise per-pixel labels. Experiments show a significant improvement using our UDA algorithm compared to the model’s performance without domain adaptation.
Open Access
Conference paper
N/A
Alejandro Moreo; Fabrizio Sebastiani; Manuel Francisco
ISTI-CNR; University of Granada
Quantification, variously called supervised prevalence estimation or learning to quantify, is the supervised learning task of generating predictors of the relative frequencies (a.k.a. prevalence values) of the classes of interest in unlabelled data samples. While many quantification methods have been proposed in the past for binary problems and, to a lesser extent, single-label multiclass problems, the multi-label setting (i.e., the scenario in which the classes of interest are not mutually exclusive) remains by and large unexplored. A straightforward solution to the multi-label quantification problem could simply consist of recasting the problem as a set of independent binary quantification problems. Such a solution is simple but naïve, since the independence assumption upon which it rests is, in most cases, not satisfied. In these cases, knowing the relative frequency of one class could be of help in determining the prevalence of other related classes. We propose the first truly multi-label quantification methods, i.e., methods for inferring estimators of class prevalence values that strive to leverage the stochastic dependencies among the classes of interest in order to predict their relative frequencies more accurately. We show empirical evidence that natively multi-label solutions outperform the naïve approaches by a large margin. The code to reproduce all our experiments is available online.
Open Access
Journal article
ACM Transactions on Knowledge Discovery from Data
Chen Feng; Ioannis Patras; Zheng Gao;
Queen Mary University of London;
Open Access
Conference paper
N/A
Claudio Gennaro; Fabrizio Falchi; Luca Ciampi; Marco Avvenuti; Marco Bongiovanni; Nicola Messina;
ISTI-CNR; University of Pisa
Automatic people counting from images has recently drawn attention for urban monitoring in modern Smart Cities due to the ubiquity of surveillance camera networks. Current computer vision techniques rely on deep learning-based algorithms that estimate pedestrian densities in still, individual images. Only a bunch of works take advantage of temporal consistency in video sequences. In this work, we propose a spatio-temporal attentive neural network to estimate the number of pedestrians from surveillance videos. By taking advantage of the temporal correlation between consecutive frames, we lowered state-of-the-art count error by 5% and localization error by 7.5% on the widely-used FDST benchmark.
Open Access
Conference paper
IEEE Symposium on Computers and Communications
AI4Media may use cookies to store your login data, collect statistics to optimize the website’s functionality and to perform marketing actions based on your interests. You can personalize your cookies in .