Check out our Software
There are no public reports available yet.
InDistill enchances the effectiveness of the Knowledge Distillation procedure by leveraging the properties of channel pruning to both reduce the capacity gap between the models and retain the information geometry. Also, this method introduces a curriculum learning based scheme for enhancing the effectiveness of transferring knowledge from multiple intermediate layers.
Keywords
pygrank is an open source framework to define, run and evaluate node ranking algorithms. It provides object-oriented and extensively unit-tested algorithmic components, such as graph filters, post-processors, measures, benchmarks, and online tuning. Computations can be delegated to numpy, tensorflow, or pytorch backends and fit in back-propagation pipelines. Classes can be combined to define interoperable complex algorithms.
Keywords
We propose a new hyperbolic-based model for metric learning. At the core of our method is a vision transformer with output embeddings mapped to hyperbolic space. These embeddings are directly optimized using modified pairwise cross-entropy loss.
Keywords
We introduce a new setting of Novel Class Discovery in Semantic Segmentation (NCDSS), which aims at segmenting unlabeled images containing new classes given prior knowledge from a labeled set of disjoint classes. In NCDSS, we need to distinguish the objects and background, and to handle the existence of multiple classes within an image, which increases the difficulty in using the unlabeled data. To tackle this new setting, we leverage the labeled base data and a saliency model to coarsely cluster novel classes for model training in our basic framework.
Keywords
We propose a simple and novel Unsupervised Domain Adaptation (UDA) approach for video action recognition. Our approach leverages recent advances on spatio-temporal transformers to build a robust source model that better generalises to the target domain. Furthermore, our architecture learns domain invariant features thanks to the introduction of a novel alignment loss term derived from the Information Bottleneck principle.
Keywords
We propose an augmentation-free unsupervised approach for point clouds to learn transferable point-level features via soft clustering, named SoftClu. SoftClu assumes that the points belonging to a cluster should be close to each other in both geometric and feature spaces. We exploit the affiliation of points to their clusters as a proxy to enable self-training through a pseudo-label prediction task. Under the constraint that these pseudo-labels induce the equipartition of the point cloud, we cast SoftClu as an optimal transport problem.
Keywords
"We propose the Style-HAllucinated Dual consistEncy learning (SHADE) framework constructed based on two consistency constraints, Style Consistency (SC) and Retrospection Consistency (RC). SC enriches the source situations and encourages the model to learn consistent representation across style-diversified samples. RC leverages real-world knowledge to prevent the model from overfitting to synthetic data and thus largely keeps the representation consistent between the synthetic and real-world models. Furthermore, we present a novel style hallucination module (SHM) to generate style-diversified samples that are essential to consistency learning."
Keywords
We address the new task of class-incremental Novel Class Discovery (class-iNCD), which refers to the problem of discovering novel categories in an unlabelled data set by leveraging a pre-trained model that has been trained on a labelled data set containing disjoint yet related categories. Apart from discovering novel classes, we also aim at preserving the ability of the model to recognize previously seen base categories.
Keywords
This is a Pytorch implementation of Hebbian learning algorithms to train deep convolutional neural networks.
Keywords
Neuronal network Models trained with the updated version (v2) of the PNN and PV datasets able to count perineural nets
Keywords
FeTrIL: Feature Translation for Exemplar-Free Class-Incremental Learning. We introduce a method which combines a fixed feature extractor and a pseudo-features generator to improve the stability-plasticity balance. The generator uses a simple yet effective geometric translation of new class features to create representations of past classes, made of pseudo-features. The translation of features only requires the storage of the centroid representations of past classes to produce their pseudo-features. Actual features of new classes and pseudo-features of past classes are fed into a linear classifier which is trained incrementally to discriminate between all classes. The incremental process is much faster with the proposed method compared to mainstream ones which update the entire deep model.
Keywords
We propose the manifold mixing model soup (ManifoldMixMS) algorithm. Instead ofsimple averaging, it uses a more sophisticated strategy to generate the fused model. Specifically, it partitionsa neural network model into several latent space manifolds (which can be individual layers or a collection oflayers). Afterwards, from the pool of finetuned models available after hyperparameter tuning, the most promisingones are selected and their latent space manifolds are mixed together individually. The optimal mixing coefficientfor each latent space manifold is calculated automatically via invoking an optimization algorithm. The fusedmodel we retrieve with this procedure can be thought as sort of a "Frankenstein" model, as it integrates (parts of)individual model components from multiple finetuned models into one mod
Keywords
We propose a method to model latent structures with a learned dynamic potential landscape, thereby performing latent traversals as the flow of samples down the landscape's gradient. Inspired by physics, optimal transport, and neuroscience, these potential landscapes are learned as physically realistic partial differential equations, thereby allowing them to flexibly vary over both space and time. To achieve disentanglement, multiple potentials are learned simultaneously, and are constrained by a classifier to be distinct and semantically self-consistent.
Keywords
We propose an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion. These factors are obtained by applying a semi-nonnegative tensor factorization on the feature maps, which in turn enables context-aware local image editing with pixel-level control. In addition, we show that the discovered appearance factors correspond to saliency maps that localize concepts of interest, without using any labels. Experiments on a wide range of GAN architectures and datasets show that, in comparison to the state of the art, our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control.
Keywords
We tackle the neural face reenactment task by leveraging the photorealistic image generation and the disentangled properties of a pretrained StyleGAN2, along with a hypernetwork. We present a novel method that performs both faithful identity reconstruction and effective facial image editing by learning to update the weights of a StyleGAN2 generator using a hypernetwork approach. Specifically, our model effectively combines the appearance features of a source image and the facial pose features of a target image to create new facial images that preserve the source identity and convey the target facial pose.
Keywords
We propose a task-agnostic anonymization procedure that directly optimises the images' latent representation in the latent space of a pre-trained GAN. By optimizing the latent codes directly, we ensure both that the identity is of a desired distance away from the original (with an identity obfuscation loss), whilst preserving the facial attributes (using a novel feature-matching loss in FaRL's deep feature space). The method is capable of anonymizing the identity of the images whilst better-preserving the facial attributes.
Keywords
We propose a framework that, using unpaired randomly generated facial images, learns to disentangle the identity characteristics of the face from its pose by incorporating the recently introduced style space S of StyleGAN2, a latent representation space that exhibits remarkable disentanglement properties. By capitalizing on this, we learn to successfully mix a pair of source and target style codes using supervision from a 3D model. The resulting latent code, that is subsequently used for reenactment, consists of latent units corresponding to the facial pose of the target only and of units corresponding to the identity of the source only, leading to notable improvement in the reenactment performance.
Keywords
Distracted driver classification (DDC) plays an important role in ensuring driving safety. This software related to the article "100-Driver: A Large-scale, Diverse Dataset for Distracted Driver Classification" published in IEEE Trans. on Intelligent Transportation Systems. The code allows investigating practical problems of DDC, including the traditional setting without domain shift and 3 challenging settings (i.e., cross-modality, cross-view, and cross-vehicle) with domain shifts.
Keywords
The yamlres library for retrieving algorithm component combinations from online or local yaml resources to enable distributed development of no-learning schemas.
Keywords
Functional implementation of graph learning algorithms that support fast experimentation with autotuning algorithms and sharing yamlres definitions of autotuned algorithms.
Keywords
Functional implementation of graph learning algorithms that support fast experimentation with autotuning algorithms and sharing yamlres definitions of autotuned algorithms.
Keywords
We propose to separate representations of the different visual modalities in CLIP's joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest. Such a subspace yields disentangled representations of the different visual properties of an image or text in closed form while respecting the underlying geometry of the manifold on which the representations lie.
Keywords
Code that mplements filter pruning on two already small and compact face detectors, named EXTD (Ex-tremely Tiny Face Detector) and EResFD (Efficient ResNetFace Detector). The main pruning algorithm that we utilize is Filter Pruning via Geometric Median (FPGM), combined with the Soft Filter Pruning (SFP) iterative procedure. We also apply L1 Norm pruning, as a baseline to compare withthe proposed approach. The experimental evaluation indicates that the proposed approach has the potential to further reduce the model size of already lightweight face detectors, with limited accuracy loss, oreven with small accuracy gain for low pruning rate.
Keywords
Most QD methods only tackle static tasks that are fixed over time, which is rarely the case in the real world. Unlike noisy environments, where the fitness of an individual changes slightly at every evaluation, dynamic environments simulate tasks where external factors at unknown and irregular intervals alter the performance of the individual with a severity that is unknown a priori. We introduce a novel and generalisable Dynamic QD methodology that aims to keep the archive of past solutions updated in the case of environment changes and we present a novel characterisation of dynamic environments that can be easily applied to well-known benchmarks, with minor interventions to move them from a static task to a dynamic one. The Dynamic QD intervention is applied on MAP-Elites and CMA-ME. "
Keywords
This is an experimentation framework assessing how well graph neural networks (GNN) can minimize various attributed graph functions on the node domain. - Based on torch geometric. - Modular architecture definition. - Implementation of several diffusion-based architectures (GCN, GCNII, APPNP, S2GC, DeepSet on graphs). - Several benchmarking tasks for the ability to approximate equivariant attributed graph functions. - Uniform interface that treats multiple graphs as one disconnected graph.
Keywords
A holistic learning framework for Novel Class Discovery (NCD), which adopts contrastive learning to learn discriminate features with both the labeled and unlabeled data. The Neighborhood Contrastive Learning (NCL) framework effectively leverages the local neighborhood in the embedding space, enabling us to take the knowledge from more positive samples and thus improve the clustering accuracy. In addition, we also introduce the Hard Negative Generation (HNG), which leverages the labeled samples to produce informative hard negative samples and brings further advantage to NCL.
Keywords
PyTorch implementation of a Geometry-Contrastive Transformer for Generalized 3D Pose Transfer. The novel GC-Transformer can freely conduct robust pose transfer on LARGE meshes at no cost, which could be a boost to Transformers in 3D fields.
Keywords
A novel unsupervised domain adaptation approach for action recognition from videos, inspired by recent literature on contrastive learning. It comprises a novel two-headed deep architecture that simultaneously adopts cross-entropy and contrastive losses from different network branches to robustly learn a target classifier.
Keywords
PyTorch implementation of AniFormer, a novel Transformer-based architecture, that generates animated 3D sequences by directly taking the raw driving sequences and arbitrary same-type target meshes as inputs. The Transformer architecture is customised for 3D animation that generates mesh sequences by integrating styles from target meshes and motions from the driving meshes. Besides, instead of the conventional single regression head in the vanilla Transformer, AniFormer generates multiple frames as outputs to preserve the sequential consistency of the generated meshes. This is achieved by a pair of regression constraints, i.e., motion and appearance constraints, that can provide strong regularization on the generated mesh sequences.
Keywords
PyTorch implementation of Intrinsic-Extrinsic Preserved Generative Adversarial Network (IEP-GAN) for both intrinsic (i.e., shape) and extrinsic (i.e., pose) information preservation. Extrinsically, a co-occurrence discriminator is used to capture the structural/pose invariance from distinct Laplacians of the mesh. Intrinsically, a local intrinsic-preserved loss is introduced to preserve the geodesic priors while avoiding heavy computations. IEP-GAN can be sued to manipulate 3D human meshes in various ways, including pose transfer, identity swapping and pose interpolation with latent code vector arithmetic. The extensive experiments on various 3D datasets of humans, animals and hands demonstrate the generality of this approach.
Keywords
Code for Word-Class Embeddings (WCEs), a form of supervised embeddings especially suited for multiclass text classification. WCEs are meant to be used as extensions (i.e., by concatenation) to pre-trained embeddings (e.g., GloVe or word2vec) embeddings in order to improve the performance of neural classifiers.
Keywords
Graph Neural Networks (GNNs) have seen a dramatic increase in popularity thanks to their ability to understand relations between graph nodes. This library aims to provide GNN capabilities to native Java applications, for example, to perform machine learning on Android. It does so by avoiding c-based machine learning libraries, such as TensorFlow Lite, that are often designed with pure performance in mind but which often require specific hardware to run, such as GPUs, and drastically increase the size of deployed applications.
Keywords
A package for implementing and simulating decentralized Graph Neural Network algorithms for the classification of peer-to-peer nodes.
Keywords
A framework for easy experimentation with Graph Neural Network (GNN) architectures by separating them from predictive components.
Keywords
Python implementation of mini-batch trimming, a novel strategy for improving the generalization capability of a trained network model. It is easy to implement and add to a training pipeline and independent of the employed model and optimizer.
Keywords
A wrapper for several SoA adaptive-gradient optimizer (Adam/AdamW/EAdam/AdaBelief/AdaMomentum/AdaFamily), including our novel 'AdaFamily' optimizer, via one API.
Keywords
The ability of artificial agents to increment their capabilities when confronted with new data is an open challenge in artificial intelligence. The main challenge faced in such cases is catastrophic forgetting, i.e., the tendency of neural networks to underfit past data when new ones are ingested. The repository includes implementations of several incremental learning techniques including among others LUCIR, iCaRL, BiC, LwF, REMIND, Deep-SLDA, ScaIL, IL2M, DeeSIL, FT, and SIW.
Keywords
CNN-based algorithm for traffic density estimation and counting that can generalize to new data sources for which there are no annotations available. This generalization is achieved by exploiting an Unsupervised Domain Adaptation strategy, whereby a discriminator attached to the output forces similar density distribution in the target and source domains.
Keywords
QuaPy is an open-source framework for quantification (a.k.a. supervised prevalence estimation, or learning to quantify) written in Python. QuaPy provides implementations of the most important aspects of the quantification workflow, such as (baseline and advanced) quantification methods, quantification-oriented model selection mechanisms, evaluation measures, and evaluation protocols used for evaluating quantification methods. QuaPy also makes available commonly used datasets, and offers visualization tools for facilitating the analysis and interpretation of the experimental results. QuaPy is accompanied by rich API documentation and a wiki guide. The software is open-source, and distributed under the BSD-3 license; it is available on GitHub and can be installed via pip.
Keywords
ql4facct is a software for replicating experiments concerning the evaluation of estimators of classifier "fairness". This repository makes available baseline systems used in literature, along with our proposed framework based on quantification. The experiments implemented in this software show, through four different experimental protocols and with the aid of visualization tools, that estimating classifier fairness via quantification yields a clear advantage with respect to the previous state-of-the-art.
Keywords
Novel fixed classifier for incremental learning in which a number of pre-allocated output nodes are subject to the classification loss right from the beginning of the learning phase. Contrarily to the standard expanding classifier, this allows: (a) the output nodes of future unseen classes to firstly see negative samples since the beginning of learning together with the positive samples that incrementally arrive; (b) to learn features that do not change their geometric configuration as novel classes are incorporated in the learning model.
Keywords
Discovery of non-linear interpretable paths in GAN latent space in an unsupervised and model-agnostic manner. Non-linear paths are modeled using RBF-based warping functions, optimized in order to be distinguishable from each other. This leads to paths that correspond to an interpretable generation where only a small number of generative factors are affected for each path. A quantitative evaluation protocol for the case of face-generating GANs is also implemented, which can be used to automatically associate the discovered paths with interpretable attributes such as smiling and rotation.
Keywords
A method that offers an intuitive way to find different types of interpretable transformations in a pre-trained GAN. We achieve this by decomposing the generator’s activations in a multilinear manner and regressing back to the latent space.
Keywords
PyTorch implementation of Multi-target Graph Domain Adaptation framework. The framework is pivoted around two key concepts: graph feature aggregation and curriculum learning.
Keywords
PyTorch implementation of the Memory-based Multi-Source MetaLearning (M^3L) framework for multi-source domain generalization (DG) in person ReID. The proposed meta-learning strategy enables the model to simulate the train-test process of DG during training, which can efficiently improve the generalization ability of the model on unseen domains. A memory-based module and MetaBN are also introduced to take full advantage of meta-learning and obtain further improvement.
Keywords
Python code for Generalised Funnelling. Funneling is a new ensemble method for heterogeneous transfer learning that can be applied to cross-lingual text classification. Funneling consists of generating a two-tier classification system where all documents, irrespective of language, are classified by the same (second-tier) classifier. For this classifier, all documents are represented in a common, language-independent feature space consisting of the posterior probabilities generated by first-tier, language-dependent classifiers. This allows the classification of all test documents, of any language, to benefit from the information present in all training documents, of any language.
Keywords
A library of self-supervised methods for unsupervised visual representation learning powered by PyTorch Lightning. It aims at providing SotA self-supervised methods in a comparable environment while, at the same time, implementing training tricks. While the library is self-contained, it is possible to use the models outside of solo-learn.
Keywords
This repository hosts the code and data lists for our two learning-based eXplainable AI (XAI) methods called L-CAM-Fm and L-CAM-Img, for deep convolutional neural networks (DCNN) image classifiers. Our methods receive as input an image and a class label and produce as output the image regions that the DCNN has focused on in order to infer this class. Both methods use an attention mechanism (AM), trained end-to-end along with the original (frozen) DCNN, to derive class activation maps (CAMs) from the last convolutional layer’s feature maps (FMs).
Keywords
W introduce a novel universal attack algorithm called ``MetaAttack'' for person re-ID. MetaAttack can mislead re-ID models on unseen domains by a universal adversarial perturbation. Specifically, to capture common patterns across different domains, we propose a meta-learning scheme to seek the universal perturbation via the gradient interaction between meta-train and meta-test formed by two datasets. We also take advantage of a virtual dataset (PersonX), instead of real ones, to conduct meta-test. This scheme not only enables us to learn with more comprehensive variation factors but also mitigates the negative effects caused by biased factors of real datasets.
Keywords
PyTorch code for our submission: "Logit Margin Matters: Improving Transferable Targeted Adversarial Attack by Logit Calibration". The code is implemented based on the Code of the paper "On Success and Simplicity: A Second Look at Transferable Targeted Attacks" (Zhengyu Zhao, Zhuoran Liu, Martha Larson, NeurIPS 2021).
Keywords
Addressing the problem of removing any client’s contribution in federated learning (FL). During FL rounds, each client performs local training to learn a model that minimizes the empirical loss on their private data.We propose to perform unlearning at the client (to be erased) by reversing the learning process, i.e., training a model to maximize the local empirical loss. In particular, we formulate the unlearning problem as a constrained maximization problem by restricting to an l2-norm ball around a suitably chosen reference model to help retain some knowledge learnt from the other clients’ data. This allows the client to use projected gradient descent to perform unlearning. The method does neither requires global access to the data used for training nor the history of the parameter updates to be stored by the aggregator (server) or any of the clients.
Keywords
This repository contains the code for the paper "Matching Pairs: Attributing Fine-Tuned Models to their Pre-Trained Large Language Models". By casting the LLM attribution as a classification problem, we develop machine learning solutions that link a fine-tuned LLM to its pre-trained base model.
Keywords
Graph Neural Networks (GNNs) have become a popular tool for learning on graphs, but their widespread use raises privacy concerns as graph data can contain personal or sensitive information. Differentially private GNN models have been recently proposed to preserve privacy while still allowing for effective learning over graph-structured datasets. However, achieving an ideal balance between accuracy and privacy in GNNs remains challenging due to the intrinsic structural connectivity of graphs. In this paper, we propose a new differentially private GNN called ProGAP that uses a progressive training scheme to improve such accuracy-privacy trade-offs. Combined with the aggregation perturbation technique to ensure differential privacy, ProGAP splits a GNN into a sequence of overlapping submodels that are trained progressively, expanding from the first submodel to the complete model. Specifically, each submodel is trained over the privately aggregated node embeddings learned and cached by the previous submodels, leading to an increased expressive power compared to previous approaches while limiting the incurred privacy costs. We formally prove that ProGAP ensures edge-level and node-level privacy guarantees for both training and inference stages, and evaluate its performance on benchmark graph datasets. Experimental results demonstrate that ProGAP can achieve up to 5-10% higher accuracy than existing state-of-the-art differentially private GNNs.
Keywords
Breaking down a model into interpretable units allows us to better understand how models store representations. However, the occurrence of polysemantic neurons, or neurons that respond to multiple unrelated features, makes interpreting individual neurons challenging. This has led to the search for meaningful directions, known as concept vectors, in activation space instead of looking at individual neurons. Thsi method is able to disentangle polysemantic neurons into concept vectors consisting of linear combinations of neurons that encapsulate distinct features.
Keywords
Source code to perform concept discovery in the latent space of deep learning models with Singular Value Decomposition.
Keywords
Fairness is crucial when training a deep-learning discriminative model, especially in the facial domain. Models tend to correlate specific characteristics (such as age and skin color) with unrelated attributes (downstream tasks), resulting in biases which do not correspond to reality. It is common knowledge that these correlations are present in the data and are then transferred to the models during training. This paper proposes a method to mitigate these correlations to improve fairness.
Keywords
We tackle the challenge of open-set bias detection in text-to-image generative models by presenting OpenBias, a new pipeline that identifies and quantifies the severity of biases agnostically, without access to any precompiled set. OpenBias has three stages. In the first phase, we leverage a Large Language Model (LLM) to propose biases given a set of captions. Secondly, the target generative model produces images using the same set of captions. Lastly, a Vision Question Answering model recognizes the presence and extent of the previously proposed biases. Experiments demonstrate that OpenBias agrees with current closed-set bias detection methods and human judgement.
Keywords
Adversarial Robustness Toolbox (ART) is a Python library for Machine Learning Security. ART provides tools that enable developers and researchers to defend and evaluate Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference. ART supports all popular machine learning frameworks (TensorFlow, Keras, PyTorch, MXNet, scikit-learn, XGBoost, LightGBM, CatBoost, GPy, etc.), all data types (images, tables, audio, video, etc.) and machine learning tasks (classification, object detection, speech recognition, generation, certification, etc.).
Keywords
Novel training-time attacks resulting in corrupted Deep Generative Models (DGMs) that synthesize regular data under normal operations and designated target outputs for inputs sampled from a trigger distribution. Depending on the control that the adversary has over the random number generation, this imposes various degrees of risk that harmful data may enter the machine learning development pipelines, potentially causing material or reputational damage to the victim organization. The attacks are based on adversarial loss functions that combine the dual objectives of attack stealth and fidelity. Its effectiveness is shown for a variety of DGM architectures like StyleGANs and WaveGANs.
Keywords
Repository with the main tools for computing Regression Concept Vectors.
Keywords
Python implementation of OBjectGraphs, a new approach for video event recognition that exploits the relations among objects within each frame. More specifically, a graph, constructed using the appearance features of the objects, is exploited by the model to recognize the video event. Moreover, using the weighted in-degrees of the graph’s adjacency matrix, the model is able to provide insightful explanations for its decisions.
Keywords
Multi-task and Adversarial CNN Training: Learning Interpretable Pathology Features Improves CNN Generalization
Keywords
Privacy-preserving, architecture-agnostic GNN learning algorithm with formal privacy guarantees based on Local Differential Privacy (LDP). This includes a multidimensional ε-LDP algorithm that allows the server to privately collect node features and estimate the first-layer graph convolution of the GNN using the noisy features. Then, to further decrease the estimation error, we introduce KProp, a simple graph convolution layer that aggregates features from higher-order neighbors, which is prepended to the backbone GNN.
Keywords
Diffprivlib is a general-purpose library developed by IBM for experimenting with, investigating, and developing applications in, differential privacy: - Experiment with differential privacy - Explore the impact of differential privacy on machine learning accuracy using classification and clustering models - Build your own differential privacy applications, using an extensive collection of mechanisms
Keywords
A tailored Graph Neural Network architecture with Differential Privacy guarantees for both training and inference.
Keywords
Prototype of the AI4Media Evaluation as a Service platform. This platform is derived from the open-source Codalab EaaS platform and contains specific functions adapted for the AI4Media project, as well as an appropriate use-case scenario. This is a prototype version of the platform and will be updated as the project continues.
Keywords
We propose TransDepth, an architecture that benefits from both convolutional neural networks and transformers. To avoid the network losing its ability to capture local level details due to the adoption of transformers, we propose a novel decoder that employs attention mechanisms based on gates. Notably, this is the first solution that applies transformers to pixel-wise prediction problems involving continuous labels (i.e., monocular depth prediction and surface normal estimation).
Keywords
We propose two efficient variants to compute the differentiable matrix square root. For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Padé Approximants (MPA). The backward gradient is computed by iteratively solving the continuous-time Lyapunov equation using the matrix sign function. Both methods yield considerable speed-up compared with the SVD or the Newton-Schulz iteration.
Keywords
Inserting an SVD meta-layer into neural networks is prone to make the covariance ill-conditioned, which could harm the model in the training stability and generalization abilities. We systematically study how to improve the covariance conditioning by enforcing orthogonality to the Pre-SVD layer.
Keywords
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. One crucial bottleneck limiting its usage is the expensive computation cost, particularly for a mini-batch of matrices in the deep neural networks. We propose a QR-based ED method dedicated to the application scenarios of computer vision. Our proposed method performs the ED entirely by batched matrix/vector multiplication, which processes all the matrices simultaneously and thus fully utilizes the power of GPUs.
Keywords
This software can be used for training a deep learning architecture which estimates frames' importance by integrating a concentrated attention mechanism and utilising information about the frames' uniqueness and diversity. The integrated mechanism is able to focus on non-overlapping blocks in the main diagonal of the attention matrix and make better estimates about the significance of different parts of the video by considering the uniqueness and diversity of the associated frames.
Keywords
This software can be used for studying our method for producing explanations for the outcomes of various attention-based video summarization models, and re-producing the reported exprerimental results in our papers titled "A Study on the Use of Attention for Explaining Video Summarization" (published in the Proc. of the IEEE Int. Symposium on Multimedia (ISM) 2022) and "Explaining Video Summarization Based on the Focus of Attention" (published in the Proc. of the NarSUM workshop at ACM Multimedia 2023).
Keywords
DivClust is a method for controlling inter-clustering diversity in deep clustering frameworks. It consists of a novel loss that can be incorporated in most modern deep clustering frameworks in a straightforward way during their training, and which allows the user to specify their desired degree of inter-clustering diversity, which is then enforced in the form of an upper bound threshold.
Keywords
Code for the training of a video similarity learning network with self-supervision.
Keywords
Code for the knowledge distillation training of coarse- and fine-grained student networks based on similarities calculated from a teacher and the selector network. Also, the scripts for the training of the selector network are included.
Keywords
We propose an adaptive method that introduces soft inter-sample relations, namely Adaptive Soft Contrastive Learning (ASCL). More specifically, ASCL transforms the original instance discrimination task into a multi-instance soft discrimination task, and adaptively introduces inter-sample relations. As an effective and concise plug-in module for existing self-supervised learning frameworks, ASCL achieves the best performance on several benchmarks in terms of both performance and efficiency.
Keywords
A novel self-supervised framework: cross-context learning between global and hypercolumn features (CGH), that enforces the consistency of instance relations between low- and high-level semantics. Specifically, we stack the intermediate feature maps to construct a hypercolumn representation so that we can measure instance relations using two contexts (hypercolumn and global feature) separately, and then use the relations of one context to guide the learning of the other. This cross-context learning allows the model to learn from the differences between the two contexts.
Keywords
We propose a contrastive learning method, called Masked Contrastive learning (MaskCon) to address the under-explored problem setting, where we learn with a coarse-labelled dataset in order to address a finer labelling problem. More specifically, within the contrastive learning framework, for each sample our method generates soft-labels against other samples and another augmented view of the sample in question. By contrast to self-supervised contrastive learning where only the sample's augmentations are considered hard positives, and in supervised contrastive learning where only samples with the same coarse labels are considered hard positives, we propose soft labels based on sample distances, that are masked by the coarse labels. This allows us to utilize both inter-sample relations and coarse labels.
Keywords
We propose an efficient and robust framework named Sample Selection and Relabelling(SSR), that minimizes the number of modules and hyperparameters required, and that achieves good results in various conditions. In the heart of our method is a sample selection and relabelling mechanism based on a non-parametric KNN classifier and a parametric model classifier, respectively, to select the clean samples and gradually relabel the closed-set noise samples. Without bells and whistles, such as model co-training, self-supervised pertaining, and semi-supervised learning, and with robustness concerning settings of its few hyper-parameters, our method significantly surpasses previous methods.
Keywords
This software can be used for training a deep learning architecture for video thumbnail selection, which quantifies the representativeness and the aesthetic quality of the selected thumbnails using deterministic reward functions, and integrates a frame picking mechanism that takes frames' diversity into account. After being trained on a collection of videos, RL-DiVTS is capable of selecting a diverse set of representative and aesthetically-pleasing video thumbnails for unseen videos, according to a user-specified value about the number of required thumbnails.
Keywords
Group (or cluster) discrimination has been one of the most successful self-supervised representation learning methods for image data. However, such frameworks need to guard against heavily imbalanced cluster assignments to prevent collapse to trivial solutions. In this work, we propose ExCB, a framework that tackles this problem with a novel cluster balancing method. ExCB estimates the relative size of the clusters across batches and balances them by adjusting cluster assignments, proportionately to their relative size and in an online manner. Thereby, it overcomes previous methods' dependence on large batch sizes and is fully online, and therefore scalable to any dataset.
Keywords
Inspired by the recent success of the Transformer network in computer vision, we introduce the Recurrent Vision Transformer (RViT) model. Thanks to the impact of recurrent connections and spatial attention in reasoning tasks, this network achieves competitive results on the same-different visual reasoning problems from the SVRT dataset. The weight-sharing both in spatial and depth dimensions regularizes the model, allowing it to learn using far fewer free parameters, using only 28k training samples. A comprehensive ablation study confirms the importance of a hybrid CNN + Transformer architecture and the role of the feedback connections, which iteratively refine the internal representation until a stable prediction is obtained.
Keywords
"This software can be used for training a deep learning architecture for video thumbnail selection, which quantifies the representativeness and the aesthetic quality of the selected thumbnails using deterministic reward functions, and integrates a frame picking mechanism that takes frames’ diversity into account. After being trained on a collection of videos, RL-DiVTS's Thumbnail Selector is capable of selecting a diverse set of representative and aesthetically-pleasing video thumbnails for unseen videos, according to a user-specified value about the number of required thumbnails. "
Keywords
We propose a novel GAN architecture for compression artifacts reduction in videoconferencing. In this context, the speaker is typically in front of the camera and remains the same for the entire duration of the transmission. With this assumption, we can maintain a set of reference keyframes of the person from the higher quality I-frames that are transmitted within the video streams. First, we extract multi-scale features from the compressed and reference frames. Then, these features are combined in a progressive manner with Adaptive Spatial Feature Fusion blocks based on facial landmarks and with Spatial Feature Transform blocks. This allows to restore the high frequency details lost after the video compression.
Keywords
We propose LLMaker, a general framework for consistent video game content generation empowered by LLMs, bridging the gap between creative vision and technical execution. We demonstrate LLMaker’s application in generating dungeon crawler level layouts, comparing it against alternative LLM-basedmethods for content generation over multiple tests, testing forconsistency of the outputs and elapsed time per request.
Keywords
This repository contains all necessary information for the ViLMA benchmark. It includes data setup, model setup and execution, and evaluation procedures. ViLMA (Video Language Model Assessment) presents a comprehensive benchmark for Video-Language Models (VidLMs) to evaluate their linguistic and temporal grounding capabilities in five dimensions: action counting, situation awareness, change of state, rare actions and spatial relations. ViLMA also includes a two stage evaluation procedure as (i) proficiency test (P) that assesses fundamental capabilities deemed essential before solving the five tests, (ii) main test (T) which evaluates the model under the proposed five diverse tests, and (iii) a combined score of these two tasks (P+T).
Keywords
We tackle the novel task of associating images from Wikipedia pages with the correct caption among a large pool of available ones written in multiple languages, as required by the image-caption matching Kaggle challenge organized by the Wikimedia Foundation. A system able to perform this task would improve the accessibility and completeness of the underlying multi-modal knowledge graph in online encyclopedias. We propose a cascade of two image-text matching models based on large pre-trained Transformer models. The first model, called Multi-modal Caption Proposal (MCProp), is based on the common space matching approach and uses XLM-RoBERTa and CLIP as text and image feature extractors, respectively. Being very efficient at inference time, this model is used to quickly propose potentially relevant candidates. The second model, Caption Re-Rank (CRank), is a fine-tuned XLM-RoBERTa pairwise classifier. This model is less efficient but more accurate and is used to re-score and reorder the candidates from the first stage.
Keywords
Python implementation of novel Cycle In Cycle Generative Adversarial Network (C2GAN) for the task of keypoint-guided image generation. The C2GAN is a cross-modal framework exploring joint exploitation of the keypoint and the image data in an interactive manner. C2GAN contains two different types of generators, i.e., keypoint-oriented generator and image-oriented generator. Both of them are mutually connected in an end-to-end learnable fashion and explicitly form three cycled sub-networks, i.e., one image generation cycle and two keypoint generation cycles. Each cycle not only aims at reconstructing the input domain, and also produces useful output involved in the generation of another cycle. By so doing, the cycles constrain each other implicitly, which provides complementary information from the two different modalities and brings extra supervision across cycles, thus facilitating more robust optimization of the whole network.
Keywords
Source code for the DVMS model and training procedure, as well as pre-trained network weights for reproducibility. This deep learning model allows for multiple trajectory predictions of head movements while experiencing 360° videos with a VR headset. The necessary libraries are bundled in a Docker image but can also be installed separately.
Keywords
Implementation of Fast SR-Net for fast video visual quality and resolution improvement. It comprises a GAN-based training procedure for obtaining a fast neural network that enables better bitrate performances with respect to the H.265 codec for the same quality, or better quality at the same bitrate.
Keywords
Novel framework for Playable Video Generation that is trained in a self-supervised manner on a large dataset of unlabelled videos. We employ an encoder-decoder architecture where the predicted action labels act as bottlenecks. The network is constrained to learn a rich action space using, as the main driving loss, a reconstruction loss on the generated video.
Keywords
This is a fork of Few-shot Object Detection (FsDet) (https://github.com/ucbdrive/few-shot-object-detection), adding an easy-to-use tool for training on custom datasets. We have extended the FsDet framework with a tool that dynamically generates datasets from annotation files and drives the training process. The tool has the following features: - Determine the base and novel classes from the provided annotations (for the novel classes only a subset may be used for training). - Determine how many instances are available, and set up the k-shot n-way problem accordingly. - Prepare model structures for the novel only and combined base+novel finetuning by adjusting the layer sizes to match the number of classes in the different sets. - If the number of samples strongly varies, set up multiple training problems to make the best use of the data, and run multiple fine-tuning steps.
Keywords
VISIONE is a content-based retrieval system that supports various search functionalities (text search, object/color-based search, semantic and visual similarity search, temporal search). It uses a full-text search engine as a search backend.
Keywords
Novel Deep Micro-Dictionary Learning and Coding Network (DDLCN). DDLCN has most of the standard deep learning layers (pooling, fully, connected, input/output, etc.) but the main difference is that the fundamental convolutional layers are replaced by novel compound dictionary learning and coding layers. The dictionary learning layer learns an over-complete dictionary for the input training data. At the deep coding layer, a locality constraint is added to guarantee that the activated dictionary bases are close to each other. Next, the activated dictionary atoms are assembled together and passed to the next compound dictionary learning and coding layers. In this way, the activated atoms in the first layer can be represented by the deeper atoms in the second dictionary. Intuitively, the second dictionary is designed to learn the fine-grained components which are shared among the input dictionary atoms. In this way, a more informative and discriminative low-level representation of the dictionary atoms can be obtained.
Keywords
The new loss function for self-supervised representation learning (SSL), is based on the whitening of the latent-space features. The whitening operation has a "scattering" effect on the batch samples, avoiding degenerate solutions where all the sample representations collapse to a single point. Our solution does not require asymmetric networks and it is conceptually simple. Moreover, since negatives are not needed, we can extract multiple positive pairs from the same image instance.
Keywords
A tool to allow Visual Transformers (VTs) to learn spatial relations within an image making the VT training much more robust when training data is scarce. The tool can be used jointly with the standard (supervised) training and it does not depend on specific architectural choices, thus it can be easily plugged into the existing VTs. Our method can improve (sometimes dramatically) the final accuracy of the VTs.
Keywords
As the backward algorithm of SVD is prone to numerical instability, we implement a variety of end-to-end SVD methods by manipulating the backward algorithms in this repository. They include: - SVD-Pad'e: use Pad'e approximants to closely approximate the gradient. - SVD-Taylor: use the Taylor polynomial to approximate the smooth gradient. - SVD-PI: use Power Iteration (PI) to approximate the gradients. - SVD-Newton: use the gradient of the Newton-Schulz iteration. - SVD-Trunc: set an upper limit of the gradient and apply truncation. - SVD-TopN: select the Top-N eigenvalues and abandon the rest. - SVD-Original: ordinary SVD with gradient overflow check.
Keywords
We propose a 3D-aware Semantic-Guided Generative Model (3D-SGAN) for human image synthesis, which combines a GNeRF with a texture generator. The former learns an implicit 3D representation of the human body and outputs a set of 2D semantic segmentation masks. The latter transforms these semantic masks into a real image, adding a realistic texture to the human appearance. Without requiring additional 3D information, our model can learn 3D human representations with a photo-realistic, controllable generation.
Keywords
We present a novel bipartite graph reasoning Generative Adversarial Network (BiGraphGAN) for two challenging tasks: person pose and facial image synthesis. The proposed graph generator consists of two novel blocks that aim to model the pose-to-pose and pose-to-image relations, respectively.
Keywords
We propose a novel edge guided generative adversarial network with contrastive learning (ECGAN) for the challenging semantic image synthesis task.
Keywords
We propose a new Attention-Guided Generative Adversarial Networks (AttentionGAN) for the unpaired image-to-image translation task. AttentionGAN can identify the most discriminative foreground objects and minimize the change of the background. The attention-guided generators in AttentionGAN are able to produce attention masks, and then fuse the generation output with the attention masks to obtain high-quality target images. Accordingly, we also design a novel attention-guided discriminator which only considers attended regions.
Keywords
We propose a novel framework, i.e., Predict, Prevent, and Evaluate (PPE), for disentangled text-driven image manipulation that requires little manual annotation while being applicable to a wide variety of manipulations.
Keywords
We propose a novel model named Multi-Channel Attention Selection Generative Adversarial Network (SelectionGAN) for guided image-to-image translation, where we translate an input image into another while respecting an external semantic guidance. The proposed SelectionGAN explicitly utilizes the semantic guidance information and consists of two stages. In the first stage, the input image and the conditional semantic guidance are fed into a cycled semantic-guided generation network to produce initial coarse results. In the second stage, we refine the initial results by using the proposed multi-scale spatial pooling & channel selection module and the multi-channel attention selection module. Moreover, uncertainty maps automatically learned from attention maps are used to guide the pixel loss for better network optimization. Exhaustive experiments on four challenging guided image-to-image translation tasks (face, hand, body, and street view) demonstrate that our SelectionGAN is able to generate significantly better results than the state-of-the-art methods. Meanwhile, the proposed framework and modules are unified solutions and can be applied to solve other generation tasks such as semantic image synthesis.
Keywords
We propose an implicit style function (ISF) to straightforwardly achieve multi-modal and multi-domain image-to-image translation from pre-trained unconditional generators. The ISF manipulates the semantics of an input latent code to make the image generated from it lying in the desired visual domain.
Keywords
A method for dealing with challenges that arise in the domain of affect and mental health in multi-label regression problems. We propose a two-stage attention architecture that uses features from the clips’ neighbourhood to introduce context information in the feature extraction. The architecture is novel in the domain of affect and mental state analysis and leads to smaller training times in comparison to state of the art. Furthermore, we introduced a novel relational regression loss that aims at learning from the label relationships of the samples during training. The proposed loss uses the distance between label vectors to learn intra-batch latent representation similarities in a supervised manner. The improvedlatent representations obtained with the addition of the relational regression loss lead to improved regression output, without the use of large datasets.
Keywords
A novel visual-language model called DFER-CLIP, which is based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition (DFER). Specifically, the proposed DFER-CLIP consists of a visual part and a textual part. For the visual part, based on the CLIP image encoder, a temporal model consisting of several Transformer encoders is introduced for extracting temporal facial expression features, and the final feature embedding is obtained as a learnable "class" token. For the textual part, we use as inputs textual descriptions of the facial behaviour that is related to the classes (facial expressions) that we are interested in recognising -- those descriptions are generated using large language models, like ChatGPT. This, in contrast to works that use only the class names and more accurately captures the relationship between them. Alongside the textual description, we introduce a learnable token which helps the model learn relevant context information for each expression during training.
Keywords
Facial Expression Recognition (FER) is a crucial task in affective computing, but its conventional focus on the seven basic emotions limits its applicability to the complex and expanding emotional spectrum. To address the issue of new and unseen emotions present in dynamic in-the-wild FER, we propose a novel vision-language model that utilises sample-level text descriptions (i.e. captions of the context, expressions or emotional cues) as natural language supervision, aiming to enhance the learning of rich latent representations, for zero-shot classification.
Keywords
We focus on the challenge of generalizing across different concept classes, e.g., when training a detector on human faces and testing on synthetic animal images -- highlighting the ineffectiveness of existing approaches that randomly sample generated images to train their models. By contrast, we propose an approach based on the premise that the robustness of the detector can be enhanced by training it on realistic synthetic images that are selected based on their quality scores according to a probabilistic quality estimation model. Our results show that our quality-based sampling method leads to higher detection performance for nearly all concepts, improving the overall effectiveness of the synthetic image detectors.
Keywords
We propose a novel interactionTransformer (InterFormer) consisting of a Transformer network with both temporal and spatial attention. Specifically, temporal attention captures the temporal dependencies of the motion of both characters and of their interaction, while spatial attention learns the dependencies between the different body parts of each character and those which are part of the interaction. Moreover,we propose using graphs to increase the performance of spatial attention via an interaction distance module that helps focus onnearby joints from both characters. The method is general and can beused to generate more complex and long-term interaction.
Keywords
A novel neural architecture, MultiFusion BERT, for fallacious argument detection and classification, combining text, argumentative features, and engineere
Keywords
To address the Synthetic Image Detection task more effectively, we propose the RINE model by leveraging Representations from Intermediate Encoder-blocks of CLIP. Specifically, we collect the image representations provided by intermediate Transformer blocks that carry low-level visual information and project them with learnable linear mappings to a forgery-aware vector space. Additionally, a Trainable Im-portance Estimator (TIE) module is used to incorporate the impact of each intermediate Transformer block in the final prediction.
Keywords
MINTIME is a video deepfake detection method that effectively captures spatial and temporal inconsistencies in videos that depict multiple individuals and varying face sizes. Unlike previous approaches that either employ simplistic a-posteriori aggregation schemes,i.e., averaging or max operations, or only focus on the largest face in the video, our proposed method learns to accurately detect spatio-temporal inconsistencies across multiple identities in a video through a Spatio-Temporal Transformer combined with a Convolutional Neural Network backbone. This is achieved through an Identity-aware Attention mechanism that applies a masking operation on the face sequence to process each identity independently, which enables effective video-level aggregation. Furthermore, our system incorporates two novel embedding schemes: (i) the Temporal Coherent Positional Embedding, which encodes the temporal information of the face sequences of each identity, and (ii) the Size Embedding, which captures the relative sizes of the faces to the video frames. MINTIME achieves state-of-the-art performance on the ForgeryNet dataset, with a remarkable improvement of up to 14% AUC in videos containing multiple people. Moreover, it demonstrates very robust generalization capabilities in cross-forgery and cross-dataset settings.
Keywords
Novel two-stage framework with a new Cascaded Cross MLP-Mixer (CrossMLP) sub-network in the first stage and one refined pixel-level loss in the second stage. In the first stage, the CrossMLP sub-network learns the latent transformation cues between image code and semantic map code via our novel CrossMLP blocks. Then, the coarse results are generated progressively under the guidance of those cues. Moreover, in the second stage, we use a refined pixel-level loss that eases the noisy semantic label problem with more reasonable regularization in a more compact fashion for better optimization.
Keywords
DeepFusion source code. This code corresponds to a DNN-based late fusion approach, that uses a custom number of inducers as inputs and outputs a new result, according to late fusion schemes.
Keywords
Social networks give free access to their services in exchange for the right to exploit their users' data. Data sharing is done in an initial context which is chosen by the users. However, data are used by social networks and third parties in different contexts which are often not transparent. In order to unveil such usages, we propose an approach that focuses on the effects of data sharing in impactful real-life situations. Focus is put on visual content because of its strong influence in shaping online user profiles. The approach relies on three components: (1) a set of visual objects with associated situation impact ratings obtained by crowdsourcing, (2) a corresponding set of object detectors for mining users' photos and (3) a ground truth dataset made of 500 visual user profiles which are manually rated per situation. These components are combined in LERVUP, a method which learns to rate visual user profiles in each situation. LERVUP exploits a new image descriptor which aggregates object ratings and object detections at user level and an attention mechanism which boosts highly-rated objects to prevent them from being overwhelmed by low-rated ones. Performance is evaluated per situation by measuring the correlation between the automatic ranking of profile ratings and a manual ground truth
Keywords
AI4Media may use cookies to store your login data, collect statistics to optimize the website’s functionality and to perform marketing actions based on your interests. You can personalize your cookies in .