Summary of the project

Multimedia content brings specific challenges for retrieval, classification, archiving, and searching tasks. Specifically, there are four challenges we will address through the project.

1. Annotation: Given the exponential growth of video content in media libraries, it is infeasible for humans to provide meaningful and complex labels on a frame-by-frame basis. Methods that can learn without this expensive and limiting labelling are required for clustering and understanding the footage at multiple levels of detail and different modalities.

2. Multi-Modal Semantic Search: Current retrieval methods do not offer a solution for viewers to search videos using multiple data modes, such as sound, image, and text. For example, “Find me a scene of two people in Paris with romantic music in the style of Wes Anderson”.

3. Memory and Computational Efficiency: Developing novel methods for solving tasks one and two is challenging with current neural network architectures. This is in part due to the large amount of visual and sound data that videos provide and, as such, the large amount of GPU power and memory required to implement video networks efficiently and at scale.

4. Continual Learning: Finally, methods for retrieval and archiving must adapt as new content is added to the database without retraining the model with all previous data. Therefore, such a system would need to address the problem of “catastrophic forgetting”. This would require a mechanism to preserve existing knowledge in the network while learning from new unseen footage.

To solve these problems  we will explore how to leverage pre-trained audio, vision, and video networks to ensure minimal computational expense while reducing the approach’s environmental impact.

The team:

Andrew Gilbert
Jon Weinbren
Ed Fish