Beth Pearson

Working PhD Project Title:
Improving Compositional Understanding Vision-Language Models
Academic Background
MEng Engineering Mathematics, University of Bristol (2016-2020)

General Profile:

My name is Beth and I studied Engineering Mathematics at the University of Bristol. This is where I first learned about AI and became interested in many areas of AI such as machine learning, natural language processing and computational neuroscience. My favourite part of the degree was completing my final year project on machine learning for smart home data which made me realise my love for research and exploring new ideas. Before joining the Interactive AI CDT I worked as a software developer at a small company called Wilxite for 2 years. I enjoyed my time working here as it allowed me to develop my programming skills through website building however I missed the excitement of research and wanted to focus more on AI and machine learning as this is where my true interest lie. I’m unsure what field my PhD project will focus on but I am looking forward to figuring it out.

Research Project Summary:

Humans learn by processing both sense of seeing and hearing to understand the world around us. There has been an increase in development of models which can process both images and text – vision-language models. They have been very successful in tasks such as image captioning, text guided image generation and visual question answering. However they are limited compared to human cognition. They encounter challenges in the process of concept binding, which involves associating and combining different attributes or features to accurately describe an object or scene.

For instance, when presented with an image of a blue circle, a VLM might produce inaccurate descriptions, such as "blue square" or "red circle," as it struggles to seamlessly connect and integrate information about color and shape.

This discrepancy arises from the VLM's difficulty in effectively merging distinct elements of perception into a cohesive representation. In contrast, human cognition excels at concept binding, enabling us to instantaneously recognize and label a "blue circle" due to our innate ability to integrate different facets of visual information.

Additionally, humans are able to learn new concepts and apply them in novel scenarios and generalise to unseen combinations. For example, if you understand the concept of ‘purple’ and ‘cat’ you will be able to accurately describe a purple cat if you saw one despite having not seen one before. VLMs however have been shown to be unable to generalise to new unseen concepts.

Researchers are actively engaged in refining VLMs to enhance their concept-binding capabilities, aiming to bring these artificial systems closer to the proficiency exhibited by humans in accurately describing and interpreting the visual world.

This project aims to improve the compositional understanding and generalisability of vision-language models so that they can handle more complex descriptions, be able to learn from fewer training examples and have broader versatility by being able to describe extensive combinations of learned concepts. To achieve these objectives, insights will be taken from generative models, such as diffusion models, which offer the potential for more robust image embeddings than existing state-of-the-art (VLMs). Additionally, alternatives for the text processing of VLMs will be explored thorough examination of current weaknesses in text representation and the consideration of additional dataset training to enhance VLM performance in complex scenarios.

This project falls within the EPSRC AI thematic research area.