My main research interests are in multimodal deep learning, self-supervised representation learning, and music and audio understanding, with a focus on joint modelling of audio and language. By using insights from signal processing, NLP and other areas of machine learning, my work focusses on developing methods to learn from multiple data modalities and obtain representations that can bridge the gap between human and machine understanding of music. As part of my PhD and internship research projects, I have worked on contrastive learning, text-guided audio generation via diffusion, and evaluation of audio-language models.
Humans understand music by processing information that comes in a variety of modalities: we listen to audio, watch music videos, look at album cover art, read and write reviews. We also give and receive music recommendations, organise music collections in playlists, search for known and unknown music through voice-activated devices, search engines or by exchanging information with other people. In a nutshell, our experience of music is informed not only by audio signals, but also by the visual and textual information associated with a musical piece, as well as contextual factors and our listening history. Music is therefore multimodal in nature and non-audio modalities hold meaningful complementary and supplementary information about our understanding and perception of music. Machines, on the other hand, are often tasked with analysing waveforms and metadata in isolation to perform tasks such as music retrieval, tagging and recommendation. Typical deep learning approaches to automatic music understanding, for example, rely purely on analysing audio data to extract low-level features and high-level semantic descriptors ranging from rhythm, timbre and melody to genre and emotional content. One of the core aspirations of my research, and AI and music research more broadly, is instead to endow machines with a more human-like, and therefore multimodal, ability to understand music. The end goal is to enable human-computer interaction in the music domain in a more natural and transparent way.
If you're curious about multimodal machine learning applied to music, I have created a Github repo with a list of academic resources on the topic that can serve as a useful starting point.
A full list of publications can be found on my Google Scholar profile.