My main research interests are in multimodal deep learning and music information retrieval, with a focus on joint modelling of audio and language. By using insights from signal processing, NLP and other areas of machine learning, my work focusses on developing methods to extract information from multiple data modalities with the aim of learning representations that can bridge the gap between human and machine understanding of music.

Why multimodal learning for music?

Humans understand music by processing information that comes in a variety of modalities: we listen to audio, watch music videos, look at album cover art, read and write reviews. We also give and receive music recommendations, organise music collections in playlists, search for known and unknown music through voice-activated devices, search engines or by exchanging information with other people. In a nutshell, our experience of music is informed not only by audio signals, but also by the visual and textual information associated with a musical piece, as well as contextual factors and our listening history. Music is therefore multimodal in nature and non-audio modalities hold meaningful complementary and supplementary information about our understanding and perception of music. Machines, on the other hand, are often tasked with analysing waveforms and metadata in isolation to perform tasks such as music retrieval, tagging and recommendation. Typical deep learning approaches to automatic music understanding, for example, rely purely on analysing audio data to extract low-level features and high-level semantic descriptors ranging from rhythm, timbre and melody to genre and emotional content. One of the core aspirations of my research, and AI and music research more broadly, is instead to endow machines with a more human-like, and therefore multimodal, ability to understand music. The end goal is to enable human-computer interaction in the music domain in a more natural and transparent way.

For a high-level overview of my research, you can also check out this short video and poster I prepared for a research showcase at the Alan Turing Institute in February 2021 or the poster I presented at the DMRN+14 workshop in December 2019.



  • MusCaps: Generating Captions for Music Audio [arXiv] [Code]
    Ilaria Manco, Emmanouil Benetos, Elio Quinton, Gyorgy Fazekas
    International Joint Conference on Neural Networks (IJCNN) 2021 (accepted)