My main research interests are in multimodal deep learning and music information retrieval, with a focus on joint modelling of audio and language. By using insights from signal processing, NLP and other areas of machine learning, my work focusses on developing methods to extract information from multiple data modalities with the aim of learning representations that can bridge the gap between human and machine understanding of music.

If you're curious about multimodal machine learning applied to music, I have created a Github repo with a list of academic resources on the topic that can serve as a useful starting point.

Why multimodal learning for music?

Humans understand music by processing information that comes in a variety of modalities: we listen to audio, watch music videos, look at album cover art, read and write reviews. We also give and receive music recommendations, organise music collections in playlists, search for known and unknown music through voice-activated devices, search engines or by exchanging information with other people. In a nutshell, our experience of music is informed not only by audio signals, but also by the visual and textual information associated with a musical piece, as well as contextual factors and our listening history. Music is therefore multimodal in nature and non-audio modalities hold meaningful complementary and supplementary information about our understanding and perception of music. Machines, on the other hand, are often tasked with analysing waveforms and metadata in isolation to perform tasks such as music retrieval, tagging and recommendation. Typical deep learning approaches to automatic music understanding, for example, rely purely on analysing audio data to extract low-level features and high-level semantic descriptors ranging from rhythm, timbre and melody to genre and emotional content. One of the core aspirations of my research, and AI and music research more broadly, is instead to endow machines with a more human-like, and therefore multimodal, ability to understand music. The end goal is to enable human-computer interaction in the music domain in a more natural and transparent way.

For a high-level overview of my research, you can also check out this short video and poster I prepared for a research showcase at the Alan Turing Institute in February 2021 or the poster I presented at the DMRN+14 workshop in December 2019.



  • Song Describer: a Platform for Collecting Textual Descriptions of Music Recordings
    Ilaria Manco, Benno Weck, Philip Tovstogan, Minz Won, Dmitry Bogdanov
    23rd International Society for Music Information Retrieval Conference (ISMIR 2022) - Late breaking/Demo
    [Paper] [Code] [Link]

  • Contrastive Audio-Language Learning for Music
    Ilaria Manco, Emmanouil Benetos, Elio Quinton, Gyorgy Fazekas
    23rd International Society for Music Information Retrieval Conference (ISMIR 2022)
    [Paper] [arXiv] [Code]

  • Learning music audio representations via weak language supervision
    Ilaria Manco, Emmanouil Benetos, Elio Quinton, Gyorgy Fazekas
    2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
    [Paper] [arXiv] [Code]

  • MusCaps: generating captions for music audio
    Ilaria Manco, Emmanouil Benetos, Elio Quinton, Gyorgy Fazekas
    International Joint Conference on Neural Networks (IJCNN) 2021
    [Paper] [arXiv] [Code]