My main research interests are in multimodal deep learning, self-supervised representation learning, and music and audio understanding, with a focus on joint modelling of audio and language. By using insights from signal processing, NLP and other areas of machine learning, my work focusses on developing methods to learn from multiple data modalities and obtain representations that can bridge the gap between human and machine understanding of music. As part of my PhD and internship research projects, I have worked on contrastive learning, text-guided audio generation via diffusion, and evaluation of audio-language models.

Why multimodal learning for music?

Humans understand music by processing information that comes in a variety of modalities: we listen to audio, watch music videos, look at album cover art, read and write reviews. We also give and receive music recommendations, organise music collections in playlists, search for known and unknown music through voice-activated devices, search engines or by exchanging information with other people. In a nutshell, our experience of music is informed not only by audio signals, but also by the visual and textual information associated with a musical piece, as well as contextual factors and our listening history. Music is therefore multimodal in nature and non-audio modalities hold meaningful complementary and supplementary information about our understanding and perception of music. Machines, on the other hand, are often tasked with analysing waveforms and metadata in isolation to perform tasks such as music retrieval, tagging and recommendation. Typical deep learning approaches to automatic music understanding, for example, rely purely on analysing audio data to extract low-level features and high-level semantic descriptors ranging from rhythm, timbre and melody to genre and emotional content. One of the core aspirations of my research, and AI and music research more broadly, is instead to endow machines with a more human-like, and therefore multimodal, ability to understand music. The end goal is to enable human-computer interaction in the music domain in a more natural and transparent way.

If you're curious about multimodal machine learning applied to music, I have created a Github repo with a list of academic resources on the topic that can serve as a useful starting point.


Selected Publications

A full list of publications can be found on my Google Scholar profile.

  • The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation
    Ilaria Manco*, Benno Weck*, SeungHeon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, Elio Quinton, Gy├Ârgy Fazekas, Juhan Nam
    Machine Learning for Audio Workshop @ NeurIPS 2023
    [arXiv] [Code]

  • Song Describer: a Platform for Collecting Textual Descriptions of Music Recordings
    Ilaria Manco, Benno Weck, Philip Tovstogan, Minz Won, Dmitry Bogdanov
    23rd International Society for Music Information Retrieval Conference (ISMIR 2022) - Late breaking/Demo
    [Paper] [Code] [Link]

  • Contrastive Audio-Language Learning for Music
    Ilaria Manco, Emmanouil Benetos, Elio Quinton, Gyorgy Fazekas
    23rd International Society for Music Information Retrieval Conference (ISMIR 2022)
    [Paper] [arXiv] [Code]

  • Learning music audio representations via weak language supervision
    Ilaria Manco, Emmanouil Benetos, Elio Quinton, Gyorgy Fazekas
    2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
    [Paper] [arXiv] [Code]

  • MusCaps: generating captions for music audio
    Ilaria Manco, Emmanouil Benetos, Elio Quinton, Gyorgy Fazekas
    International Joint Conference on Neural Networks (IJCNN) 2021
    [Paper] [arXiv] [Code]