Ilaria Manco

My main research interests are representation learning, music and audio understanding, and controllable generative modelling, with a focus on multimodal approaches. By using insights from generative modelling, signal processing, NLP and other areas of machine learning, my work focusses on developing methods to learn from multiple data modalities and obtain representations that can bridge the gap between human and machine understanding of music. As part of my PhD and internship research projects, I have worked on contrastive learning, text-guided audio generation via diffusion, and evaluation of audio-language models.

Why multimodal learning for music?

Humans understand music by processing information that comes in a variety of modalities: we listen to audio, watch music videos, look at album cover art, read and write reviews. We also give and receive music recommendations, organise music collections in playlists, search for known and unknown music through voice-activated devices, search engines or by exchanging information with other people. In a nutshell, our experience of music is informed not only by audio signals, but also by the visual and textual information associated with a musical piece, as well as contextual factors and our listening history. Music is therefore multimodal in nature and non-audio modalities hold meaningful complementary and supplementary information about our understanding and perception of music. Machines, on the other hand, are often tasked with analysing waveforms and metadata in isolation to perform tasks such as music retrieval, tagging and recommendation. Typical deep learning approaches to automatic music understanding, for example, rely purely on analysing audio data to extract low-level features and high-level semantic descriptors ranging from rhythm, timbre and melody to genre and emotional content. One of the core aspirations of my research, and AI and music research more broadly, is instead to endow machines with a more human-like, and therefore multimodal, ability to understand music. The end goal is to enable human-computer interaction in the music domain in a more natural and transparent way.

If you're curious about multimodal machine learning applied to music, I have created a Github repo with a list of academic resources on the topic that can serve as a useful starting point.

Selected Publications

A full list of publications can be found on my Google Scholar profile.

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
Benno Weck*, Ilaria Manco*, Emmanouil Benetos, Elio Quinton, György Fazekas, Dmitry Bogdanov
25th International Society for Music Information Retrieval Conference (ISMIR 2024)
[arXiv] [Code] [Data] [Website]

Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning
Ilaria Manco*, Justin Salamon, Oriol Nieto
25th International Society for Music Information Retrieval Conference (ISMIR 2024)
[arXiv] [Paper]

The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation
Ilaria Manco*, Benno Weck*, SeungHeon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, Elio Quinton, György Fazekas, Juhan Nam
Machine Learning for Audio Workshop @ NeurIPS 2023
[arXiv] [Code] [Data]

Song Describer: a Platform for Collecting Textual Descriptions of Music Recordings
Ilaria Manco, Benno Weck, Philip Tovstogan, Minz Won, Dmitry Bogdanov
23rd International Society for Music Information Retrieval Conference (ISMIR 2022) - Late breaking/Demo
[Paper] [Code] [Link]

Contrastive Audio-Language Learning for Music
Ilaria Manco, Emmanouil Benetos, Elio Quinton, Gyorgy Fazekas
23rd International Society for Music Information Retrieval Conference (ISMIR 2022)
[Paper] [arXiv] [Code]

Learning music audio representations via weak language supervision
Ilaria Manco, Emmanouil Benetos, Elio Quinton, Gyorgy Fazekas
2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
[Paper] [arXiv] [Code]

MusCaps: generating captions for music audio
Ilaria Manco, Emmanouil Benetos, Elio Quinton, Gyorgy Fazekas
International Joint Conference on Neural Networks (IJCNN) 2021
[Paper] [arXiv] [Code]

Research

Why multimodal learning for music?

Selected Publications