Skip to content

Multimodal Benchmarking

We benchmark prominent SOTA music understanding models on our dataset showcase their performance on a set of canonical and novel metrics.

Taxonomy of Multimodal Music Understanding Tasks

Preview

Sensitivity Metrics

Given a text-audio pair (t,a) we construct a counterfactual textual annotation tk by changing only the elements of category k in sentence t to be intuitively different (in the example above we consider k= situational and therefore modify only the situation in which a song is engaged with). We present a set of metrics {sk:kcaption sets} used to assess the output of a text-conditioned music generation model, such that

sk=1n[i=1n1cosine_sim(ai,a~ik)]

where n is the number of data points and a=M(t), ak=M(tk) are the generated audio associated with an original textual annotation t and its counterfactual annotation tk constructed by exchanging subset k for it’s counterfactual counterpart, respectively. Each metric is bounded between (0,1) with higher values showing higher sensitivity.

Released under the MIT License.