Multimodal Benchmarking

We benchmark prominent SOTA music understanding models on our dataset showcase their performance on a set of canonical and novel metrics.

Taxonomy of Multimodal Music Understanding Tasks

Preview

Sensitivity Metrics

Given a text-audio pair $(t, a)$ we construct a counterfactual textual annotation $t^{k}$ by changing only the elements of category $k$ in sentence $t$ to be intuitively different (in the example above we consider $k =$ situational and therefore modify only the situation in which a song is engaged with). We present a set of metrics ${s_{k} : k \in caption sets}$ used to assess the output of a text-conditioned music generation model, such that

s_{k} = \frac{1}{n} [\sum_{i = 1}^{n} 1 - c o s i n e_s i m (a_{i}, {\tilde{a}}_{i}^{k})]

where $n$ is the number of data points and $a = M (t)$ , $a^{k} = M (t^{k})$ are the generated audio associated with an original textual annotation t and its counterfactual annotation $t^{k}$ constructed by exchanging subset $k$ for it’s counterfactual counterpart, respectively. Each metric is bounded between $(0, 1)$ with higher values showing higher sensitivity.

Multimodal Benchmarking ​

Taxonomy of Multimodal Music Understanding Tasks ​

Sensitivity Metrics ​

Multimodal Benchmarking

Taxonomy of Multimodal Music Understanding Tasks

Sensitivity Metrics