Skip to content

多模态研究标准测试

我们对数据集 SOTA 音乐理解模型进行了基准测试。

Taxonomy of Multimodal Music Understanding Tasks

Preview

Sensitivity Metrics

Given a text-audio pair (t,a) we construct a counterfactual textual annotation tk by changing only the elements of category k in sentence t to be intuitively different (in the example above we consider k= situational and therefore modify only the situation in which a song is engaged with). We present a set of metrics {sk:kcaption sets} used to assess the output of a text-conditioned music generation model, such that

sk=1n[i=1n1cosine_sim(ai,a~ik)]

where n is the number of data points and a=M(t), ak=M(tk) are the generated audio associated with an original textual annotation t and its counterfactual annotation tk constructed by exchanging subset k for it’s counterfactual counterpart, respectively. Each metric is bounded between (0,1) with higher values showing higher sensitivity.

基于 MIT 许可发布.