Given a text-audio pair we construct a counterfactual textual annotation by changing only the elements of category in sentence to be intuitively different (in the example above we consider situational and therefore modify only the situation in which a song is engaged with). We present a set of metrics used to assess the output of a text-conditioned music generation model, such that
where is the number of data points and , are the generated audio associated with an original textual annotation t and its counterfactual annotation constructed by exchanging subset for it’s counterfactual counterpart, respectively. Each metric is bounded between with higher values showing higher sensitivity.