Common Fate Model for Unison Source Separation
Fabian-Robert Stöter, Antoine Liutkus,
Roland Badeau, Bernd Edler, Paul Magron
March 21, 2016
Common Source Separation Scenario
Unison Scenario
Wanted
- Signal representation
- exploiting differences in AM, FM, PM
- Easily invertible
- Suitable Model
Related Work
- Frequency-dependent activation matrices by using a source/filter-based model (Hennequin 2011)
- HR-NMF models each complex entry of a time-frequency as a linear combination of its neighbours (Badeau 2011)
- Exploiting AM by computing a modulation spectrogram and factorise using NTF (Barker 2013)
Common Fate in Audio
- Bregman 1994 used term in auditory scene analysis.
- Ability to group sound objects based on their common motion over time
- Humans ability to detect and group sound sources by small differences in the FM and AM modulation is outstanding
Proposing: Transformation which groups common modulation textures to sound sources
Common Fate Transform
Audio
$x \in \mathbb{R}^{72000}$
STFT
$\mathbf{X} \in \mathbb{C}^{352 \times 279}$
Common Fate Transform
STFT Grid
$\mathcal{G} \in \mathbb{C}^{32 \times 48 \times 11 \times 6}$
|
CFT
$\mathcal{V} \in \mathbb{C}^{32 \times 48 \times 11 \times 6}$
|
In Detail
Compared to modulation spectrograms...
- CFT is computed using complex STFT $X$
- Easily invertible
- Models phase dependencies between neighbouring STFT entries
- Patches span/merge several frequency bins
- Results in modulation texture
NMF
$$\sum\limits_{j=1}^{J} \mathbf{w}_{j}(f) \circ \mathbf{h}_{j}(t) $$
Common Fate Model
$$\sum\limits_{j=1}^{J} \mathcal{A}_{j}(a,b,f) \circ \mathbf{h}_{j}(t)$$
Common Fate Model
$$\sum\limits_{j=1}^{J} \mathcal{A}_{j}(a,b,f) \circ \mathbf{h}_{j}(t)$$
CPD/PARAFAC/NTF
$$\sum\limits_{j=1}^{J} \mathbf{w}_{j}(f) \circ \mathbf{m}_{j}(b) \circ \mathbf{h}_{j}(t)$$
Signal Separation
- Compute the CFT from audio signal to get tensor $\mathcal{V}$
- Take the magnitude $|\mathcal{V}|$
- Initialise $\mathcal{A}$ and $\mathbf{h}$ with random non-negative values
- Apply multiplicative update rule to minimize $\beta$-divergence
- Synthesise factorised components using Wiener filtering
- Inverse CFT
Dataset
- Single pitches (C4 at 261.63 Hz)
- Viola
- Cello
- Tenor sax
- English horn
- Flute
- $\rightarrow$ ten mixtures of two instruments each
- Mixtures generated with a simple A — B — (A + B) scheme.
- Data were encoded in 44.1 kHz / 16 bit.
Models
- NMF Non-Negative Matrix Factorization
- MOD CP on modulation spectrogram
- CFM Common Fate Model
- CFMM Common Fate Magnitude Model
- CFMMOD CFMM with $a=1$
- HR-NMF High Resolution NMF model
Evaluation Results
Number of Components
Conclusion
- CFT a transformation based on a complex tensor representation computed from patches of the STFT
- CFM derived from the idea of humans perceiving common modulation over time as one source.
- Our results on unisonous musical instruments indicate that this method can perform well for this scenario.