Input is a real 16 s excerpt of Orphans (mono, 22.05 kHz) — a full
band mix with sung vocals. This run has true ground truth: the isolated
vocal and instrumental stems (from the original session) are loaded separately
and used only as references to score each separator. Both separators see
only the mixture.
A. REPET-SIM (repetition-based separation, unsupervised):
decompose.nn_filter(|STFT|, median, cosine, width=2 s) →
element-min with S → decompose.softmask (margins 2/10, power 2)
→ istft with the mixture phase. It models the repeating background and
treats the non-repeating residue as foreground.
B. Fingerprint (pleco flagship, supervised — it is
given the true vocal's fingerprints):
processAudioToFingerprints → optimizeEqCurves →
reconstructVocal.
Honest framing: both are pure DSP — no trained model, no weights, no GPU. On real dense material the unsupervised REPET-SIM baseline recovers the vocal weakly (it needs enough background repetition, and dense percussion leaks through); the supervised fingerprint method — handed the true vocal as a target — wins clearly, as it should. All correlations are measured against the real stems and reported below, nothing hidden. (Metric: time-domain Pearson correlation; the mix ≈ vocal + instrumental so this is meaningful.)