Vocal separation — REPET-SIM (nn_filter) vs pleco's fingerprint method, on a real vocal mix

Input is a real 16 s excerpt of Orphans (mono, 22.05 kHz) — a full band mix with sung vocals. This run has true ground truth: the isolated vocal and instrumental stems (from the original session) are loaded separately and used only as references to score each separator. Both separators see only the mixture.

A. REPET-SIM (repetition-based separation, unsupervised): decompose.nn_filter(|STFT|, median, cosine, width=2 s) → element-min with S → decompose.softmask (margins 2/10, power 2) → istft with the mixture phase. It models the repeating background and treats the non-repeating residue as foreground. B. Fingerprint (pleco flagship, supervised — it is given the true vocal's fingerprints): processAudioToFingerprintsoptimizeEqCurvesreconstructVocal.

Honest framing: both are pure DSP — no trained model, no weights, no GPU. On real dense material the unsupervised REPET-SIM baseline recovers the vocal weakly (it needs enough background repetition, and dense percussion leaks through); the supervised fingerprint method — handed the true vocal as a target — wins clearly, as it should. All correlations are measured against the real stems and reported below, nothing hidden. (Metric: time-domain Pearson correlation; the mix ≈ vocal + instrumental so this is meaningful.)

scoreboard (correlations vs the real Orphans stems)

mixture |STFT| (0–~5 kHz)

A. REPET-SIM foreground (should lean vocal)

A. REPET-SIM background (should lean instrumental)

B. fingerprint reconstruction (should track the true vocal)