vocal separation on the REAL Orphans master — REPET-SIM (unsupervised) vs pleco fingerprint (supervised)

Input is Cameron's real orphans-mix.wav (22.05 kHz mono, 16 s), decoded with the package's own decodeWav. The ground truth is the real stems: orphans-vocals.wav (isolated vocal) and orphans-instrumental.wav (summed non-vocal stems). The master is the sample-exact sum of the stems and the stems are mutually decorrelated, so a time-domain Pearson correlation of each estimate against each stem is an honest recovery metric — a working vocal estimate must correlate higher with the true vocal than with the true instrumental.

A. REPET-SIM (repetition-based separation, unsupervised): decompose.nn_filter(|STFT|, median, cosine, width=2 s) → element-min with S → decompose.softmask (margins 2/10, power 2) → istft with mix phase. B. fingerprint (pleco flagship, supervised — handed the true vocal's fingerprints): processAudioToFingerprintsoptimizeEqCurvesreconstructVocal.

Honest scoreboard (node spot-run 2026-07-02, deterministic — the web page recomputes the identical numbers): the raw mix is instrumental-dominated (corr voc 0.335 / ins 0.947 — the vocal is buried). Both separators flip that ordering, but the supervised fingerprint wins vocal fidelity by a clear margin (corr voc 0.744 vs REPET 0.438), while REPET's background still retains ~65 % of the true instrumental energy. Gate = the fingerprint's vocal-vs-instrumental ordering; REPET's numbers are reported, not hidden.

decoding stems and running both separators…

full scoreboard (all measured against the REAL stems)

reference ground truth (audition)

row 1 — mixture (the real master, what both separators see)

row 2 — estimated vocal (foreground)

row 3 — estimated background (REPET-SIM backing)