Appearance
Igbo ASR — Multilingual Speech Cognition Framework
task: 300018
phase: 1 — Acoustic Infrastructure
created: 2026-05-30
repo: hinata-sandpit/igbo-asr
branch: claude/modest-bohr-WoZMI
Vision
A sovereign, framework-independent speech cognition system for Igbo and Nigerian multilingual audio (Igbo / English / Pidgin code-switching). Starting corpus: one 3-hour Nollywood film. End state: a standalone research platform that assimilates, distils, and ultimately replaces all external framework dependencies.
**Sovereignty principle:** Whisper, wav2vec2, pyannote are _bootstrap scaffolding only_.
Each is isolated to a single file. Phase 4 distillation retires them one by one.
The permanent proprietary asset is the **phonetic memory store** (DuckDB + Parquet).
Why Igbo is harder than English
Tonal: same CV sequence, 3+ meanings by pitch alone (H / L / Downstepped H)
8 oral vowels with vowel harmony constraints (ị ụ ọ ẹ — ATR distinction)
Syllable-timed: timing carries meaning, not just rhythm
Code-switching: Nollywood speech transitions Igbo ↔ English ↔ Pidgin freely
Data-scarce: <200h of aligned training audio known publicly
Architecture Layers
1
Foundation
Raw audio understanding — ingest, surgeon, diarization
Built
2
Linguistic
Phonetics, tone contours, formants, syllables, language regions
Built
3
Reasoning
Probabilistic interpretation — multi-researcher consensus engine
Phase 1
4
Memory
Acoustic embeddings store (wav2vec2 → own encoder) — DuckDB + Parquet
Built
5
Consensus
Competing researcher systems with weighted Bayesian fusion
Phase 1
6
Adaptation
Dialect + speaker profile adaptation, active learning loop
Phase 3
7
Output
SRT / VTT (confidence-coded) / JSON annotations aligned to video
Built
Phase 1 Pipeline (Built — igbo-asr/)
StageModuleFunctionSovereign?
1`pipeline/ingest.py`Video → mono 16kHz WAV + probe metadataYes
2`pipeline/surgeon.py`Demucs vocals + spectral denoise + EBU R128 normaliseYes
3`pipeline/diarize.py`pyannote speaker diarization — timestamps + labelsPartial
4`pipeline/lang_detect.py`Igbo / English / Pidgin region detection (bootstrap + lexical)Partial
5`pipeline/phonetic.py`pYIN F0, tone H/L/M/R/F, MFCC, LPC formants, nasalizationYes
6`pipeline/bootstrap.py`Whisper teacher — candidate transcripts (ISOLATED, Phase 4 exits)Bootstrap
7`pipeline/embeddings.py`wav2vec2 hidden state embeddings — permanent acoustic memoryYes
8`pipeline/consensus.py`Weighted researcher fusion — confidence intervals per wordYes
9`pipeline/align.py`SRT / VTT (colour-coded confidence) / JSON annotation exportYes
**First run command:**
`python run.py dissect --input /path/to/nollywood.mp4 --no-separate`
(use `--no-separate` first run to skip Demucs; add it once dep is confirmed installed)
Researcher Ensemble
Each researcher exposes a standard process(audio, sr, segment_id, ...) → ResearcherOutput interface. The consensus engine calls all researchers in parallel and fuses by weighted confidence.
ResearcherSignalPhase 1 WeightPhase 4 Weight
`audio_surgeon`Signal quality / SNRPre-processor (not weighted)Pre-processor
`phonetic_cartographer`F0, tone, MFCC, formants0.400.50
`linguistic_mapper`Language ID, code-switchingSignal only0.20
`bootstrap (Whisper)`Text + word timestamps0.600.00 — retired
`acoustic_embeddings` (future)Phoneme exemplar similarity—0.30
Phonetic Memory Store (Sovereign Asset)
TableContentSovereign?
`source_media`File registry, duration, formatYes
`speaker_segments`Diarized speaker turns with timestampsYes
`language_regions`Ig/EN/PCM regions with confidenceYes
`bootstrap_segments`Whisper teacher outputs (temporary training signal)Bootstrap
`phonetic_features`F0, tones, MFCC, F1/F2, nasalization, syllable rateYes
`acoustic_embeddings`wav2vec2 hidden states, float32 blobsYes
`consensus_results`Final transcripts, confidence, tone labelsYes
`review_queue` (view)Low-confidence segments for human correctionYes
Export: all tables → Parquet (ZSTD) at end of each run. DuckDB file is the working store; Parquet is the durable portable archive.
Phase Plan
PhaseGoalKey DeliverablesTimelineStatus
1
Acoustic Infrastructure
Movie dissection pipeline, phonetic store, Whisper bootstrap
M0–2
Built
2
Phonetic Memory System
Scale to 10+ films, tone labelling, speaker profiles
M2–4
Up next
3
Weakly Supervised Learning
Pseudo-labels, human correction loop, Whisper weight drops to 0.2
M4–8
Planned
4
Distillation (sovereignty)
Own acoustic encoder trained on memory store; Whisper retired
M8–14
Planned
5
Native Framework
Custom tone-aware transformer, dialect adapters, mobile inference
Y2–3
Planned
Training Data Requirements
StageHoursQualityNotes
Phase 1 MVP3h (1 film)Weakly labelledBootstrap transcripts, phonetic features — first iteration
Phase 2 target50–200hWeakly labelled10–50 Nollywood films, YouTube, church recordings
Commercial viability gate500–1,500hAlignedWER <35% target; diverse speakers required
Strong commercial5,000–15,000hAlignedWER <20%; viable for enterprise API
Research frontier30,000h+Aligned + tone-markedState-of-the-art; requires annotation team
Data sources (prioritised)
Nollywood films with embedded subtitles (alignment target)
Common Voice Igbo dataset (public, ~5h as of 2024)
YouTube — Igbo news, church sermons, interviews
Nigerian radio broadcasts (NTA, AIT)
University of Nigeria linguistics corpora
WhatsApp voice notes (consented, community collection)
Tech Stack
Audio processing
FFmpeg · Librosa · SoundFile · SciPy signal
Source separation
Demucs (htdemucs) · spectral subtraction (custom)
Diarization
pyannote.audio 3.1 (HF gated)
Bootstrap teacher
OpenAI Whisper large-v3 — _temporary_
Acoustic embeddings
wav2vec2-base (Hugging Face) · layer -4
Pitch / tone
pYIN (librosa) · LPC formants (scipy)
Data store
DuckDB 0.10 · Apache Arrow · Parquet ZSTD
Training (future)
PyTorch · PyTorch Lightning · DeepSpeed
Inference (future)
ONNX Runtime · CTranslate2 · llama.cpp-style
Compute
2TB self-hosted server (z2) · CPU now, GPU Phase 3
Storage & Compute Budget
StageRaw audioFeatures + embeddingsModelsTotal
Phase 1 (1 film, ~3h)~1 GB~5 GB~10 GB~16 GB
Phase 2 (50 films, ~150h)~30 GB~80 GB~15 GB~125 GB
Commercial (5,000h)1–3 TB3–8 TB~100 GB5–11 TB
Research frontier6–20 TB10–30 TB~500 GB20–50 TB
2TB server is sufficient through Phase 2. Phase 3+ requires NAS expansion or cloud cold storage.
Key optimisation: store embeddings (compact float32), not raw retraining pipelines.
Training time estimates (CPU vs GPU)
TaskCPU (z2)1× A100
Phase 1 — 1 film full pipeline2–6h20–40min
Phase 3 — fine-tune (500h data)Weeks2–5 days
Phase 4 — distillation (5,000h)Not viable2–4 weeks
Evaluation Strategy
MetricDescriptionPhase 1 targetLaunch gate
WERWord Error Rate (standard ASR)<60% (bootstrap quality)<20%
CERCharacter Error Rate<40%<12%
Tone Error RateCorrect tone label on voiced segmentsBaseline collection<25%
Language ID Acc.Ig/EN/PCM region classification>70%>90%
RTFReal-time factor (1.0 = real-time)<5.0 (CPU)<1.0 (GPU)
Human scoreNative speaker intelligibility 1–5Baseline collection≥3.5/5
**Tone Error Rate** is the key novel metric. Standard WER misses tone-related meaning errors.
Same spelling, wrong tone = wrong word. This must be evaluated separately with native speaker judges.
Retraining Strategy
* **Continuous active learning:** human corrections → validation → retraining queue
* **Monthly micro-updates:** small fine-tunes on accumulated corrections
* **Quarterly major retrains:** full model refresh with expanded corpus
* **Drift detection:** monitor WER on held-out test set; alert if degrades >2%
* **Intermediate representation reuse:** existing embeddings are reruns without re-extracting raw audio
* **Researcher weight rebalancing:** as Whisper weight drops, phonetic + embedding weights rise — no code change needed
Launch Gates
GateCriteriaUnlocks
Gate 1 Technical viability
WER <35%, Tone ER collected, Language ID >70%
Beta user programme, researcher publications
Gate 2 Commercial usability
WER <20%, RTF <1.0, works on phone/YouTube/podcast audio
API launch, pricing, enterprise outreach
Gate 3 Production readiness
Scaling infra, billing, monitoring, GPU cost controls, privacy compliance
Full commercial launch, partnership deals
Revenue Scenarios
Year 1
£20k–£150k
Research grants · early beta API · subtitle companies · academic licensing
Prerequisite: Gate 1 cleared, niche but functional product.
Key risk: accuracy not yet good enough for paying enterprise.
Year 2
£150k–£1M
API platform · media companies · churches · courts · Nollywood distribution
Prerequisite: Gate 2 cleared (WER <20%).
Differentiation: only system with tone-aware Igbo transcription.
Year 3
£1M–£8M+
Enterprise deals · multilingual expansion (Yoruba, Hausa) · government/compliance · diaspora
Prerequisite: Gate 3 + multilingual platform.
Note: most speech startups fail on inference cost — GPU cost controls are Gate 3 blocker.
Revenue scenarios are speculative. Failure modes: data cost escalation, inference cost squeeze, accuracy plateau, enterprise sales cycle length.
The strongest moat is the **phonetic memory store** — proprietary Igbo acoustic data that no competitor can easily replicate.
Full Timeline
M0–1 (now)
Phase 1 — First iteration complete
Pipeline built · Run on first Nollywood film · Phonetic memory seeded · Bootstrap quality baseline
M1–3
Data acquisition + annotation tooling
10–20 more films · Human annotation interface · Tone labelling protocol · Speaker diversity audit
M3–6
Phase 2 — Phonetic Memory at scale
50+ films processed · Tone error rate baseline · Language ID accuracy >70%
M6–12
Phase 3 — Weakly supervised + first serious model
Pseudo-label training · Whisper weight 0.2 · WER target <35% · Gate 1
Y2
Phase 4 — Distillation + sovereignty
Own acoustic encoder · Whisper retired · Tone-aware decoder · API beta · Gate 2
Y3
Phase 5 — Production + multilingual
Mobile inference · Yoruba/Hausa expansion · Enterprise sales · Gate 3 · Revenue Y3 range
Immediate Next Actions
1. Install deps on z2 server: `pip install -r igbo-asr/requirements.txt`
2. Set `HF_TOKEN` env var for pyannote diarization
3. Source a Nollywood film with Igbo/English/Pidgin code-switching
4. First run: `python run.py dissect --input movie.mp4 --no-separate` (skip Demucs first pass)
5. Review `review_queue.json` output — first human correction session establishes annotation protocol
6. Inspect `igbo_asr.duckdb` — verify phonetic features table populated correctly
◆ hinata · projects/igbo-asr.html · task-300018 · 2026-05-30