Skip to content

Igbo ASR — Multilingual Speech Cognition Framework

  task: 300018
  phase: 1 — Acoustic Infrastructure
  created: 2026-05-30
  repo: hinata-sandpit/igbo-asr
  branch: claude/modest-bohr-WoZMI

Vision

A sovereign, framework-independent speech cognition system for Igbo and Nigerian multilingual audio (Igbo / English / Pidgin code-switching). Starting corpus: one 3-hour Nollywood film. End state: a standalone research platform that assimilates, distils, and ultimately replaces all external framework dependencies.

  **Sovereignty principle:** Whisper, wav2vec2, pyannote are _bootstrap scaffolding only_.
  Each is isolated to a single file. Phase 4 distillation retires them one by one.
  The permanent proprietary asset is the **phonetic memory store** (DuckDB + Parquet).

Why Igbo is harder than English

  • Tonal: same CV sequence, 3+ meanings by pitch alone (H / L / Downstepped H)

  • 8 oral vowels with vowel harmony constraints (ị ụ ọ ẹ — ATR distinction)

  • Syllable-timed: timing carries meaning, not just rhythm

  • Code-switching: Nollywood speech transitions Igbo ↔ English ↔ Pidgin freely

  • Data-scarce: <200h of aligned training audio known publicly

Architecture Layers

  1
  Foundation
  Raw audio understanding — ingest, surgeon, diarization
  Built

  2
  Linguistic
  Phonetics, tone contours, formants, syllables, language regions
  Built

  3
  Reasoning
  Probabilistic interpretation — multi-researcher consensus engine
  Phase 1

  4
  Memory
  Acoustic embeddings store (wav2vec2 → own encoder) — DuckDB + Parquet
  Built

  5
  Consensus
  Competing researcher systems with weighted Bayesian fusion
  Phase 1

  6
  Adaptation
  Dialect + speaker profile adaptation, active learning loop
  Phase 3

  7
  Output
  SRT / VTT (confidence-coded) / JSON annotations aligned to video
  Built

Phase 1 Pipeline (Built — igbo-asr/)

  StageModuleFunctionSovereign?

    1`pipeline/ingest.py`Video → mono 16kHz WAV + probe metadataYes
    2`pipeline/surgeon.py`Demucs vocals + spectral denoise + EBU R128 normaliseYes
    3`pipeline/diarize.py`pyannote speaker diarization — timestamps + labelsPartial
    4`pipeline/lang_detect.py`Igbo / English / Pidgin region detection (bootstrap + lexical)Partial
    5`pipeline/phonetic.py`pYIN F0, tone H/L/M/R/F, MFCC, LPC formants, nasalizationYes
    6`pipeline/bootstrap.py`Whisper teacher — candidate transcripts (ISOLATED, Phase 4 exits)Bootstrap
    7`pipeline/embeddings.py`wav2vec2 hidden state embeddings — permanent acoustic memoryYes
    8`pipeline/consensus.py`Weighted researcher fusion — confidence intervals per wordYes
    9`pipeline/align.py`SRT / VTT (colour-coded confidence) / JSON annotation exportYes
  **First run command:**

  `python run.py dissect --input /path/to/nollywood.mp4 --no-separate`
  (use `--no-separate` first run to skip Demucs; add it once dep is confirmed installed)

Researcher Ensemble

Each researcher exposes a standard process(audio, sr, segment_id, ...) → ResearcherOutput interface. The consensus engine calls all researchers in parallel and fuses by weighted confidence.

  ResearcherSignalPhase 1 WeightPhase 4 Weight

    `audio_surgeon`Signal quality / SNRPre-processor (not weighted)Pre-processor
    `phonetic_cartographer`F0, tone, MFCC, formants0.400.50
    `linguistic_mapper`Language ID, code-switchingSignal only0.20
    `bootstrap (Whisper)`Text + word timestamps0.600.00 — retired
    `acoustic_embeddings` (future)Phoneme exemplar similarity—0.30

Phonetic Memory Store (Sovereign Asset)

  TableContentSovereign?

    `source_media`File registry, duration, formatYes
    `speaker_segments`Diarized speaker turns with timestampsYes
    `language_regions`Ig/EN/PCM regions with confidenceYes
    `bootstrap_segments`Whisper teacher outputs (temporary training signal)Bootstrap
    `phonetic_features`F0, tones, MFCC, F1/F2, nasalization, syllable rateYes
    `acoustic_embeddings`wav2vec2 hidden states, float32 blobsYes
    `consensus_results`Final transcripts, confidence, tone labelsYes
    `review_queue` (view)Low-confidence segments for human correctionYes

Export: all tables → Parquet (ZSTD) at end of each run. DuckDB file is the working store; Parquet is the durable portable archive.

Phase Plan

  PhaseGoalKey DeliverablesTimelineStatus

      1
      Acoustic Infrastructure
      Movie dissection pipeline, phonetic store, Whisper bootstrap
      M0–2
      Built

      2
      Phonetic Memory System
      Scale to 10+ films, tone labelling, speaker profiles
      M2–4
      Up next

      3
      Weakly Supervised Learning
      Pseudo-labels, human correction loop, Whisper weight drops to 0.2
      M4–8
      Planned

      4
      Distillation (sovereignty)
      Own acoustic encoder trained on memory store; Whisper retired
      M8–14
      Planned

      5
      Native Framework
      Custom tone-aware transformer, dialect adapters, mobile inference
      Y2–3
      Planned

Training Data Requirements

  StageHoursQualityNotes

    Phase 1 MVP3h (1 film)Weakly labelledBootstrap transcripts, phonetic features — first iteration
    Phase 2 target50–200hWeakly labelled10–50 Nollywood films, YouTube, church recordings
    Commercial viability gate500–1,500hAlignedWER <35% target; diverse speakers required
    Strong commercial5,000–15,000hAlignedWER <20%; viable for enterprise API
    Research frontier30,000h+Aligned + tone-markedState-of-the-art; requires annotation team

Data sources (prioritised)

  • Nollywood films with embedded subtitles (alignment target)

  • Common Voice Igbo dataset (public, ~5h as of 2024)

  • YouTube — Igbo news, church sermons, interviews

  • Nigerian radio broadcasts (NTA, AIT)

  • University of Nigeria linguistics corpora

  • WhatsApp voice notes (consented, community collection)

Tech Stack

    Audio processing
    FFmpeg · Librosa · SoundFile · SciPy signal

    Source separation
    Demucs (htdemucs) · spectral subtraction (custom)

    Diarization
    pyannote.audio 3.1 (HF gated)

    Bootstrap teacher
    OpenAI Whisper large-v3 — _temporary_

    Acoustic embeddings
    wav2vec2-base (Hugging Face) · layer -4

    Pitch / tone
    pYIN (librosa) · LPC formants (scipy)

    Data store
    DuckDB 0.10 · Apache Arrow · Parquet ZSTD

    Training (future)
    PyTorch · PyTorch Lightning · DeepSpeed

    Inference (future)
    ONNX Runtime · CTranslate2 · llama.cpp-style

    Compute
    2TB self-hosted server (z2) · CPU now, GPU Phase 3

Storage & Compute Budget

  StageRaw audioFeatures + embeddingsModelsTotal

    Phase 1 (1 film, ~3h)~1 GB~5 GB~10 GB~16 GB
    Phase 2 (50 films, ~150h)~30 GB~80 GB~15 GB~125 GB
    Commercial (5,000h)1–3 TB3–8 TB~100 GB5–11 TB
    Research frontier6–20 TB10–30 TB~500 GB20–50 TB
  2TB server is sufficient through Phase 2. Phase 3+ requires NAS expansion or cloud cold storage.
  Key optimisation: store embeddings (compact float32), not raw retraining pipelines.

Training time estimates (CPU vs GPU)

  TaskCPU (z2)1× A100

    Phase 1 — 1 film full pipeline2–6h20–40min
    Phase 3 — fine-tune (500h data)Weeks2–5 days
    Phase 4 — distillation (5,000h)Not viable2–4 weeks

Evaluation Strategy

  MetricDescriptionPhase 1 targetLaunch gate

    WERWord Error Rate (standard ASR)<60% (bootstrap quality)<20%
    CERCharacter Error Rate<40%<12%
    Tone Error RateCorrect tone label on voiced segmentsBaseline collection<25%
    Language ID Acc.Ig/EN/PCM region classification>70%>90%
    RTFReal-time factor (1.0 = real-time)<5.0 (CPU)<1.0 (GPU)
    Human scoreNative speaker intelligibility 1–5Baseline collection≥3.5/5
  **Tone Error Rate** is the key novel metric. Standard WER misses tone-related meaning errors.
  Same spelling, wrong tone = wrong word. This must be evaluated separately with native speaker judges.

Retraining Strategy

* **Continuous active learning:** human corrections → validation → retraining queue

* **Monthly micro-updates:** small fine-tunes on accumulated corrections

* **Quarterly major retrains:** full model refresh with expanded corpus

* **Drift detection:** monitor WER on held-out test set; alert if degrades >2%

* **Intermediate representation reuse:** existing embeddings are reruns without re-extracting raw audio

* **Researcher weight rebalancing:** as Whisper weight drops, phonetic + embedding weights rise — no code change needed

Launch Gates

  GateCriteriaUnlocks

      Gate 1 Technical viability
      WER <35%, Tone ER collected, Language ID >70%
      Beta user programme, researcher publications

      Gate 2 Commercial usability
      WER <20%, RTF <1.0, works on phone/YouTube/podcast audio
      API launch, pricing, enterprise outreach

      Gate 3 Production readiness
      Scaling infra, billing, monitoring, GPU cost controls, privacy compliance
      Full commercial launch, partnership deals

Revenue Scenarios

  Year 1

    £20k–£150k
    Research grants · early beta API · subtitle companies · academic licensing

    Prerequisite: Gate 1 cleared, niche but functional product.
    Key risk: accuracy not yet good enough for paying enterprise.

  Year 2

    £150k–£1M
    API platform · media companies · churches · courts · Nollywood distribution

    Prerequisite: Gate 2 cleared (WER <20%).
    Differentiation: only system with tone-aware Igbo transcription.

  Year 3

    £1M–£8M+
    Enterprise deals · multilingual expansion (Yoruba, Hausa) · government/compliance · diaspora

    Prerequisite: Gate 3 + multilingual platform.
    Note: most speech startups fail on inference cost — GPU cost controls are Gate 3 blocker.

  Revenue scenarios are speculative. Failure modes: data cost escalation, inference cost squeeze, accuracy plateau, enterprise sales cycle length.
  The strongest moat is the **phonetic memory store** — proprietary Igbo acoustic data that no competitor can easily replicate.

Full Timeline

  M0–1 (now)

    Phase 1 — First iteration complete
    Pipeline built · Run on first Nollywood film · Phonetic memory seeded · Bootstrap quality baseline

  M1–3

    Data acquisition + annotation tooling
    10–20 more films · Human annotation interface · Tone labelling protocol · Speaker diversity audit

  M3–6

    Phase 2 — Phonetic Memory at scale
    50+ films processed · Tone error rate baseline · Language ID accuracy >70%

  M6–12

    Phase 3 — Weakly supervised + first serious model
    Pseudo-label training · Whisper weight 0.2 · WER target <35% · Gate 1

  Y2

    Phase 4 — Distillation + sovereignty
    Own acoustic encoder · Whisper retired · Tone-aware decoder · API beta · Gate 2

  Y3

    Phase 5 — Production + multilingual
    Mobile inference · Yoruba/Hausa expansion · Enterprise sales · Gate 3 · Revenue Y3 range

Immediate Next Actions

  1. Install deps on z2 server: `pip install -r igbo-asr/requirements.txt`

  2. Set `HF_TOKEN` env var for pyannote diarization

  3. Source a Nollywood film with Igbo/English/Pidgin code-switching

  4. First run: `python run.py dissect --input movie.mp4 --no-separate` (skip Demucs first pass)

  5. Review `review_queue.json` output — first human correction session establishes annotation protocol

  6. Inspect `igbo_asr.duckdb` — verify phonetic features table populated correctly

◆ hinata · projects/igbo-asr.html · task-300018 · 2026-05-30