Appearance
Email Intelligence Pipeline — Heimerdinger NLP
Deployed State
Container: CT101 (heimerdinger-nlp) on Z2 Proxmox Stack: Python 3.11, PyTorch 2.12, sentence-transformers 3.4, BERTopic 0.17, spaCy 3.8, scikit-learn
Three-Phase Pipeline
┌────────────────────────────────────────────────────────────────────────┐
│ P1: Embed │
│ BAAI/bge-small-en-v1.5 (384d) │
│ subject + sender + snippet → dense vector │
│ Output: embeddings.npy (N×384), faiss.index, manifest.jsonl │
├────────────────────────────────────────────────────────────────────────┤
│ P2: Topic Model │
│ BERTopic (HDBSCAN + c-TF-IDF) │
│ Unsupervised topic discovery, split personal/professional │
│ Output: topic-model/{personal,professional}/, topic-map.json │
├────────────────────────────────────────────────────────────────────────┤
│ P3: Classifier │
│ CalibratedClassifierCV(LogisticRegression) on BGE embeddings │
│ 12-class supervised categorisation │
│ Output: classifier.pkl, label-encoder.json │
└────────────────────────────────────────────────────────────────────────┘Corpus
| Source | Messages | Date |
|---|---|---|
| Old archive (pre–Graph API) | 47,807 | Embedded 2026-06-04 |
| New archive (Graph API + Gmail backfill) | 35,077 | Backfilled 2026-06-11, not yet embedded |
| Combined | ~82,884 |
Artefact Inventory
All artefacts at /opt/hinata-sandpit/resources/email-intelligence/ on Z2.
| File | Size | Description |
|---|---|---|
embeddings.npy | 70 MB | (47807, 384) float32 dense vectors |
faiss.index | 70 MB | FAISS flat L2 index over embeddings |
manifest.jsonl | 21 MB | Per-email metadata: id, subject, sender, date, account, commander, category, embed_idx |
classifier.pkl | 42 KB | Calibrated LogReg trained on keyword-derived labels |
label-encoder.json | 763 B | 12-class label mapping |
topic-map.json | 12 KB | 39 topics (4 personal, 35 professional) with top words and commander mapping |
topics.jsonl | 25 MB | Per-email topic assignment |
topic-model/personal/ | 1.1 MB | BERTopic model: ctfidf + topic embeddings + topics.json |
topic-model/professional/ | 340 KB | BERTopic model |
sender-domain-map.json | 526 KB | Domain → frequency mapping |
sender-fingerprints.json | 151 KB | Sender behavioural fingerprints |
sender-temporal.json | 29 KB | Sender time-of-day patterns |
thread-graph.json | 7 MB | Reply-chain graph |
sender-centroids.json | 2 B | Empty — not yet computed |
Scripts
| Script | Location | Function |
|---|---|---|
email-p1-embed.py | /opt/hinata-sandpit/scripts/ | Embeds corpus with BGE-small, writes embeddings.npy + faiss.index + manifest.jsonl |
email-p2-topics.py | /opt/hinata-sandpit/scripts/ | Runs BERTopic on embeddings, writes topic-model/ + topic-map.json |
email-p3-classify.py | /opt/hinata-sandpit/scripts/ | Trains LogReg on embeddings + labels, writes classifier.pkl |
email-batch-classify.py | CT101 /root/ | Batch inference: reads JSONL, loads BGE + classifier, outputs label + confidence per email |
P1 script currently reads from /Users/nnamdi/Sandpit/hinata/resources/email-poller (old Mac path). New archive is at /mnt/data/hinata/mail-archive/.
Classification Labels
| ID | Label | Commander | Description |
|---|---|---|---|
| 0 | marketing | — | Promotions, offers, newsletters with commercial intent |
| 1 | passive-newsletter | — | Non-commercial subscriptions, digests |
| 2 | ecommerce | — | Order confirmations, shipping, returns |
| 3 | career | zuko | Job opportunities, recruiter outreach, applications |
| 4 | finance | bulma | Payments, statements, transactions, tax |
| 5 | housing | — | Property, rentals, utilities |
| 6 | health | allmight | Fitness, nutrition, recovery |
| 7 | events | luffy | Invites, RSVPs, tickets, gigs |
| 8 | music_arts | squidward | Concerts, albums, galleries, creative |
| 9 | learning | shikamaru | Courses, tutorials, research, papers |
| 10 | security | itachi | Sign-in alerts, password resets, 2FA |
| 11 | general | — | Uncategorised |
Topic Model Summary
| Segment | Topics | Emails | Largest topic |
|---|---|---|---|
| Personal | 4 | 33,086 | Topic 0 (general, n=32,440) |
| Professional | 35 | 4,277 | Topic 4 (zuko/career, n=766) |
Live vs ML Classification
The mail pollers (outlook-graph-poller.py, gmail-api-poller.py) use hybrid routing: keyword + sender-domain matching first, ML fallback for unrouted emails via email-batch-classify.py on CT101.
| Method | Precision | Coverage | Latency |
|---|---|---|---|
| Keyword/domain rules (primary) | High (hand-tuned) | ~25% of emails matched | <1ms |
| P3 LogReg on BGE (fallback) | Medium (trained on keyword labels) | Unrouted emails with confidence ≥ 0.7 | ~2s cold start + ~5ms/email |
The ML fallback calls CT101 via pct exec with batch input. Only predictions with confidence ≥ 0.7 and a mapped commander are applied. Predictions below threshold are logged as uncertain for active learning review.
Routed messages carry a source field: "keyword" or "ml". ML-routed messages also carry ml_confidence.
Embedding Model Specifications
| Model | Dim | Speed (CPU) | RAM |
|---|---|---|---|
bge-small-en-v1.5 (current) | 384 | 3ms/email | 130 MB |
bge-base-en-v1.5 | 768 | 8ms/email | 430 MB |
bge-large-en-v1.5 | 1024 | 20ms/email | 1.3 GB |
all-MiniLM-L6-v2 | 384 | 2ms/email | 90 MB |
BERTopic Configuration
json
{
"min_topic_size": 10,
"nr_topics": "auto",
"n_gram_range": [1, 1],
"top_n_words": 10,
"language": "english",
"zeroshot_min_similarity": 0.7
}Integration Points
| Consumer | Endpoint / File | Reads |
|---|---|---|
| Mail pollers (Outlook/Gmail) | email-batch-classify.py on CT101 | ML fallback for unrouted emails (live, confidence ≥ 0.7) |
| mail-30d-digest.py | classifier.pkl | Category scoring for digest (not yet wired) |
| Studio MailFlags | /api/mail-30d-digest | Alerts, calendar signals, insights (currently keyword-only) |
| FAISS index | faiss.index | Semantic search ("find emails similar to X") |
| Topic model | topic-model/ | Topic exploration, drift detection |
Gold Labels
File: /mnt/data/hinata/resources/mail/gold-labels.jsonl (created 2026-06-12, empty — awaiting first labels)
Schema:
json
{"message_id": "abc123", "label": "career", "source": "michael", "date": "2026-06-12"}Gold labels are human-verified category assignments. They override keyword-derived labels during P3 retraining. The file is append-only.
Cross-links: reference_mail-poller-architecture · reference_z2-service-catalog · reference_z2-container-architecture