Email Intelligence Pipeline — Heimerdinger NLP

Deployed State

Container: CT101 (heimerdinger-nlp) on Z2 Proxmox Stack: Python 3.11, PyTorch 2.12, sentence-transformers 3.4, BERTopic 0.17, spaCy 3.8, scikit-learn

Three-Phase Pipeline

┌────────────────────────────────────────────────────────────────────────┐
│ P1: Embed                                                              │
│ BAAI/bge-small-en-v1.5 (384d)                                        │
│ subject + sender + snippet → dense vector                             │
│ Output: embeddings.npy (N×384), faiss.index, manifest.jsonl           │
├────────────────────────────────────────────────────────────────────────┤
│ P2: Topic Model                                                        │
│ BERTopic (HDBSCAN + c-TF-IDF)                                        │
│ Unsupervised topic discovery, split personal/professional             │
│ Output: topic-model/{personal,professional}/, topic-map.json          │
├────────────────────────────────────────────────────────────────────────┤
│ P3: Classifier                                                         │
│ CalibratedClassifierCV(LogisticRegression) on BGE embeddings          │
│ 12-class supervised categorisation                                    │
│ Output: classifier.pkl, label-encoder.json                            │
└────────────────────────────────────────────────────────────────────────┘

Corpus

Source	Messages	Date
Old archive (pre–Graph API)	47,807	Embedded 2026-06-04
New archive (Graph API + Gmail backfill)	35,077	Backfilled 2026-06-11, not yet embedded
Combined	~82,884

Artefact Inventory

All artefacts at /opt/hinata-sandpit/resources/email-intelligence/ on Z2.

File	Size	Description
`embeddings.npy`	70 MB	(47807, 384) float32 dense vectors
`faiss.index`	70 MB	FAISS flat L2 index over embeddings
`manifest.jsonl`	21 MB	Per-email metadata: id, subject, sender, date, account, commander, category, embed_idx
`classifier.pkl`	42 KB	Calibrated LogReg trained on keyword-derived labels
`label-encoder.json`	763 B	12-class label mapping
`topic-map.json`	12 KB	39 topics (4 personal, 35 professional) with top words and commander mapping
`topics.jsonl`	25 MB	Per-email topic assignment
`topic-model/personal/`	1.1 MB	BERTopic model: ctfidf + topic embeddings + topics.json
`topic-model/professional/`	340 KB	BERTopic model
`sender-domain-map.json`	526 KB	Domain → frequency mapping
`sender-fingerprints.json`	151 KB	Sender behavioural fingerprints
`sender-temporal.json`	29 KB	Sender time-of-day patterns
`thread-graph.json`	7 MB	Reply-chain graph
`sender-centroids.json`	2 B	Empty — not yet computed

Scripts

Script	Location	Function
`email-p1-embed.py`	`/opt/hinata-sandpit/scripts/`	Embeds corpus with BGE-small, writes embeddings.npy + faiss.index + manifest.jsonl
`email-p2-topics.py`	`/opt/hinata-sandpit/scripts/`	Runs BERTopic on embeddings, writes topic-model/ + topic-map.json
`email-p3-classify.py`	`/opt/hinata-sandpit/scripts/`	Trains LogReg on embeddings + labels, writes classifier.pkl
`email-batch-classify.py`	CT101 `/root/`	Batch inference: reads JSONL, loads BGE + classifier, outputs label + confidence per email

P1 script currently reads from /Users/nnamdi/Sandpit/hinata/resources/email-poller (old Mac path). New archive is at /mnt/data/hinata/mail-archive/.

Classification Labels

ID	Label	Commander	Description
0	marketing	—	Promotions, offers, newsletters with commercial intent
1	passive-newsletter	—	Non-commercial subscriptions, digests
2	ecommerce	—	Order confirmations, shipping, returns
3	career	zuko	Job opportunities, recruiter outreach, applications
4	finance	bulma	Payments, statements, transactions, tax
5	housing	—	Property, rentals, utilities
6	health	allmight	Fitness, nutrition, recovery
7	events	luffy	Invites, RSVPs, tickets, gigs
8	music_arts	squidward	Concerts, albums, galleries, creative
9	learning	shikamaru	Courses, tutorials, research, papers
10	security	itachi	Sign-in alerts, password resets, 2FA
11	general	—	Uncategorised

Topic Model Summary

Segment	Topics	Emails	Largest topic
Personal	4	33,086	Topic 0 (general, n=32,440)
Professional	35	4,277	Topic 4 (zuko/career, n=766)

Live vs ML Classification

The mail pollers (outlook-graph-poller.py, gmail-api-poller.py) use hybrid routing: keyword + sender-domain matching first, ML fallback for unrouted emails via email-batch-classify.py on CT101.

Method	Precision	Coverage	Latency
Keyword/domain rules (primary)	High (hand-tuned)	~25% of emails matched	<1ms
P3 LogReg on BGE (fallback)	Medium (trained on keyword labels)	Unrouted emails with confidence ≥ 0.7	~2s cold start + ~5ms/email

The ML fallback calls CT101 via pct exec with batch input. Only predictions with confidence ≥ 0.7 and a mapped commander are applied. Predictions below threshold are logged as uncertain for active learning review.

Routed messages carry a source field: "keyword" or "ml". ML-routed messages also carry ml_confidence.

Embedding Model Specifications

Model	Dim	Speed (CPU)	RAM
`bge-small-en-v1.5` (current)	384	3ms/email	130 MB
`bge-base-en-v1.5`	768	8ms/email	430 MB
`bge-large-en-v1.5`	1024	20ms/email	1.3 GB
`all-MiniLM-L6-v2`	384	2ms/email	90 MB

BERTopic Configuration

json

{
  "min_topic_size": 10,
  "nr_topics": "auto",
  "n_gram_range": [1, 1],
  "top_n_words": 10,
  "language": "english",
  "zeroshot_min_similarity": 0.7
}

Integration Points

Consumer	Endpoint / File	Reads
Mail pollers (Outlook/Gmail)	`email-batch-classify.py` on CT101	ML fallback for unrouted emails (live, confidence ≥ 0.7)
mail-30d-digest.py	`classifier.pkl`	Category scoring for digest (not yet wired)
Studio MailFlags	`/api/mail-30d-digest`	Alerts, calendar signals, insights (currently keyword-only)
FAISS index	`faiss.index`	Semantic search ("find emails similar to X")
Topic model	`topic-model/`	Topic exploration, drift detection

Gold Labels

File: /mnt/data/hinata/resources/mail/gold-labels.jsonl (created 2026-06-12, empty — awaiting first labels)

Schema:

json

{"message_id": "abc123", "label": "career", "source": "michael", "date": "2026-06-12"}

Gold labels are human-verified category assignments. They override keyword-derived labels during P3 retraining. The file is append-only.

Cross-links: reference_mail-poller-architecture · reference_z2-service-catalog · reference_z2-container-architecture

Email Intelligence Pipeline — Heimerdinger NLP ​

Deployed State ​

Three-Phase Pipeline ​

Corpus ​

Artefact Inventory ​

Scripts ​

Classification Labels ​

Topic Model Summary ​

Live vs ML Classification ​

Embedding Model Specifications ​

BERTopic Configuration ​

Integration Points ​

Gold Labels ​