Skip to content

Email Intelligence Pipeline — Heimerdinger NLP

Deployed State

Container: CT101 (heimerdinger-nlp) on Z2 Proxmox Stack: Python 3.11, PyTorch 2.12, sentence-transformers 3.4, BERTopic 0.17, spaCy 3.8, scikit-learn

Three-Phase Pipeline

┌────────────────────────────────────────────────────────────────────────┐
│ P1: Embed                                                              │
│ BAAI/bge-small-en-v1.5 (384d)                                        │
│ subject + sender + snippet → dense vector                             │
│ Output: embeddings.npy (N×384), faiss.index, manifest.jsonl           │
├────────────────────────────────────────────────────────────────────────┤
│ P2: Topic Model                                                        │
│ BERTopic (HDBSCAN + c-TF-IDF)                                        │
│ Unsupervised topic discovery, split personal/professional             │
│ Output: topic-model/{personal,professional}/, topic-map.json          │
├────────────────────────────────────────────────────────────────────────┤
│ P3: Classifier                                                         │
│ CalibratedClassifierCV(LogisticRegression) on BGE embeddings          │
│ 12-class supervised categorisation                                    │
│ Output: classifier.pkl, label-encoder.json                            │
└────────────────────────────────────────────────────────────────────────┘

Corpus

SourceMessagesDate
Old archive (pre–Graph API)47,807Embedded 2026-06-04
New archive (Graph API + Gmail backfill)35,077Backfilled 2026-06-11, not yet embedded
Combined~82,884

Artefact Inventory

All artefacts at /opt/hinata-sandpit/resources/email-intelligence/ on Z2.

FileSizeDescription
embeddings.npy70 MB(47807, 384) float32 dense vectors
faiss.index70 MBFAISS flat L2 index over embeddings
manifest.jsonl21 MBPer-email metadata: id, subject, sender, date, account, commander, category, embed_idx
classifier.pkl42 KBCalibrated LogReg trained on keyword-derived labels
label-encoder.json763 B12-class label mapping
topic-map.json12 KB39 topics (4 personal, 35 professional) with top words and commander mapping
topics.jsonl25 MBPer-email topic assignment
topic-model/personal/1.1 MBBERTopic model: ctfidf + topic embeddings + topics.json
topic-model/professional/340 KBBERTopic model
sender-domain-map.json526 KBDomain → frequency mapping
sender-fingerprints.json151 KBSender behavioural fingerprints
sender-temporal.json29 KBSender time-of-day patterns
thread-graph.json7 MBReply-chain graph
sender-centroids.json2 BEmpty — not yet computed

Scripts

ScriptLocationFunction
email-p1-embed.py/opt/hinata-sandpit/scripts/Embeds corpus with BGE-small, writes embeddings.npy + faiss.index + manifest.jsonl
email-p2-topics.py/opt/hinata-sandpit/scripts/Runs BERTopic on embeddings, writes topic-model/ + topic-map.json
email-p3-classify.py/opt/hinata-sandpit/scripts/Trains LogReg on embeddings + labels, writes classifier.pkl
email-batch-classify.pyCT101 /root/Batch inference: reads JSONL, loads BGE + classifier, outputs label + confidence per email

P1 script currently reads from /Users/nnamdi/Sandpit/hinata/resources/email-poller (old Mac path). New archive is at /mnt/data/hinata/mail-archive/.

Classification Labels

IDLabelCommanderDescription
0marketingPromotions, offers, newsletters with commercial intent
1passive-newsletterNon-commercial subscriptions, digests
2ecommerceOrder confirmations, shipping, returns
3careerzukoJob opportunities, recruiter outreach, applications
4financebulmaPayments, statements, transactions, tax
5housingProperty, rentals, utilities
6healthallmightFitness, nutrition, recovery
7eventsluffyInvites, RSVPs, tickets, gigs
8music_artssquidwardConcerts, albums, galleries, creative
9learningshikamaruCourses, tutorials, research, papers
10securityitachiSign-in alerts, password resets, 2FA
11generalUncategorised

Topic Model Summary

SegmentTopicsEmailsLargest topic
Personal433,086Topic 0 (general, n=32,440)
Professional354,277Topic 4 (zuko/career, n=766)

Live vs ML Classification

The mail pollers (outlook-graph-poller.py, gmail-api-poller.py) use hybrid routing: keyword + sender-domain matching first, ML fallback for unrouted emails via email-batch-classify.py on CT101.

MethodPrecisionCoverageLatency
Keyword/domain rules (primary)High (hand-tuned)~25% of emails matched<1ms
P3 LogReg on BGE (fallback)Medium (trained on keyword labels)Unrouted emails with confidence ≥ 0.7~2s cold start + ~5ms/email

The ML fallback calls CT101 via pct exec with batch input. Only predictions with confidence ≥ 0.7 and a mapped commander are applied. Predictions below threshold are logged as uncertain for active learning review.

Routed messages carry a source field: "keyword" or "ml". ML-routed messages also carry ml_confidence.

Embedding Model Specifications

ModelDimSpeed (CPU)RAM
bge-small-en-v1.5 (current)3843ms/email130 MB
bge-base-en-v1.57688ms/email430 MB
bge-large-en-v1.5102420ms/email1.3 GB
all-MiniLM-L6-v23842ms/email90 MB

BERTopic Configuration

json
{
  "min_topic_size": 10,
  "nr_topics": "auto",
  "n_gram_range": [1, 1],
  "top_n_words": 10,
  "language": "english",
  "zeroshot_min_similarity": 0.7
}

Integration Points

ConsumerEndpoint / FileReads
Mail pollers (Outlook/Gmail)email-batch-classify.py on CT101ML fallback for unrouted emails (live, confidence ≥ 0.7)
mail-30d-digest.pyclassifier.pklCategory scoring for digest (not yet wired)
Studio MailFlags/api/mail-30d-digestAlerts, calendar signals, insights (currently keyword-only)
FAISS indexfaiss.indexSemantic search ("find emails similar to X")
Topic modeltopic-model/Topic exploration, drift detection

Gold Labels

File: /mnt/data/hinata/resources/mail/gold-labels.jsonl (created 2026-06-12, empty — awaiting first labels)

Schema:

json
{"message_id": "abc123", "label": "career", "source": "michael", "date": "2026-06-12"}

Gold labels are human-verified category assignments. They override keyword-derived labels during P3 retraining. The file is append-only.

Cross-links: reference_mail-poller-architecture · reference_z2-service-catalog · reference_z2-container-architecture