Skip to content

Mail Poller Architecture

2026-06-14 — DOC STATUS: Z2-host poller blocks below are retired per how-to_mail-poller-consolidation-and-backpop. CT102 is the sole canonical poller (Michael ruling 2026-06-14). The /opt/jimmy-brain-ops/scripts/{gmail-api,outlook-graph}-poller.py boxes are kept here only as a record of what was deleted; they are not running infrastructure. Update of this diagram is tracked in the consolidation playbook step 15.

2026-06-14 — retention scope: the 1y window (since 2025-06-14) is an OAuth stability test (--oauth-test), not a retention boundary. Retention is full history, append-only. CT102 holds one continuous corpus of Mac archive (Oct 2016 → Jun 2026) + live polls under canonical {account}/{YYYY}/{MM}/{sha256[:16]}.json paths. Container + mount discipline: nothing on filesystem root.

System Overview

┌─────────────────────────────────────────────────────────────────┐
│ Z2 Proxmox VE                                                   │
├─────────────────────────────────────────────────────────────────┤
│ ┌──────────────────────────────────────────────────────────┐   │
│ │ ct103: itachi (Debian 12) — Vaultwarden                  │   │
│ │  └─ /bin/bw CLI (NODE_EXTRA_CA_CERTS=/ssl/certs.pem)   │   │
│ │     ├─ outlook-graph-credentials                        │   │
│ │     ├─ outlook-tokens-{account} (3x)                   │   │
│ │     ├─ gmail_oauth_client                               │   │
│ │     └─ gmail_oauth_token                                │   │
│ └──────────────────────────────────────────────────────────┘   │
│                     ↓ (sync-mail-creds.sh)                     │
│ ┌──────────────────────────────────────────────────────────┐   │
│ │ Z2 host — /opt/jimmy-brain-ops/scripts/                  │   │
│ │  ├─ outlook-graph-poller.py                             │   │
│ │  ├─ gmail-api-poller.py                                 │   │
│ │  ├─ sync-mail-creds.sh (ExecStartPre)                  │   │
│ │  └─ /mnt/data/hinata/                                   │   │
│ │     ├─ secrets/mail/*.json (cached credentials)        │   │
│ │     ├─ mail-archive/                                    │   │
│ │     │  ├─ {date}.jsonl (daily aggregate)               │   │
│ │     │  ├─ gmail-michael-asolo1/{year}/{month}/*.json   │   │
│ │     │  ├─ hotmail-michael-asolo/{year}/{month}/*.json  │   │
│ │     │  ├─ outlook-michael-nnamah/{year}/{month}/*.json │   │
│ │     │  └─ outlook-n-nnamah/{year}/{month}/*.json       │   │
│ │     ├─ data/mail/gmail-cursor-*.json                   │   │
│ │     └─ resources/mail/mail-digest-{source}.json        │   │
│ └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│ External APIs (via network):                                   │
│ ├─ Gmail API: gmail.googleapis.com/gmail/v1/users/me        │
│ ├─ Google Token: oauth2.googleapis.com/token                 │
│ ├─ Microsoft Graph: graph.microsoft.com/v1.0/me/messages    │
│ └─ Microsoft Token: login.microsoftonline.com/...            │
└─────────────────────────────────────────────────────────────────┘

Process Flow

Normal Execution (every 15 minutes)

systemd timer triggers

/opt/hinata/mail-poller/mail-poller.py executes

Load state.json (cursor positions)
  ├─ Gmail: last_uid per folder
  └─ Outlook: last_received datetime

Poll each account
  ├─ Gmail (IMAP)
  │  ├─ Connect to imap.gmail.com:993
  │  ├─ Authenticate with app password
  │  ├─ For each folder:
  │  │  ├─ Fetch UIDs > last_uid
  │  │  ├─ Parse RFC2822 message
  │  │  └─ Extract: subject, from, to, date, body_text, body_html
  │  └─ Update folder state (last_uid)

  └─ Outlook (Graph API) × 3
     ├─ Refresh OAuth2 access_token
     ├─ GET /me/messages?$filter=receivedDateTime gt {last_received}
     ├─ For each message:
     │  └─ Extract: subject, from, to, date, body_text, body_html
     └─ Update account state (last_received)

Archive messages
  ├─ For each message:
  │  ├─ Calculate: message_hash = SHA256({message_id}) → {hash[:16]}
  │  ├─ Determine: year_month = message.date.strftime("%Y/%m")
  │  ├─ Create path: archive/{account}/{year_month}/{hash}.json
  │  └─ Write JSON to disk

Save state.json (updated cursors)

Exit (0 = success)

systemd logs to journalctl

State Management

state.json Structure

json
{
  "gmail-michael-asolo1": {
    "last_received": "2026-06-11T19:00:00Z",
    "updated_at": "2026-06-11T20:57:47"
  },
  "hotmail-michael-asolo": {
    "last_received": "2026-06-05T10:30:45Z",
    "last_poll": "2026-06-05T10:30:45.123456"
  },
  "outlook-michael-nnamah": {
    "last_received": "2026-06-05T10:30:45Z",
    "last_poll": "2026-06-05T10:30:45.123456"
  },
  "outlook-n-nnamah": {
    "last_received": "2026-06-05T10:30:45Z",
    "last_poll": "2026-06-05T10:30:45.123456"
  }
}

Key points:

  • last_received (Gmail API / Graph API): ISO8601 datetime of most recent message. Next poll fetches messages received after this timestamp.
  • last_poll: Timestamp of last successful poll run.

Invariants:

  • State is only updated after successful poll
  • If poll fails, state is not modified (next run will retry same messages)
  • Cursors monotonically increase (no going backwards)

Archive Format

Directory Structure

/opt/hinata/mail-poller/archive/
├── gmail/
│   ├── 2026/
│   │   ├── 05/
│   │   │   ├── a1b2c3d4e5f6g7h8.json
│   │   │   ├── b2c3d4e5f6g7h8i9.json
│   │   │   └── ...
│   │   └── 06/
│   │       └── ...
│   └── 2027/
│       └── ...
├── hotmail-michael-asolo/
│   ├── 2026/
│   │   ├── 05/
│   │   └── 06/
│   └── 2027/
├── outlook-michael-nnamah/
│   └── ...
└── outlook-n-nnamah/
    └── ...

Naming scheme: {account}/{YYYY}/{MM}/{message_hash}.json

where:

  • account = "gmail" | "hotmail-michael-asolo" | "outlook-michael-nnamah" | "outlook-n-nnamah"
  • YYYY/MM = year/month extracted from message date
  • message_hash = SHA256(message_id)[:16] — first 16 hex chars of SHA256 hash

Message JSON Format

json
{
  "account": "gmail",
  "email": "michael.asolo1@gmail.com",
  "message_id": "<abc123.mail@gmail.com>",
  "message_hash": "a1b2c3d4e5f6g7h8",
  "date": "2026-06-05T10:30:45+00:00",
  "subject": "Test Email Subject",
  "from": "sender@example.com",
  "to": "michael.asolo1@gmail.com",
  "body_text": "Plain text content...",
  "body_html": "<html>...</html>",
  "year_month": "2026/06"
}

Field meanings:

  • account: Which account this email came from
  • email: Email address of the account
  • message_id: RFC2822 Message-ID header (unique identifier from server)
  • message_hash: Short hash of message_id (for archive path)
  • date: ISO8601 timestamp when message was received
  • subject: Email subject line
  • from: Sender address
  • to: Recipient addresses (semicolon-separated for multiple)
  • body_text: Plaintext body
  • body_html: HTML body (if available)
  • year_month: Denormalized copy of date in YYYY/MM format (for archive path)

Polling Protocols

Gmail (REST API — OAuth2)

Script: gmail-api-poller.py at /opt/jimmy-brain-ops/scripts/

Credentials:

  • Client registration: gmail_oauth_client.json (Vaultwarden item gmail_oauth_client)
    • Fields: client_id, client_secret, token_uri, auth_uri, project_id
  • Account token: gmail_oauth_token.json (Vaultwarden item gmail_oauth_token)
    • Fields: access_token, refresh_token, token_type, refreshed

Token refresh flow:

  1. POST to token_uri (default https://oauth2.googleapis.com/token)
  2. Payload: client_id + client_secret + refresh_token + grant_type=refresh_token
  3. Response: new access_token (+ optionally refreshed refresh_token)
  4. Write updated tokens to disk and back to Vaultwarden

Re-auth flow: gmail-oauth-reauth.py at ~/Sandpit/hinata-sandpit/scripts/ (Mac-only, requires browser). Starts localhost:8090 HTTP server, opens Google consent URL, captures code, exchanges for tokens, pushes to Z2 via SCP. Required when refresh_token expires or is revoked.

Incremental polling:

  1. Refresh access_token (if older than 55 minutes)
  2. GET https://gmail.googleapis.com/gmail/v1/users/me/messages?q=in:inbox after:{epoch}
  3. For each message ref: GET .../messages/{id}?format=full
  4. Parse headers + decode MIME body (plain text preferred, HTML fallback with tag stripping)
  5. Update cursor file at /mnt/data/hinata/data/mail/gmail-cursor-gmail-michael-asolo1.json

Backfill mode: --backfill --since YYYY-MM iterates month by month using after:{epoch} before:{epoch} queries. Deduplicates via per-message JSON file existence check.

Archive paths:

  • JSONL: /mnt/data/hinata/mail-archive/{date}.jsonl
  • Per-message JSON: /mnt/data/hinata/mail-archive/gmail-michael-asolo1/{year}/{year-month}/{hash}.json

Account: michael.asolo1@gmail.com (single account)

Microsoft Graph API (OAuth2)

Credentials:

  • App registration: outlook-graph-credentials.json
    json
    {
      "client_id": "uuid",
      "client_secret": "secret",
      "tenant_id": "consumers"
    }
  • Per-account tokens: outlook-tokens-{account}.json
    json
    {
      "access_token": "token",
      "refresh_token": "refresh",
      "updated": "2026-06-05T10:30:45.123456"
    }

Token refresh flow:

  1. POST to https://login.microsoftonline.com/consumers/oauth2/v2.0/token
  2. Payload: client_id + client_secret + refresh_token + grant_type=refresh_token
  3. Response: new access_token (+ refreshed refresh_token)
  4. Write updated tokens back to disk

Incremental polling:

  1. Refresh access_token (if expired)
  2. GET /me/messages?$filter=receivedDateTime gt {last_received}
  3. Response: array of message objects (JSON)
  4. Extract fields from item dict
  5. Update state.json[account]["last_received"]

Advantages:

  • Single API endpoint (no per-folder logic)
  • Built-in filtering (by date, subject, etc.)
  • OAuth2 tokens (no password stored)

Disadvantages:

  • More complex token management
  • JSON response (need to parse body_text from mixed content)
  • Requires Azure app registration

Error Handling

Per-Account Isolation

If one account fails, others continue:

python
for account_key, config in accounts_to_poll.items():
    try:
        count, messages = poll_*()
    except Exception as e:
        logger.error(f"[{account_key}] Poll failed: {e}")
        continue  # Continue to next account

Partial Failure Recovery

If a message fails to parse/archive, it's logged but doesn't stop the run:

python
for message in messages:
    try:
        archive_message(message)
    except Exception as e:
        logger.error(f"Failed to archive: {e}")
        continue  # Continue to next message

State Not Saved on Failure

State is only written if entire account poll succeeds:

python
if not dry_run:
    save_state(state)  # Only if no exceptions above

This ensures cursors don't advance for failed runs, allowing retry without skipping messages.

Performance Characteristics

Time Complexity

OperationTimeNotes
Connect to Gmail IMAP50–100msSSL handshake
IMAP UID search10–50msServer-side filtered
Fetch N messages100–500msDepends on message size
Parse message1–5msRFC2822 parsing
Refresh token100–300msHTTP POST to Azure
Graph API fetch200–800msHTTP GET with filter
Archive N messages50–200msDisk I/O
Save state.json1–5msJSON serialization

Space Complexity

ItemSizeNotes
state.json<50 KBIndependent of archive size
Message JSON10–100 KBDepends on body size
Archive (47K emails)4.2 GBAverage ~90 KB per message

Network Overhead

  • Gmail IMAP: 4 bytes per UID (~12 K for 3000 UIDs to search)
  • Graph API: ~50 bytes per field × N messages
  • Token refresh: ~500 bytes request, ~1 KB response

Scaling Considerations

Single-threaded Sequential Polling

Current implementation polls accounts sequentially:

  1. Gmail (all folders)
  2. Outlook account 1
  3. Outlook account 2
  4. Outlook account 3

Total runtime: 2–5 seconds (typical)

If scaling to >10 accounts, consider:

  • Parallel polling (async/threading) — would require stdlib changes
  • Batching (archive multiple times per day instead of every 15 min)
  • Filter rules (only fetch certain folders or date ranges)

Archive I/O Bottleneck

Writing files to disk is single-threaded. For >1000 emails per run:

  • Consider batch archiving (write to single file, split on read)
  • Or use database backend (SQLite, PostgreSQL)

Current design (individual files) is good for:

  • Simple access patterns (git clone for backups)
  • Transparent format (human-readable JSON)
  • Per-message isolation (no locking issues)

Memory Usage

Script uses <50 MB RSS (resident set size):

  • Email bodies held in memory during run
  • State dict is small
  • No caching of archive index

For >100K emails per run, consider:

  • Streaming processing (don't hold all messages in memory)
  • Or increase ct102 available RAM

Operational Notes

Idempotency

The script is idempotent:

  • Running twice in a row = same result
  • Archive deduplication: message_hash prevents duplicates
  • State advancement: cursor only moves forward

Safe to run manually or via cron without conflicts.

Observability

Logging:

  • All events logged to journalctl (systemd)
  • Structured format: timestamp, level, component, message
  • Errors logged with full exception info

State inspection:

bash
# Last poll times
jq '.[] | {account: .account, last_poll: .last_poll}' state.json

# Message counts
find archive -name "*.json" | wc -l

# Archive growth
du -sh archive/*

Disaster Recovery

If state.json is lost:

  • Delete state.json
  • Next run will treat as "first run" and fetch all messages
  • May take 5–10 minutes for large accounts (all IMAP UIDs fetched)
  • No data loss (archive is separate)

If archive is lost:

  • Emails not re-fetched (state.json remembers what was archived)
  • To re-archive: delete state.json + run again

If credentials are leaked:

  • Immediately regenerate passwords/tokens in Azure + Gmail
  • Update credential files on ct103
  • Monitor account for unauthorized access

Future Extensions

Filters

Add --since, --until options for backfill:

bash
mail-poller.py --since 2024-01-01 --until 2024-06-01

Deduplication

Check if message_id already exists in archive before fetch.

Database Backend

Replace file-based archive with SQLite:

/opt/hinata/mail-poller/archive.db
  ├─ messages table (id, account, message_id, date, subject, from, to, body_text, body_html)
  ├─ state table (account, last_uid, last_received)
  └─ indices (account, date, from, subject)

Would enable:

  • Fast queries (SQL)
  • Transaction safety
  • Full-text search

Classification Integration

Hook into Heimerdinger NLP classifier:

archive message → send to classifier API → store result in archive.db

API Endpoint

Expose FastAPI endpoint:

GET /api/emails?account=gmail&from=2026-06-01&to=2026-06-05&limit=50

Would enable Studio access to archive without direct filesystem access.