Appearance
Mail Poller Architecture
2026-06-14 — DOC STATUS: Z2-host poller blocks below are retired per how-to_mail-poller-consolidation-and-backpop. CT102 is the sole canonical poller (Michael ruling 2026-06-14). The
/opt/jimmy-brain-ops/scripts/{gmail-api,outlook-graph}-poller.pyboxes are kept here only as a record of what was deleted; they are not running infrastructure. Update of this diagram is tracked in the consolidation playbook step 15.2026-06-14 — retention scope: the 1y window (since 2025-06-14) is an OAuth stability test (
--oauth-test), not a retention boundary. Retention is full history, append-only. CT102 holds one continuous corpus of Mac archive (Oct 2016 → Jun 2026) + live polls under canonical{account}/{YYYY}/{MM}/{sha256[:16]}.jsonpaths. Container + mount discipline: nothing on filesystem root.
System Overview
┌─────────────────────────────────────────────────────────────────┐
│ Z2 Proxmox VE │
├─────────────────────────────────────────────────────────────────┤
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ ct103: itachi (Debian 12) — Vaultwarden │ │
│ │ └─ /bin/bw CLI (NODE_EXTRA_CA_CERTS=/ssl/certs.pem) │ │
│ │ ├─ outlook-graph-credentials │ │
│ │ ├─ outlook-tokens-{account} (3x) │ │
│ │ ├─ gmail_oauth_client │ │
│ │ └─ gmail_oauth_token │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ↓ (sync-mail-creds.sh) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Z2 host — /opt/jimmy-brain-ops/scripts/ │ │
│ │ ├─ outlook-graph-poller.py │ │
│ │ ├─ gmail-api-poller.py │ │
│ │ ├─ sync-mail-creds.sh (ExecStartPre) │ │
│ │ └─ /mnt/data/hinata/ │ │
│ │ ├─ secrets/mail/*.json (cached credentials) │ │
│ │ ├─ mail-archive/ │ │
│ │ │ ├─ {date}.jsonl (daily aggregate) │ │
│ │ │ ├─ gmail-michael-asolo1/{year}/{month}/*.json │ │
│ │ │ ├─ hotmail-michael-asolo/{year}/{month}/*.json │ │
│ │ │ ├─ outlook-michael-nnamah/{year}/{month}/*.json │ │
│ │ │ └─ outlook-n-nnamah/{year}/{month}/*.json │ │
│ │ ├─ data/mail/gmail-cursor-*.json │ │
│ │ └─ resources/mail/mail-digest-{source}.json │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ External APIs (via network): │
│ ├─ Gmail API: gmail.googleapis.com/gmail/v1/users/me │
│ ├─ Google Token: oauth2.googleapis.com/token │
│ ├─ Microsoft Graph: graph.microsoft.com/v1.0/me/messages │
│ └─ Microsoft Token: login.microsoftonline.com/... │
└─────────────────────────────────────────────────────────────────┘Process Flow
Normal Execution (every 15 minutes)
systemd timer triggers
↓
/opt/hinata/mail-poller/mail-poller.py executes
↓
Load state.json (cursor positions)
├─ Gmail: last_uid per folder
└─ Outlook: last_received datetime
↓
Poll each account
├─ Gmail (IMAP)
│ ├─ Connect to imap.gmail.com:993
│ ├─ Authenticate with app password
│ ├─ For each folder:
│ │ ├─ Fetch UIDs > last_uid
│ │ ├─ Parse RFC2822 message
│ │ └─ Extract: subject, from, to, date, body_text, body_html
│ └─ Update folder state (last_uid)
│
└─ Outlook (Graph API) × 3
├─ Refresh OAuth2 access_token
├─ GET /me/messages?$filter=receivedDateTime gt {last_received}
├─ For each message:
│ └─ Extract: subject, from, to, date, body_text, body_html
└─ Update account state (last_received)
↓
Archive messages
├─ For each message:
│ ├─ Calculate: message_hash = SHA256({message_id}) → {hash[:16]}
│ ├─ Determine: year_month = message.date.strftime("%Y/%m")
│ ├─ Create path: archive/{account}/{year_month}/{hash}.json
│ └─ Write JSON to disk
↓
Save state.json (updated cursors)
↓
Exit (0 = success)
systemd logs to journalctlState Management
state.json Structure
json
{
"gmail-michael-asolo1": {
"last_received": "2026-06-11T19:00:00Z",
"updated_at": "2026-06-11T20:57:47"
},
"hotmail-michael-asolo": {
"last_received": "2026-06-05T10:30:45Z",
"last_poll": "2026-06-05T10:30:45.123456"
},
"outlook-michael-nnamah": {
"last_received": "2026-06-05T10:30:45Z",
"last_poll": "2026-06-05T10:30:45.123456"
},
"outlook-n-nnamah": {
"last_received": "2026-06-05T10:30:45Z",
"last_poll": "2026-06-05T10:30:45.123456"
}
}Key points:
last_received(Gmail API / Graph API): ISO8601 datetime of most recent message. Next poll fetches messages received after this timestamp.last_poll: Timestamp of last successful poll run.
Invariants:
- State is only updated after successful poll
- If poll fails, state is not modified (next run will retry same messages)
- Cursors monotonically increase (no going backwards)
Archive Format
Directory Structure
/opt/hinata/mail-poller/archive/
├── gmail/
│ ├── 2026/
│ │ ├── 05/
│ │ │ ├── a1b2c3d4e5f6g7h8.json
│ │ │ ├── b2c3d4e5f6g7h8i9.json
│ │ │ └── ...
│ │ └── 06/
│ │ └── ...
│ └── 2027/
│ └── ...
├── hotmail-michael-asolo/
│ ├── 2026/
│ │ ├── 05/
│ │ └── 06/
│ └── 2027/
├── outlook-michael-nnamah/
│ └── ...
└── outlook-n-nnamah/
└── ...Naming scheme: {account}/{YYYY}/{MM}/{message_hash}.json
where:
account= "gmail" | "hotmail-michael-asolo" | "outlook-michael-nnamah" | "outlook-n-nnamah"YYYY/MM= year/month extracted from message datemessage_hash= SHA256(message_id)[:16] — first 16 hex chars of SHA256 hash
Message JSON Format
json
{
"account": "gmail",
"email": "michael.asolo1@gmail.com",
"message_id": "<abc123.mail@gmail.com>",
"message_hash": "a1b2c3d4e5f6g7h8",
"date": "2026-06-05T10:30:45+00:00",
"subject": "Test Email Subject",
"from": "sender@example.com",
"to": "michael.asolo1@gmail.com",
"body_text": "Plain text content...",
"body_html": "<html>...</html>",
"year_month": "2026/06"
}Field meanings:
account: Which account this email came fromemail: Email address of the accountmessage_id: RFC2822 Message-ID header (unique identifier from server)message_hash: Short hash of message_id (for archive path)date: ISO8601 timestamp when message was receivedsubject: Email subject linefrom: Sender addressto: Recipient addresses (semicolon-separated for multiple)body_text: Plaintext bodybody_html: HTML body (if available)year_month: Denormalized copy of date in YYYY/MM format (for archive path)
Polling Protocols
Gmail (REST API — OAuth2)
Script: gmail-api-poller.py at /opt/jimmy-brain-ops/scripts/
Credentials:
- Client registration:
gmail_oauth_client.json(Vaultwarden itemgmail_oauth_client)- Fields:
client_id,client_secret,token_uri,auth_uri,project_id
- Fields:
- Account token:
gmail_oauth_token.json(Vaultwarden itemgmail_oauth_token)- Fields:
access_token,refresh_token,token_type,refreshed
- Fields:
Token refresh flow:
- POST to
token_uri(defaulthttps://oauth2.googleapis.com/token) - Payload: client_id + client_secret + refresh_token + grant_type=refresh_token
- Response: new access_token (+ optionally refreshed refresh_token)
- Write updated tokens to disk and back to Vaultwarden
Re-auth flow: gmail-oauth-reauth.py at ~/Sandpit/hinata-sandpit/scripts/ (Mac-only, requires browser). Starts localhost:8090 HTTP server, opens Google consent URL, captures code, exchanges for tokens, pushes to Z2 via SCP. Required when refresh_token expires or is revoked.
Incremental polling:
- Refresh access_token (if older than 55 minutes)
- GET
https://gmail.googleapis.com/gmail/v1/users/me/messages?q=in:inbox after:{epoch} - For each message ref: GET
.../messages/{id}?format=full - Parse headers + decode MIME body (plain text preferred, HTML fallback with tag stripping)
- Update cursor file at
/mnt/data/hinata/data/mail/gmail-cursor-gmail-michael-asolo1.json
Backfill mode: --backfill --since YYYY-MM iterates month by month using after:{epoch} before:{epoch} queries. Deduplicates via per-message JSON file existence check.
Archive paths:
- JSONL:
/mnt/data/hinata/mail-archive/{date}.jsonl - Per-message JSON:
/mnt/data/hinata/mail-archive/gmail-michael-asolo1/{year}/{year-month}/{hash}.json
Account: michael.asolo1@gmail.com (single account)
Microsoft Graph API (OAuth2)
Credentials:
- App registration:
outlook-graph-credentials.jsonjson{ "client_id": "uuid", "client_secret": "secret", "tenant_id": "consumers" } - Per-account tokens:
outlook-tokens-{account}.jsonjson{ "access_token": "token", "refresh_token": "refresh", "updated": "2026-06-05T10:30:45.123456" }
Token refresh flow:
- POST to
https://login.microsoftonline.com/consumers/oauth2/v2.0/token - Payload: client_id + client_secret + refresh_token + grant_type=refresh_token
- Response: new access_token (+ refreshed refresh_token)
- Write updated tokens back to disk
Incremental polling:
- Refresh access_token (if expired)
- GET
/me/messages?$filter=receivedDateTime gt {last_received} - Response: array of message objects (JSON)
- Extract fields from
itemdict - Update
state.json[account]["last_received"]
Advantages:
- Single API endpoint (no per-folder logic)
- Built-in filtering (by date, subject, etc.)
- OAuth2 tokens (no password stored)
Disadvantages:
- More complex token management
- JSON response (need to parse body_text from mixed content)
- Requires Azure app registration
Error Handling
Per-Account Isolation
If one account fails, others continue:
python
for account_key, config in accounts_to_poll.items():
try:
count, messages = poll_*()
except Exception as e:
logger.error(f"[{account_key}] Poll failed: {e}")
continue # Continue to next accountPartial Failure Recovery
If a message fails to parse/archive, it's logged but doesn't stop the run:
python
for message in messages:
try:
archive_message(message)
except Exception as e:
logger.error(f"Failed to archive: {e}")
continue # Continue to next messageState Not Saved on Failure
State is only written if entire account poll succeeds:
python
if not dry_run:
save_state(state) # Only if no exceptions aboveThis ensures cursors don't advance for failed runs, allowing retry without skipping messages.
Performance Characteristics
Time Complexity
| Operation | Time | Notes |
|---|---|---|
| Connect to Gmail IMAP | 50–100ms | SSL handshake |
| IMAP UID search | 10–50ms | Server-side filtered |
| Fetch N messages | 100–500ms | Depends on message size |
| Parse message | 1–5ms | RFC2822 parsing |
| Refresh token | 100–300ms | HTTP POST to Azure |
| Graph API fetch | 200–800ms | HTTP GET with filter |
| Archive N messages | 50–200ms | Disk I/O |
| Save state.json | 1–5ms | JSON serialization |
Space Complexity
| Item | Size | Notes |
|---|---|---|
| state.json | <50 KB | Independent of archive size |
| Message JSON | 10–100 KB | Depends on body size |
| Archive (47K emails) | 4.2 GB | Average ~90 KB per message |
Network Overhead
- Gmail IMAP: 4 bytes per UID (~12 K for 3000 UIDs to search)
- Graph API: ~50 bytes per field × N messages
- Token refresh: ~500 bytes request, ~1 KB response
Scaling Considerations
Single-threaded Sequential Polling
Current implementation polls accounts sequentially:
- Gmail (all folders)
- Outlook account 1
- Outlook account 2
- Outlook account 3
Total runtime: 2–5 seconds (typical)
If scaling to >10 accounts, consider:
- Parallel polling (async/threading) — would require stdlib changes
- Batching (archive multiple times per day instead of every 15 min)
- Filter rules (only fetch certain folders or date ranges)
Archive I/O Bottleneck
Writing files to disk is single-threaded. For >1000 emails per run:
- Consider batch archiving (write to single file, split on read)
- Or use database backend (SQLite, PostgreSQL)
Current design (individual files) is good for:
- Simple access patterns (git clone for backups)
- Transparent format (human-readable JSON)
- Per-message isolation (no locking issues)
Memory Usage
Script uses <50 MB RSS (resident set size):
- Email bodies held in memory during run
- State dict is small
- No caching of archive index
For >100K emails per run, consider:
- Streaming processing (don't hold all messages in memory)
- Or increase ct102 available RAM
Operational Notes
Idempotency
The script is idempotent:
- Running twice in a row = same result
- Archive deduplication:
message_hashprevents duplicates - State advancement: cursor only moves forward
Safe to run manually or via cron without conflicts.
Observability
Logging:
- All events logged to journalctl (systemd)
- Structured format: timestamp, level, component, message
- Errors logged with full exception info
State inspection:
bash
# Last poll times
jq '.[] | {account: .account, last_poll: .last_poll}' state.json
# Message counts
find archive -name "*.json" | wc -l
# Archive growth
du -sh archive/*Disaster Recovery
If state.json is lost:
- Delete state.json
- Next run will treat as "first run" and fetch all messages
- May take 5–10 minutes for large accounts (all IMAP UIDs fetched)
- No data loss (archive is separate)
If archive is lost:
- Emails not re-fetched (state.json remembers what was archived)
- To re-archive: delete state.json + run again
If credentials are leaked:
- Immediately regenerate passwords/tokens in Azure + Gmail
- Update credential files on ct103
- Monitor account for unauthorized access
Future Extensions
Filters
Add --since, --until options for backfill:
bash
mail-poller.py --since 2024-01-01 --until 2024-06-01Deduplication
Check if message_id already exists in archive before fetch.
Database Backend
Replace file-based archive with SQLite:
/opt/hinata/mail-poller/archive.db
├─ messages table (id, account, message_id, date, subject, from, to, body_text, body_html)
├─ state table (account, last_uid, last_received)
└─ indices (account, date, from, subject)Would enable:
- Fast queries (SQL)
- Transaction safety
- Full-text search
Classification Integration
Hook into Heimerdinger NLP classifier:
archive message → send to classifier API → store result in archive.dbAPI Endpoint
Expose FastAPI endpoint:
GET /api/emails?account=gmail&from=2026-06-01&to=2026-06-05&limit=50Would enable Studio access to archive without direct filesystem access.