Appearance
How to rehash the Mac mail archive into CT102 canonical paths
This is the executable spec for rehash-mac-import.py. Trunks specs the contract; Jimmy Neutron writes the .py from this document. The script lands the ~14.7K-message Mac archive (sha1[:12] filenames, Oct 2016 → Jun 2026) into CT102's canonical /mail-archive/{account}/{YYYY}/{MM}/{sha256[:16]}.json tree in one continuous corpus with live polls.
Background context (why this script exists, not what it does): see explanation_mail-poller-historical-use-cases.
Locked contract
| Property | Value |
|---|---|
| Script path | /opt/hinata/mail-poller/rehash-mac-import.py (CT102 — sibling of mail-poller.py) |
| Invocation surface | ssh hinata-z2 'pct exec 102 -- /opt/hinata/mail-poller/rehash-mac-import.py [flags]' |
| Runtime | Python 3.11 stdlib only — pathlib, hashlib, json, email, argparse, logging. No pip. |
| Source path (read-only) | /Users/nnamdi/Sandpit/hinata/resources/email-poller/ on the Mac. Reached over NFS from CT102 via the host bind (see § Source access). |
| Target tree | /mail-archive/{account}/{YYYY}/{MM}/{sha256[:16]}.json (CT102 canonical, bind of /mnt/data/hinata/mail-archive) |
| Idempotency journal | /mail-archive/_journal/mac-import.jsonl (one JSONL line per write) + /mail-archive/_journal/rehash-state.json (resume cursor) |
| Hashing | Source filename = sha1(message_id)[:12]; target filename = sha256(message_id)[:16]. Message-id is read from inside the JSON envelope, never inferred from filename. |
| Envelope normalisation | Output envelope matches reference_mail-poller-z2.md § Archive Message Format exactly — same field names, same order. Missing fields infer from RFC2822 headers in the source file; absent → empty string (never null). |
| Move semantics | The Mac source stays in place until _journal/rehash-state.json reports status: complete. Deletion of the Mac source is a separate manual step (§ Source cleanup). |
Source access (Mac source over NFS)
Per supreme-court/runtime/container-storage-strategy: the Mac is a client. CT102 reads the Mac archive via:
- Mac exports
~/Sandpit/hinata/resources/email-poller/read-only over the Tailscale-gated NFS share already in place for the inversion law (100.64.0.0/10CIDR only). - Z2 host mounts the Mac NFS export at
/mnt/mac-mail-import/(read-only). - CT102 bind-mounts
/mnt/mac-mail-import/at/mac-mail-import/(read-only inside the container).
The script reads from /mac-mail-import/ only. It never writes to that path. If NFS is unavailable the script exits non-zero with NFS source unavailable — start at host mount and writes nothing.
Flags
| Flag | Default | Purpose |
|---|---|---|
--source PATH | /mac-mail-import | Override source root (test fixtures) |
--target PATH | /mail-archive | Override target root (test fixtures) |
--dry-run | false | List intended writes + counts, write nothing, no journal entries |
--chunk-size N | 500 | Messages per chunk before fsync + journal flush |
--checkpoint-every N | 500 | Update rehash-state.json cursor every N messages (same cadence as chunk by default) |
--account NAME | all | Restrict to one account dir (gmail-michael-asolo1, etc.) |
--resume | false | Resume from rehash-state.json cursor; skip processed message-ids |
--verbose | false | One log line per message processed |
Conversion algorithm
For each *.json file under /mac-mail-import/:
- Read the source file. Parse JSON.
- Read
message_idfrom inside the envelope. If absent, parse RFC2822 headers frombody_text/ raw block to recover it. If still absent, log to_journal/rehash-state.jsonunderunrecoverableand skip. - Compute
target_hash = sha256(message_id)[:16]. - Derive
accountfrom the source path (gmail-michael-asolo1→gmail; outlook accounts map 1:1 by name perreference_mail-poller-z2.md§ Accounts). - Derive
year_monthfromdatefield (ISO8601 parse). Ifdateis missing, fall back to RFC2822Date:header. If still missing, log tounrecoverableand skip. - Build target path
/mail-archive/{account}/{YYYY}/{MM}/{target_hash}.json. - Idempotency gate: if target path exists, skip (do not overwrite). This is the safe overlap window with live polls.
- Normalise envelope to the CT102 schema (
reference_mail-poller-z2.md§ Archive Message Format) — preserve all source body content; rewritemessage_hashfield totarget_hash. - Write target file (atomic — write to
.tmp, fsync, rename). - Append one line to
/mail-archive/_journal/mac-import.jsonl:json{"ts":"<ISO8601>","src":"<source-relative-path>","target":"<target-relative-path>","src_hash":"<sha1:12>","target_hash":"<sha256:16>","account":"<acct>"} - Every
--chunk-sizemessages: fsync the target dir, update/mail-archive/_journal/rehash-state.json.
Idempotency
The script is safe to re-run any number of times. Re-runs:
- Load
rehash-state.jsonif--resume, skip every source file whose source-relative path appears inprocessed. - Even without
--resume, the per-targetPath.exists()gate prevents duplicate writes. - Live polls cannot collide: live writes and rehash writes target the same canonical paths with the same hash scheme. Whoever writes first wins; the other returns False.
rehash-state.json schema:
json
{
"status": "running | complete | aborted",
"started": "<ISO8601>",
"last_update": "<ISO8601>",
"processed_count": 12345,
"skipped_existing": 234,
"unrecoverable_count": 5,
"processed": ["gmail-michael-asolo1/2016/10/<sha1:12>.json", "..."],
"unrecoverable": [{"src":"...", "reason":"no message-id"}, "..."]
}The processed list can grow to ~15K entries — keep it as a flat array for simplicity at this scale (file stays <2 MB).
Chunking + checkpoint cadence
| Phase | Setting | Rationale |
|---|---|---|
| Default chunk size | 500 messages | At ~30 ms/file (NFS read + sha256 + atomic write), 500 messages ≈ 15 s per chunk. Bounds resume granularity. |
| Default checkpoint | every 500 messages | Matches chunk cadence — one journal sync per chunk. |
| Total corpus | ~14,700 messages | Expected ~30 chunks. Full run ≈ 8–10 min wall time. |
| Disk pressure | bind-mount on the Z2 data disk | Writes batch within the same dir tree the live poller already targets — no extra I/O contention because live polls write 5–50 messages per 15-min run. |
Rollback
If mid-move corruption is detected (Python exception, NFS drop, disk error, malformed source file):
- Stop the script —
pkill -f rehash-mac-importon CT102. - Inspect
rehash-state.json—statusfield showsrunning(mid-flight) oraborted(graceful exit).processed_countandlast_updateconfirm how far it got. - The Mac source is untouched — the script never writes to
/mac-mail-import/, so the original sha1[:12] tree is intact and re-runnable. - Inspect the journal —
_journal/mac-import.jsonlenumerates every successful write. Tail the last 100 lines to identify the suspect range. - If a known-bad range needs deletion:
find /mail-archive/{account}/{YYYY}/{MM}/ -newer [last-known-good-ts] -name "*.json" -deletethen re-run with--resumeafter clearing those entries fromprocessedinrehash-state.json. - If corruption is global: delete
rehash-state.jsonand the journal; the target tree's idempotency gate prevents duplicates on a clean re-run, but live polls may have written new messages in the meantime — those stay (they are correct by construction). - Never delete the Mac source until
status: completeand a manual spot-check of 10 random rehashed files renders correctly.
Source cleanup (post-completion)
After rehash-state.json reports status: complete AND a 10-file spot-check passes AND the journal line count matches processed_count:
- Snapshot the Mac source path (
tar -czf ~/Sandpit/hinata/resources/email-poller-pre-delete-[YYYY-MM-DD].tar.gz email-poller/). - Park the tarball on the Z2 data disk under
/mnt/data/hinata/_one-off-backups/. - Delete the Mac source path (
rm -rf ~/Sandpit/hinata/resources/email-poller/). - Update reference_mail-poller-z2.md § Future Work to mark the Mac archive move complete.
The tarball is a recovery artefact only — Heimerdinger reads canonical CT102 paths only, never the tarball.
Verification
After every run (or every resume), confirm:
find /mail-archive -path /mail-archive/_journal -prune -o -path /mail-archive/_state -prune -o -name "*.json" -print | wc -l≥ expected total.jq .status /mail-archive/_journal/rehash-state.jsonreturns"complete"on the final run.wc -l /mail-archive/_journal/mac-import.jsonlmatchesprocessed_countinrehash-state.json.- Sample 10 random rehashed files:
jq '{message_id, message_hash, date, subject, from}'returns populated fields, no nulls. - Live poll cursor
/mail-archive/_state/state.jsonUNCHANGED from pre-cutover snapshot (md5 match) — the rehash must never touch the live cursor.