Skip to content

How to rehash the Mac mail archive into CT102 canonical paths

This is the executable spec for rehash-mac-import.py. Trunks specs the contract; Jimmy Neutron writes the .py from this document. The script lands the ~14.7K-message Mac archive (sha1[:12] filenames, Oct 2016 → Jun 2026) into CT102's canonical /mail-archive/{account}/{YYYY}/{MM}/{sha256[:16]}.json tree in one continuous corpus with live polls.

Background context (why this script exists, not what it does): see explanation_mail-poller-historical-use-cases.

Locked contract

PropertyValue
Script path/opt/hinata/mail-poller/rehash-mac-import.py (CT102 — sibling of mail-poller.py)
Invocation surfacessh hinata-z2 'pct exec 102 -- /opt/hinata/mail-poller/rehash-mac-import.py [flags]'
RuntimePython 3.11 stdlib only — pathlib, hashlib, json, email, argparse, logging. No pip.
Source path (read-only)/Users/nnamdi/Sandpit/hinata/resources/email-poller/ on the Mac. Reached over NFS from CT102 via the host bind (see § Source access).
Target tree/mail-archive/{account}/{YYYY}/{MM}/{sha256[:16]}.json (CT102 canonical, bind of /mnt/data/hinata/mail-archive)
Idempotency journal/mail-archive/_journal/mac-import.jsonl (one JSONL line per write) + /mail-archive/_journal/rehash-state.json (resume cursor)
HashingSource filename = sha1(message_id)[:12]; target filename = sha256(message_id)[:16]. Message-id is read from inside the JSON envelope, never inferred from filename.
Envelope normalisationOutput envelope matches reference_mail-poller-z2.md § Archive Message Format exactly — same field names, same order. Missing fields infer from RFC2822 headers in the source file; absent → empty string (never null).
Move semanticsThe Mac source stays in place until _journal/rehash-state.json reports status: complete. Deletion of the Mac source is a separate manual step (§ Source cleanup).

Source access (Mac source over NFS)

Per supreme-court/runtime/container-storage-strategy: the Mac is a client. CT102 reads the Mac archive via:

  1. Mac exports ~/Sandpit/hinata/resources/email-poller/ read-only over the Tailscale-gated NFS share already in place for the inversion law (100.64.0.0/10 CIDR only).
  2. Z2 host mounts the Mac NFS export at /mnt/mac-mail-import/ (read-only).
  3. CT102 bind-mounts /mnt/mac-mail-import/ at /mac-mail-import/ (read-only inside the container).

The script reads from /mac-mail-import/ only. It never writes to that path. If NFS is unavailable the script exits non-zero with NFS source unavailable — start at host mount and writes nothing.

Flags

FlagDefaultPurpose
--source PATH/mac-mail-importOverride source root (test fixtures)
--target PATH/mail-archiveOverride target root (test fixtures)
--dry-runfalseList intended writes + counts, write nothing, no journal entries
--chunk-size N500Messages per chunk before fsync + journal flush
--checkpoint-every N500Update rehash-state.json cursor every N messages (same cadence as chunk by default)
--account NAMEallRestrict to one account dir (gmail-michael-asolo1, etc.)
--resumefalseResume from rehash-state.json cursor; skip processed message-ids
--verbosefalseOne log line per message processed

Conversion algorithm

For each *.json file under /mac-mail-import/:

  1. Read the source file. Parse JSON.
  2. Read message_id from inside the envelope. If absent, parse RFC2822 headers from body_text / raw block to recover it. If still absent, log to _journal/rehash-state.json under unrecoverable and skip.
  3. Compute target_hash = sha256(message_id)[:16].
  4. Derive account from the source path (gmail-michael-asolo1gmail; outlook accounts map 1:1 by name per reference_mail-poller-z2.md § Accounts).
  5. Derive year_month from date field (ISO8601 parse). If date is missing, fall back to RFC2822 Date: header. If still missing, log to unrecoverable and skip.
  6. Build target path /mail-archive/{account}/{YYYY}/{MM}/{target_hash}.json.
  7. Idempotency gate: if target path exists, skip (do not overwrite). This is the safe overlap window with live polls.
  8. Normalise envelope to the CT102 schema (reference_mail-poller-z2.md § Archive Message Format) — preserve all source body content; rewrite message_hash field to target_hash.
  9. Write target file (atomic — write to .tmp, fsync, rename).
  10. Append one line to /mail-archive/_journal/mac-import.jsonl:
    json
    {"ts":"<ISO8601>","src":"<source-relative-path>","target":"<target-relative-path>","src_hash":"<sha1:12>","target_hash":"<sha256:16>","account":"<acct>"}
  11. Every --chunk-size messages: fsync the target dir, update /mail-archive/_journal/rehash-state.json.

Idempotency

The script is safe to re-run any number of times. Re-runs:

  1. Load rehash-state.json if --resume, skip every source file whose source-relative path appears in processed.
  2. Even without --resume, the per-target Path.exists() gate prevents duplicate writes.
  3. Live polls cannot collide: live writes and rehash writes target the same canonical paths with the same hash scheme. Whoever writes first wins; the other returns False.

rehash-state.json schema:

json
{
  "status": "running | complete | aborted",
  "started": "<ISO8601>",
  "last_update": "<ISO8601>",
  "processed_count": 12345,
  "skipped_existing": 234,
  "unrecoverable_count": 5,
  "processed": ["gmail-michael-asolo1/2016/10/<sha1:12>.json", "..."],
  "unrecoverable": [{"src":"...", "reason":"no message-id"}, "..."]
}

The processed list can grow to ~15K entries — keep it as a flat array for simplicity at this scale (file stays <2 MB).

Chunking + checkpoint cadence

PhaseSettingRationale
Default chunk size500 messagesAt ~30 ms/file (NFS read + sha256 + atomic write), 500 messages ≈ 15 s per chunk. Bounds resume granularity.
Default checkpointevery 500 messagesMatches chunk cadence — one journal sync per chunk.
Total corpus~14,700 messagesExpected ~30 chunks. Full run ≈ 8–10 min wall time.
Disk pressurebind-mount on the Z2 data diskWrites batch within the same dir tree the live poller already targets — no extra I/O contention because live polls write 5–50 messages per 15-min run.

Rollback

If mid-move corruption is detected (Python exception, NFS drop, disk error, malformed source file):

  1. Stop the scriptpkill -f rehash-mac-import on CT102.
  2. Inspect rehash-state.jsonstatus field shows running (mid-flight) or aborted (graceful exit). processed_count and last_update confirm how far it got.
  3. The Mac source is untouched — the script never writes to /mac-mail-import/, so the original sha1[:12] tree is intact and re-runnable.
  4. Inspect the journal_journal/mac-import.jsonl enumerates every successful write. Tail the last 100 lines to identify the suspect range.
  5. If a known-bad range needs deletion: find /mail-archive/{account}/{YYYY}/{MM}/ -newer [last-known-good-ts] -name "*.json" -delete then re-run with --resume after clearing those entries from processed in rehash-state.json.
  6. If corruption is global: delete rehash-state.json and the journal; the target tree's idempotency gate prevents duplicates on a clean re-run, but live polls may have written new messages in the meantime — those stay (they are correct by construction).
  7. Never delete the Mac source until status: complete and a manual spot-check of 10 random rehashed files renders correctly.

Source cleanup (post-completion)

After rehash-state.json reports status: complete AND a 10-file spot-check passes AND the journal line count matches processed_count:

  1. Snapshot the Mac source path (tar -czf ~/Sandpit/hinata/resources/email-poller-pre-delete-[YYYY-MM-DD].tar.gz email-poller/).
  2. Park the tarball on the Z2 data disk under /mnt/data/hinata/_one-off-backups/.
  3. Delete the Mac source path (rm -rf ~/Sandpit/hinata/resources/email-poller/).
  4. Update reference_mail-poller-z2.md § Future Work to mark the Mac archive move complete.

The tarball is a recovery artefact only — Heimerdinger reads canonical CT102 paths only, never the tarball.

Verification

After every run (or every resume), confirm:

  1. find /mail-archive -path /mail-archive/_journal -prune -o -path /mail-archive/_state -prune -o -name "*.json" -print | wc -l ≥ expected total.
  2. jq .status /mail-archive/_journal/rehash-state.json returns "complete" on the final run.
  3. wc -l /mail-archive/_journal/mac-import.jsonl matches processed_count in rehash-state.json.
  4. Sample 10 random rehashed files: jq '{message_id, message_hash, date, subject, from}' returns populated fields, no nulls.
  5. Live poll cursor /mail-archive/_state/state.json UNCHANGED from pre-cutover snapshot (md5 match) — the rehash must never touch the live cursor.