Privacy-First Image Set Contract

Privacy-First Image Set Contract

When you’re building an on-device image capture pipeline, privacy can’t be something you “handle later.” If a full-frame or identifiable capture ever gets persisted or shipped—even briefly—you’ve already created a privacy risk and a testing headache.

A stronger pattern is to make privacy a data-format guarantee. The system should be incapable of producing an uploadable package that contains identifiable context unless a deliberately different mode is enabled. That shifts privacy from “policy” to contract + invariants.

This post lays out a blueprint that keeps the architecture intact: the MCU writes to flash first, then streams to a Linux host over CAN. The MCU generates a minimized image set (ROI-only or downscaled) and persists it atomically via a commit marker. The host verifies completeness and CRC, enforces a strict schema allow-list, and emits a final “safe package” for downstream use (optionally encrypted). The result is privacy-by-construction that’s deterministic, testable, and resilient to partial failure.


The real issue: privacy drift in normal pipelines

Most pipelines drift because “temporary” becomes permanent. Raw frames leak into debug dumps, staging buffers get reused, partial writes become ambiguous after power loss, and host parsers start accepting “best-effort” data to keep the system moving.

Instead of trying to patch these leaks downstream, we aim for a simpler promise: only minimized content can exist in persistent storage and only minimized content can leave the device by default. Everything else becomes explicitly “unsafe mode,” gated and separated.


The approach: a contract that enforces privacy and integrity

The contract is intentionally small and strict: a manifest plus a set of payloads referenced by that manifest. Privacy and integrity are enforced at multiple boundaries, but crucially, minimization happens before flash persistence.

The flow looks like this:

  1. MCU minimizes images into an image set (ROI-only or downscaled)
  2. MCU persists the set atomically using a commit marker
  3. MCU streams the set from flash to host over CAN in CRC-checked chunks
  4. Host validates and enforces schema allow-lists
  5. Host emits a “safe package” for upload/processing (optional encryption)

This design makes failure boring. If a set isn’t committed, it doesn’t exist. If CRC fails, it’s retried. If schema violates allow-list, it’s rejected. No guessing.


Contract design: what “safe” means on the wire

A privacy-first contract is defined as much by what it doesn’t include as what it does. The manifest should never carry fields that indirectly reintroduce identifying context (e.g., original full-frame dimensions, raw frames, debug attachments, or “extra metadata”).

A good manifest has three characteristics: it’s versioned, it’s deterministically parseable, and it’s allow-list friendly.

Here’s an illustrative example:

{
  "contract_version": "1.0",
  "set_id": "1699871234-0042",
  "timestamp_ms": 1699871234123,
  "privacy_mode": "ROI_ONLY",
  "rgb": { "format": "JPEG", "w": 320, "h": 240 },
  "thermal": { "format": "RAW16", "w": 64, "h": 64, "scale": "0.01C" },
  "roi": { "x": 48, "y": 36, "w": 320, "h": 240, "space": "RGB_DOWNSCALED" },
  "payloads": [
    { "type": "RGB_IMAGE", "offset": 4096, "len": 18234, "crc32": "0x9a31..." },
    { "type": "THERMAL_ROI", "offset": 22528, "len": 8192, "crc32": "0x11bf..." }
  ],
  "set_crc32": "0x5c2a..."
}

The important part is the constraint: only minimized dimensions and minimized payloads exist. The contract doesn’t contain a pathway for raw context to “accidentally” ride along.


Privacy minimization at source

This is the non-negotiable boundary: minimization must happen before writing anything to flash. If full frames are ever persisted, the contract becomes a documentation exercise instead of a guarantee.

In practice, there are two workable minimization strategies:

  • ROI-only (preferred): crop RGB to ROI and persist only that crop, and persist thermal as ROI-equivalent region.
  • Downscaled image ROI: downscale RGB aggressively and still keep ROI minimal.

The firmware should be structured so the flash writer can accept only “minimized payload” types. Raw buffers may exist in RAM, but the storage API should not accept them unless you deliberately build a separate unsafe mode.


Atomic completeness: commit marker that makes state obvious

Flash-first pipelines must treat power loss and partial writes as normal. Your storage format should make a simple question easy to answer: is this set complete?

Two patterns work well, and you can choose based on your storage layer:

Footer-based commit marker (append-style):
Write header → append payloads → write footer (payload table + set CRC + magic) → write final COMMIT word.
On boot, a set is valid only when footer + COMMIT exist and lengths/CRC match.

Rename-based commit marker (filesystem-style):
Write into a temp name → fsync → rename to “ready.”
On boot, only “ready” names are considered valid.

The contract-friendly rule remains the same: no commit = no set. That’s what keeps ingestion deterministic.


Integrity: why you want CRC per chunk and per set

Integrity needs two layers because corruption doesn’t happen in one place.

Per-chunk CRC helps you catch corruption during transport (CAN) and enables efficient retries. Per-set CRC catches incorrect assembly, stale chunks, and mismatched payload tables.

This leads to clear behavior:

  • If chunk CRC fails, retry that chunk.
  • If chunk CRCs pass but set CRC fails, discard and re-fetch the set (or re-read from flash).

You don’t need a complex scheme here. CRC32 is fast, practical, and widely supported.


Reliability and Privacy Guarantees

1) Contract definition (versioned + strict)

  • Define a binary image-set container with:
    • format_version, set_id, capture_time_bucket, sensor_config_hash
    • payload_directory[]: (type, offset, length, crc32)
    • ROI geometry: (x, y, w, h) + source frame size
  • Payload types (at minimum):
    • RGB_ROI (cropped) or RGB_DOWNSCALED
    • THERMAL_ROI (16-bit)
    • optional META_MIN (only what’s needed for alignment/ordering)
  • Strict parsing rules:
    • reject unknown fields by default (or ignore with version gates)
    • hard caps on payload sizes and ROI bounds

2) Privacy minimization at source (MCU)

  • Enforce ROI-first packaging before flash write:
    • crop RGB to ROI; never store background in default mode
    • thermal stored only as ROI (not full sensor frame) unless required
  • Maintain a controlled fallback mode for engineering builds only:
    • full-frame capture behind a flag + time-limited enable
    • CM4 policy prevents raw-frame upload unless explicitly permitted

3) Atomic completeness (commit markers)

  • Ensure consumers never see partial sets:
    • Rename-based commit: write set_<id>.tmp then rename to set_<id>.bin
    • or Footer-based commit: append MAGIC + version + final_crc at the end
  • Define recovery rules after power loss:
    • tmp files auto-cleaned
    • missing footer = “invalid set”

4) Integrity (CRC and verification)

  • Per-chunk CRC32 for transport reliability + fast reject
  • Per-set final CRC for end-to-end integrity
  • Verification behavior:
    • CM4 refuses to process/upload a set unless all payload CRCs pass
    • define bounded retries / re-requests for missing chunks

5) CAN streaming compatibility (flash → CM4)

  • Chunk protocol supports:
    • set_id, chunk_id, offset, len, crc32
    • selective retransmit (or at least resume-from-offset)
  • Pacing/backpressure:
    • avoid starving control frames (priority IDs / token bucket)
    • maintain stable throughput during bulk transfers

6) CM4 enforcement layer (privacy firewall)

  • CM4 responsibilities:
    • validate commit marker + CRCs
    • enforce schema allow-list (drop any unexpected metadata)
    • create an upload manifest (what was included + hashes)
    • optionally encrypt before upload/store (and manage retention policies)

7) Validation plan (prove it works + prove privacy)

  • Soak test:
    • sustained capture + sustained transfer with flash fill conditions
  • Fault injection:
    • dropped chunks, corrupted chunks, reboot mid-write
  • Privacy validation:
    • automated checks that only ROI payload types exist in production sets
    • manifest verifies outside-ROI is absent (not just “ignored”)

Host enforcement: strict parsing and schema allow-lists

The host should behave like a security boundary, not a forgiving parser. That means strictness by default:

  • If contract version is unsupported → reject
  • If unknown fields appear → reject
  • If unknown payload types appear → reject
  • If sizes exceed bounds → reject
  • If CRC/completeness fails → reject

“Forward compatibility” is where privacy holes sneak in. If you want extensibility, do it intentionally with version gates and explicit support.


Safe package emission

After validation, the host emits a final “safe package” (manifest + payloads). At this point, even if you upload it, you’re uploading minimized content that passed allow-list enforcement.

Encryption is optional but often useful if you’re moving data off-device. The important point is: encryption is not the privacy mechanism here — minimization is. Encryption mainly protects against leakage in transit or storage.


In Hoomanely, this contract fits naturally when you already run flash-first capture and stream to a Linux host for ingestion. The contract becomes the stable boundary shared across firmware, host-side ingestion, and downstream processing.

What this unlocks is iterative improvement: ROI heuristics and insight quality can evolve without changing the privacy guarantees, because the system still emits only minimized, contract-validated sets by default.


Validation: prove it with tests, not policy

A privacy-first contract becomes powerful when it’s testable. You want three kinds of tests to run routinely.

Soak tests validate stability under long runs: capture → persist → stream → package, repeated for hours. Your goal is a stable system with no memory creep or retry storms.

Fault injection validates determinism: power cuts mid-write, dropped/reordered CAN frames, bit flips in payload chunks. Your expected outcomes are crisp: uncommitted sets vanish, corrupted chunks retry, invalid sets reject.

Privacy checks validate minimization: automated rules confirm that only allowed dims and payload types exist, and that no fields can carry full-frame context. This is where contract-first privacy becomes provable.


Key takeaways

  • Privacy is a format guarantee, not a policy. If the artifact doesn’t contain identifiable context, no downstream service can “accidentally leak” it.
  • Minimize at the earliest point of truth. Create ROI-only / downscaled payloads before writing to flash, so raw background context is never persisted in the default path.
  • Make completeness explicit and atomic. Use a commit marker (rename/footer). If you can’t commit it, it doesn’t exist—partial sets are treated as invalid and ignored.
  • Integrity is non-negotiable. Add per-chunk and per-set CRC so corruption is detected early and deterministically. If you can’t validate it, you don’t upload it.
  • The host is a “privacy firewall.” CM4 must parse via a strict allow-list: only known payload types, bounded sizes, bounded metadata, and version-gated fields. Everything else is rejected.
  • Separate “debug” from “production” by design. If full-frame capture exists for lab debugging, it must be behind explicit flags, time limits, and a hard rule that prevents raw uploads by default.
  • Metadata is a leak vector—treat it like one. Keep only what’s required for ordering/alignment (set_id, coarse time bucket, ROI geometry, config hash). Avoid anything that implies location/environment.
  • Transport should preserve guarantees. Stream only committed sets; support resume/retry without duplicating or mixing chunks across sets.
  • Build proof into the pipeline. Emit a small “privacy manifest” per set (payload types, ROI bounds, hashes) so you can audit what left the device without storing raw content.
  • Measure it like reliability. Define pass/fail checks: “no non-ROI payloads in production,” “0 uploads without valid commit + CRC,” and continuous soak tests to ensure privacy doesn’t degrade throughput.
  • Default behavior matters more than edge cases. The goal isn’t perfect privacy only when everything goes right—it’s privacy that holds under retries, power loss, partial transfers, and real-home noise.

Read more