Designing Hardware That Explains a Reset Event
In embedded systems, resets are often treated as normal behaviour.
A watchdog expires.
A brownout occurs.
A PMIC toggles reset.
Firmware crashes.
Power sequencing fails.
A peripheral hangs the bus.
A rail collapses temporarily.
The system resets and comes back online.
From the outside, everything appears functional again.
But the actual failure disappears the moment the reset occurs.
This is one of the biggest problems in real-world embedded debugging:
the system recovers faster than engineers can understand what happened.
At Hoomanely, we learned this very early while building connected edge devices with multiple processors, sensor domains, cameras, external peripherals, and distributed power architecture. A reset event rarely originates from a single cause. More importantly, the visible symptom is often far away from the real failure source.
A device may reboot because:
- a rail dipped for 8 milliseconds,
- a peripheral locked an interrupt line,
- a wake source oscillated,
- an external sensor partially powered through a GPIO,
- or firmware entered a timing deadlock during suspend.
After reboot, the evidence is gone.
Traditional debugging methods struggle here because reset events are inherently destructive. They wipe volatile state, restart sequencing, reinitialise peripherals, and clear the exact conditions engineers need to inspect.
This is why we stopped treating resets as isolated processor events.
Instead, we started designing hardware that can explain why the reset happened.
A Reset Is a System-Level Event, Not an MCU Event
Most designs approach reset architecture from the processor’s perspective.
The MCU has:
- NRST,
- watchdog reset,
- brownout reset,
- software reset,
- external reset.
The assumption is that understanding the processor reset source is sufficient.
In practice, modern systems contain multiple independent reset domains:
- PMIC reset logic,
- external watchdogs,
- sensor reset lines,
- radio reset domains,
- Linux processor resets,
- USB reset behaviour,
- power-good monitors,
- retention domains,
- and asynchronous external wake sources.
Now the system can enter inconsistent states where:
- one processor resets while another remains active,
- peripherals reboot while clocks remain unstable,
- or power domains collapse independently.
The processor may correctly report:
“External reset occurred.”
But that does not explain:
- which subsystem triggered it,
- what electrical condition caused it,
- or what sequence preceded it.
The reset source register becomes informationally incomplete.
At system scale, reset visibility must extend beyond the MCU.

Most Reset Failures Are Timing Problems
One of the hardest truths in embedded systems is that many reset-related failures are not permanent faults.
They are timing failures.
A power rail rises slightly slower during cold temperature startup.
A PMIC enable line transitions before a regulator stabilises.
An interrupt arrives during suspend entry.
A peripheral exits reset before its clock domain becomes valid.
The system fails for milliseconds.
Then recovers.
These failures are dangerous because they rarely appear consistently during lab validation. They emerge:
- after hundreds of power cycles,
- under thermal drift,
- during noisy battery conditions,
- or only during partial subsystem wake scenarios.
Traditional debugging tools struggle because oscilloscopes capture only isolated windows, while firmware logs disappear during reset.
This is why reset architecture must preserve system context before recovery destroys it.

The Reset Cause Must Survive the Reset
One of the most important architectural principles we adopted was simple:
The explanation for a reset must survive the reset itself.
This sounds obvious, but most systems do not actually preserve enough information.
Common implementations only expose:
- watchdog reset,
- software reset,
- brownout reset,
- external reset.
But those categories are too broad.
For example:
A watchdog timeout alone says almost nothing useful.
Questions still remain:
- Which subsystem stopped responding?
- Was storage active?
- Was the system entering sleep?
- Was DMA running?
- Did an interrupt storm occur?
- Was a peripheral holding a shared bus?
- Was the rail already unstable before timeout?
Similarly, a brownout flag does not explain:
- which rail collapsed first,
- whether the collapse was caused by load transient,
- sequencing error,
- or external power instability.
Meaningful reset visibility requires preserving:
- subsystem state,
- sequencing stage,
- rail validity,
- wake ownership,
- and transition context.
Not just the reset category.

Power Sequencing Should Leave Breadcrumbs
One of the most effective approaches we implemented was making power sequencing self-observable.
Instead of treating sequencing as hidden hardware behaviour, we began exposing sequencing progression explicitly through:
- GPIO state indicators,
- latchable fault flags,
- persistent PMIC status,
- rail-good monitoring,
- subsystem handshake signals,
- and retained debug state.
Now, if a subsystem fails during bring-up, the system can identify:
- how far sequencing progressed,
- which rail failed validation,
- which domain never acknowledged readiness,
- or which dependency violated timing constraints.
This transforms debugging dramatically.
Without sequencing visibility:
“The board resets randomly.”
With sequencing visibility:
“Sensor domain acknowledged power-good, but camera rail never stabilized before watchdog expiration.”
The second statement is actionable engineering information.
The first is noise.

Coordinated Reset Architecture Matters
In distributed systems, resets must behave predictably across subsystems.
A common mistake is allowing every processor or peripheral to reset independently without coordination.
Initially, this appears modular.
In practice, it creates inconsistent system states.
For example:
- Linux resets while sensor MCUs remain active,
- peripheral state machines continue running,
- shared buses remain busy,
- or retained interrupts survive unexpectedly.
Now the restarted processor re-enters a system whose external state no longer matches initialisation assumptions.
These are some of the hardest bugs to reproduce because the reset itself creates nondeterministic system conditions.
At Hoomanely, we gradually shifted toward reset orchestration rather than isolated reset ownership.
Some resets must remain local.
Others must propagate across domains intentionally.
The key is that reset behaviour should be architecturally defined — not accidentally emergent.

Reset Boundaries Are Architectural Boundaries
One of the clearest indicators of system maturity is how intentionally reset boundaries are designed.
In weaker architectures, resets spread unpredictably:
- one subsystem glitches another,
- partial rail collapse triggers unrelated domains,
- shared enables create cascading resets,
- or watchdog recovery unintentionally destabilises neighbouring logic.
In stronger architectures, reset domains are explicit.
Subsystems understand:
- who can reset them,
- which rails affect them,
- what dependencies must exist before startup,
- and which signals remain valid during recovery.
This becomes especially important in systems with:
- external sensors,
- detachable modules,
- SOM-carrier separation,
- distributed processors,
- or mixed-voltage domains.
Reset isolation is not just about fault recovery.
It is about preventing unrelated failures from propagating unnecessarily.

Hardware Should Explain Failure Without Firmware
One major design mistake is relying entirely on firmware logging for reset diagnosis.
Firmware is fragile during failures.
Especially during:
- watchdog events,
- clock instability,
- stack corruption,
- memory faults,
- power collapse,
- or partial resets.
In many critical failures, firmware never gets the opportunity to log useful information.
This is why hardware-level observability matters.
Some of the most valuable debug mechanisms are extremely simple:
- retained fault latches,
- persistent rail indicators,
- sequencing LEDs,
- reset-source propagation lines,
- hardware fault aggregation,
- or GPIO-driven debug checkpoints.
These mechanisms survive conditions where firmware cannot.
More importantly, they provide visibility during the exact moments where software loses control.

Brownouts Rarely Look Like Brownouts
One of the most misleading reset categories is the brownout reset.
Engineers often imagine brownouts as large visible voltage drops.
Real brownouts are frequently much smaller:
- transient dips,
- sequencing instability,
- localised rail collapse,
- fast load spikes,
- or temporary regulator saturation.
Some last only microseconds.
Yet those brief events can:
- corrupt transactions,
- destabilize clocks,
- violate timing assumptions,
- or trigger undefined peripheral behaviour before the processor itself resets.
In distributed systems, different rails react differently to the same event.
One subsystem may recover immediately. Another may partially reset. Another may remain electrically unstable without fully rebooting.
This creates inconsistent system behaviour that looks like firmware instability while actually originating from power integrity.
This is why rail visibility and sequencing observability become essential during reset analysis.

Watchdogs Should Explain What Timed Out
Many systems implement watchdogs as binary recovery tools:
- system alive,
- or system dead.
But watchdog architecture becomes far more useful when paired with subsystem observability.
Instead of simply resetting the processor, the system should preserve:
- which subsystem stopped responding,
- which task failed heartbeat validation,
- whether storage or communication was active,
- and what operating mode existed before timeout.
Otherwise the watchdog only tells engineers:
“Something failed eventually.”
That is rarely enough.
A well-designed watchdog system narrows the problem space immediately.
This is especially important in systems containing:
- asynchronous peripherals,
- distributed processors,
- external sensors,
- and multiple timing domains.
Because not all watchdog failures originate in the processor itself.
Sometimes the processor is merely the victim of another subsystem failing first.

Reset Recovery Must Be Deterministic
A reset event is only half the problem.
Recovery behaviour matters equally.
Many systems recover inconsistently because startup assumptions are invalid after partial subsystem resets.
For example:
- retained peripherals may still hold interrupts,
- buses may remain busy,
- sensors may require stabilisation delays,
- or external domains may not have fully collapsed.
Now recovery timing changes depending on the previous failure mode.
At Hoomanely, we learned that deterministic recovery requires:
- explicit startup sequencing,
- rail validation before initialisation,
- subsystem acknowledgement,
- timeout-aware bring-up,
- and controlled dependency ordering.
Otherwise, resets simply hide instability temporarily before the next failure occurs.

Designing Systems That Explain Themselves
The deeper lesson is this:
Reliable systems are not defined by the absence of resets.
They are defined by how clearly they explain reset events.
In complex embedded devices, failures will eventually occur:
- power instability,
- unexpected timing interactions,
- environmental variation,
- asynchronous wake behaviour,
- subsystem faults.
The question is whether the system preserves enough context to make those failures diagnosable.
At Hoomanely, we increasingly design observability directly into the hardware architecture itself.
Not as an afterthought.
Not as a debug-only feature.
But as part of the system contract.
Because modern embedded systems are too distributed, too asynchronous, and too timing-sensitive to rely entirely on post-failure software logging.
The hardware itself must participate in explaining what happened.

Final Thoughts
Reset architecture is often treated as simple infrastructure.
In reality, it is one of the most important observability systems inside an embedded product.
A reset event represents:
- power behaviour,
- subsystem coordination,
- timing integrity,
- sequencing correctness,
- and architectural maturity.
When reset visibility is weak, engineers spend days reproducing failures.
When reset visibility is strong, failures become diagnosable almost immediately.
The difference is not luck.
It is whether the system was intentionally designed to preserve context during failure.
Because in modern embedded systems, the hardest bugs are rarely the ones that crash permanently.
They are the ones that recover before anyone understands why they happened.