Preventing One Feature From Bricking the Whole Device

Preventing One Feature From Bricking the Whole Device

Modern embedded products are no longer single-purpose machines.

A connected edge device may capture images, process sensor data, communicate over wireless or wired interfaces, store logs, run OTA updates, manage power states, control LEDs, detect proximity, and coordinate with other boards — all inside one compact system.

Each feature may be valid by itself.

The camera works.
The sensor pipeline works.
The communication stack works.
The OTA mechanism works.
The power management logic works.

But the real question is different:

Can one broken feature take down the entire device?

That is where product-grade embedded design begins.

At Hoomanely, this question comes up often because our systems are not built as isolated demo boards. They are built as long-running, field-deployed products where multiple subsystems must coexist reliably. A feature may fail, a peripheral may hang, a firmware module may behave unexpectedly, or a power domain may become unstable. But the device should not become unrecoverable because one part of the system failed.

A good embedded product is not one where nothing ever goes wrong.

A good embedded product is one where failure stays contained.

A Feature Failure Should Not Become a System Failure

One of the most common architectural mistakes in embedded systems is allowing every feature to share the same critical path.

A new feature gets added. It needs a sensor, a driver, a task, a power rail, a memory buffer, a communication interface, and some persistent storage. The feature works during development, so it gets integrated deeply into the main application flow.

Over time, this pattern repeats.

Eventually, the device reaches a point where one feature can block boot, stall the main loop, exhaust memory, overload a rail, corrupt shared storage, or prevent firmware updates.

That is how a feature becomes dangerous.

Not because the feature itself is bad.

But because the system gives it too much authority.

A camera feature should not be able to permanently block the device from booting.
A sensor driver should not be able to prevent recovery mode.
A wireless stack should not be able to corrupt the only working firmware image.
A UI feature should not be able to disable debug access.
A logging feature should not be able to fill storage until the system becomes unusable.

The core design principle is simple:

Every feature must have a boundary.

Without boundaries, features slowly become system-level risks.

Bricking Usually Comes From Shared Assumptions

When people think about bricking a device, they usually imagine a failed firmware update. That is one type of bricking, but it is not the only one.

A device can become practically bricked even when firmware is technically still present.

For example, the system may boot but immediately crash because a peripheral never responds. It may enter a watchdog loop because a task blocks forever. It may keep resetting because one feature enables a high-current load too early. It may fail to connect because a corrupted configuration file prevents networking. It may never reach OTA recovery because the boot sequence waits indefinitely for a non-critical sensor.

In all these cases, the device is not dead in the physical sense.

But from a user or field-support perspective, it is unavailable.

This happens when the system assumes that every feature will behave correctly every time.

Real products cannot make that assumption.

A robust product must assume that:

  • peripherals may be missing,
  • sensors may return invalid data,
  • storage may become full or corrupted,
  • network connectivity may fail,
  • power rails may droop,
  • firmware modules may crash,
  • and user-triggered flows may happen at inconvenient times.

The system should continue to provide a recovery path even when these assumptions fail.

Boot Must Belong to the System, Not to Features

The boot path is the most important part of device recoverability.

If a feature can block boot, that feature has too much control.

During boot, the system should bring up only what is required to establish a stable, recoverable platform. Everything else should be treated as optional until proven healthy.

A common mistake is initialising every subsystem during early boot because it feels convenient. The system powers up, initialises sensors, starts communication interfaces, mounts storage, loads configuration, starts feature services, and then announces readiness.

This works beautifully when every subsystem behaves.

But if one non-critical sensor fails, boot may stall. If a bus hangs, the entire device may appear dead. If a configuration file is corrupted, the application may crash before enabling recovery mechanisms.

At Hoomanely, we prefer a more layered boot philosophy.

The device should first establish a minimum safe state. That means power is stable, the processor is alive, reset cause is understood, debug or recovery access is available, and the firmware can make decisions. Only after that should higher-level features start.

A practical boot hierarchy looks like this:

  1. Establish power and reset stability.
  2. Start the minimum runtime required for recovery.
  3. Validate storage and configuration.
  4. Bring up communication or service access where possible.
  5. Initialise sensors and feature modules.
  6. Mark individual features healthy or degraded.

This structure ensures that optional feature failure does not prevent the system from becoming recoverable.

The device should be able to say:

“This feature failed, but the system is alive.”

That difference matters.

Feature Isolation Starts in Hardware

Firmware isolation is important, but hardware isolation is just as important.

If a feature controls a high-current load, shared rail, wake source, reset line, or bus interface, it can affect much more than its own function.

For example, an illumination feature may pull down a shared rail if enabled at the wrong time. A sensor connected to a shared I2C bus may hold SDA low and block other devices. A level shifter may back-power a disabled domain. A peripheral may keep an interrupt line asserted and prevent sleep. A communication transceiver may remain in an unexpected state and disturb bus behavior.

These are not software-only problems.

They are architecture problems.

Hardware must define clear containment boundaries:

  • A feature power rail should be switchable or isolatable where necessary.
  • A noisy or high-load feature should not share sensitive rails without proper isolation.
  • A hung peripheral should not permanently lock the only control bus.
  • A resettable subsystem should have a way to recover independently.
  • A feature enable line should fail to a safe state.
  • A missing peripheral should not create an undefined electrical condition.

The goal is not to make every feature completely independent. That would be expensive and impractical.

The goal is to prevent one feature from gaining uncontrolled influence over the full system.

Shared Buses Need Failure Containment

Shared communication buses are one of the most common places where one feature can affect many others.

I2C is a classic example. It is simple, widely used, and convenient. But one misbehaving device can hold the bus low and block communication for every other device on that bus.

The same concept applies to other shared resources: SPI chip-select mistakes, UART framing issues, CAN transceiver state problems, shared interrupt lines, and shared reset signals.

A shared bus should be treated as a shared failure domain.

That means the design should include a recovery strategy.

Can the bus be reset?
Can the failing device be power-cycled independently?
Can the firmware detect a stuck bus?
Can the system continue operating in degraded mode?
Can critical devices be separated from experimental or non-critical devices?

In early prototypes, it is tempting to connect many peripherals to one shared bus because it saves pins and routing effort. In production systems, that convenience can become a reliability risk.

For critical designs, bus architecture should be reviewed not only for functionality, but also for failure behaviour.

The question is not only:

“Can all devices communicate?”

It is also:

“What happens when one device stops communicating correctly?”

Storage Must Never Be a Single Point of Failure

Persistent storage is another common source of device bricking.

A feature may write logs, store images, cache metadata, save configuration, or maintain local state. If storage is not controlled carefully, a feature can consume all available space, corrupt shared files, or leave the system unable to boot cleanly.

This is especially dangerous when configuration, logs, OTA metadata, and runtime data share the same storage region without strict boundaries.

A good storage architecture treats persistence as a controlled system resource, not an unlimited feature convenience.

Important practices include:

  • separating critical configuration from bulk logs,
  • limiting log growth,
  • using atomic writes for important state,
  • validating configuration before applying it,
  • maintaining known-good defaults,
  • preserving OTA rollback metadata,
  • and ensuring recovery mode does not depend on writable feature data.

A logging feature should never brick the device by filling storage.

A corrupted feature cache should never prevent boot.

A failed write should never destroy the last known-good configuration.

Storage failures are inevitable over time. The system must be designed so that storage degradation reduces functionality before it eliminates recoverability.

OTA Must Be Protected From Feature Complexity

OTA update systems deserve special protection.

A device that can update itself can recover from many field issues. But if the OTA path depends too heavily on normal application behaviour, it can become fragile.

For example, if OTA requires the full application to boot, all feature services to initialise, network configuration to load correctly, storage to be fully healthy, and no watchdog loops to occur, then OTA is not truly a recovery mechanism. It is just another feature.

A robust OTA architecture should be closer to infrastructure than application logic.

It should be protected from ordinary feature failures as much as possible.

This usually means:

  • a bootloader or recovery path that remains independent,
  • firmware image validation,
  • rollback support,
  • separation between update metadata and application logs,
  • safe handling of interrupted updates,
  • and a way to enter recovery even when the main application is unhealthy.

The most dangerous OTA failure is not simply a failed download.

It is a system that loses the ability to update because another feature broke first.

That is how recoverable bugs become field-service problems.

Feature Flags Are Not Just Product Tools

In cloud software, feature flags are often used for controlled rollouts.

In embedded systems, they can also be used for containment.

A feature flag allows a device or fleet to disable a problematic feature without replacing the whole firmware image immediately. This can be extremely valuable when a feature behaves unexpectedly under field conditions that were not reproduced during validation.

But feature flags must be designed carefully.

If a feature flag is stored in a corrupted configuration region, depends on the failing feature itself, or is applied too late in the boot process, it may not help.

A good embedded feature flag system should be available early, simple to evaluate, and safe by default.

The system should be able to boot and decide:

“This feature is disabled until proven safe.”

This is especially useful for features that interact with power, sensors, storage, or communications.

Feature flags are not a substitute for good architecture.

But they are a powerful layer of operational safety.

Watchdogs Should Recover the System, Not Hide the Problem

Watchdogs are often used as a final line of defense. If the system hangs, the watchdog resets it.

That is necessary, but not sufficient.

A watchdog that simply resets the device may create an endless reboot loop if the same feature fails again immediately after boot.

A better watchdog strategy preserves context and supports escalation.

For example, if the device crashes once during normal operation, reboot normally. If it crashes repeatedly during the same feature initialisation, skip that feature on the next boot. If failures continue, enter safe mode. If safe mode starts successfully, keep communication and OTA available.

This approach turns the watchdog from a blind reset tool into part of a recovery policy.

The system learns enough from repeated failures to avoid repeating the same mistake indefinitely.

A practical recovery ladder might look like this:

  • First failure: restart the affected task or subsystem.
  • Repeated feature failure: disable that feature temporarily.
  • Repeated system reset: enter degraded boot.
  • Continued instability: enter recovery mode.
  • Recovery available: wait for remote update or service action.

This prevents one feature from trapping the device in a permanent reset loop.

Safe Mode Should Be a First-Class Architecture

Safe mode is one of the most valuable patterns in embedded product design.

It is the system state where the device intentionally starts with minimum functionality to preserve recoverability.

Safe mode does not need to support every product feature. It only needs to support enough capability to inspect, configure, update, or service the device.

A useful safe mode may include:

  • stable power initialisation,
  • basic processor runtime,
  • minimal communication path,
  • reset and fault reporting,
  • OTA or service update support,
  • configuration reset,
  • and enough diagnostics to identify failed subsystems.

It should avoid non-essential sensors, high-current loads, advanced processing, unnecessary storage writes, and experimental feature paths.

Safe mode is not a failure.

Safe mode is a controlled recovery state.

A device that enters safe mode is still alive. A device trapped before safe mode is not.

Degraded Operation Is Better Than Total Failure

Not every feature failure should produce a full device failure.

If a non-critical sensor fails, the device may still provide partial value. If a camera fails, other sensing modes may continue. If wireless is unavailable, local buffering may continue. If storage is limited, the system may reduce logging rather than crash.

This requires firmware and product architecture to understand degraded states.

Instead of treating every failure as fatal, the system should classify failures:

  • critical failures that require recovery,
  • feature failures that disable only one capability,
  • temporary failures that may retry,
  • environmental failures that require waiting,
  • and configuration failures that can fall back to defaults.

This is where product thinking and engineering thinking meet.

The user does not care whether a subsystem driver returned a timeout. The user cares whether the device is usable, recoverable, and trustworthy.

A degraded device with clear behaviour is far better than a device that silently bricks itself trying to maintain full functionality.

Observability Makes Containment Practical

Failure containment only works if the system can identify what failed.

If every failure looks like a generic reboot, the system cannot make intelligent recovery decisions.

This is why observability must be part of the architecture.

Useful signals include reset reason, boot stage, last successful subsystem initialisation, watchdog context, rail-good status, storage health, bus recovery count, feature startup result, and safe-mode entry reason.

These do not need to be complicated. Even a small amount of retained state can dramatically improve diagnosis.

For example, knowing that the device failed three times immediately after enabling a specific feature is far more useful than knowing only that the watchdog reset occurred.

The goal is to make the system explain:

“I am alive, but I skipped this feature because it repeatedly failed.”

That level of visibility changes debugging, field support, and fleet reliability.

Hoomanely’s Practical View: Features Must Earn Trust

At Hoomanely, we increasingly think of features as participants in a shared system rather than isolated capabilities.

A feature must earn trust before it receives full authority.

It should not automatically get unlimited power access, unlimited storage writes, unrestricted boot influence, permanent wake authority, or control over shared buses without recovery paths.

The design mindset becomes:

Can this feature fail safely?
Can it be disabled remotely?
Can it be skipped during boot?
Can its power be removed?
Can its bus be recovered?
Can the system still update without it?
Can support understand what happened?

These questions may seem strict during development, but they prevent painful field failures later.

Because in product engineering, the costliest failures are not always the ones that happen often.

They are the ones that leave no recovery path.

Final Thoughts

Preventing one feature from bricking the whole device is not a single mechanism.

It is a design philosophy.

It requires hardware boundaries, firmware supervision, safe boot paths, recoverable storage, protected OTA, meaningful watchdog behaviour, feature flags, degraded operation, and system observability.

Most importantly, it requires accepting that features will fail.

A sensor may hang.
A bus may lock.
A file may corrupt.
A rail may dip.
A firmware module may crash.
A new feature may behave differently in the field than it did in the lab.

The system should be ready for that.

At Hoomanely, the goal is not to build devices that depend on perfect behaviour from every subsystem. The goal is to build devices that remain recoverable when one subsystem behaves badly.

Because a feature is valuable only if it improves the product without threatening the product.

And the strongest embedded systems are not the ones where every feature is tightly connected.

They are the ones where every feature is allowed to fail without taking the whole device down.

Read more