The "Brick-Proof" Bootloader: Designing an A/B Swap Partition
In the unforgiving world of embedded systems, a failed firmware update can transform a functional device into an expensive paperweight. At Hoomanely, where our pet health monitoring devices operate continuously in homes worldwide, device reliability isn't just a technical requirement—it's essential for maintaining the trust of pet owners who depend on our technology for their companions' wellbeing.
The challenge of safe firmware updates becomes critical when devices operate in remote locations without technical support. A single corrupted update could render a pet monitoring system inoperable, potentially missing crucial health events. This reality drove our engineering team to develop what we call a "brick-proof" bootloader architecture, achieving zero field failures across thousands of over-the-air updates.
Our solution leverages dual-bank flash architecture with hardware-backed A/B partitioning, enabling instant rollback capabilities and maintaining system availability even during critical firmware failures. This approach has proven itself in production, handling complex multi-stage updates while preserving device functionality under all failure scenarios.

The Critical Problem: Update Failure Modes
Traditional firmware update mechanisms suffer from fundamental vulnerabilities during the update process. Power failures, communication interruptions, or corrupted firmware images can leave devices in unrecoverable states. The most dangerous scenario occurs when the bootloader itself becomes corrupted, creating a complete system failure that requires physical intervention.
Our pet monitoring devices face additional challenges due to their deployment environment. Unlike industrial systems with dedicated maintenance teams, these devices must operate reliably in homes where technical support isn't readily available. A bricked device doesn't just mean downtime—it means potentially missing critical health indicators for beloved pets.
Analysis of field failure modes revealed several critical vulnerabilities in standard update approaches. Single-bank updates overwrite existing firmware before validation, creating vulnerability windows where device failure results in total loss of functionality. Partial write failures can corrupt both bootloader and application code, requiring factory recovery procedures.
Dual-Bank Architecture: Hardware-Enforced Safety
The foundation of our brick-proof approach lies in dual-bank flash architecture available in advanced microcontrollers. This hardware feature provides two complete flash banks, allowing one bank to remain active while updates are written to the alternate bank. The key innovation lies in how we leverage the hardware bank swapping mechanism to achieve atomic updates.
Each bank contains a complete firmware image, enabling the system to maintain full functionality throughout the update process. The hardware bank swap mechanism operates at the memory controller level, remapping the entire address space instantaneously through a single option bit modification. This atomic operation eliminates the vulnerability window present in software-based update mechanisms.
Our implementation maps logical addresses to physical banks dynamically, ensuring that active firmware remains untouched during updates. The inactive bank serves as the staging area for new firmware, undergoing complete validation before any commitment to the update process.
// Bank mapping logic handles hardware swap state
bool swap_bank = (optsr_cur & FLASH_OPTSR_SWAP_BANK_Msk) != 0;
if (swap_bank) {
// When swapped: Bank1 logical = Bank2 physical
physical_bank = (bank_number == 1) ? FLASH_BANK_2 : FLASH_BANK_1;
} else {
// Normal mapping: Bank1 logical = Bank1 physical
physical_bank = (bank_number == 1) ? FLASH_BANK_1 : FLASH_BANK_2;
}
This hardware abstraction ensures that update logic remains consistent regardless of the current bank configuration, while the underlying hardware provides the safety guarantees necessary for field deployment.
Multi-Stage Boot Process: Defense in Depth
Our bootloader architecture implements a multi-stage boot process that provides multiple recovery points in case of failure. The First Stage Boot Loader (FSBL) resides in protected flash memory and handles initial system validation, while the main bootloader manages application loading and update coordination.
The FSBL performs critical hardware initialization and basic system health checks before transferring control to the main bootloader. This separation ensures that even if the main bootloader becomes corrupted, the FSBL can initiate recovery procedures or activate the alternate bank directly.
Each boot stage includes comprehensive validation mechanisms that verify both code integrity and hardware compatibility. CRC32 validation ensures firmware authenticity, while hardware-specific checks prevent loading incompatible firmware versions that could damage system components.

Boot Sequence Flow:
- FSBL: Hardware initialization and system validation
- Bootloader: Application verification and bank management
- Application: Main firmware with integrated update client
- Recovery: Automatic fallback on any validation failure
This layered approach provides multiple opportunities for error detection and recovery, ensuring that system failures result in graceful degradation rather than complete device loss.
CAN FD Over-the-Air Updates: Efficient and Reliable
Our OTA update mechanism leverages CAN FD (Controller Area Network with Flexible Data-Rate) for efficient firmware delivery. CAN FD's enhanced payload capacity and built-in error detection make it ideal for automotive and industrial applications where update reliability is paramount.
The update process begins with comprehensive pre-validation, including firmware size checks, compatibility verification, and available space confirmation. Chunked transfer with bitmap tracking ensures that partial transfers can resume without restarting the entire update process.
Each firmware chunk undergoes immediate validation upon receipt, with corrupted chunks marked for retransmission. The chunk bitmap tracks completion status, enabling intelligent retry logic that minimizes bandwidth usage while ensuring complete firmware delivery.
// CAN FD optimized chunk processing
uint32_t bitmap_size = (total_chunks + 7) / 8;
ota_session.chunk_bitmap = (uint8_t*)malloc(bitmap_size);
// Track chunk completion to enable resume capability
if (chunk_received_successfully) {
ota_session.chunk_bitmap[chunk_id / 8] |= (1 << (chunk_id % 8));
ota_session.chunks_received++;
}
The CAN FD transport layer provides inherent error detection and recovery mechanisms that complement our application-level validation. This combination ensures that communication failures don't result in corrupted firmware installation, maintaining system integrity throughout the update process.
Atomic Bank Switching: The Safety Guarantee
The core safety mechanism relies on atomic bank switching implemented through hardware option bytes. Once new firmware passes complete validation in the inactive bank, a single option bit modification instantly makes the new firmware active while preserving the previous version for potential rollback.
This atomic operation occurs at the hardware level, eliminating the possibility of partial completion that could leave the system in an undefined state. The previous firmware remains completely intact and accessible, providing an immediate fallback option if issues are detected with the new firmware.
Bank switching validation includes comprehensive checks of both the new firmware's functionality and the switching mechanism itself. Pre-switch validation ensures that the target bank contains valid, compatible firmware, while post-switch validation confirms successful activation of the new image.

Critical Implementation Details:
- Single-bit swap operation: Atomic at hardware level
- Instant activation: No copying or moving of firmware images
- Preserved rollback: Previous firmware remains intact and accessible
- Validation checkpoints: Multiple verification stages before commitment
The elegance of this approach lies in its simplicity - the most complex part of firmware updates (the actual switching) is handled entirely by hardware, eliminating software-induced failure modes.
Field Performance: Zero-Brick Achievement
Production deployment across thousands of devices has validated our brick-proof architecture's effectiveness. Over eighteen months of field operation, we've achieved zero device failures due to firmware updates, including scenarios involving power failures during critical update phases.
Performance metrics demonstrate the system's robustness under real-world conditions. Update completion rates exceed 99.8%, with failed updates automatically rolling back without user intervention. The average rollback time measures under two seconds, ensuring minimal disruption to monitoring services.
Most critically, we've encountered no scenarios where devices became unrecoverable due to firmware update failures. Even deliberate corruption tests and simulated power failures during bank switching operations result in successful recovery to the previous firmware version.

Field Performance Metrics:
- Update success rate: 99.8% on first attempt
- Zero brick incidents: 0 unrecoverable devices in 18 months
- Rollback time: <2 seconds to previous firmware
- Recovery capability: 100% success rate for failed updates
These results validate the architectural choices and demonstrate the practical value of hardware-backed safety mechanisms in production environments.
Reliability in Critical Applications
At Hoomanely, our mission to provide continuous health monitoring for pets demands unprecedented reliability from our embedded systems. Pet health events can occur at any time, making device downtime unacceptable. Our brick-proof bootloader architecture ensures that firmware improvements never compromise device availability.
The dual-bank approach enables us to deploy sophisticated health monitoring algorithms and machine learning models while maintaining the safety net of proven firmware versions. This capability is particularly valuable for edge AI applications, where new models must be deployed safely without risking device functionality.
Our field-deployed devices demonstrate how robust update mechanisms enable aggressive innovation cycles. Teams can deploy experimental features and advanced algorithms knowing that any issues will result in automatic rollback rather than device failure. This safety net accelerates development while maintaining the reliability that pet owners depend on.
The technology stack supporting these updates extends beyond the bootloader itself. Our sensor fusion platform coordinates updates across multiple subsystems, ensuring that thermal sensors, proximity detection, and health monitoring algorithms remain synchronized throughout update processes.
Implementation Insights: Lessons from Production
Developing a truly brick-proof bootloader requires attention to subtle details that only emerge during real-world deployment. Hardware timing considerations, flash memory endurance, and electromagnetic interference all impact update reliability in ways that laboratory testing cannot fully capture.
One critical insight involves the interaction between bank swapping and memory-mapped peripherals. Careful coordination ensures that memory-mapped hardware configurations remain valid across bank switches, preventing system instability after firmware updates.
Flash memory endurance considerations become significant in systems performing frequent updates. Our architecture minimizes option byte modifications by batching multiple validation checks before committing to bank switches, extending hardware lifetime while maintaining update safety.
Production Deployment Considerations:
- Flash endurance management: Minimizing option byte write cycles
- Peripheral state preservation: Ensuring hardware compatibility across updates
- Electromagnetic tolerance: Robust operation in electrically noisy environments
- Power management: Handling updates during varying power conditions
These real-world factors significantly influence bootloader design decisions and highlight the importance of comprehensive field testing for safety-critical update mechanisms.
Beyond Basic Safety: Advanced Recovery Mechanisms
While basic dual-bank switching provides excellent safety guarantees, production systems benefit from additional recovery mechanisms that handle edge cases and system degradation scenarios. Our implementation includes intelligent retry logic, progressive timeout handling, and diagnostic capabilities that support field troubleshooting.
Advanced diagnostics enable remote identification of update issues without requiring physical device access. Comprehensive logging throughout the update process provides visibility into failure modes and enables continuous improvement of the update mechanism.
The system also implements intelligent update scheduling that considers device usage patterns and power availability. Updates automatically defer during critical monitoring periods, ensuring that health monitoring never experiences interruption due to maintenance activities.
Key Takeaways
Building a truly brick-proof bootloader requires hardware-software co-design that leverages the safety features available in modern microcontrollers. Dual-bank flash architecture with hardware-backed bank swapping provides the foundation for atomic updates that eliminate traditional firmware update vulnerabilities.
The combination of multi-stage boot processes, comprehensive validation, and instant rollback capabilities creates a system that can handle any conceivable update failure scenario. Field deployment validates these approaches, demonstrating zero-brick performance across thousands of devices over extended periods.
Success in critical applications requires going beyond basic functionality to address the subtle reliability challenges that emerge in real-world deployments. The investment in robust update architecture pays dividends in reduced support costs, improved user confidence, and the ability to deploy advanced features without compromising system reliability.
This architectural approach reflects Hoomanely's commitment to building technology that pet owners can depend on completely. When health monitoring systems must operate continuously and reliably, every technical decision contributes to our mission of enabling longer, healthier lives for pets through unwavering technological reliability.