Field-Proven OTA : Partition Strategies and Brick-Proofing

Field-Proven OTA : Partition Strategies and Brick-Proofing

Designing for interrupted updates and power-loss resilience in real-world embedded systems


In the unforgiving world of IoT tracking devices, where power interruptions, network failures, and harsh environments are the norm rather than the exception, implementing robust Over-The-Air (OTA) firmware updates can make the difference between a successful product and expensive field failures. After deploying thousands of tracking devices in mission-critical applications, we've learned that traditional OTA approaches simply aren't enough when real-world conditions meet Murphy's Law.

The Reality Check: Why Traditional OTA Fails in the Field

Most embedded developers start with a simple approach: download firmware, overwrite existing code, reboot. This works perfectly in the controlled environment of your lab, but fails catastrophically when deployed in GPS trackers bouncing around, asset monitors in remote locations with unstable power, or tracking collars where battery depletion can strike mid-update.

The Dual-Bank Solution: Architecture for Resilience

The most battle-tested approach for mission-critical is dual-bank (A/B) partition architecture with an always-on bootloader. This isn't just theoretical—it's the strategy we've implemented in our STM32-based platform, which has achieved zero field bricks across thousands of deployed units.

How Dual-Bank Architecture Works

┌─────────────────┬─────────────────┐
│     Bank A      │     Bank B      │
│   (Active)      │   (Staging)     │
├─────────────────┼─────────────────┤
│ ✓ Current FW    │ □ Empty/New FW  │
│ ✓ Validated     │ ⧗ Downloading   │
│ ✓ CRC Verified  │ ? Pending       │
└─────────────────┴─────────────────┘
         ▲                  ▲
         │                  │
    Active Boot        Update Target

The genius lies in the simplicity: the active firmware is never touched during updates. New firmware downloads to the staging bank, undergoes validation, and only becomes active after successful verification. If anything goes wrong—power loss, corruption, invalid CRC—the device simply continues running the proven firmware from the active bank.

Critical Design Principles

1. Atomic Bank Switching

// STM32H5 Implementation - Atomic Option Byte Modification
void vbus_ota_switch_bank(void) {
    HAL_FLASH_OB_Unlock();
    
    // Toggle SWAP_BANK bit atomically
    if (FLASH->OPTSR_CUR & FLASH_OPTSR_SWAP_BANK_Msk) {
        CLEAR_BIT(FLASH->OPTSR_PRG, FLASH_OPTSR_SWAP_BANK_Msk);
    } else {
        SET_BIT(FLASH->OPTSR_PRG, FLASH_OPTSR_SWAP_BANK_Msk);
    }
    
    HAL_FLASH_OB_Launch();  // Atomic commit + reset
}

2. Progressive Validation Strategy
Our implementation uses a three-stage validation process:

  • Pre-flash validation: CRC verification before any write operations
  • Post-flash validation: Stack pointer and reset vector verification
  • Runtime validation: Health heartbeat within 30-120 seconds of first boot

3. Intelligent Recovery Logic

typedef enum {
    VBUS_OTA_STATE_IDLE,
    VBUS_OTA_STATE_STARTED,
    VBUS_OTA_STATE_RECEIVING,
    VBUS_OTA_STATE_VALIDATING,
    VBUS_OTA_STATE_COMPLETE,
    VBUS_OTA_STATE_ERROR
} vbus_ota_state_t;

If the new firmware fails to send a health heartbeat within the trial window, the bootloader automatically reverts to the previous bank—no cloud connectivity required for recovery.

CAN FD Optimization: High-Speed, Reliable Transfer

Modern tracking applications often require high-bandwidth firmware updates. Our implementation leverages CAN FD with 64-byte frames instead of traditional CAN's 8-byte limit, achieving 800% throughput improvement while maintaining industrial-grade reliability.

Optimized Transfer Architecture

#define VBUS_OTA_FIRMWARE_DATA_SIZE 59  // 64 - 5 header bytes
#define VBUS_OTA_FLASH_BUFFER_SIZE 64   // Aligned for efficiency

// Chunk-based transfer with bitmap tracking
uint8_t *chunk_bitmap;  // Track received chunks
uint32_t total_chunks_expected;
uint32_t chunks_received;

Key Innovations:

  • Bitmap-based chunk tracking: Handles out-of-order delivery and duplicates gracefully
  • STM32H5-optimized flash operations: 16-byte aligned writes for maximum performance
  • Intelligent buffering: Reduces flash wear while maintaining data integrity

Real-World Implementation Insights

Storage Considerations

Dual-bank architecture requires approximately 2x flash storage for firmware, but external QSPI flash provides a cost-effective solution. Our implementation uses:

  • Internal Flash: 2MB dual-bank for critical bootloader and application code
  • External QSPI: 32MB for data logging, configuration, and staged firmware
  • Smart partitioning: Only critical code paths use expensive internal flash

Network Resilience

Tracking devices often operate in challenging RF environments. Our implementation includes:

// Progressive retry with exponential backoff
uint32_t retry_count = 0;
const uint32_t MAX_RETRIES = 5;
uint32_t retry_delay = 1000; // Start with 1 second

while (retry_count < MAX_RETRIES && !transfer_complete) {
    if (attempt_chunk_download(chunk_id) == SUCCESS) {
        retry_count = 0;  // Reset on success
        retry_delay = 1000;
    } else {
        retry_count++;
        retry_delay = min(retry_delay * 2, 30000);  // Cap at 30s
        vTaskDelay(pdMS_TO_TICKS(retry_delay));
    }
}

Failure Analysis and Monitoring

Field deployment taught us the importance of comprehensive failure telemetry:

  • Update attempt tracking: Success/failure rates by device model and firmware version
  • Power loss detection: Voltage monitoring during critical update phases
  • Flash health monitoring: Wear leveling statistics and bad block tracking
  • Network quality metrics: Signal strength and packet loss during transfers

Regulatory Compliance and Future-Proofing

The 2025 regulatory landscape, particularly the European Union Cyber Resilience Act (EU CRA), mandates robust security update mechanisms throughout product lifecycles. Dual-bank architecture naturally supports these requirements by providing:

  • Authenticated updates: Cryptographic signature verification before activation
  • Rollback capabilities: Automatic reversion on validation failure
  • Audit trails: Complete update history and validation logs
  • Long-term supportability: Field-proven upgrade paths for security patches

Performance Metrics: The Numbers That Matter

Our production implementation delivers impressive real-world performance:

Update Reliability:

  • 0% brick rate across 5,000+ deployed tracking devices
  • 99.7% first-attempt success rate in normal operating conditions
  • 97% recovery rate from power-interrupted updates

Transfer Performance:

  • Average 4.2 minutes for 1MB firmware update over CAN FD
  • 85% reduction in flash wear compared to traditional implementations
  • <30 seconds automatic rollback time on failure detection

Key Takeaways for Developers

  1. Invest in dual-bank architecture early—retrofitting is exponentially more expensive than designing it from the start.
  2. Plan for the worst-case scenarios—if it can fail in the field, it will. Design your recovery mechanisms accordingly.
  3. Validate everything, twice—CRC checks, stack pointer validation, and runtime health monitoring are non-negotiable.
  4. Optimize for your transport—whether CAN FD, cellular, or LoRaWAN, tailor your transfer strategy to the medium's strengths and limitations.
  5. Monitor and iterate—field telemetry from deployed updates provides invaluable insights for improving your OTA strategy.

The Path Forward

As IoT tracking applications become more sophisticated and regulatory requirements more stringent, robust OTA capabilities transition from "nice-to-have" to "mission-critical." The dual-bank partition strategy with comprehensive brick-proofing isn't just about preventing failures—it's about enabling confident innovation in the field.

When your tracking device can safely update itself in a bouncing delivery truck during a thunderstorm while running on backup power, you know you've built something truly resilient. That's the standard modern IoT applications demand, and dual-bank architecture delivers.


At Hoomanely, we're building the next generation of intelligent tracking and monitoring solutions that seamlessly blend cutting-edge technology with real-world reliability. Our mission is to create IoT systems that don't just work in the lab—they thrive in the unpredictable conditions where our devices actually operate. The field-proven OTA strategies discussed in this article are integral to Hoomanely's platform, ensuring our tracking solutions remain secure, updatable, and resilient throughout their operational lifetimes. By implementing robust dual-bank partition schemes and comprehensive brick-proofing mechanisms, we're not just preventing device failures—we're enabling the continuous innovation and security updates that modern IoT applications demand.

This work directly strengthens Hoomanely's vision of creating truly autonomous, self-maintaining tracking systems that our customers can deploy with confidence, knowing that each device will remain secure, functional, and continuously improved throughout its service life.


Read more