The Silent Tax: When Compression Demands More RAM Than Your System Has
In the high-stakes world of embedded firmware development, "optimization" is usually the golden word. We optimize for speed, for power consumption, and perhaps most frantically, for memory. We count bytes like misers counting pennies, shaving off struct padding and squashing enums into uint8_t wherever possible. We analyze map files, hunt down stack overflows, and obsess over the footprint of every library we link. So, when the requirement comes down to "save storage space and transmission bandwidth on the bus," integrating a compression library seems like a no-brainer. It is the textbook solution: trade a few milliseconds of CPU cycles to reduce the data footprint by 50% or more. It sounds like a win-win scenario, the kind of engineering decision that gets simpler as you explain it.
But recently, while working on the firmware for a high-speed imaging module, we ran headfirst into a paradox that is rarely discussed in the glossy brochures of compression algorithms or the "Quick Start" guides of repositories: to make data smaller, you often need a momentary explosion of memory usage that can bring a constrained system to its knees. This is the story of how an attempt to save memory almost caused us to run out of it entirely, and the architectural gymnastics required to solve it. It is a tale of caching co-processors, bus arbitration, and the unyielding laws of entropy.
The Setup: A Pipeline Built on Speed and Assumptions
Our system was relatively standard for a high-performance imaging device, but "standard" in embedded terms implies a strict set of constraints. We were capturing high-resolution images from a sensor, storing them, and then transmitting them over a high-speed field bus to a host controller. The images were roughly 529 KB in size Raw Bayer data, consisting of unadulterated pixel values from the sensor.
In the world of desktop computing or server-side backend development, 529 KB is a rounding error. You could lose it in your L2 cache and not notice. A Python script wouldn’t even blink at allocating a hundred times that amount. But on a microcontroller (MCU), specifically the high-performance Cortex-M7 series we were using, 529 KB is a significant chunk of real estate. It’s not just "data"; it’s a dominant feature of the memory map.
We had a few distinct memory regions to play with:
- Internal SRAM (System RAM): Super fast, zero wait states, accessible by the core at full clock speed. However, it is limited in size (roughly 1MB total, but fragmented into Tightly Coupled Memory, System RAM, etc.). This is where the stack, the heap, and critical ISR vectors live.
- External PSRAM (Pseudo-Static RAM): Huge by comparison (8 MB), but significantly slower. It lives across the OCTOSPI interface. Accessing it requires traversing the bus, incurring latency penalties, and dealing with cache lines.
- External OCSPI Flash: Non-volatile storage. This is the final destination for persistent data, but writing to it is slow and requires sector erases.
The initial pipeline was simple: Capture the image via the camera interface directly into a buffer in PSRAM. Verify the image integrity. Write it to Flash. Later, to offload the data, we read from Flash and sent it over the bus.

The bottleneck, predictably, was the bus. Even at high speeds, sending 529 KB of raw data took roughly 500 to 800 milliseconds per image. During this time, the bus was saturated, and the user was waiting.
The obvious solution was to compress the image before storage or before transmission. If we could get even a modest 2:1 compression ratio, we’d cut the transmission time in half. We chose a legendary generic block-compression algorithm known for its blazing decompression speed and respectable compression performance. It was industry-standard, lightweight, open-source, and theoretically perfect for real-time systems.
The Trap of the "Worst Case" Calculation
We integrated the library, wrote a simple wrapper function, and prepared to test. The logic seemed sound:
- Take the 529 KB source buffer (sitting in PSRAM).
- Allocate a destination buffer for the compressed output.
- Run the standard compression function.
- Profit from the reduced file size.
We decided to be "optimistic" with our memory allocation. Since we expected the images (mostly dark environments with some thermal noise) to compress well, we figured a 300 KB destination buffer would be plenty. That’s nearly 60% of the original size. Surely, the algorithm wouldn't need more than that?
The first time we ran the code, the system didn't crash. It didn't hang. But the compression function returned failure immediately. It didn't even try to compress a single byte.
We dug into the logs. The error wasn't "compression failed due to bad data." It was a check we had implemented ourselves, a safeguard recommended by the library documentation:
if (dst_capacity < CompressionBound(src_size)) {
LOG_ERROR("Dest buffer too small!"); return -1;
}This single line of code is where reality collided with our assumptions. CompressionBound(inputSize) calculates the maximum possible size the output could effectively become.
The core tenant of lossless compression is that it must handle every possible input pattern. If you feed the algorithm a string of repeating characters, it compresses to nearly nothing. But if you feed it completely random noise—data with maximum entropy—it cannot compress it. In fact, due to the overhead of headers, token flags, and structural metadata, a compressed file of random noise will actually be larger than the original input.
For a 529 KB input, the bound calculation returned roughly 531 KB.

The library was effectively telling us: "I cannot guarantee I won't write past the end of your buffer unless you give me 531 KB of space."
This was the paradox. To save 200 KB of ultimate storage space, we first had to allocate a contiguous block of RAM larger than the original file during the processing phase.
The Memory Tetris Challenge
This requirement triggered a cascading series of problems that exposed the fragility of our memory architecture.
In a standard operating system environment like Linux or Windows, you just allocate 531 KB dynamically. The OS handles the physical mapping. If physical RAM is full, the OS swaps old pages to disk to make room. You might experience a page fault, a slight stutter, and then life goes on. Reliability is handled by the virtual memory manager.
In embedded firmware running an RTOS on a bare-metal MCU, allocating 531 KB dynamically is often a death sentence for the heap.
Our internal SRAM was roughly 1 MB, but it was heavily fragmented. We had the RTOS heap, the main stack, the process stacks, the static BSS section, and various DMA double-buffers scattered across the memory map. Finding a contiguous, linear 531 KB hole in internal SRAM was impossible. It simply wasn't there. Attempting to allocate it would result in a NULL pointer, or worse, a hard fault if we tried to access it blindly.
This meant we had to use the external PSRAM.
"Okay," we reasoned, "We have 8 MB of PSRAM. That's plenty of space."
But we used PSRAM for the source image already. The captured image arrived from the camera interface into a dedicated buffer in PSRAM. To compress it, we now needed:
- Source Buffer: 529 KB (In PSRAM)
- Destination Buffer: 531 KB (In PSRAM, because SRAM can't hold it)
- Algorithm Internal State: ~16 KB (Hash tables for the sliding window, usually on the Stack or in SRAM)
So now we have two massive buffers living in external memory, and the CPU needs to shuttle data between them.

The Performance Penalty and the Bus War
We allocated the second buffer in PSRAM and ran the specific test case. It worked, technically. The image compressed down to roughly 250 KB. Success?
Not quite. The compression time was abysmal.
The algorithm is designed to be fast because it looks for repeating patterns within a sliding window. It does this by reading the input and writing to the output aggressively. When both the Source and Destination are in PSRAM, every single read and write has to cross the external memory bus.
The CPU runs at 480 MHz. The external memory interface, even running at 100 MHz in Double Data Rate (DDR) mode, has significant latency compared to the L1 cache or internal SRAM. By forcing the CPU to fetch data from external RAM, process it, and write it back to external RAM, we were stalling the pipeline. We were effectively thrashing the external bus.
We were seeing compression times of 400-500ms. While that might sound fast to a human, in a real-time system attempting to capture bursts of images at 10 Hz, half a second is an eternity. It blocked the main communication task. The watchdog timer, set to bark at 100ms intervals for strict responsiveness, began to threaten system resets.
The Cache Coherency Nightmare
To make matters worse, introducing heavy CPU processing on PSRAM contents brought the Data Cache (D-Cache) into play.
The Cortex-M7 has a sophisticated cache. When the algorithm reads from PSRAM, the cache controller pulls in a "line" of data (32 bytes) into the fast L1 cache. The CPU modifies it or reads it. If we write to the destination buffer, those writes sit in the cache (Dirty state) until they are evicted back to physical PSRAM.
This is fine for pure CPU operations. But our system used DMA (Direct Memory Access) to move the final compressed data to the SPI Flash controller.
DMA does not see the Cache. DMA sees physical RAM.

If the compression function finished writing the compressed data, that data might still be sitting in the CPU's cache, not in the physical PSRAM chips. If we then triggered a DMA transfer to write that buffer to Flash, the DMA would read the old (stale) data from physical RAM, effectively writing garbage or zeros to the Flash.
To fix this, we had to implement rigorous Cache Maintenance operations:
- Before Compression: Invalidate Cache by Address — Ensure the CPU isn't reading stale values if DMA just put the image there.
- After Compression: Clean Cache by Address — Force the CPU to flush all calculated compressed bytes out to physical RAM so the DMA can see them later.
These cache maintenance operations are blocking and expensive. They added another 5-10ms of overhead, further eating into our tight timing budget.
The "Optimistic" Approach vs. The "Safe" Approach
At this point, faced with slow performance and high memory usage, we considered cheating.
We thought, "We know our images. They will never be random noise. They will always compress to under 300 KB. Why can't we just give the library a 300 KB buffer and tell it to try?"
We looked into the specific implementation of the compression function. It takes a destination capacity argument. If the algorithm detects that it is about to write beyond that capacity, it stops and returns an error code.
So, theoretically, we could pass a smaller buffer. We could blindly allocate 300 KB and hope for the best.
But this introduced a Non-Deterministic Failure Mode.
Imagine the device is deployed in the field. It works perfectly for months. Then, on one particularly hot day, the sensor noise floor rises due to thermal effects. Or perhaps the user points the camera at a scene with incredibly high-frequency detail—like a field of gravel, or static on a screen. The entropy of the image spikes. Suddenly, the image doesn't compress to 300 KB—it compresses to 310 KB.
If we had allocated 300 KB, the compression would fail.
What happens then?
- Discard the data? Unacceptable. The user asked for a picture; we can't just delete it because it was "too detailed."
- Send uncompressed? We could, but verify the downstream receiver can handle an uncompressed packet when it expects a compressed one? And does our transmission ring buffer have space for the full 529 KB blob?
- Crash? The worst option, but a likely one if we didn't handle the error code correctly.
In embedded engineering, correctness is paramount. We cannot rely on "probably won't happen." We have to design for "even if the worst thing happens, the system survives."
The Pivot: Architecture over Algorithms
We realized we couldn't optimize the algorithm to solve the memory constraint without sacrificing safety. We had to optimize the architecture.
We couldn't fit the worst-case destination buffer in fast internal SRAM. We couldn't tolerate the uncertainty of an undersized buffer.
The solution was a dedicated "Compression Scratchpad" in PSRAM, managed with strict ownership rules, effectively creating a dedicated memory lane for this operation.

We allocated a persistent, static buffer temp_buffer of 600 KB in PSRAM at system startup. This memory was reserved. It was "dead" space to the rest of the system, but it was our safety net.
When a compression request came in:
- We claimed a mutex for the scratchpad to prevent any other task from touching it.
- We ran the compression function, reading from the verified image source and writing to this huge scratchpad.
- Because the scratchpad was guaranteed to be larger than the theoretical bound, we knew the operation would never fail due to size. It was mathematically impossible.
- Once compression finished, we looked at the actual size (e.g., 250 KB).
- We then performed a
memcpyfrom the scratchpad to the final destination buffer, or directly queued it for DMA.
"Wait," you might ask. "You added a memcpy? Isn't that slower?"
Yes, we added a memory copy. In a desktop app, this is heresy. In embedded, it was the price of deterministic reliability.
By using the scratchpad, we decoupled the safety requirement (needing 531 KB max) from the storage requirement (needing only ~250 KB eventually).
To solve the latency/watchdog issue, we optimized the task handling. Since the default compression call is blocking, we couldn't yield in the middle of it. So we bumped the priority of the compression task down, allowing the Watchdog task to preempt it, but we also had to feed the watchdog before and after the heavy call. We eventually explored splitting the compression into chunks, but the complexity of maintaining state across chunks for the stream API outweighed the benefits of the Scratchpad approach.
The Streaming Alternative (And Why We Didn't implementation It)
A savvy reader might ask: "Why not use chunked compression? Why not compress in blocks?"
Chunked compression (streaming) is indeed the standard answer to "low memory" problems. You break the 529 KB file into 8 KB chunks. You compress each chunk into a small 8 KB internal SRAM buffer, send it out, and repeat.
We explored this. It solves the buffer size problem perfectly. You never need more than ~16 KB of RAM.
However, it introduces Frame Overhead. Each compressed chunk has its own header. If you have too many small chunks, the compression ratio worsens.
More importantly, it introduces System Complexity. Our current pipeline verification relied on having the "whole image" ready to checksum and verify before committing it to storage. Streaming means the image exists in a transient state. If a transmission error occurs halfway through, or if the flash write fails on the 10th chunk, you cannot easily "retry" without re-compressing the first half of the image, which might have already been overwritten in the sensor buffer.
In our specific case, looking at the code complexity of a streaming implementation versus the "brute force" PSRAM scratchpad, the scratchpad won. Memory is there to be used, after all. Leaving 8 MB of PSRAM empty to save "clean code points" while struggling to fit things into 512KB of SRAM is false economy.
Challenges Faced & Lessons Learned
The journey to stable compression was not just about calling a C function. It forced us to confront the physical reality of our hardware.
Challenge 1: The Fragmentation Lie. We initially thought we had free RAM because the heap manager reported 200 KB free. But it was fragmented into tiny 10 KB holes. You cannot fit a 50 KB output buffer into five 10 KB holes without a complex scattering file system. Lesson: Total Free RAM != Max Allocatable Block.
Challenge 2: The Debugger's Blind Spot. When the library failed silently (returning failure codes), it was easy to blame the code. "It's buggy," we said. It wasn't. It was protecting us. We had to trace the execution flow into the assembly of the bounds check to understand why it was rejecting our buffer. Lesson: Read the docs on bounds, specifically the "Worst Case" sections.
Challenge 3: The Latency Surprise. Moving data to PSRAM fixed the crash but broke the timing. We had to re-evaluate our task priorities. The compression_task had to be lower priority than the camera_capture_task to prevent frame drops. The massive memory transactions on the external bus also contended with the Camera DMA. We had to tune the bus priorities in the bus matrix to ensure the camera always won.
Conclusion
Compression is not magic. It is a trade. Usually, we think of it as trading CPU cycles for storage. But in the embedded world, there is a hidden third currency: Peak Memory Allocation.
To make a file smaller, you must momentarily possess the capacity to hold something larger.
For us, the solution wasn't to fight this requirement or to gamble on "average case" scenarios. It was to accept valid worst-case inputs as a fundamental system constraint. We allocated the massive buffer. We accepted the memory bus latency. We optimized the surrounding system—caches, DMAs, and task priorities—to handle that latency.
The result? A robust system that runs 24/7 without a single memory fault, regardless of whether the camera sees a pitch-black room or a chaotic static field. In firmware, boring stability is the only metric that truly matters.