Synchronizing Time Across Multiple SoMs: Building a Distributed Clock System
The Problem: When Time Doesn't Add Up
Picture this: You're debugging a critical issue in production. The logs show that a sensor event was processed before it was captured. Impossible? No-just unsynchronized clocks.
This was our reality when building distributed embedded systems. Our platforms rely on multiple Systems-on-Module (SoMs) working in concert, each handling different sensors and processing tasks. But without synchronized time, we were essentially flying blind.
Why Multiple SoMs? The Distributed Architecture Advantage
Modern embedded systems increasingly use multiple SoMs rather than a single monolithic computer. Consider our platforms - Everhub for edge computing and Everbowl for intelligent multi-sensor processing - both leverage this distributed architecture.
The Case for Distribution
Processing Distribution: When you have multiple high-bandwidth sensors (cameras, thermal imaging, audio streams), dedicating separate SoMs prevents resource contention. One SoM handles a subset of sensors while another processes different inputs simultaneously.
Thermal Management: High-performance processing generates heat. Distributing the computational load across multiple physical modules prevents thermal throttling and hotspots that plague single-board solutions under sustained load.
Isolation and Reliability: If one SoM encounters an issue or needs to reboot, the others continue operating. Critical communication functions remain available even if sensor processing temporarily fails.
Specialization: Each SoM can be optimized for its specific task—some configured for low-latency sensor capture, others for compute-intensive processing, and one dedicated to system coordination and external connectivity.
Scalability: Need to add more sensors? Integrate a new processing algorithm? Simply add or upgrade the relevant SoM without redesigning the entire system.
Parallel Processing: True hardware parallelism - multiple sensors being captured and processed simultaneously on different physical modules, not just different threads.
But this architectural elegance introduces a fundamental challenge: time synchronization.
The Challenge: When Every SoM Lives in Its Own Time
In a multi-SoM system, events happen across different modules constantly. Let's say you have a system with multiple SoMs where:
- SoM 1 handles certain sensors
- SoM 2 handles different sensors
- Master SoM coordinates communication
Real-world scenario:
- SoM 1 captures a sensor reading at its local timestamp
- SoM 2 processes data at its local timestamp
- Master SoM logs a system event at its local timestamp
Question: Which event actually happened first?
Without synchronized clocks, you simply cannot know. Each SoM's clock runs independently, and they drift apart over time.
What You Cannot Do Without Synchronized Time
1. Correlate Multi-Sensor Data
Modern systems fuse data from multiple sensors across different SoMs. Was one sensor's reading present when another sensor triggered? Without synchronized timestamps, correlation becomes guesswork.
2. Calculate True Latencies
How long did your processing pipeline take? If sensor capture happens on one SoM and processing completes on another, the calculated latency is meaningless if the clocks disagree.
3. Reconstruct Event Sequences
During debugging or incident analysis, you need to reconstruct what happened. But if logs from different SoMs have inconsistent timestamps, you might see effects appearing before causes—making root cause analysis impossible.
4. Perform Accurate Sensor Fusion
Sensor fusion algorithms depend on temporal alignment. If data from different SoMs has misaligned timestamps due to clock drift, fusion algorithms produce incorrect results or miss detections entirely.
5. Meet System Requirements
Many applications have hard timing requirements:
- Real-time systems need provable maximum latencies
- Safety-critical systems require timing certification
- Data logging must have consistent, auditable timestamps
- Synchronization with external systems demands a common time reference
Example: The Multi-Sensor Detection Problem
Consider a system detecting events using sensors distributed across SoMs:
- Sensor on SoM 1 detects signature
- Sensor on SoM 1 captures additional data
- Sensor on SoM 2 picks up related signal
- Sensor on SoM 1 confirms reading
Question: Are these sensors detecting the same event or different events?
For accurate sensor fusion and event correlation, you need to know:
- Are these from the same object/event or different occurrences?
- What's the temporal correlation?
- Which sensor detected it first?
- What's the propagation delay between sensors?
With unsynchronized clocks, even if all sensors detect the same event simultaneously, the timestamps could differ significantly—making fusion algorithms produce incorrect results.
The Physics of the Problem: Clock Drift is Inevitable
Every SoM has its own crystal oscillator driving its system clock. These crystals are imperfect physical devices, not perfect timekeepers.
Why Crystals Drift
Environmental Factors:
- Temperature: Crystals change frequency with temperature. A temperature change causes the crystal to oscillate faster or slower, changing the rate at which the clock ticks.
- Supply voltage fluctuations: The oscillator's frequency depends on stable power. Voltage variations cause frequency variations.
- Mechanical stress: Physical vibration affects the crystal's resonance frequency. In moving platforms or environments with machinery, this is significant.
- Electromagnetic interference: External radio frequency signals can modulate the oscillator frequency.
Manufacturing Variability:
- Tolerance: Even brand new crystals have inherent frequency tolerance from their nominal specification. No crystal oscillates at exactly its rated frequency.
- Aging: Crystal frequency drifts over months and years as the material properties change.
- Unit-to-unit variation: No two crystals behave identically, even from the same manufacturing batch.
Real-World Impact
Even with high-quality crystals, clocks drift apart over time:
- Short term (seconds to minutes): Small but measurable drift begins
- Medium term (hours): Drift accumulates noticeably
- Long term (days): Clocks can be significantly out of sync
Practical implications:
Two SoMs powered on at the same moment will show different clock readings within minutes. The difference grows continuously. Temperature changes accelerate drift.
Consider this: If one SoM is processing heavy computations and runs warmer, while another SoM handles lighter loads and stays cooler, their clock drift rates will be different. The warmer SoM's crystal may drift faster than the cooler one's.
Why One-Time Calibration Doesn't Work
You might think: "Let's just measure the clock offset once at startup and correct for it!"
Problem: Drift accumulates continuously.
Even if you perfectly calibrate clocks at startup:
- Temperature changes during operation cause frequency drift
- The SoM that was cool at startup warms up and its clock frequency changes
- Supply voltage variations cause additional drift
- Clock drift is not constant—it changes with environmental conditions
If you calibrate at room temperature startup, but then the system heats up during operation, the calibration becomes invalid within minutes.
You need continuous, active synchronization.
The Solution: A Master Clock and Synchronization Protocol
Our solution designates one SoM as the master clock (the time authority) and implements a synchronization protocol that continuously corrects the other SoMs' clocks.
Architecture Overview

Design Principles
Master SoM Selection: Choose the SoM that:
- Has external connectivity (useful for synchronizing to absolute time sources like NTP or GPS, though optional)
- Has a stable thermal environment (SoMs with lower processing loads tend to have less thermal variation)
- Has reliable power supply
- Plays a central coordination role in the system
Client SoMs: All other SoMs become time clients. They periodically synchronize their local clocks to the master's reference time.
Communication: The SoMs communicate over whatever bus connects them—CAN, Ethernet, SPI, or any other protocol. The synchronization algorithm adapts to the communication medium's characteristics.
Why this works: Instead of having each SoM try to keep perfect time independently (impossible with crystal drift), we establish one reference clock and continuously measure and correct the differences between that reference and each client's local clock.
The Core Algorithm: Measuring and Correcting Clock Offset
The synchronization process needs to answer one fundamental question: "How different is my clock from the master's clock?"
This is trickier than it sounds because the very act of asking the question takes time—messages don't travel instantaneously between SoMs.
Step 1: Clock Offset Measurement
The Protocol (Cristian's Algorithm):
When a client SoM wants to synchronize:
- Client sends sync request to master, recording its local time as
t0 - Master receives request, timestamps it as
t1using the master's clock - Master sends response containing the timestamp
t1 - Client receives response, recording its local time as
t3
Now the client has three timestamps: t0 and t3 from its own clock, and t1 from the master's clock.
The Calculation:
Round-Trip Time (RTT) = t3 - t0
(This is how long it took for the request to go to master and response to return)
Estimated One-Way Delay = RTT / 2
(Assuming symmetric delay—message takes similar time in both directions)
Clock Offset = t1 - (t0 + One-Way Delay)
= t1 - (t0 + RTT/2)
Error Bound = RTT / 2
(Our uncertainty in the measurement)Why this works:
If we assume the message takes roughly the same time traveling from client to master as from master to client (symmetric delay), then when the master timestamped the request at t1, the client's clock should have read t0 + RTT/2.
The difference between what the master's clock said (t1) and what the client's clock should have said (t0 + RTT/2) reveals the clock offset between them.
Visual Example:
Imagine the client's clock reads 1000.000 ms when it sends the request. The response returns when the client's clock reads 1000.003 ms. The RTT is 3 microseconds.
If the message traveled symmetrically, it took 1.5 microseconds each way. So when the master received the request and timestamped it, the client's clock should have read 1000.0015 ms.
But suppose the master's timestamp says 2000.010 ms. This means the client's clock is behind the master's clock by approximately 2000.010 - 1000.0015 = 1000.0085 milliseconds.
Now the client knows: "My clock is about 1000 ms behind the master."
The Key Insight: We've measured the clock offset despite the communication delay by using the round-trip time to estimate when the master's timestamp corresponds to in the client's time frame.
Step 2: Handling Communication Variability
Not all sync measurements are equally reliable. Communication delay varies due to:
- Message queuing in buffers (other traffic delays our sync messages)
- CPU scheduling delays (the receiving SoM might be busy with other tasks)
- Interrupt latency (delay before the timestamp is actually captured)
- Bus contention (multiple devices trying to communicate simultaneously)
The Problem: If a sync request gets delayed in a queue for 5 milliseconds, but we assume symmetric delay, our offset calculation will be wrong by about 2.5 milliseconds.
Solution: Weight measurements by their quality.
We track the minimum observed RTT over recent measurements. This represents the "best case" communication delay—when the path is clear with no queuing or contention.
When a new measurement arrives:
- If its RTT is close to the minimum, it's a high-quality measurement with little delay variation
- If its RTT is much higher, it likely suffered from queuing or contention, making it less reliable
We give higher weight to measurements with low RTT and lower weight (or completely discard) measurements with suspiciously high RTT.
Example: If our minimum observed RTT is 500 microseconds, and a new measurement has RTT of 550 microseconds, we trust it highly. But if a measurement has RTT of 5000 microseconds, we know something delayed it—probably queuing—and we either discard it or weight it very lightly.
This strategy filters out measurements corrupted by communication delays, keeping only the clean, reliable measurements.
Step 3: Correcting Two Types of Error
Simple offset correction isn't enough. Clock drift has two independent components that need separate treatment:
- Phase Error: The current clock offset (e.g., "my clock currently reads 5ms behind master")
- Frequency Error: The rate of drift (e.g., "my clock gains 50 microseconds per second relative to master")
Why both matter:
If you only correct phase (just adding 5ms to fix the current offset), the underlying frequency error means the clocks immediately start drifting apart again. Within seconds, you're out of sync again.
If you only correct frequency (adjusting how fast the clock ticks), you fix the drift rate, but the initial offset remains. You still have the 5ms error, even though it's not getting worse.
You need to correct both simultaneously.
The Challenge: With a single correction mechanism, frequency adjustments interfere with phase corrections. When you adjust the clock rate to fix frequency, it looks like a phase error. When you adjust the offset to fix phase, it looks like a frequency error. The system fights itself and never converges stably.
The Solution: Two-stage correction process with independent controls.
┌──────────────┐ ┌───────────────┐ ┌───────────────┐
│ Master │────►│ Virtual │────►│ Virtual │────►Synchronized
│ Reference │ │ Clock #1 │ │ Clock #2 │ Time
│ Clock │ │ │ │ │
└──────────────┘ └───────┬───────┘ └───────┬───────┘
│ │
┌──────▼─────────┐ ┌──────▼─────────┐
│ Frequency │ │ Phase │
│ Correction │ │ Correction │
│ (slow loop) │ │ (fast loop) │
└────────────────┘ └────────────────┘Stage 1 - Frequency Correction (slow, long-term):
This stage observes how the phase offset changes over time. If your clock is consistently gaining 50 microseconds per second relative to the master, this indicates a frequency error.
The frequency correction adjusts a virtual clock's tick rate. If the local crystal runs slightly fast, we slow down the virtual clock proportionally. If it runs slow, we speed up the virtual clock.
Measurements are taken over long intervals (many seconds) because frequency is a rate-of-change measurement. You need to observe the accumulated drift over time to accurately measure the frequency error. This long measurement window also filters out short-term noise and communication jitter.
Stage 2 - Phase Correction (fast, short-term):
This stage measures the instantaneous phase offset between the (now frequency-corrected) virtual clock and the master.
It directly adjusts the clock phase to eliminate the current offset. If there's still a 2ms offset after frequency correction, the phase correction adds 2ms.
Measurements are taken frequently (every second or few seconds) to quickly eliminate transient errors and respond to sudden changes.
Why two stages work:
The frequency correction provides a stable foundation—a clock ticking at approximately the right rate. The phase correction makes fine adjustments on top of that stable base. They don't interfere because they operate on different time scales and different aspects of the clock (rate vs. offset).
This approach is called a "phase-locked loop" (PLL), a well-established control theory technique. The two-stage design gives us two "degrees of freedom" allowing the system to independently control both frequency and phase, achieving rapid and stable convergence.
Step 4: Tracking Accuracy Over Time
Between synchronization measurements, we don't know exactly how accurate our clock is. The error grows as time passes since the last sync.
Every timestamp comes with an error estimate:
Current_Error = Last_Measurement_Error + (Time_Since_Last_Sync × Residual_Uncertainty)Components:
Last_Measurement_Error: This comes from the RTT of the last sync measurement. If RTT was 1000 microseconds, our measurement error is approximately 500 microseconds (RTT/2). This is the inherent uncertainty in that measurement.
Residual_Uncertainty: Even after frequency correction, there's still some unmeasured drift. The frequency correction isn't perfect—temperature might be changing, the crystal might be aging. We estimate this residual drift rate (typically a few parts per million).
Time_Since_Last_Sync: The longer since our last sync measurement, the more uncertainty accumulates due to residual drift.
Example:
Last sync measurement had RTT of 1000 microseconds, giving measurement error of 500 microseconds. Our residual frequency uncertainty is 10 PPM (parts per million), meaning 10 microseconds per second. If 5 seconds have passed since the last sync:
Current_Error = 500 µs + (5 seconds × 10 µs/second)
= 500 µs + 50 µs
= 550 µsOur current timestamp has an estimated error bound of 550 microseconds.
Out-of-Sync Detection:
If Current_Error exceeds your accuracy requirement (say, 1000 microseconds), the system enters out-of-sync state:
- Stops providing timestamps (prevents delivering bad data)
- Logs a warning for system diagnostics
- Attempts more frequent synchronization to recover
- Resumes normal operation once error drops below threshold
Critical design principle: Rather than delivering timestamps with unknown accuracy, we fail explicitly and loudly when we cannot meet requirements. Silent failures with bad timestamps corrupt data and cause subtle bugs. Loud failures are debuggable.
Implementation Philosophy: The Transform Function Approach
We deliberately do not adjust the operating system's clock. Instead, we maintain a mathematical transform function that converts local monotonic time to synchronized time.
Why Not Adjust the System Clock?
Privilege Requirements: Adjusting system clocks typically requires administrator or root privileges. Applications running as regular users can't modify the system clock. This would force our synchronization service to run with elevated privileges, violating the principle of least privilege.
Software Conflicts: If other time services are running (like NTP daemon, systemd-timesyncd, or chronyd), they'll fight with our service for control of the system clock. They'll undo our adjustments, and we'll undo theirs, creating oscillations and instability.
Application Compatibility: Some applications expect monotonic time—time that only goes forward, never backward. If we adjust the system clock backward to correct an offset, applications might malfunction. Databases, loggers, and many other programs depend on monotonic time.
Flexibility: By not touching the system clock, different processes can theoretically synchronize to different time references if needed (useful for testing or multi-master scenarios).
How the Transform Function Works
Instead of changing the underlying hardware clock, we mathematically transform its readings.
The concept: We maintain parameters describing the relationship between local time and synchronized time:
- Phase offset: A constant offset to add (e.g., "add 1000 milliseconds to local time")
- Frequency multiplier: A scaling factor for elapsed time (e.g., "local clock runs 1.0001× too fast, so multiply elapsed time by 0.9999")
When application code requests the current synchronized timestamp:
- Read the local monotonic clock (the hardware clock that always advances forward)
- Calculate how much time has elapsed since our reference point
- Apply the frequency correction to that elapsed time (scale it by the multiplier)
- Add the phase offset
- Return the result as the synchronized timestamp
Example:
Suppose at some reference moment, local clock read 1000.000 seconds and we determined that corresponded to 5000.000 seconds in master time. Now, local clock reads 1005.000 seconds—5 seconds have elapsed locally.
If our frequency correction is 0.9999 (local clock runs slightly fast), corrected elapsed time is 5.000 × 0.9999 = 4.9995 seconds.
If our phase offset is +0.002 seconds, synchronized time is 5000.000 + 4.9995 + 0.002 = 5004.9995 seconds.
We've converted local time 1005.000 into synchronized time 5004.9995 without ever touching the system clock.
Benefits:
- Works as an unprivileged user process
- No interference with system clock or other time services
- Clean, mathematical transformation that's easy to test and validate
- Robust against system clock adjustments by other software
Updating the Transform
As sync measurements arrive, we update the transform parameters using the two-stage correction process described earlier.
The frequency discipline updates the frequency multiplier based on long-term drift observations. The phase discipline updates the phase offset based on short-term measurements.
Both updates use "gains" (weighting factors) that control how aggressively we respond to measurements. Too aggressive causes oscillation and instability. Too conservative causes slow convergence. The gains are tuned through testing to achieve rapid convergence with stability.
Practical Considerations
Thread Safety
The synchronization service must handle concurrent access from multiple threads safely.
Multiple application threads might request timestamps simultaneously while a background thread is updating the transform parameters based on new sync measurements. Without proper synchronization, race conditions could cause corrupted timestamps or crashes.
We use mutexes (mutual exclusion locks) to ensure atomic operations:
- Reading the transform parameters requires locking
- Updating the transform parameters requires locking
- The sync status flag uses atomic operations for lock-free checking
The implementation minimizes lock contention by checking sync status with a fast atomic read before attempting the slower locked read of transform parameters.
API Design
Simple API: For most applications, provide a straightforward interface:
get_timestamp(): Returns current synchronized timestamp or error if out of syncis_synchronized(): Checks if the clock is currently synchronized
Advanced API: For applications with strict timing requirements, provide detailed information:
get_timestamp_with_error(): Returns timestamp plus error bound and sync status- Applications can decide whether the current accuracy is sufficient for their needs
Example use case: A sensor fusion algorithm might require sub-millisecond accuracy. It calls get_timestamp_with_error(), checks if error bound is below 1000 microseconds, and only proceeds with high-accuracy fusion if the requirement is met. Otherwise, it falls back to a degraded mode or skips that fusion cycle.
Configuration
Each SoM has a configuration file specifying its role and parameters:
Master configuration:
- Role: master
- Whether to sync to external time source (NTP/GPS)
- External time source address if enabled
Client configuration:
- Role: client
- Master SoM identifier/address
- Sync interval (how often to request sync)
- Maximum acceptable error bound
- Various tuning parameters (gains, RTT thresholds, etc.)
Adaptive synchronization:
Rather than syncing at a fixed rate, intelligent implementations adapt:
- Start with frequent syncs (every few hundred milliseconds) during initial lock acquisition
- Gradually increase interval (up to several seconds) once stably synchronized
- Return to frequent syncs if error bound starts growing
- Balance accuracy requirements against communication and CPU overhead
Master SoM Implementation
The master SoM runs a simple service that responds to sync requests:
When a sync request arrives from a client:
- Read the master's current time from its local clock
- Send that timestamp back to the requesting client
- Done
The master doesn't need to track clients or maintain per-client state. Each sync request is independent.
Optionally, the master can synchronize its own clock to an external absolute time source (NTP server over the internet, GPS receiver, etc.). This isn't required for relative synchronization between SoMs, but provides absolute UTC time if needed.
Communication Abstraction
The sync protocol works over various communication media—CAN bus, Ethernet UDP, SPI, shared memory, etc. The core algorithm (measure RTT, calculate offset, update transform) remains the same.
An abstraction layer handles media-specific details:
- How to send sync requests
- How to receive sync responses
- How to encode/decode timestamps in messages
- Media-specific timing characteristics
This allows the same synchronization code to work across different hardware platforms and communication protocols by simply swapping the communication backend.
Performance Expectations
The achievable accuracy depends on several factors:
Communication Characteristics:
- Latency: Lower latency enables better accuracy. Fast communication (microseconds) supports sub-millisecond sync.
- Jitter: Consistent delay is more important than low delay. Highly variable delay degrades accuracy.
- Reliability: Dropped messages require retransmission, degrading sync frequency.
System Characteristics:
- Temperature stability: Thermal transients cause frequency drift, requiring more frequent sync.
- Processing load: Heavy CPU load can delay message handling, increasing jitter.
- Clock quality: Better crystals (TCXO vs. standard crystal) have less drift, easier to synchronize.
Typical Results:
With well-designed systems (low-latency communication, moderate thermal stability, reasonable processing load), sub-millisecond synchronization is routinely achievable. Many deployments achieve accuracy in the hundreds of microseconds.
With challenging conditions (high-latency or jittery communication, severe thermal transients, heavy processing load), accuracy degrades but typically remains within a few milliseconds.
The Key Characteristic: The system continuously measures and reports its own accuracy. You always know how accurate your timestamps are. The system degrades gracefully—accuracy gets worse under stress, but predictably and measurably. When accuracy becomes unacceptable, the system detects this and reports it rather than silently delivering bad data.
Validation and Testing
Stress Testing
Introduce challenging conditions deliberately:
- Network stress: Generate burst traffic to increase communication delays
- Thermal stress: Use thermal chambers or heat guns to create temperature transients
- Processing load: Run CPU-intensive tasks to create scheduling delays
- Long duration: Run continuously for days or weeks to observe long-term behavior
This reveals how the system behaves under real-world stress and validates graceful degradation. Good synchronization systems maintain accuracy even under stress, and when they can't, they detect and report the problem rather than silently failing.
Common Pitfalls to Avoid
Don't assume one-time calibration suffices: New developers often think "just measure the offset at startup." This fails within minutes as thermal drift accumulates. You need continuous, active synchronization throughout system lifetime.
Don't ignore communication variability: Treating all measurements equally degrades accuracy. Some measurements are corrupted by delays. Weight or discard poor measurements.
Don't silently deliver bad timestamps: It's tempting to always return something. But returning inaccurate timestamps without warning corrupts data and causes mysterious bugs. Better to fail explicitly.
Don't forget thermal management: Clock drift is largely temperature-driven. Systems that don't manage thermal environment see worse synchronization performance. Consider thermal design early.
Don't skip validation: It's easy to implement sync code that seems to work but has subtle bugs. Always validate with hardware ground truth and statistical analysis.
Conclusion
Synchronizing time across multiple SoMs is fundamental to building reliable distributed embedded systems. Without it, you cannot:
- Correlate multi-sensor data accurately
- Measure true system latencies
- Debug timing-related issues
- Perform accurate sensor fusion
- Meet safety or compliance requirements
The solution we've described combines several key elements:
- One SoM designated as master clock (time authority)
- Cristian's algorithm for measuring clock offset despite communication delay
- Two-stage phase-locked loops for independently correcting frequency and phase
- Transform function approach avoiding system clock modification
- Continuous accuracy tracking with explicit error bounds
- Graceful degradation and loud failure when requirements can't be met
With proper implementation, sub-millisecond time synchronization is achievable across all SoMs using commodity hardware with no specialized timing equipment. This enables sophisticated applications - from industrial automation to intelligent sensor systems—to operate reliably with precise, consistent timing across all distributed components.
The investment in time synchronization infrastructure pays dividends throughout the system lifecycle: faster debugging (timestamps actually mean something), more reliable operation (sensor fusion works correctly), and confidence that your distributed system's view of time is consistent and accurate.
Building this infrastructure takes effort upfront, but the alternative - dealing with unsynchronized clocks—causes endless debugging pain and unreliable systems. Get time synchronization right from the start.
Key Takeaway: Don't assume your SoMs' clocks agree. They don't, and they never will without active synchronization. Build synchronization in from the beginning, track accuracy explicitly, and fail loudly when precision degrades. Your future debugging self will thank you.