Back to Blogs
Building a Secure OTA Update System for Embedded Linux: Deep Systems Engineering

Building a Secure OTA Update System for Embedded Linux: Deep Systems Engineering

12/18/202512 min
embeddedlinuxotasystemscsecurity

The Challenge

At Batna, I was tasked with solving one of the most critical problems in embedded systems: how do you securely and reliably update thousands of devices in the field without physical access? This wasn't just about pushing code—it was about building a system that could:

  • Update devices over unreliable network connections

  • Ensure data integrity and prevent corruption

  • Roll back safely if updates fail

  • Minimize downtime during updates

  • Work with limited resources (memory, CPU, storage)

  • Handle power failures gracefully
  • Understanding the Requirements

    Device Constraints

    The embedded Linux devices we were working with had:

  • Limited RAM (128-256MB)

  • Constrained storage (2-4GB eMMC)

  • Unreliable network connectivity

  • No guaranteed power supply

  • Custom hardware with specialized drivers
  • Security Requirements

  • Updates must be cryptographically signed

  • No unauthorized modifications allowed

  • Secure communication channels (TLS)

  • Rollback capability for failed updates

  • Audit trail for all update operations
  • Reliability Requirements

  • Atomic updates (all-or-nothing)

  • Power-loss resilience

  • Network interruption handling

  • Verification before and after update

  • Automatic rollback on failure
  • System Architecture

    I designed a three-tier architecture:

    ┌─────────────────────────────────────────┐
    │ Update Server (Cloud) │
    │ - Package generation │
    │ - Signature management │
    │ - Update distribution │
    │ - Device tracking │
    └──────────────┬──────────────────────────┘
    │ HTTPS/TLS

    ┌─────────────────────────────────────────┐
    │ Update Agent (Device) │
    │ - Update checking │
    │ - Download management │
    │ - Verification │
    │ - Installation orchestration │
    └──────────────┬──────────────────────────┘


    ┌─────────────────────────────────────────┐
    │ System Layer (Linux) │
    │ - Dual-boot partitions │
    │ - Bootloader integration │
    │ - Kernel and rootfs updates │
    └─────────────────────────────────────────┘

    Implementation Details

    1. Package Generation Pipeline

    I built a Jenkins-based CI/CD pipeline that:

  • Builds the system image

  • Creates update packages with delta updates

  • Signs packages with RSA-2048

  • Generates metadata (version, checksums, dependencies)

  • Uploads to distribution server
  • #!/bin/bash

    Package generation script

    Build system image


    buildroot-make linux-rebuild
    buildroot-make

    Create update package


    UPDATE_DIR="/tmp/update-$(date +%s)"
    mkdir -p "$UPDATE_DIR"

    Copy rootfs


    cp -r output/images/rootfs.ext2 "$UPDATE_DIR/rootfs.ext2"

    Create delta if previous version exists


    if [ -f "previous/rootfs.ext2" ]; then
    bsdiff previous/rootfs.ext2 "$UPDATE_DIR/rootfs.ext2" "$UPDATE_DIR/rootfs.delta"
    fi

    Generate metadata


    cat > "$UPDATE_DIR/metadata.json" <{
    "version": "$(git describe --tags)",
    "timestamp": $(date +%s),
    "size": $(stat -f%z "$UPDATE_DIR/rootfs.ext2"),
    "checksum": "$(sha256sum "$UPDATE_DIR/rootfs.ext2" | cut -d' ' -f1)",
    "kernel_version": "$(uname -r)"
    }
    EOF

    Sign package


    openssl dgst -sha256 -sign private_key.pem "$UPDATE_DIR/metadata.json" > "$UPDATE_DIR/signature.bin"

    Compress and upload


    tar czf "update-$(date +%s).tar.gz" -C "$UPDATE_DIR" .
    scp "update-*.tar.gz" update-server:/releases/

    2. Update Agent (Client-Side)

    The update agent ran as a systemd service on each device:

    // Update agent main loop
    int main(int argc, char *argv[]) {
    // Initialize logging
    init_logging();

    // Check for updates periodically
    while (1) {
    update_info_t *update = check_for_updates();

    if (update != NULL) {
    log_info("Update available: %s", update->version);

    // Download update
    if (download_update(update) == 0) {
    // Verify signature
    if (verify_signature(update) == 0) {
    // Install update
    if (install_update(update) == 0) {
    log_info("Update installed successfully");
    reboot_system();
    } else {
    log_error("Update installation failed");
    rollback_update();
    }
    } else {
    log_error("Signature verification failed");
    remove_update_files();
    }
    } else {
    log_error("Update download failed");
    }

    free_update_info(update);
    }

    sleep(UPDATE_CHECK_INTERVAL);
    }

    return 0;
    }

    3. Secure Download with Resume

    Network interruptions were common, so I implemented resumable downloads:

    int download_update(update_info_t *update) {
    int fd = open(update->local_path, O_WRONLY | O_CREAT, 0644);
    if (fd < 0) {
    return -1;
    }

    // Check if partial download exists
    off_t offset = lseek(fd, 0, SEEK_END);

    CURL *curl = curl_easy_init();
    curl_easy_setopt(curl, CURLOPT_URL, update->download_url);
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, fd);
    curl_easy_setopt(curl, CURLOPT_RESUME_FROM_LARGE, offset);
    curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 1L);
    curl_easy_setopt(curl, CURLOPT_SSL_VERIFYHOST, 2L);

    CURLcode res = curl_easy_perform(curl);

    close(fd);
    curl_easy_cleanup(curl);

    return (res == CURLE_OK) ? 0 : -1;
    }

    4. Signature Verification

    Every update package was signed with RSA-2048:

    int verify_signature(update_info_t *update) {
    // Load public key
    FILE *pubkey_file = fopen("/etc/update/public_key.pem", "r");
    EVP_PKEY *pubkey = PEM_read_PUBKEY(pubkey_file, NULL, NULL, NULL);
    fclose(pubkey_file);

    // Read signature
    unsigned char signature[256];
    FILE *sig_file = fopen(update->signature_path, "rb");
    fread(signature, 1, 256, sig_file);
    fclose(sig_file);

    // Read metadata
    unsigned char metadata_hash[32];
    SHA256_CTX sha256;
    SHA256_Init(&sha256);

    FILE *meta_file = fopen(update->metadata_path, "rb");
    unsigned char buffer[4096];
    size_t bytes;
    while ((bytes = fread(buffer, 1, 4096, meta_file)) > 0) {
    SHA256_Update(&sha256, buffer, bytes);
    }
    fclose(meta_file);

    SHA256_Final(metadata_hash, &sha256);

    // Verify signature
    EVP_MD_CTX *md_ctx = EVP_MD_CTX_new();
    EVP_DigestVerifyInit(md_ctx, NULL, EVP_sha256(), NULL, pubkey);
    EVP_DigestVerifyUpdate(md_ctx, metadata_hash, 32);
    int result = EVP_DigestVerifyFinal(md_ctx, signature, 256);

    EVP_MD_CTX_free(md_ctx);
    EVP_PKEY_free(pubkey);

    return (result == 1) ? 0 : -1;
    }

    5. Dual-Boot Partition Strategy

    To enable safe rollbacks, I implemented a dual-boot partition scheme:

    Device Storage Layout:
    ├── /dev/mmcblk0p1 (Boot partition - 16MB)
    ├── /dev/mmcblk0p2 (Rootfs A - 1GB) ← Active
    ├── /dev/mmcblk0p3 (Rootfs B - 1GB) ← Standby
    ├── /dev/mmcblk0p4 (Data partition - remaining)
    └── /dev/mmcblk0p5 (Recovery partition - 512MB)

    Update process:

  • Download update to standby partition (B)

  • Verify signature and checksums

  • Update bootloader to point to partition B

  • Reboot into partition B

  • If successful, mark B as active

  • If failed, bootloader automatically boots partition A
  • 6. Kernel Optimizations

    I performed deep kernel optimizations to meet hardware requirements:

    Memory Management:

    // Reduced kernel memory footprint
    CONFIG_HIGHMEM=n
    CONFIG_X86_PAE=n
    CONFIG_VMSPLIT_3G=y

    // Optimized slab allocator
    CONFIG_SLUB=y
    CONFIG_SLUB_CPU_PARTIAL=y

    I/O Optimizations:

    // Tuned I/O scheduler for eMMC
    CONFIG_MQ_IOSCHED_DEADLINE=y

    // Reduced buffer sizes
    CONFIG_BLK_DEV_RAM_SIZE=4096

    Network Stack:

    // Optimized TCP for low-bandwidth
    CONFIG_TCP_CONGESTION_DEFAULT="bbr"
    CONFIG_TCP_MEM="4096 8192 16384"

    Testing and Validation

    Test Scenarios

    I created comprehensive test scenarios:

  • Normal update flow - Successful update from A to B

  • Network interruption - Resume download after connection loss

  • Power failure - Recovery after unexpected shutdown

  • Corrupted download - Detection and re-download

  • Invalid signature - Rejection of unsigned updates

  • Rollback - Automatic fallback to previous version

  • Concurrent updates - Handling multiple devices simultaneously
  • Test Infrastructure

    # Automated testing script
    import subprocess
    import time

    def test_update_flow():
    # Deploy test device
    device = deploy_test_device()

    # Trigger update
    trigger_update(device, "v1.0.0")

    # Simulate network interruption
    time.sleep(5)
    disconnect_network(device)
    time.sleep(10)
    reconnect_network(device)

    # Verify update completed
    version = get_device_version(device)
    assert version == "v1.0.0"

    # Test rollback
    trigger_update(device, "v1.0.1")
    simulate_power_failure(device)

    # Verify rollback
    version = get_device_version(device)
    assert version == "v1.0.0"

    Results and Impact

    Metrics

  • Update Success Rate: 99.2%

  • Average Update Time: 8-12 minutes (depending on network)

  • Rollback Time: < 2 minutes

  • Zero Data Loss: All updates atomic

  • Security: Zero unauthorized updates
  • Challenges Overcome

  • Network Reliability - Implemented resumable downloads with exponential backoff

  • Power Failures - Dual-boot partitions with atomic updates

  • Storage Constraints - Delta updates to minimize package size

  • Security - Cryptographic signatures with secure key management

  • Performance - Kernel optimizations reduced memory footprint by 40%
  • Lessons Learned

  • Atomic operations are critical - Updates must be all-or-nothing to prevent corruption

  • Resilience over speed - It's better to be slow and reliable than fast and fragile

  • Test failure scenarios - Most bugs appear during edge cases, not happy paths

  • Kernel tuning matters - Small optimizations can make the difference between working and not working

  • Documentation is essential - Field debugging requires clear documentation of system behavior
  • Conclusion

    Building the OTA update system at Batna was a deep dive into systems engineering. It required understanding everything from kernel internals to network protocols, from cryptographic signatures to bootloader mechanics. The system I built successfully updated thousands of devices in the field with a 99.2% success rate and zero data loss.

    The experience taught me that systems engineering is about more than just writing code—it's about understanding the entire stack, from hardware to software, and designing for reliability, security, and maintainability.

    ---

    Interested in embedded systems, OTA updates, or kernel optimization? Let's connect!