Noisy Neighbors in Real-Time IIoT Edge Systems

EDF Scheduling, Interference, and Knowledge Graph Insights

Posted by Christopher O’Hara on September 25, 2025

This post is framed as an AIOps case study, where we combine runtime observability, scheduling analysis, and knowledge graphs to diagnose a noisy-neighbor fault in a multi-robot IIoT environment.

Noisy Neighbors in Real-Time IIoT Edge Systems

Industrial IoT (IIoT) deployments are increasingly built around shared compute resources at the edge. Instead of sending every sensor reading and control command to the cloud, robots, conveyors, and vision systems push workloads to a local edge server. This improves latency, reduces backhaul bandwidth, and provides resilience when connectivity is degraded.

But with shared compute comes contention. Just as in cloud data centers, edge servers experience the noisy neighbor problem: one workload dominates the CPU or memory, introducing jitter and deadline misses for others. In a shopfloor environment with three cooperating robots, this can mean one robot’s large sensor-fusion update or batch vision inference slows down the others’ motion planning loops. The effect is not just slower throughput — it can directly compromise real-time guarantees that safety controllers, collision avoidance, or force-control loops rely on.

In this post, we explore the noisy neighbor phenomenon in an IIoT context. We start with a best-case scenario using Earliest Deadline First (EDF) scheduling, then show how jitter accumulates when tasks are preempted by competing workloads. By mapping this to time diagrams inspired by Buttazzo’s real-time scheduling charts, we can visualize how locked compute windows, task switching, and processor starvation appear. Finally, we connect this to the system-level cause of the NN and the knowledge graph results that characterize it.


Real-Time Scheduling and the EDF Perspective

To understand the noisy neighbor effect on an IIoT shopfloor, it helps to frame the problem in real-time scheduling theory. Buttazzo’s treatment of task scheduling highlights that algorithms like Earliest Deadline First (EDF) are optimal in theory for uniprocessor systems: if a set of periodic tasks can be scheduled at all, EDF will schedule them without missing deadlines. Each task (or $\tau_i$) is released at regular intervals with a computation time $C_i$ and a deadline $D_i$. EDF ensures that the task with the closest deadline always runs next.

In the best case, with three robots sharing edge compute resources, EDF would provide deterministic execution: motion-control loops, sensor updates, and path-planning tasks each complete within their deadlines. The time diagrams would show clean, contiguous execution blocks, and every task instance would complete before its deadline.

The challenge arises once jitter and lag enter the picture. In practice, noisy neighbors on shared edge servers cause two key degradations:

  1. Jitter in task start times.
    A robot’s motion-control task might be ready to execute, but if another workload is holding the processor longer than expected (e.g., a heavy vision inference job), the start of the motion task is delayed. EDF still prioritizes deadlines correctly, but the effective slack shrinks.

  2. Lag in distributed updates.
    IIoT robots rely on frequent updates to their environment model (e.g., obstacle maps or coordination signals). If one robot’s update is late, the others act on stale data. Even if each task eventually executes, the control loop’s effective latency grows beyond real-time bounds.

When these effects combine, the system may miss hard deadlines — for example, a collision-avoidance update delayed just enough that two robots enter the same zone without awareness of each other. In Buttazzo-style diagrams, the result is clear: instead of contiguous, deadline-aligned blocks, tasks are interrupted or shifted, leaving gaps where they should not be. Those “empty” areas correspond to processor contention, directly leading to hazard risks.


Visualizing Best-Case vs. Noisy Neighbor Execution

The diagrams below illustrate the difference between ideal EDF scheduling and what happens when noisy neighbor interference introduces jitter.

Best-Case Execution (EDF):
EDF Best Case

In this case, three robots ($\tau_{r1}, \tau_{r2}, \tau_{r3}$) share the processor. Each task activates at its interval (hatched region) and is scheduled immediately by EDF based on deadline order. All tasks complete before their deadlines, with clean execution windows and no overlap. This is the theoretical guarantee EDF provides under schedulability conditions: if total utilization is ≤ 1, all deadlines are met.

Noisy Neighbor Jitter and Deadline Misses:
EDF Noisy Neighbor

Here, we introduce interference from a noisy neighbor workload. Tasks are delayed in starting, visible as shifts in execution blocks into later parts of their activation intervals. This causes:

  • Execution stretching: longer blocks where one robot’s task monopolizes compute.
  • Task displacement: $\tau_{r2}$ and $\tau_{r3}$ start later than intended, encroaching on their deadlines.
  • Deadline misses: some tasks now execute partially or entirely after their deadline markers (arrows), breaking real-time guarantees.

The effect is cumulative: small jitters propagate forward, shrinking the available slack for subsequent tasks. Even if each robot’s job executes eventually, the timing violations mean that motion commands and perception updates no longer arrive when the robots need them.


Causes of the Noisy Neighbor

The shopfloor model involves three MyAGV robots running edge compute with camera/IMU/LiDAR sensors at distinct rates. Each sensor event consumes local CPU for preprocessing and may probabilistically trigger a cloud inference. Control loops run at 10 Hz and sometimes depend on the most recent cloud inference before issuing an actuator command. Response times are measured against a 150 ms SLA, with dwell counters to flag sustained violations. This makes each control loop sensitive to both edge load and cloud inference delays.

Cloud compute is represented by two servers, each with a queue and a dynamic load balancer that routes jobs by combining CPU load and queue depth. Service times inflate with server load, creating a feedback loop: load raises latency, which further inflates load. A periodic high-load process pinned to cloud-b acts as the noisy neighbor, cycling CPU upward during its duty phase. When cloud-b is under load, the balancer diverts jobs to cloud-a, which accumulates queue depth. As cloud-b cools, jobs swing back, causing oscillations. This interaction between periodic load, reactive balancing, and service inflation creates the jitter and deadline misses seen in the EDF diagrams. At the edge, added CPU bumps from preprocessing and inference packaging compound the effect, so delays ripple directly into control loops and hazard scenarios.

CPU Evidence of Noisy Neighbor Behavior

We can also confirm the noisy neighbor directly by inspecting CPU traces. Cloud-b shows a clear duty cycle pattern where CPU periodically spikes to high levels before dropping back down, while myagv-1 exhibits fluctuating but bounded edge CPU usage. This confirms the presence of a background workload contending for resources on cloud-b.

cloud-b CPU
myagv-1 CPU



Knowledge Graph View of the Shopfloor

To organize these dynamics, we export the run artifacts into a knowledge graph. Robots (myagv-1..3) are nodes with properties like avg_edge_cpu, and each sensor (camera, imu, lidar) is a child node annotated with its operating rate. Cloud servers (cloud-a, cloud-b) are nodes with aggregated properties such as avg_cpu and avg_qdepth derived from time-series data. Edges encode relationships like HAS_SENSOR and inference triggers. This graph enables structured queries that link topology and performance — for example, “which robots saw SLA dwell alerts while their chosen server showed inflated service time?”

One challenge for the KG is that we can quickly blow up the state space. In this scenario of only three robots, and a lot of abstraction (processes and tasks are hidden, we have 112 rows of values. See the next image for an idea of the breadth of data (and why tabular data structures are not going to work well…):

Variables

In general, we need to constrain the KG to immediately intelligible and explainable information for humans, while keeping a fully connected KG backend to run queries on (you might be surprised to know a single faulty sensor can cause issues for a compute service, and FTA and FMEA might not find it with conventional methods).

In the snapshot, robots average 18–20% edge CPU. Sensors report their configured rates (camera 15 Hz, imu 100 Hz, lidar 10 Hz). cloud-b shows much higher avg_cpu (≈33.5) than cloud-a (≈10.0), consistent with the noisy-neighbor duty cycle, but its queue depth is lower (0.56 vs 1.51). This inversion is a tell-tale sign of reactive load balancing: cloud-b appears saturated, so the balancer shifts load away, keeping its queue short, while cloud-a silently accumulates backlog. This KG representation not only captures the runtime symptoms of NN interference but also highlights system-level asymmetries that explain why control latencies spike downstream.

KG nodes summary


System Connections and Technologies

To make the interactions concrete, we visualize the whole dataflow from the robots to the cloud and back. On the edge, each MyAGV hosts sensors (camera 15 Hz, LiDAR 10 Hz, IMU 100 Hz) that feed a ROS 2 preprocessing front end (SLAM/feature extraction) and a 10 Hz control loop. An edge gateway bridges ROS 2 to transport (MQTT or HTTP), emitting telemetry and inference requests. In the cloud path, a least-load load balancer routes requests to inference services (containerized model servers on EKS). A periodic background noisy-neighbor workload shares the same node as one inference service and injects CPU/IO contention. Requests and results can traverse an event backbone (Kafka/MSK topics: topic.infer.req, topic.infer.resp, topic.telemetry.raw, topic.ops.alerts) so inference outputs and telemetry are available to multiple consumers. The control loop consumes the latest inference results and updates actuators; when delays occur, they propagate directly into control latency.

For observability and coordination, telemetry lands in MongoDB (time-series JSON), and enriched relationships (who waited on whom, contention edges, routing decisions, incident nodes) are written to Neo4j. A stream processor (Flink/Spark) normalizes telemetry, correlates inference latencies with server load, derives risk edges, and emits operational alerts back to the bus. The three PlantUML views correspond to: (1) a component-level map of edge, cloud, and data services with contention edges, (2) a full dataflow including event topics and processing pipeline, and (3) a minimal overview suitable for slides. Together they show where jitter is introduced (noisy neighbor on the cloud node), how routing decisions react (load balancer), how results return to control, and how the data layer captures incidents for later analysis.

System Diagram 1
System Diagram 2
System Diagram 3


Knowledge Graph Updates and Queryability

We extend the KG beyond static inventory so relationships are first-class and analyzable. Robots, sensors, preprocessors, control loops, the edge gateway, topics, the load balancer, inference services, the background workload, and data services are all nodes with typed properties (e.g., sensor hz, control hz, LB policy, service node=cloud-a|cloud-b). Edges encode operational semantics, not just wiring: HAS_SENSOR (robot→sensor), FEEDS (dataflow), PUBLISHES_TO and CONSUMES_FROM (bus interactions), ROUTES_TO (LB decision), WRITES_TO (persistence), and INTERFERES_WITH (contention). This lets edge context carry equal weight to cloud node stats in queries: you can traverse from a control loop’s latency incident to the specific result topic that fed it, to the inference service that produced it, to the LB decision that routed it, and finally to the noisy-neighbor that interfered with that service on its host.

With these semantics in place, we can run concise Cypher patterns that answer operator questions directly. Examples:

  • Sensors feeding a given robot’s preprocess
    MATCH (a:Sensor)-[:FEEDS]->(b:Preprocess) WHERE a.robot_id='myagv-1' RETURN a.id, a.kind, b.id;
  • Inference results reaching controllers
    MATCH (t:Topic {id:'topic.infer.resp'})-[:FEEDS]->(c:ControlLoop) RETURN t.id, c.id;
  • Load-balancer fan-out
    MATCH (lb:LoadBalancer)-[:ROUTES_TO]->(svc:InferenceService) RETURN lb.id, svc.id;
  • Explicit interference edges
    MATCH (nn:BackgroundWorkload)-[r:INTERFERES_WITH]->(svc:InferenceService) RETURN nn.id, svc.id, r;
  • Telemetry lineage
    MATCH (t:Topic {id:'topic.telemetry.raw'})-[:FEEDS]->(sp:StreamProcessor)-[:WRITES_TO]->(db:Database) RETURN t.id, sp.id, db.id;

Because relationships model contention and routing decisions, path queries naturally correlate edge-side timings (control dwell, preprocessing bursts) with cloud-side causes (duty-cycle spikes, queue growth), making edge information as actionable as node information for root-cause and safety analysis.

Knowledge Graph Visualization

GQL


Root Cause Discovery with the KG

The KG structure allows us to trace the path of an SLA breach. For example, an SLA alert raised by r1-control can be followed upstream: the control loop consumes from topic.infer.resp, which in turn is published by inference-service-b. That service has an incoming INTERFERES_WITH edge from workload-b, the noisy neighbor. At the same time, the KG shows that the load balancer initially routed requests to service-b because of low queue depth, even though its CPU load was inflated by background contention. This pattern reveals the subtle interaction: the balancer’s local metrics favored the wrong node, masking the actual interference and leading directly to jitter and missed deadlines downstream.

By combining runtime traces with this graph structure, we can identify that the root cause is not just “cloud-b is overloaded,” but “the load balancer misclassified cloud-b as available due to shallow queues, while CPU contention from workload-b extended service times.” This difference is crucial for operators: mitigation is not simply scaling up resources, but rethinking load-balancing criteria and isolation from colocated background tasks.


Conclusions and Challenges

The case study highlights several systemic challenges in multi-agent IIoT environments:

  • Distributed computing and resource allocation. Shared cloud nodes amplify contention risks when workloads are colocated, especially with dynamic service-time inflation.
  • Dynamic load balancing. Reactive, metric-driven routing can oscillate under interference, producing emergent instability instead of equilibrium.
  • Multi-agent timing sensitivity. Edge control loops depend on predictable inference returns; small jitters propagate and cascade into deadline misses and safety hazards.
  • Observability at multiple levels. Without telemetry correlated across edge, cloud, and data services, operators cannot identify that contention, not just utilization, is the fault driver.

Toward Advanced Mitigations

Several advanced techniques can reduce exposure: partitioning noisy workloads with container isolation or cgroups, applying mixed-criticality scheduling at the node level, refining load balancer policies to weight historical latencies instead of instantaneous queues, or deploying real-time containers with CPU pinning and IO throttling. Stream processors can augment this with adaptive policies based on incident frequency.

The caveat is that emergent behavior is often unpredictable. Even with advanced controls, new interaction loops (for example, between balancers, autoscalers, and stream alerts) can produce unexpected oscillations. For this reason, system designers must balance intervention with humility: sometimes the system will self-organize in ways that evade prediction.


System Architecture and -ilities

At the architecture level, the key -ilities are:

  • Scalability (elasticity of inference services without destabilizing load-balancing),
  • Reliability (ensuring safety-critical loops remain bounded even under interference),
  • Maintainability (using graph-based models to diagnose faults and replay causal chains),
  • Resilience (graceful degradation when deadlines are missed, rather than catastrophic collisions).

By embedding both structural diagrams and runtime knowledge graphs, we gain a dual perspective: the designed system and the emergent system. Together, they inform how we architect future IIoT platforms that are robust to noisy neighbors, aware of emergent contention, and traceable from the edge loop to the cloud node.


Why This Matters

This investigation demonstrates how localized interference can ripple into system-wide hazards. The noisy-neighbor workload is not unique to IIoT; it is a general property of shared resources in distributed systems. What makes the shopfloor case distinct is the direct tie between compute jitter and physical safety: delayed inference results lead to robots missing collision-avoidance windows.

By combining EDF scheduling analysis, system-level diagrams, and knowledge-graph reasoning, we see how to make invisible interactions explicit. The lesson is not only that resource contention matters, but that it must be captured at multiple levels: scheduling theory, telemetry traces, and relational graphs. This approach is transferable from smart factories to autonomous vehicles, from cloud robotics to edge AI. In all cases, the same principle holds: systems must be architected not just for throughput, but for resilience under contention and transparency of failure modes.

The broader message is architectural. Robust IIoT systems must anticipate contention, embrace observability, and maintain verifiability. Without this, emergent behaviors will remain opaque, and safety will depend on luck. With it, operators can reason across abstraction layers, from a single queue spike to a hazard event, and design mitigations that make real-time autonomy dependable.

Looking ahead, these methods form the backbone of future AIOps, where automation, inference, and graph reasoning converge to manage complex distributed systems with safety and resilience in mind.


Reference

Buttazzo, G. C. (2011). Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications (3rd ed.). Springer. https://doi.org/10.1007/978-1-4614-0676-1