Kalman Filters: Multimodal LLM Breaking

Kalman filtering is often presented with simplified block diagrams that hide where noise actually enters the system and whether a direct feedthrough exists. For robust evaluation (or grading) of a solver, the diagram must be interpreted, not ignored. This post provides: (i) a concise problem statement that requires the diagram, (ii) a full step-by-step discrete-time Kalman filter derivation with all matrix multiplications shown, (iii) compact definitions of the model symbols and indexing, and (iv) brief notes on extensions (EKF/UKF), linearization, and fixed-interval smoothing (Rauch–Tung–Striebel).

Why this example (and why it can break LLMs)

This example is deliberately small enough for pencil-and-paper yet rich enough to matter: process noise enters through $B$, the measurement has additive noise and a nonzero direct feedthrough $D$. Those two details—noise channel and $D\neq0$—are exactly where many solutions (and models) go wrong. That makes it useful for:

Pedagogy: you can derive all steps by hand and see each matrix multiplication.
Stress-testing solvers/LLMs: omitting $D u_k$ from the innovation or replacing $BQB^\top$ with a bare $Q$ yields confident but wrong answers. The diagram forces the correct equations.

Control Systems Diagram:
Ctrl Diagram

source: https://www.mathworks.com/help/examples/control/win64/kalmdemo_02.png

Problem Setup

Inspect the control systems diagram from MathWorks, introducing the Kalman Filter. The block diagram shows that (1) process noise enters through the same channel as the control input (via $B$), (2) measurement noise is added after the plant, and (3) the output has nonzero direct feedthrough $D$. Those features must appear in both the prediction covariance and the innovation.

Model and data.
$A=\begin{bmatrix}1&1\\0&1\end{bmatrix},\quad B=\begin{bmatrix}0.5\\1\end{bmatrix},\quad C=\begin{bmatrix}1&0\end{bmatrix},\quad D=0.2;\quad Q=0.04,\ R=0.09;\quad \hat x_0=\begin{bmatrix}0\\0\end{bmatrix},\ P_0=\mathrm{diag}(1,1);$ $u=[\,2.0,\ 0.0,\ 0.5\,],\qquad y=[\,1.50,\ 1.60,\ 4.00\,].$

One-sentence symbol definitions.
$A$ is the state-transition matrix, $B$ is the control/input matrix (the process noise uses this same channel), $C$ is the output/measurement matrix, $D$ is the direct-feedthrough input-to-output gain, $Q$ is the process-noise covariance, $R$ is the measurement-noise covariance, $x$ is the system state vector.

Indexing convention. Let $k=0$ correspond to the first pair $(u_0,y_0)$; the first prior uses $u_{-1}=0: \hat x^-0 = A \hat x_0 + B \, u{-1}$.

Physical meaning (vehicle/robot mindset).
For a simple kinematics example, take $x=[\text{position},\ \text{velocity}]^\top$.

$A$ advances position by adding velocity; velocity persists.
$B$ maps commanded acceleration (or a thrust proxy) to the state; driving process noise through $B$ encodes that the same channel that moves the plant also injects uncertainty (actuator jitter, drive ripple).
$C$ measures position; $D$ models instantaneous input leakage to the sensor (e.g., feedthrough/cross-coupling).
$Q$ is variance of the drive noise; $R$ is the sensor noise variance.

Kalman Filter Equations (discrete-time, matching the diagram)

\[\begin{aligned} x_{k+1} &= A x_k + B\,(u_k + w_k),\qquad w_k\sim\mathcal N(0,Q),\\ y_k &= C x_k + D u_k + v_k,\qquad\ \ \ v_k\sim\mathcal N(0,R), \end{aligned}\] \[\begin{aligned} \hat x^-_k &= A \hat x_{k-1} + B u_{k-1},\\ P^-_k &= A P_{k-1} A^\top + B Q B^\top,\\ \tilde r_k &= y_k - \big(C\hat x^-_k + D u_k\big),\\ S_k &= C P^-_k C^\top + R,\\ K_k &= P^-_k C^\top S_k^{-1},\\ \hat x_k &= \hat x^-_k + K_k \tilde r_k,\\ P_k &= (I - K_k C)\,P^-_k,\\ \hat y_k &= C \hat x_k + D u_k. \end{aligned}\]

Bayesian anatomy: prior vs posterior (and why other fields care).

Prior (a priori): $(\hat x^-_k, P^-_k)$ summarizes our belief before seeing $y_k$.
Likelihood: innovation $\tilde r_k$ and its covariance $S_k$ measure how surprising the measurement is, given the prior.
Posterior (a posteriori): $(\hat x_k, P_k)$ after fusing the evidence.
This exact prior→likelihood→posterior loop is what turns up in:
Predictive coding / neurointelligence: descending predictions (priors) corrected by ascending prediction errors (innovations).
Reinforcement learning: belief updates in partially observable settings; “critic” signals look a lot like innovations.
Classical control with observers: the KF is the Luenberger observer that learns its gain from noise statistics.
It’s the same story: a disciplined prior, a measurable surprise, a corrected posterior. Real systems are messy and rarely linear, but the structure generalizes.

Worked Example — Full Step-by-Step Derivation

Consider the figure of a standard discrete-time LTI system. Pay close attention to:
1) the channel through which process noise enters,
2) the point at which measurement noise is injected, and
3) the presence of direct feedthrough from input to output.

Use the diagram-consistent stochastic model:
$x_{k+1}=A\,x_k + B\,(u_k+w_k),\quad w_k\sim\mathcal N(0,Q);\qquad y_k=C\,x_k + D\,u_k + v_k,\quad v_k\sim\mathcal N(0,R).$

Indexing and conventions
Let $k=0$ index the first pair $(u_0,y_0)$ and use $u_{-1}=0$ for the first prediction. All quantities are dimensionless.

Kalman filter equations to apply (consistent with the diagram)
$\hat x^-_k = A \hat x_{k-1} + B u_{k-1}, \qquad P^-_k = A P_{k-1} A^\top + B Q B^\top$

$\tilde r_k = y_k - (C\hat x^-_k + D u_k),\qquad S_k = C P^-_k C^\top + R,$

$K_k = P^-_k C^\top S_k^{-1},\qquad \hat x_k = \hat x^-_k + K_k \tilde r_k,\qquad P_k = (I - K_k C)P^-_k,$

$\hat y_k = C \hat x_k + D u_k.$

Task
Compute the sequence up to $k=2$ and report $\hat y_2$ (the estimated output at $k=2$) to five significant figures.
Important: honor the noise placement ($BQB^\top$ in the prediction covariance) and the direct feedthrough term $D u_k$ in the innovation and output. Note for LLM Breaking: You must feed the image of the system (above) along with the input parameters and the task requirements.

Precompute $BQB^\top$ with explicit multiplications. $BQ=\begin{bmatrix}0.5\\1\end{bmatrix}\,0.04 =\begin{bmatrix}0.02\\0.04\end{bmatrix},\qquad BQB^\top=\begin{bmatrix}0.02\\0.04\end{bmatrix}\begin{bmatrix}0.5&1\end{bmatrix} =\begin{bmatrix}0.01&0.02\\0.02&0.04\end{bmatrix}.$ Why $BQB^\top$? Because the diagram says process noise enters via $B$. If you add $Q$ directly to $P^-_k$ you’re pretending noise acts on the state with identity gain. That’s not this system.

Step $k=0$ (use $u_{-1}=0$)

Prediction. $\hat x^-_0=A\hat x_0+B\cdot 0 =\begin{bmatrix}1&1\\0&1\end{bmatrix}\!\begin{bmatrix}0\\0\end{bmatrix} =\begin{bmatrix}0\\0\end{bmatrix}.$ $A P_0=\begin{bmatrix}1&1\\0&1\end{bmatrix}\!\begin{bmatrix}1&0\\0&1\end{bmatrix} =\begin{bmatrix}1&1\\0&1\end{bmatrix},\quad (A P_0)A^\top=\begin{bmatrix}1&1\\0&1\end{bmatrix}\!\begin{bmatrix}1&0\\1&1\end{bmatrix} =\begin{bmatrix}2&1\\1&1\end{bmatrix}.$ $P^-_0=\begin{bmatrix}2&1\\1&1\end{bmatrix}+\begin{bmatrix}0.01&0.02\\0.02&0.04\end{bmatrix} =\begin{bmatrix}2.01&1.02\\1.02&1.04\end{bmatrix}.$

Innovation, covariance, gain. $C\hat x^-_0 + D u_0 = \begin{bmatrix}1&0\end{bmatrix}\!\begin{bmatrix}0\\0\end{bmatrix}+0.2\cdot 2.0=0.4,$ $\tilde r_0=1.50-0.40=1.10,\qquad C P^-_0=\begin{bmatrix}2.01&1.02\end{bmatrix},\qquad S_0=2.01+0.09=2.10,$ $P^-_0 C^\top=\begin{bmatrix}2.01\\1.02\end{bmatrix},\qquad K_0=\frac{1}{2.10}\begin{bmatrix}2.01\\1.02\end{bmatrix} =\begin{bmatrix}0.95714286\\0.48571429\end{bmatrix}.$

Update. $K_0\tilde r_0=\begin{bmatrix}1.05285714\\0.53428571\end{bmatrix},\quad \hat x_0=\begin{bmatrix}0\\0\end{bmatrix}+\begin{bmatrix}1.05285714\\0.53428571\end{bmatrix} =\begin{bmatrix}1.05285714\\0.53428571\end{bmatrix}.$ $K_0C=\begin{bmatrix}0.95714286&0\\0.48571429&0\end{bmatrix},\quad I-K_0C=\begin{bmatrix}0.04285714&0\\-0.48571429&1\end{bmatrix},$ $P_0=(I-K_0C)P^-_0 =\begin{bmatrix}0.08614286&0.04371429\\0.04371429&0.54457143\end{bmatrix}.$

Output estimate. $\hat y_0=C\hat x_0+D u_0=1.05285714+0.40=1.45285714.$

Step $k=0$ — Interpretation.

What we computed: the very first prior/posterior with zero pre-input.
Why it works: the gain tilts toward position (large first component of $K_0$) because that’s what we measure.
Why $BQB^\top$ matters here: even with $u_{-1}=0$, actuator-path noise still inflates $P^-_0$.
Physical meaning: starting from rest, the first measurement pulls position and (to a lesser extent) velocity toward the observed track.

Step $k=1$ (use $u_0=2.0$)

Prediction. $A\hat x_0=\begin{bmatrix}1&1\\0&1\end{bmatrix}\!\begin{bmatrix}1.05285714\\0.53428571\end{bmatrix} =\begin{bmatrix}1.58714285\\0.53428571\end{bmatrix},\quad B u_0=\begin{bmatrix}1.0\\2.0\end{bmatrix},$ $\hat x^-_1=\begin{bmatrix}2.58714285\\2.53428571\end{bmatrix}.$ $A P_0=\begin{bmatrix}0.12985715&0.58828572\\0.04371429&0.54457143\end{bmatrix},\quad (A P_0)A^\top=\begin{bmatrix}0.71814287&0.58828572\\0.58828572&0.54457143\end{bmatrix},$ $P^-_1=\begin{bmatrix}0.72814286&0.60828572\\0.60828572&0.58457143\end{bmatrix}.$

Innovation, covariance, gain. $C\hat x^-_1 + D u_1 = 2.58714285 + 0 = 2.58714285,\quad \tilde r_1=1.60-2.58714285=-0.98714285,$ $C P^-_1=\begin{bmatrix}0.72814286&0.60828572\end{bmatrix},\quad S_1=0.72814286+0.09=0.81814286,$ $P^-_1 C^\top=\begin{bmatrix}0.72814286\\0.60828572\end{bmatrix},\quad K_1=\frac{1}{0.81814286}\begin{bmatrix}0.72814286\\0.60828572\end{bmatrix} =\begin{bmatrix}0.88999476\\0.74349572\end{bmatrix}.$

Update. $K_1\tilde r_1=\begin{bmatrix}-0.87855196\\-0.73393649\end{bmatrix},\quad \hat x_1=\begin{bmatrix}2.58714285\\2.53428571\end{bmatrix} +\begin{bmatrix}-0.87855196\\-0.73393649\end{bmatrix} =\begin{bmatrix}1.70859089\\1.80034922\end{bmatrix}.$ $K_1C=\begin{bmatrix}0.88999476&0\\0.74349572&0\end{bmatrix},\quad I-K_1C=\begin{bmatrix}0.11000524&0\\-0.74349572&1\end{bmatrix},$ $P_1=(I-K_1C)P^-_1 =\begin{bmatrix}0.08009953&0.06691461\\0.06691461&0.13231360\end{bmatrix}.$

Output estimate. $\hat y_1=C\hat x_1+D u_1=1.70859089+0=1.70859089.$

Step $k=1$ — Interpretation.

What we computed: a prior that includes actual actuation $u_0$ plus actuator-path uncertainty via $BQB^\top$.
Why the innovation is negative: the predicted output overshot; the filter pulls state back.
Why $D$ matters even when $u_1=0$: $D$ still shaped the previous innovation at (k=0); forgetting $D$ anywhere breaks consistency.
Physical meaning: commanded acceleration at $k=0$ raised both position and velocity; the measurement at $k=1$ said “too high,” so we trim.

Step $k=2$ (use $u_1=0$)

Prediction. $A\hat x_1=\begin{bmatrix}1&1\\0&1\end{bmatrix}\!\begin{bmatrix}1.70859089\\1.80034922\end{bmatrix} =\begin{bmatrix}3.50894011\\1.80034922\end{bmatrix},\quad B u_1=\begin{bmatrix}0\\0\end{bmatrix},$ $\hat x^-_2=\begin{bmatrix}3.50894011\\1.80034922\end{bmatrix}.$ $A P_1=\begin{bmatrix}0.14701414&0.19922821\\0.06691461&0.13231360\end{bmatrix},\quad (A P_1)A^\top=\begin{bmatrix}0.34624235&0.19922821\\0.19922821&0.13231360\end{bmatrix},$ $P^-_2=\begin{bmatrix}0.35624236&0.21922821\\0.21922821&0.17231360\end{bmatrix}.$

Innovation, covariance, gain. $C\hat x^-_2 + D u_2 = 3.50894011 + 0.2\cdot 0.5 = 3.60894011,\quad \tilde r_2=4.00-3.60894011=0.39105989,$ $C P^-_2=\begin{bmatrix}0.35624236&0.21922821\end{bmatrix},\quad S_2=0.35624236+0.09=0.44624236,$ $P^-_2 C^\top=\begin{bmatrix}0.35624236\\0.21922821\end{bmatrix},\quad K_2=\frac{1}{0.44624236}\begin{bmatrix}0.35624236\\0.21922821\end{bmatrix} =\begin{bmatrix}0.79831588\\0.49127612\end{bmatrix}.$

Update. $K_2\tilde r_2=\begin{bmatrix}0.31218932\\0.19211839\end{bmatrix},\quad \hat x_2=\begin{bmatrix}3.50894011\\1.80034922\end{bmatrix} +\begin{bmatrix}0.31218932\\0.19211839\end{bmatrix} =\begin{bmatrix}3.82112943\\1.99246761\end{bmatrix}.$ $K_2C=\begin{bmatrix}0.79831588&0\\0.49127612&0\end{bmatrix},\quad I-K_2C=\begin{bmatrix}0.20168412&0\\-0.49127612&1\end{bmatrix},$ $P_2=(I-K_2C)P^-_2 =\begin{bmatrix}0.07184843&0.04421485\\0.04421485&0.06461201\end{bmatrix}.$

Output estimate. $\hat y_2=C\hat x_2+D u_2=3.82112943+0.10=3.92112943.$

Step $k=2$ — Interpretation.

What we computed: coast on dynamics (no input) plus measurement with nonzero $u_2$ feedthrough in the sensor path.
Why the innovation is positive: predicted output too low; the update nudges both position and velocity upward.
Why the covariances shrink: repeated measurements reduce uncertainty in the measured subspace.
Physical meaning: with no new thrust, position advances by carried velocity; the sensor’s small direct input term ($D u_2$) still tweaks what we expect to read.

Final Answer

Ordered list (5 s.f.): $\boxed{[\,1.4529,\ 1.7086,\ 3.9211\,]}$

Assumptions and When They Break

Linearity: dynamics are linear in $x$ and $u$; output is linear in $x$ and $u$ with direct feedthrough $D$.
Gaussianity: $w_k$ and $v_k$ are zero-mean, white, Gaussian, with covariances $Q$ and $R$, independent of each other and of the initial state.
Correct noise placement: process noise acts through $B$ (hence $BQB^\top$); measurement noise is added after the plant.
Indexing: the prior at step $k$ uses $u_{k-1}$; the innovation at $k$ uses $u_k$ and $y_k$.

Violations (nonlinear dynamics, non-Gaussian noise, correlated $w_k$–$v_k$, time-varying or state-dependent noise) require modifications. There are basically no truly linear systems in the wild; this LTI setup is a controlled sandbox, not a worldview.

Extensions

Extended Kalman Filter (EKF)

For nonlinear models $x_{k+1}=f(x_k,u_k)+G(x_k,u_k)\,w_k,\qquad y_k=h(x_k,u_k)+v_k,$ linearize about $\hat x^-_k$: $A_k=\left.\frac{\partial f}{\partial x}\right|_{\hat x^-_k,u_k},\qquad C_k=\left.\frac{\partial h}{\partial x}\right|_{\hat x^-_k,u_k}.$ Use $A_k$ and $C_k$ in the KF recursions, and map $Q$ through the effective noise channel (e.g., $G_k Q G_k^\top$). This preserves the diagram-grounded logic: the covariance prediction must follow the actual noise path.

Linearization techniques. Compute analytic Jacobians when possible; otherwise use automatic differentiation or finite differences with step-size control. For discrete-time models induced by continuous dynamics, linearize the discretized model or use first-order discretization of the continuous-time Jacobians.

Unscented Kalman Filter (UKF)

Avoid Jacobians by propagating sigma points through $f(\cdot)$ and $h(\cdot)$, then recover mean and covariance by weighted recombination. Preserve direct feedthrough in the predicted measurement $\hat y^-_k$ as $h(\cdot)$ or explicitly via $D u_k$. For noise through $B$, treat the process noise via augmented sigma points or known additive channels.

Practical Variants

Time-varying $R_k$ (gating, heteroscedastic sensors): choose $R_k$ from context available before the update (e.g., based on $\hat y^-_k$); update $S_k$ and $K_k$ accordingly.
Correlated $w_k$–$v_k$: if $\mathrm{cov}[w_k,v_k]=N\neq 0$, use $K_k=(P^-_k C^\top + B N)\,S_k^{-1}$ and $S_k=C P^-_k C^\top + R + CBN + (CBN)^\top$.
Model mismatch: track inflated $Q$ or adapt $Q,R$ via innovation statistics; avoid silently dropping $D$.

Fixed-Interval Smoothing (Rauch–Tung–Striebel)

When the full measurement sequence $y_{0:T}$ is available, the RTS smoother improves state estimates by a backward pass:

Forward pass (standard KF): compute $\hat x_k, P_k$ for $k=0,\dots,T$.

Backward pass (for $k=T-1,\dots,0$): $J_k=P_k A^\top (P_{k+1}^-)^{-1},\qquad \hat x_k^s=\hat x_k + J_k(\hat x_{k+1}^s - \hat x_{k+1}^-),$ $P_k^s=P_k + J_k\,(P_{k+1}^s - P_{k+1}^-)\,J_k^\top.$ The smoother leverages future information to reduce estimation variance; it preserves the same noise-placement logic used in the forward filter (e.g., $BQB^\top$ in $P_{k+1}^-$).

Summary

This example demonstrates a diagram-grounded Kalman filter derivation in which noise pathways and direct feedthrough are enforced by design. The explicit steps show (i) why the innovation must subtract $D u_k$, (ii) why the prediction covariance must use $BQB^\top$, and (iii) how indexing aligns priors and measurements. The same discipline extends to EKF/UKF and to RTS smoothing, where accurate handling of noise placement, linearization, and indexing remains essential for reliable state estimation—useful as a hand-derivable tutorial and a compact stress test for automated solvers.

A worked LTI example, explicit derivation, and practical extensions