Unsupervised Learning for Software Defect Prediction

KC1 (NASA), Method Motivations, Effort–Difficulty Maps, and PHM Transfer

Posted by Christopher O’Hara on February 12, 2024

Modern engineering often begins with lots of data and few trustworthy labels. The NASA KC1 software dataset is a classic example: many module metrics (Halstead, complexity, coupling), sparse/uncertain defect labels, and strong class imbalance. In such settings, unsupervised learning isn’t a convenience—it’s a design requirement: we need to discover structure, measure difficulty/risk, and prioritize work before labels are reliable.

Machine learning, especially in unsupervised settings, rarely gives us immediate satisfaction. Many outputs look messy or unhelpful at first. But each one carries meaning, and together they build the case for a final actionable result. Below, I reflect on the KC1 images and explain why frustration is part of the process—and why trusting the stepwise theory matters.

This post explains why each method was chosen, what it reveals that others do not, how the results translate into a practical effort–difficulty map for prioritizing code review and testing, and where these ideas transfer to PHM (prognostics & health management) and anomaly detection in physical systems.

👉 Full code and experiments: KAS NASA KC1 Notebook

Industry Note:

In real projects, unsupervised learning rarely gives neat answers on the first try. Each method here—PCA, clustering, anomaly detection, SOM—looks imperfect alone. But taken together, they converge on something actionable: a map of where to spend engineering time for the best return. The value is not in a single algorithm but in building a systematic pipeline that transforms messy data into a triage dashboard for decision-making.

This post is not about presenting a polished “best model.” It is a worked example of the thought process when applying unsupervised learning to a noisy, imbalanced dataset. In practice, we have to try methods that may at first look unhelpful or frustrating—PCA blobs, overlapping clusters, over-flagged anomalies. But each step carries information if we know how to interpret it. By trusting the process piece by piece, we still arrive at meaningful results: in this case, an effort–difficulty map that prioritizes testing and review. The lesson is that machine learning for engineering is less about chasing a perfect algorithm and more about reasoning systematically through what each method tells us along the way.


The KC1 Dataset

The KC1 dataset comes from NASA’s Metrics Data Program (MDP), a collection of real-world software defect datasets.

  • Source: A flight software project at NASA.
  • Content: Each row corresponds to a software module, with static code metrics such as lines of code, cyclomatic complexity, Halstead effort/volume, coupling, and branching factors.
  • Labels: Modules are marked as defective or non-defective, though these labels are known to be noisy and imbalanced (only about 15–20% are defective).
  • Challenge: Sparse and unreliable defect labels, high class imbalance, and heterogeneous metrics make it a benchmark problem for unsupervised or semi-supervised learning.

The KC1 dataset is widely used in software engineering research to test methods for defect prediction and has a natural analogy to fault detection in engineered systems.

👉 Direct access: NASA KC1 Dataset (MDP Repository)

Problem Framing: Why Unsupervised First?

  • Reality: defect labels are expensive, noisy, and imbalanced (~few defects among many modules).
  • Goal: Triage modules by difficulty/risk and allocate limited engineering effort (review, testing, refactor) for best ROI.
  • Principle: start with structure discovery (manifolds, clusters, outliers) → then risk scoring → finally prioritization using an effort–difficulty lens.

We use Halstead effort ($e$) and volume ($v$) as effort proxies, while difficulty comes from unsupervised risk signals (cluster separation, anomaly scores, reconstruction error, SOM activations).


Data Preprocessing (Motivation)

Why normalize/denoise first?
KC1 metrics are heterogeneous (orders of magnitude apart); unscaled distances distort neighborhood and density estimates, which breaks clustering and anomaly detection. PCA/t-SNE are only meaningful when features are on comparable scales.

What we do

  • Standardize features; winsorize extreme tails if necessary.
  • Handle missing values; remove trivially collinear metrics.
  • Preserve $e$ and $v$ for later effort mapping.

PCA — Linear Structure, Denoising, and Interpretable Axes

Motivation.
If variation is approximately linear, PCA gives the least-squares optimal low-dimensional basis. It reduces noise, mitigates collinearity, and yields axes engineers understand (e.g., size/volume vs complexity/coupling). It is the baseline manifold for everything that follows.

What PCA reveals that others don’t.

  • Global variance directions (not just local neighborhoods).
  • How much of the dataset geometry is explainable linearly.
  • A physically interpretable 2D plane to overlay effort/volume.

Limitations.
Misses curved manifolds; clusters can appear merged.

Figure — PCA
KC1 PCA

At first glance, this figure is frustrating. The clusters look arbitrary—one large dense blob and some scattered points far away. It doesn’t scream “defects” or “effort–difficulty.” But if we stop here, we’d miss the point: clustering in raw PCA space mostly tells us that linear variance alone doesn’t cleanly separate risk.

Interpretation: There are different regimes of code modules, but their boundaries aren’t sharply aligned with defects. It’s not satisfying, but it shows us why we need richer models.


t-SNE — Local Neighborhoods and Nonlinear Manifolds

Motivation.
When defect patterns are nonlinear or locally tight, t-SNE preserves local neighborhoods better than PCA. It’s a visual microscope for seeing pockets of similar modules that may correspond to high-risk code styles.

What t-SNE reveals.

  • Fine-grained clumps that PCA smears.
  • Candidate cluster counts and unusual neighborhoods for inspection.

Cautions.
Global distances are not meaningful; tune perplexity; fix random seeds for reproducibility; use as exploration, not a classifier.

Clustering (k-Means / GMM / Hierarchical) — From Structure to Groups

Motivation.
Once we glimpse structure, we need groups to (1) summarize modules, (2) compare group-level risk, and (3) route different test strategies.

  • k-Means (fast, centroid-based): good when clusters are roughly spherical in the feature space you use (PCA plane or standardized metrics).
  • GMM (soft probabilities): when clusters overlap; gives a defect-likelihood proxy via membership probabilities.
  • Hierarchical: exposes multi-scale structure; dendrogram helps choose $k$ data-driven rather than arbitrary.

What clustering reveals.

  • Stable mode families of modules (coding styles, complexity regimes).
  • Border cases with ambiguous membership → hardest to classify ⇒ good candidates for review.

Limitations.
Cluster shapes and scales matter; verify with silhouette/BIC; don’t assume one “true” $k$.

Figure — Clusters
KC1 Clusters

Here the picture looks noisy too: lots of overlapping clusters, spaghetti-like boundaries, and no obvious “risk frontier.” Frustrating? Yes. But the result is still interpretable: t-SNE is showing local neighborhoods of similar modules, and GMM probabilities are telling us that membership is fuzzy.

Interpretation: This confirms that defects aren’t globally distinct—they live inside otherwise normal neighborhoods. That’s valuable insight, even if the visual doesn’t “solve” the problem.


Isolation Forest — Density-Independent Outliers

Motivation.
Clustering finds centers; we also need to find rare events that don’t belong anywhere. Isolation Forest isolates points by random splits; anomalies need fewer splits.

What IF reveals.

  • Local anomalies within otherwise “safe” clusters.
  • Outliers in mixed metrics where distance is unreliable.

Limitations.
Sensitive to feature scaling and contamination rate; scores are relative, not absolute probabilities.

Figure — Isolation Forest
KC1 Isolation Forest

This output can also feel unrewarding: almost everything looks anomalous. Why bother? But the meaning here is that Isolation Forests don’t guarantee balanced splits—they isolate data based on partition depth, and in skewed datasets like KC1, this leads to many “outliers.”

Interpretation: The method is telling us the dataset is extremely imbalanced, and defect-like behavior is scattered throughout. Again, not a neat separation, but it teaches us that simple anomaly scores will over-flag without additional structure.


Autoencoder — Nonlinear Reconstruction Error as Difficulty

Motivation.
When “normal” code lies on a curved manifold, the right notion of difficulty is how far a module is from what the network can reconstruct. Autoencoders supply a nonlinear anomaly score that complements IF.

What AE reveals.

  • Modules that are manifold-distant, even if not density-sparse.
  • Interactions among metrics (e.g., unusual coupling/size combos) invisible to linear models.

Risk & mitigation.
If trained naively on mixed data, the AE can learn to reconstruct anomalies. Use early stopping, robust loss, or bias training to likely-normal regions.

Figure — Autoencoder
KC1 Autoencoder

The autoencoder output also feels disappointing at first: red and black points scattered without a crisp defect boundary. Where’s the signal? But what it’s really showing us is the nonlinear reconstruction difficulty: which modules deviate most from what the network considers “normal.” This is fair, because Professor [X] said AEs are good for every task! (small joke)

Interpretation: Even if it doesn’t form neat groups, it adds another dimension of difficulty. Modules with consistently high reconstruction error are telling us, “these combinations of metrics are unusual and fragile.” It’s not a final answer, but it adds another layer of evidence before we move to SOM.

Self-Organizing Map (SOM) — Topology + Operator Interpretability

Motivation.
Engineers need dashboards where similar modules land in nearby cells. SOMs preserve topology on a 2D grid and let us overlay task-specific fields (effort, difficulty, delta). They are ideal for turning ML outputs into actionable maps.

What SOM reveals.

  • Regions of consistent coding style or complexity regime.
  • “Hot” cells where many difficult modules accumulate.
  • A canvas to combine metrics and model scores.

Limitations.
Grid size and initialization matter; treat as calibrated visualization rather than a classifier.

Figure — SOM overview
KC1 SOM


Results: Effort–Difficulty Mapping (The “So What”)

Motivation.
Resources are limited. We must prioritize modules where improving quality is cheapest for the largest risk reduction. We operationalize:

  • Effort ($E$): engineering cost proxy (Halstead $e$ or $v$, plus complexity/coupling as needed).
  • Difficulty ($D$): unsupervised risk score (e.g., max of normalized IF score, AE reconstruction error, GMM “low-likelihood”, cluster border proximity, SOM activation).

A simple, tunable priority score is
\(S = \lambda \, D + (1-\lambda)\, \tilde{E}\) with $\tilde{E}$ a scaled effort term. We then plot on SOM to get an operator-friendly triage map.

Figure — SOM effort–difficulty

  • Difference map (volume - difficulty): red = over-engineered/low risk (de-prioritize), blue = low-effort/high-difficulty (fast wins), mid-orange = moderate-effort/high-difficulty (best ROI):
    KC1 SOM Diff

This is where things start to click. The SOM heatmaps finally tie the effort proxy (volume) to the difficulty proxy (defect likelihood) in a way that produces structure we can act on.

  • Some cells show high volume but low difficulty → large but stable code.
  • Others show low volume but high difficulty → small, fragile modules.
  • The difference map crystallizes the intuition: prioritize the moderate-effort/high-difficulty zones for best ROI.

Interpretation: The SOM stage converts abstract metrics into an actionable effort–difficulty triage map.

Takeaways (KC1).

  1. Many high-effort / low-difficulty modules are false alarms—large but stable.
  2. A nontrivial pocket of low-effort / high-difficulty modules yields quick wins.
  3. The moderate-effort / high-difficulty band is the sweet spot for test/refactor investment.

This is the operational output of the project: a map that tells teams where to spend the next hour.


How Methods Fit Together (Motivated Pipeline)

  1. PCA → robust, interpretable base space; remove noise/collinearity.
  2. t-SNE → inspect nonlinear local structure; choose candidate $k$.
  3. Clustering (k-Means/GMM/Hier.) → define groups & border cases; get soft difficulty via GMM likelihood.
  4. Isolation Forest → catch density outliers that clusters miss.
  5. Autoencoder → capture nonlinear reconstruction difficulty.
  6. SOM → fuse $E$ & $D$ into actionable 2D maps for triage.

Each step is motivated by a capability gap left by the previous one.

Results Journey – Trusting the Process

The journey across these images is a reminder: in unsupervised ML, each method reveals one lens, often messy, rarely decisive.

  • PCA + k-Means → showed us variance isn’t enough.
  • Isolation Forest → exposed the imbalance problem.
  • t-SNE + GMM → confirmed that defects live inside otherwise normal neighborhoods.
  • SOM → finally delivered a clear prioritization map.

If we judged PCA or IF results alone, we’d abandon the project in frustration. But step by step, layering clustering, anomaly detection, and topology-preserving maps, the landscape sharpened.

Lesson: Not every figure must be satisfying, but we must understand what each result means—even if it isn’t what we hoped for. Only then do we arrive at the actionable insight: the effort–difficulty map.


PHM / Fault Prediction Transfer (Why It Generalizes)

The broader systems design principles (see my FSMs and Logic Controllers post) apply here too:

  • Separation of concerns: preprocessing (feature extraction) vs clustering (state discovery)
  • Feedback and history: anomaly detection improves when tracking data trajectories, not just snapshots
  • Encoding and compositionality: clusters and anomaly scores become system codes fed to schedulers, watchdogs, and diagnostics

Replace “module” with component/sensor/subsystem; replace $e,v$ with maintenance effort proxies (downtime, access cost, swap price). The same methods deliver:

  • PCA/t-SNE: healthy manifold vs drifting states.
  • Clustering: operating modes and degradation stages.
  • IF/AE: rare precursors and nonlinear failure signatures.
  • SOM: control-room maps of where the plant/aircraft/robot is trending.
  • Effort–Difficulty: plan inspections by risk × cost, not size or tradition.

Why This Matters

Software defect prediction is structurally similar to fault detection in engineered systems:

  • Software modulesmechanical subsystems, sensors, or valves
  • Defectsfaults, anomalies, or degradations
  • Code metricssensor features, vibration signals, or process logs

Effort–difficulty framing unifies these domains under a single principle:
👉 Not all anomalies are equal. Prioritize based on both cost and likelihood.

Practical Notes and Caveats

  • Validation without labels: use stability checks (seed/perplexity sweeps), cluster separation (silhouette), density diagnostics, and post-hoc audits on small labeled subsets.
  • Thresholds: treat anomaly thresholds as operational dials (precision/recall trade-offs differ for safety-critical vs exploratory review).
  • Reproducibility: fix seeds, log preprocessing; for t-SNE/AE, publish configs.
  • Ethos: unsupervised is decision support, not oracle—pair with code review and tests.

Closing Thoughts

Unsupervised learning on KC1 highlights a pathway for fault detection and PHM in broader engineering contexts:

  • Software engineering → catching risky modules before deployment
  • Aerospace → identifying precursors to subsystem failures
  • Industry → clustering anomalies for predictive maintenance
  • Robotics → embedding safety/awareness directly into navigation and control loops

The lesson: unsupervised structures exist everywhere—by uncovering them, we enable systems to adapt, anticipate, and manage risk even when explicit guidance is missing.


Conclusion

KC1 shows that unsupervised learning is a systems tool, not just a visualization trick. By motivating each method for what it uniquely contributes and fusing them on an interpretable SOM effort–difficulty map, we move from “interesting plots” to actionable triage. The same blueprint scales to PHM and anomaly detection for physical systems where labels are scarce and stakes are high.

👉 Notebook: KAS NASA KC1