Data Imputation Strategies: Utilizing Expectation-Maximization (EM) for Handling Missing Not At Random (MNAR) Data

Introduction
Missing data is one of the most common reasons analytics projects go off track. A model that performs well in development can fail in production simply because real-world records arrive incomplete. Many teams treat missing values as a cleaning task-fill with a mean, drop rows, or add a “missing” flag. That can work when data is missing at random, but it becomes risky when the missingness is systematic. Missing Not At Random (MNAR) means the probability of a value being missing depends on the value itself or on an unobserved factor. For example, high-income customers might avoid sharing salary, or patients with severe symptoms might be less likely to respond to follow-ups. In such cases, simple imputation can distort distributions and bias outcomes. For learners in a data scientist course, understanding EM-based imputation is useful because it brings statistical rigour to a very practical problem.
Understanding MNAR and Why It Is Hard
To choose the right strategy, it helps to distinguish the three standard missingness mechanisms:
- MCAR (Missing Completely At Random): missingness is unrelated to observed or unobserved data.
- MAR (Missing At Random): missingness depends on observed variables (e.g., age, region).
- MNAR (Missing Not At Random): missingness depends on the missing value itself or hidden drivers.
MNAR is challenging because you cannot fully diagnose it from observed data alone. You can suspect it based on domain knowledge and patterns (e.g., a form field left blank more often at high values), but the missing values are precisely what you do not see. That is why MNAR requires modelling assumptions. EM (Expectation-Maximization) becomes helpful because it estimates parameters of a data-generating process while accounting for incomplete observations, rather than forcing a simplistic fill-in.
What Expectation-Maximization (EM) Does for Imputation
EM is an iterative algorithm used to find maximum likelihood estimates when data is incomplete or involves latent variables. In the context of missing values, EM treats missing entries as latent variables and alternates between:
- E-step (Expectation): Determine the probable values of missing data points using the observed data and the current parameter estimates.
- M-step (Maximization): Update model parameters to maximise the likelihood using the observed data plus the expected missing values from the E-step.
This repeats until the parameters stabilise. The outcome is not just a single imputed dataset; it is a fitted probabilistic model from which imputations can be drawn. In many practical workflows, EM is used to produce a completed dataset (or multiple completed datasets) while retaining relationships between variables more faithfully than basic methods.
Using EM Specifically for MNAR Settings
EM alone does not automatically “solve” MNAR. The critical step is to incorporate a missingness model or an assumption that links missingness to values. Two common modelling approaches are:
1) Selection Models
Selection models jointly specify:
- A model for the data (e.g., a multivariate normal model for continuous variables, or a regression model for outcomes), and
- A model for the missingness indicator (whether a value is missing), where missingness can depend on the unobserved value.
EM is then used to estimate parameters for both parts. This is useful when you believe, for example, that people with higher values are less likely to report them.
2) Pattern-Mixture Models
Pattern-mixture models group records by missingness patterns and model distributions within each pattern. For example, you might model responders and non-responders separately and then combine them with weights. EM helps estimate parameters when some groups have partially observed data.
In both approaches, domain logic is essential. A strong MNAR imputation strategy is as much about reasonable assumptions and sensitivity testing as it is about algorithms. This is a common theme in applied learning, including projects you might see in a data science course in Pune, where real-world data quality issues are part of the challenge.
Practical Workflow: EM-Based Imputation You Can Defend
Here is a practical, defensible process for EM imputation under suspected MNAR:
- Explore missingness patterns: Check missing rates by segment (region, channel, tenure), correlations with observed variables, and whether missingness spikes at meaningful thresholds.
- Start with a baseline MAR model: Fit an EM imputation under MAR assumptions first. This gives a reference point.
- Add an MNAR mechanism: Incorporate a selection model or pattern-mixture assumption that reflects the domain reality.
- Run EM and validate: Compare distributions before and after imputation, and evaluate downstream model stability (coefficients, feature importance, error rates).
- Perform sensitivity analysis: Because MNAR depends on assumptions, test multiple plausible settings. For example, assume missing values are on average 10% higher than predicted under MAR, then 20% higher, and observe the impact.
The goal is not to claim certainty, but to show how conclusions change under realistic missingness scenarios.
Common Pitfalls to Avoid
EM is often misused when teams skip the assumptions:
- Treating EM output as “truth”: Under MNAR, imputation is model-based inference, not a factual recovery of missing values.
- Ignoring model mismatch: If your data is highly non-normal or has strong nonlinear relationships, a simple EM model may not fit well. Consider transformations, alternative likelihoods, or multiple imputation variants.
- Overlooking uncertainty: Single imputation underestimates variance. If possible, draw multiple imputations from the fitted model and combine estimates to reflect uncertainty.
- Not documenting assumptions: MNAR work must be auditable. Record why you chose the missingness mechanism and what sensitivity tests were run.
Conclusion
Handling MNAR data is less about “filling blanks” and more about modelling why the blanks exist. Expectation-Maximization provides a structured way to estimate parameters and impute values while respecting statistical relationships, but MNAR requires explicit assumptions about the missingness process. When paired with careful pattern analysis and sensitivity testing, EM-based imputation can reduce bias and improve the reliability of downstream models. For professionals building capability through a data scientist course or advancing applied skills via a data science course in Pune, mastering EM for MNAR is a practical step toward more trustworthy, real-world analytics and machine learning.
Business Name: ExcelR – Data Science, Data Analytics Course Training in Pune
Address: 101 A ,1st Floor, Siddh Icon, Baner Rd, opposite Lane To Royal Enfield Showroom, beside Asian Box Restaurant, Baner, Pune, Maharashtra 411045
Phone Number: 098809 13504
Email Id: [email protected]



