Incomplete data is one of the most prevalent and biggest problems not only for data professionals, such as data analysts, data scientists, or machine learning engineers, but also for the decision-makers sitting at the end of a data pipeline in organisations. It is an important data quality dimension that should be handled properly so the data can be fully utilised to achieve the optimal result whilst at the same time avoiding pitfalls that can lead to a spurious conclusion. This is especially important when data is used to drive key decisions in downstream modelling or AI systems for organisations.
While many people realise the need to handle missing values appropriately, few are well-equipped. There is still a strong paradigm that simple methodology such as ignoring the missingness (dropping values to be specific also known as listwise deletion) is good enough. While it works in certain cases, we believe it is important to come up with a framework to handle missingness properly which we aim to do through this article. The article will start with a few examples in the financial service industry where ignoring the missingness can lead to an erroneous conclusion. After that, we will venture into the technical sides of handling missing data (without any mathematics, of course) and conclude with a practical procedure.
Why can’t we just ignore the missingness?
Consider a case where a bank aims to perform capital requirement calculation and modelling (Florez-Lopez, 2010). Under the Basel III accord, the bank is allowed to use Internal Rating Based (IRB) approach to calculate regulatory capital only if the bank satisfies certain conditions, requirements, and approval. Out of sets of conditions and requirements, there are two requirements that are considered as highly crucial. Firstly, banks must categorise their exposures into various asset classes as defined by the Basel II accord, such as corporate, sovereign, bank, retail, and equity. Secondly, banks must provide an accurate estimation of risk parameters particularly the Probability of Default (PD). The problem, however, for certain classes of assets, such as retail, internal data must be considered for estimating the PD and in practice, internal records are often incomplete. As such, estimating PD proves to be a tricky activity.
In such a case, ignoring the missing values or dropping the values in the modelling stage will lead to great information loss and hence, cause an incorrect calculation of capital requirement for the bank. Consider another case of Credit Default Swaps (CDS) risk estimation (Bauer, Angelini, & Denev, 2017) that uses daily quotes of CDSs as the predictor. Daily quotes of bonds with short and medium maturities (6M, 1Y, 2Y, 5Y, 7Y) are frequently observed. However, daily quotes of bonds with longer maturities (15Y, 20Y, 30Y) are frequently missing. In such a case, what should we do to estimate the overall risk? Should we just exclude bonds with longer maturities from the equation and proceed in estimating the risk? By doing so, we will fail to capture the true risk of the CDSs.
As a result, there is a need to introduce a framework to handle missingness which involves two stages:
- Understand why the data is missing
- Utilise the right techniques
Stage 1: Understanding why the data is missing
This is the most important step in handling missingness and unfortunately is not trivial. Lucky for us, Rubin (a famous statistician) provides a useful and concrete categorisation. According to Rubin (1976), missingness in a dataset can be categorised under three different mechanisms: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing not at Random (MNAR). Without diving into the formal mathematical definition of each mechanism, here is a brief explanation:
- Missing Completely at Random (MCAR): Missingness in data sets is unrelated to any observed variables. In other words, values are missing because of pure chance.
- Missing at Random (MAR): Missingness in data sets are related to one or more observed variables. Contrary to the name, the missingness under this mechanism is not random at all but rather quite systematic. Although it is somewhat misleading, we will stick with the name because it is still commonly used in the literature.
- Missing not at Random (MNAR): Missingness in data sets are related to unobserved variables or the values of the missing variables (It can also be both).
To bring the explanation to life, let’s consider the CDS risk estimation case. This is how the missing values could be generated under each mechanism:
- Missing Completely at Random (MCAR): A system failure occurs in the database where the dataset is being stored and causes the daily quotes for some maturities to be missing without any pattern.
- Missing at Random (MAR): The daily quotes of some bonds are missing because they don’t have good liquidity (the liquidity needs to be an observed variable).
- Missing not at Random (MNAR): The daily quotes of some CDSs are missing because they are below a certain threshold.
Knowing these three concrete mechanisms, however, is not enough because deciding which mechanism causes missingness in the data is much more complex than it seems. Firstly, what if we do not have much prior information at hand. For example, how do we know we can assume that the missingness is MCAR? Or how to know that our missing values are not caused by unobserved variables? Secondly, the various types of missing data mechanisms are not mutually exclusive. This essentially means that it is absolutely possible for a data set to contain more than one missing data mechanisms. In fact, very often that is the case. In a situation like this, there are two approaches. Firstly, we can utilise statistical techniques, such as hypothesis testing, to reject the possibility of MCAR. However, just like any other hypothesis testing, not rejecting the presence of MCAR does not really confirm the existence of the mechanism. Hence, hypothesis testing is not a panacea, and in fact is only the first step. A much more robust approach will be to collect extra information regarding the missingness. This can be done for example through discussion with subject matter experts (Bauer, Angelini, & Denev, 2017) or the data owners.
The methods above can also be combined with a technique called the inclusive analysis strategy which aims to prevent the most problematic mechanism: MNAR. The intuition behind this strategy is to conduct a careful design and execution at the data collection stage by incorporating variables that are highly correlated with features that are prone to be missing. By doing this, we are ‘forcing’ the missing features to fall under MAR by definition.
Stage 2: Utilising the right techniques
After understanding the 'why' we can then proceed to use various techniques that are widely available: from the simplest technique such as mean, median, or mode imputation to utilising deep learning algorithms such as Generative Adversarial Network (GAN). The bad news is, there is no one-size-fits-all technique because choosing the right techniques depends on the types of missing data mechanism. However, there is a consensus that imputing the missing values with mean, median, or mode should be performed with extra caution because the technique introduces statistical bias for two reasons. Firstly, the technique assumes that the imputed value is exactly the true value of the missing value. And secondly, it does not take into consideration the relationship between features in the data. Another advice is that arbitrarily applying techniques to impute missing values regardless of the missing data mechanism should be avoided because most techniques operate under the MAR assumptions instead of MNAR. Hence, applying blindly may cause serious ramifications down the line. Additionally, if the missing values only constitute a small percentage of the entire data (around 1% or 5%), it might be worthwhile to ignore them unless of course, the missing values are very crucial to the analysis process just as in the case of our examples above. Here is a list of popular techniques that are utilised for different missing mechanism:
- MCAR: listwise deletion, Multivariate Imputation using Chained Equations (Buuren & Groothuis-Oudshoorn, 2011), Expectation-Maximization algorithm (Dempster, Laird, & Rubin, 1977)
- MAR: Multivariate Imputation using Chained Equations (Buuren & Groothuis-Oudshoorn, 2011), Expectation-Maximization algorithm (Dempster, Laird, & Rubin, 1977)
- MNAR: Heckman selection model (Heckman, 1976)
The list above is not exhaustive. For example, under MAR or MCAR assumption, it is also possible to use a tree-based model to impute missing values such as MissForest (Stekhoven & Buhlmann, 2012) or GAN-based algorithm (Guo, Wan, & Ye, 2019) for time series data.
A General Guideline
There is no one-size-fits-all technique as the field of incomplete data analysis is wide and heavily requires practical considerations. We have concluded a practical guideline here for handling missing values:
- Have a discussion with the data champions or subject matter experts.
- Do we need all variables in our analysis? If most of the missing values are located within variables that are not needed, drop the variables.
- Are the missing values subjects of interest in the analysis?
- Can we deduce the missing mechanisms right away?
- Check the number of rows that have missing values and calculate the proportion against the whole data set.
- If it is somewhere around 1% - 5% and there is a strong reason for MCAR to be present, drop these rows. Otherwise, proceed to step 3.
- Identify the missing data mechanisms.
- Start by using statistical tests such as the unpaired t-test to check that the missing values are MAR and not MCAR.
- If the missing values are MCAR, it is possible to use listwise deletion or other methods.
- If the missing values are not MCAR, do not use listwise deletion, and use other methods.
- Always assume MAR and MNAR are both present.
- Have another round of discussion with data owners or subject matter experts.
- Can we exclude one of the two?
- Start by using statistical tests such as the unpaired t-test to check that the missing values are MAR and not MCAR.
- Use a combination of techniques to impute missing values and compare the result.
- If time is of the utmost importance, techniques such as mean, median, or mode imputation can be used. A more advanced technique but easily implemented with a better result will be stochastic regression.
- If the data is much more complicated, meaning lots of observations and features, deterministic techniques such as MICE or generative techniques such as EM or GAN-based imputation can be utilised.
- Once we used a combination of techniques, we can evaluate the imputation accuracy by using test data sets that are related to the specific case at hand.
- Use more than one test data set if possible.
- An average imputation accuracy can be calculated if multiple test data sets are used.
- Select the result that yields the highest average imputation accuracy.
Bauer, J., Angelini, O., & Denev, A. (2017). Imputation of multivariate time series data - performance benchmarks for multiple imputation and spectral techniques. SSRN 2996611.
Buuren, S. V., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society.
Florez-Lopez, R. (2010). Effects of Missing Data in Credit Risk Scoring. A Comparative Methods to Achieve Robustness in the Absence of Sufficient Data. The Journal of the Operational Research Society.
Guo, Z., Wan, Y., & Ye, H. (2019). A data imputation method for multivariate time series based on generative adversarial network. Neurocomputing.
Rubin, D. B. (1976). Inference and Missing Data. Biometrika.
Stekhoven, D. J., & Buhlmann, P. (2012). MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics, 112-118.