The problem of data biases
It is a common practice at universities for math and statistics professors to demonstrate models by using perfect, and in many cases even fictional data. When students later apply these models in real-world scenarios, however, they often encounter the stark reality that perfect data does not exist. Most datasets are characterized by underlying biases, many of which remain unidentified. And even when biases are recognized, there are no formulas to account for these errors.
As a data scientist, a significant aspect of my role involves identifying potential issues in the data and assessing the limitations of both the model and the dataset, especially prevalent problems related to data biases, encompassing selection/sample biases, variable biases, and what I call temporal biases.
Data biases are usually an issue in inferential statistics and statistical learning. Selection bias arises when the distribution of items and their characteristics in the data does not reflect the actual distribution in the total population. Selection bias arises if two theoretical criteria are violated:
The first requirement involves how data is collected, especially the use of a random selection method. Random selection means that each item in the total population has an equal chance of being part of the sample. In practice this theoretical requirement is difficult to achieve, especially when human data is involved. Some people might prefer not to participate in these surveys because they have better things to do or certain population groups are less likely to be reached through specific communication technologies or at certain times.
The second requirement is a sufficient number of randomly selected items relative to the total population, which can be derived from the law of large numbers.
Another form of data bias occurs when essential variables are misrepresented. If independent variables crucial for explaining the dependent variable are missing from the dataset, logical or mathematical models will yield higher errors. Variable bias can also impact selection bias if different underlying definitions of variables are present, representing distinct quantities.
This issue becomes pronounced when dealing with country-level cross-sectional data, where data are collected from various sources or countries with different definitions of variables. Variable biases may also emerge when measurement methods are inaccurate, such as using diverse instruments based on different parameters or facing external disturbances negatively affecting the measurement process.
A third type of bias arises along the temporal dimension when historical changes shape variables or items over time. A machine learning algorithm trained on historical data struggles to predict future events if conditions leading to these events were not reflected in the original training data. Temporal biases can be also linked to variable biases, like changing definitions of certain variables over time, hindering the comparison of the same variable at different times.
Similarly, changes in measurement behavior pose challenges. Take conflict data, for instance, where casualties resulting from armed conflict are often reported by media. Yet, the media and journalists can have a certain attention span and their focus on a specific conflict can fluctuate over time, complicating the analysis of casualty data and drawing conclusions on conflict dynamics.
Unbiased data are a vital prerequisite for robust statistical analysis at various levels. Without unbiased data, conclusions drawn from statistical comparisons, inferential statistics and probability theory are inconclusive. While methods like Naive Bayes algorithms are less sensitive to randomly selected data, inferential statistics and classical probability theory rely on the assumption of normally distributed data.