Pitfalls of Data Reduction — Mathing the Future

“Data Reduction” is the technical name for what statisticians and scientists and analysts do when there’s too much information to process. In a world where an internet gives you hundreds of conflicting answers to pretty much any question, it’s a necessary step.

There are several approaches, depending on what data’s available and what you need to know about it:

Each approach has strengths and weaknesses, and choosing the wrong approach has the potential to mislead the researchers and / or their audience. Note that just because the information is misleading doesn’t mean its wrong or intentionally deceptive. Researchers often have intuitions, opinions, and a vested interest in the subject of their research - but there are ways to mitigate those problems. I intend to discuss both “lying with statistics” and “verifying scientific research“ in future posts. For now, I’ll discuss each of the three major data reduction methods in turn.

Averages

There are several types of “average” used for different purposes. I’m slightly broadening the notion of an “average” here in that I use it to refer not just to a mean, median, or mode, but also windowed and weighted averages. Some averages also use a change of variables as an intermediate step. (For example, you can construct the root-mean-square average or calculate the inverse of the mean of the inputs’ inverses - see the picture at right.) What averages have in common is that the output data is conceptually the same as each of the inputs. If you average together the heights of a thousand people, you get a single height as the output. Additionally, the average will always fall somewhere between the minimum and maximum values of the inputs.

Averages are useful for developing a description of a typical sample, but there is much potential trouble in that notion of “typical.” For example, a country’s per-capita GDP is effectively the mean income. However, suppose Economy A has one million people: a single person who makes $100 billion per year, and the rest have no income at all. Compare Economy B, which also has one million people, each of whom makes $100,000. Both economies have the same population and GDP, and thus the same per-capita GDP, but their living conditions are likely to be very different. Although A and B are obviously extreme cases, economists and other social scientists must concern themselves with various intermediate possibilities.

Averages are often paired with measures of uncertainty and variation such as the standard deviation. Even when the average income is quite high, there will be a few people with very low, possibly-negative incomes. Averages are standard deviations tend to be more robust when more cases are considered, but they may also be thrown off by a few highly-unusual cases. It is often necessary to clean the data by removing erroneous, obviously-wrong, or extremely-odd data points. At times, the assessment and elimination of outliers can cause controversy.

Case Study

A case study is an in-depth analysis of a single individual or situation. They are usually used for exploratory analysis, or when getting sufficient information from a large number of individuals is prohibitively expensive. For example, medical case studies may track a few individuals over the course of their entire lives. Although the statistical relevance of an individual history is small, a case study is less invasive than a controlled experiment with a similar level of detail and sidesteps many ethical issues. Case studies can help identify which variables are important for later more-rigorous investigation. Case studies are also very useful in describing issues to lay people. Your knowledge of your personal history and circumstances is, in some sense, a case study: your life is the “case” and your analysis and introspection of your personal history is the “study”. Similarly, your knowledge of your friends’ and family members’ history forms around 150 additional case studies. Many news articles can also be viewed as limited case studies, describing a single crime, political debate, or event. (I am admittedly blurring the line between “case study” and “example” here. A case study is more detailed and rigorous than an example.)

Because case studies (and examples) do not sample a large number of individual cases, they are vulnerable to bias and fallacies like post hoc propter hoc. Assertions about general trends should never be based primarily on a single case study. Barack Obama being elected President of the United States did not signal the end of racism, or even of racism in US politics. People often view themselves and their friends as harder-working and more capable than those in other groups. This can cause them to assume that because something wasn’t a problem for them, it must not be a valid problem for anyone.

Change of variables

Sometimes it is possible to simplify a problem by combining several raw variables, inferring something that is not directly measurable. Simple applications may use the difference between individuals’ weight before and after participating in an exercise program. It may be useful to convert planetary distance & albedo into diameter, or to estimate a star’s temperature from the ratio of emission at two different colors. For automated analyses, this is often combined with regression (regression) to search for patterns in the data: correlated variables or combinations of variables. Changing variables typically involves a model about the real-world quantities. That model necessarily makes assumptions. Sometimes the assumptions are faulty, which can lead to confusion and also new discoveries. When the models are verified, they can be used to describe quantities which are not directly measurable due to time or distance.

Regression

I’m using “regression” as a catch-all for a wide variety of automated and semi-automated pattern-extraction algorithms. I’m specifically including both supervised and unsupervised machine learning techniques. The commonality between all these regression techniques is a goal of restructuring the data into a more convenient or intuitive form based on the data itself. Supervised learning (and classic statistical regression) techniques output rules of thumb like “these things tend to happen together” or “if you measure X between values Xl and Xh, you’ll probably find Y between values Yl and Yh,” Unsupervised learning techniques tend to emit groupings of similar cases. Statistical regression and machine learning techniques are similar in that both tend to rely on averages of quantities like the ratios of variables. Despite the popularity of machine learning and AI, there is nothing magical happening here: The computer merely iterates through many combinations of variables and different types of averages, looking for patterns that match a specified format. All the caveats associated with averages apply here as well. Changing variables is often a component of the analysis on either the input side of a machine-learning or regression application, and those caveats also apply. In addition, regression and machine-learning techniques may be vulnerable to humans’ and computers’ capacity to see patterns where none exist. This problem can be exacerbated by publish-or-perish mentalities. There are statistical techniques to measure the significance of any correlation, and these should be used rigorously.

A blog on US politics, Math, and Physics… with occasional bits of gaming

Oct 26 Pitfalls of Data Reduction

Oct 27 How should we design electoral districts?

Oct 23 How do we verify voters are who they claim to be?