Book Rounds: Statistical Smarts


Book Rounds, Professional Skills Development / Tuesday, July 2nd, 2019

How to Lie with Statistics

Darrell Huff

A humorous, insightful and easy read outlining the foibles of statistics and our interpretation of them. Highly recommended as a refresher for those that feel like they’ve lost touch with statistics, or those that would appreciate a reasonable reference to offer owners who wish to be well-informed. The author uses many real life examples of statistics gone wrong that improves understanding.

The title is rather tongue in cheek, as the author points out that our enamoration with precision often washes right over our common sense, just as often as the misuse or misinterpretation of statistics contributes to problems. The author sums his purpose with “The crooks already know these tricks; honest men must learn them in self-defense.” 

The common manipulations he suggests we be wary of are as follows: 

Sample Bias: Even the most honest of researchers can be blind-sided by unintentional sample bias. Be especially wary of self-selecting or self-reporting samples. The conclusions drawn need to be very limited to the represented population, and rarely provide a “true” representative sample. Sample size can also bias the results. Small sample sizes are rarely representative of the normal distribution of data points. Check for number and source of data points.

Average: Average can actually be a very non-specific term, and may be used to refer to the meanmedian or mode. Frequently, the mean is used. This may be helpful in some cases, but given it is the sum of all values divided by the number of values, it may not give a clear picture. A small number of samples may lie at the extreme of values and skew the mean towards the extreme of the range. The median is the value at which half of the values lie above, and half below. This gives a clearer picture of the distribution of values. The mode is the value of most frequent occurrence, so can still skew the picture, but towards frequency, rather than range. Understanding which term is being utilized for average gives you a bit more information to understand the bias. The author warns that “an unqualified “average” may be … meaningless”. 



Data presentation: The data may be presented in a variety of fashions that warp our understanding. Graphs or visual representation can be manipulated in manners that invite us to subconsciously draw conclusions that may not be correct. A graph should be labeled on both axes. Check these labels to make sure they make sense. Do you know what the numbers are supposed to be, or are there just a set of numbers strung up the side of the graph? The data may be honestly and fairly manipulated (say, placed on a logarithmic scale), but the visual perception increases the differences between groups. Alternatively, the scale can be manipulated to omit chunks of numbers, which again toys with the visual perception of the data. Always be suspicious of visual depictions that aren’t labeled.

https://imgs.xkcd.com/comics/normal_distribution.png

Correlation vs Causation: A very common error and tendency is to identify correlation and presume causationCorrelation is only indicating that two factors have a commonality. It is entirely possible (and frequently probable) that this is merely a coincidence. Sensationalism and natural human tendencies will presume that this commonality indicates that one caused the other. Even if causation is present, the statistics indicating correlation don’t clarify which factor is the effector and which is the effectee. In other words, a chicken or the egg conundrum should be considered if you believe that the reported correlation indicates causation. Finally, correlation may indicate a direct relationship, but the relationship may be the consequence of an entirely different factor. For example, every time your child gets a cold, they always have a runny nose and a cough. That doesn’t mean the cough causes the runny nose, nor that the runny nose causes the cough. Rather, the cold virus causes both events to occur! 

p-value: This is a critical evaluator of the data, giving a probability that the significance of findings is due to true differences versus chance. A generally accepted p-value of significance is p<0.05. While generally accepted, this is somewhat arbitrarily set or selected. A researcher may choose to set significance at p<0.01. Partly to impress upon others their integrity and stringent evaluation methods, and partly to indicate that the probability that their findings are random chance are particularly low. There is nothing to stop a researcher from setting their p-value significance at 0.1 or even 0.5, other than perhaps peer ridicule. Be suspicious of data presented or implied to be significant which is not accompanied by a p-value.  

https://www.facebook.com/sassyeconometrics/posts/i-hope-p-value-jokes-are-still-funny/1953677048179439/

Numbers: Be cautious of a false sense of security when numbers are thrown at you. We tend to assume greater validity when precise numbers are given, but that can be a false assumption. Numeric differences may be insignificant when associated with subjective material.  For example, intelligent tests-> can you really qualify someone as smarter than another individual with a two point difference in scores? Percentages can also manipulate our understanding of data, and can do some pretty underhanded tricks to our brains and our understandings. Beware of data presented or compared between percentages. A 1% increase in salary for a company’s employees could be approximately a hundred dollar addition (for an employee that grosses a thousand a month) or ten thousand (for an employee that grosses one million per year). The CEO and the employee are going to have very different feelings about that 1% increase. 

Indirect conclusionsDrawing a conclusion based on inference from the information is a very natural human tendency, but can be very risky, and very false. These indirect conclusions may be made for you (often by non-professionals interpreting data, such as news reporters or media sources), or be designed to prey on your humanness, by allowing you to unwittingly do the dirty work. For instance, a hand soap my claim to reduce bacteria by 99% (with proven studies). You think, “That’s fantastic. Got to be better than the handsoap that makes no claims about effectiveness! I’ll buy this (for 0.80 cents more)!” And yet, no one studied (or reported) if that reduction in bacteria actually changes the incidence of disease. Your super computer of a brain subconsciously drew that inferred conclusion. That 0.80 cents more may not have accomplished anything other than relieving you of some spare change. Advertisers love this dirty little trick, and it can be employed even by people we think we should trust. 

https://www.forbes.com/sites/erikaandersen/2012/03/23/true-fact-the-lack-of-pirates-is-causing-global-warming/#7fcf22013a67

As the author points out, “despite its mathematical base, statistics is as much an art as it is a science.” Often, there are multiple statistical methods that may be appropriate, and the statistician must subjectively select which they feel reflects the data best. Nevertheless, it is prudent to ask yourself for every statistic you face: “Does it make sense?” Do not be white-washed by the sciency feel of numbers, abandoning your common sense! Those with unscrupulous biases are hoping you’ll do just that. Now we know their dirty, lying tricks, though, and are prepared! Go forth, and be skeptical!

104 Replies to “Book Rounds: Statistical Smarts”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.