Explainer: Correlation, causation, coincidence and more | Science News for Students

Explainer: Correlation, causation, coincidence and more

Statistics don’t always say what people think they might
Jul 24, 2015 — 7:15 am EST
Over a 10-year period, Americans’ fondness for margarine correlated strongly with the divorce rate in Maine. Yet there’s no reason to think one caused the other. It’s an instance of two unrelated data sets showing a coincidental pattern.

Over a 10-year period, Americans’ fondness for margarine correlated strongly with the divorce rate in Maine. Yet there’s no reason to think one caused the other. It’s an instance of two unrelated data sets showing a coincidental pattern. 

Tyler Vigen/“Spurious Connections”/(CC BY 4.0)

Eating more mozzarella cheese shouldn’t make engineering schools hand out more diplomas. Yet between 2000 and 2009, the more mozzarella that Americans downed, the more doctorates in civil engineering that U.S. universities awarded. Over a 10-year period, as levels of one went up, so did the other. The two showed a strong positive correlation. Yet almost certainly this happened by coincidence. One did not cause the other.

This is a cheesy example. Still, it shows an important point about statistics: Correlation is not the same thing as causation — showing that one thing caused the other.

Another complication: Many events or trends can have multiple causes. And sometimes two variables might both be due to a third factor. All of this can sometimes confuse, or confound, a statistical study. (Statistics involves collecting and analyzing numerical data in large quantities and interpreting their meaning.)

Experiments can rule out such other — or confounding — causes by having a test group and a control group. But that’s not always possible or ethical. For example, researchers would not want to expose children to toxic chemicals just to see what bad effects might follow.

Fortunately, statistics offers mathematical tools that can account for possible confounders. That allows scientists to see how much a change in one variable might be linked to differences in something else.

Researchers built such a tool into their computer model for a recent study about lead. The model had data about lead in children’s blood and scores on a third-grade test. The researchers wanted to look for any link between those two variables. In addition, the model had data on family income, ethnicity and other things.

The statistical tool used math to rule out possible effects from those other factors. That let the model measure just the relationship between lead and test scores. Compared to children with no lead poisoning, children with even low levels of lead in their blood were more likely to fail the reading and math portions of the test. Environmental Health published the research on April 7, 2015.

Power Words

(for more about Power Words, click here)

computer model     A program that runs on a computer that creates a model, or simulation, of a real-world feature, phenomenon or event.

confounding   (in statistics) A situation where one or more unrecognized variables (conditions or events) were responsible for some effect. This could give the faulty impression that the effect was due to something else. Confounding often occurs when researchers did not “control” for the possibility that other variables were or could be at work.

control     A part of an experiment where there is no change from normal conditions. The control is essential to scientific experiments. It shows that any new effect is likely due only to the part of the test that a researcher has altered. For example, if scientists were testing different types of fertilizer in a garden, they would want one section of it to remain unfertilized, as the control. Its area would show how plants in this garden grow under normal conditions. And that give scientists something against which they can compare their experimental data.

correlation   A mutual relationship or connection between two variables. When there is a positive correlation, an increase in one variable is associated with an increase in the other. (For instance, scientists might correlate an increase in time spent watching TV with an increase in risk of obesity.) Where there is an inverse correlation, an increase in one value is associated with a decrease in the other. (Scientists might correlate an increase in TV watching with a decrease in time spent exercising each week.) A correlation between two variables does not necessarily mean one is causing the other. 

data     Facts and statistics collected together for analysis but not necessarily organized in a way that give them meaning. For digital information (the type stored by computers), those data typically are numbers stored in a binary code, portrayed as strings of zeros and ones.

doctoral degree     Also known as a PhD or doctorate, these are advanced degrees offered by universities — typically after five or six years of study — for work that creates new knowledge. People qualify to begin this type of graduate study only after having first completed a college degree (a program that typically takes four years of study).

engineer  A person who uses science to solve problems. As a verb, to engineer means to design a device, material or process that will solve some problem or unmet need.

engineering   The field of research that uses math and science to solve practical problems.

ethics   (adj. ethical) A code of conduct for how people interact with others and their environment. To be ethical, people should treat others fairly, avoid cheating or dishonesty in any form and avoid taking or using more than their fair share of resources (which means, to avoid greed). Ethical behavior also would not put others at risk without alerting people to the dangers beforehand and having them choose to accept the potential risks.

factor    Something that plays a role in a particular condition or event; a contributor. 

lead     A toxic heavy metal (abbreviated as Pb) that in the body moves to where calcium wants to go. The metal is particularly toxic to the brain, where in a child’s developing brain it can permanently impair IQ, even at relatively low levels.

model    A simulation of a real-world event (usually using a computer) that has been developed to predict one or more likely outcomes.

statistical analysis    A mathematical process that allows scientists to draw conclusions from a set of data. In research, a result is significant (from a statistical point of view) if the observed difference between two or more conditions is unlikely to be due to chance. Obtaining a result that is statistically significant means that it is unlikely to observe that much of a difference if there really is no effect of the conditions being measured.

statistical significance    In research, a result that appears reliable — or significant — from a mathematical point of view. Findings that are statistically significant have a reduced likelihood that the apparent link between two or more conditions is a fluke, or merely due to chance.

statistics    The practice or science of collecting and analyzing numerical data in large quantities and interpreting their meaning. Much of this work involves reducing errors that might be attributable to random variation. A professional who works in this field is called a statistician.

variable   (in mathematics) A letter used in a mathematical expression that may take on different values. (in experiments) A factor that can be changed, especially one allowed to change in a scientific experiment. For instance, when researchers measure how much insecticide it might take to kill a fly, they might change the dose or the age at which the insect is exposed. Both the dose and age would be variables in this experiment.

Citation

A. Evens et al. “The impact of low-level lead toxicity on school performance among children in the Chicago Public Schools: a population-based retrospective cohort study.” Environmental Health. Vol. 14, April 7, 2015, p. 21. doi: 10.1186/s12940-015-0008-9.

S. Magzamen et al. “Quantile regression in environmental health: Early life lead exposure and end-of-grade exams.” Environmental Research. Vol. 137, February 2015, p. 108. doi: 10.1016/j.envres.2014.12.004.