Why? Humans like answers to that question, but the answers are increasingly hard to find. That matters … for science and for policy. (More)

Too Much Information, Part II: The Veil of Cause

This week Morning Feature looks at an irony of the Information Age: with so much measured data available, it’s easier to make arguments but harder to draw reliable conclusions. Yesterday we considered the first half of that problem, why a cloud of numbers enables us to make so many more and different arguments. Today we ponder the second half, why the veil of causation makes it so difficult to draw reliable conclusions.

The rise of Big Data

“Big Data” is as much a marketing term as a precisely-defined concept. Yet as we saw yesterday, we increasingly live in a cloud of numbers. How many people searched the internet for flu symptoms or remedies over the past month? How many searched for houses in the Syracuse, New York area in the last three months? Such numbers are easy to find and, as Steve Lohr wrote in the New York Times earlier this month, those numbers can be very useful:

Researchers have found a spike in Google search requests for terms like “flu symptoms” and “flu treatments” a couple of weeks before there is an increase in flu patients coming to hospital emergency rooms in a region (and emergency room reports usually lag behind visits by two weeks or so).
In economic forecasting, research has shown that trends in increasing or decreasing volumes of housing-related search queries in Google are a more accurate predictor of house sales in the next quarter than the forecasts of real estate economists.

Yet there are limits, as Lohr notes:

With huge data sets and fine-grained measurement, statisticians and computer scientists note, there is increased risk of “false discoveries.” The trouble with seeking a meaningful needle in massive haystacks of data, says Trevor Hastie, a statistics professor at Stanford, is that “many bits of straw look like needles.”

Big Data also supplies more raw material for statistical shenanigans and biased fact-finding excursions. It offers a high-tech twist on an old trick: I know the facts, now let’s find ’em. That is, says Rebecca Goldin, a mathematician at George Mason University, “one of the most pernicious uses of data.”

Why do Republicans hate women?

For example, Google {“Democrats hate women”} and you’ll get 19,300 hits, many of which purport to prove that Democrats do hate women. Does this mass of data prove that Democrats hate women? Hardly, but you could easily write a blog post claiming that, with plenty of links and quotes. On the other hand, Google {“Republicans hate women”} and you’ll get 104,000 hits. Does that prove Republicans hate women more than Democrats hate women? No, but it does suggest that a lot more people perceive and write online about Republicans hating women. If you’re a Republican candidate or campaign strategist, those numbers should concern you … even if you think that common perception is false.

Still, those numbers only suggest relative perceptions of the major parties’ hatred of women. That is not the question we first asked – “Why do Republicans hate women?” – and in fact we haven’t even proved that Republicans do hate women. We might try to explain the perception, and you probably already have an explanation in mind. Your explanation might be very similar to an explanation I could offer. But that similarity would show only that we agree. It doesn’t prove we’re correct.

Doing the research

We could read through the 104,000 hits and see what explanations other writers gave. But many of those reasons also reflect what the writers had read or heard elsewhere: in TV news programs, newspapers, magazine articles, assignments for women’s studies or political science classes, and – of course – other online writers. Internalizing and repeating such reasons is rarely an exercise in rigorous research and reflection. More often, it’s a matter of familiarity, narrative coherence, and stickiness. And we should remember that writing online is not as introspective as writing a personal diary. On most websites, writers hope readers will reply in comments. The reasons offered in those 104,000 hits reflect, in part, the writers’ belief (or hope) that readers would agree.

So if we read all 104,000 hits, we might say with confidence: “A Google search of {“Republicans hate women”} gave 104,000 hits, as compared to the 19,000 hits from {“Democrats hate women”}, and here are some familiar, coherent, sticky, and hoped-would-be-accepted-by-readers reasons that online writers gave to support claims that Republicans hate women.”

Now look back where our search began: “Why do Republicans hate women?” Even in the Information Age, we can’t really answer that question. We can’t even be certain Republicans do hate women. We’d have to define “hate” in specific terms – a score based on observable actions or statements – and then gather data. And assuming our definition were reasonable, and that we had gathered the data, and that the data showed Republicans score higher than Democrats on our “hatred of women” scale …

… we still wouldn’t know why.

What we see, and what we don’t

We can see the relative numbers of hits on those two searches. We can follow the links and see the reasons online writers gave in claiming that “Republicans hate women.” We could, if we had the training and funding, construct the study I described above and – hypothetically – see that Republicans score higher than Democrats on a reasonable “hatred of women” score.

But then we reach a veil. We can’t see the why.

The problem is not methodology. It’s not merely that asking Republicans why they hate women would probably not yield entirely truthful, coherent, or useful answers. The problem is deeper than that, as Jonah Lehrer wrote last December in Wired:

[C]auses are a strange kind of knowledge. This was first pointed out by David Hume, the 18th-century Scottish philosopher. Hume realized that, although people talk about causes as if they are real facts – tangible things that can be discovered – they’re actually not at all factual. Instead, Hume said, every cause is just a slippery story, a catchy conjecture, a “lively conception produced by habit.” When an apple falls from a tree, the cause is obvious: gravity. Hume’s skeptical insight was that we don’t see gravity – we see only an object tugged toward the earth. We look at X and then at Y, and invent a story about what happened in between. We can measure facts, but a cause is not a fact – it’s a fiction that helps us make sense of facts.

The truth is, our stories about causation are shadowed by all sorts of mental shortcuts. Most of the time, these shortcuts work well enough. They allow us to hit fastballs, discover the law of gravity, and design wondrous technologies. However, when it comes to reasoning about complex systems – say, the human body – these shortcuts go from being slickly efficient to outright misleading.

Lehrer’s article – which begins by exploring the $21 billion failure of a cholesterol drug and moves on to the surprising relationship between bulging discs and back pain – is worth reading in full. He discusses why science moved from direct observation to statistical analysis, and why scientists are bumping against the limits of that analysis. (Short answer: they’ve already made the easy discoveries, and those that remain require a lot more – and more expensive – data to validate.) And even when scientists are reasonably confident they’ve found a meaningful correlation … the why is still “a fiction that helps us make sense of facts,” rather than a fact we can see directly.

The End of Theory?

In 2006, Wired‘s Chris Anderson wrote an article titled The End of Theory that concluded:

The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.

We don’t know why people search online for flu symptoms. Some may think they have symptoms. Others may think they’ve seen symptoms in friends, family members, neighbors, or coworkers. Others may merely have a hunch. Yet even without knowing why people made those searches, the research says, if internet searches of flu symptoms rise sharply, doctors and hospitals should expect an increase in flu patients in about two weeks. And even without proving that Republicans do indeed hate women – let alone the unanswerable why – the 104,000 hits for “Republicans hate women” vs. 19,000 hits for “Democrats hate women” should worry Republican candidates and strategists.

Ironically, there is too much information floating around for us to be certain about many things. But there is also too much information to ignore.


Happy Leap Day!