Gmail Peter Burgess
FT article: 'Big data: big mistake?'
Andrew Simmons
Thu, Jun 19, 2014 at 4:44 PM
To: peterbnyc@gmail.com, christopher macrae , NAhuja@wri.org
March 2014, Tim Hartford, Financial Times
www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html
the article covers:
causation vs. correlation -- (too often, big data analysts apparently care more about correlation -- identifying statistical patterns of data, theory-free.) Example: Google Flu Trends
sample bias and sample error (analysts overlook the risks of sample bias more often) - Sampling error is when a randomly chosen sample doesn't reflect the underlying population purely by chance; sampling bias is when the sample isn't randomly chosen at all. A bigger sample does not necessarily mean a better sample. Classic example: 1936 election prediction of Roosevelt vs. Landon -- Literary Digest's collecting 2.4 million inaccurate sample vs. Gallup's 3,000 more accurate interviews.
Other example of sample bias: 'digital divide' issue of civic engagement, when analysts too often take a big data set to mean (implicitly) that 'N = All' -- that N represents the entire background population rather than a sample. Examples given are twitter as well as Boston's 'Street Bump' smartphone app that automatically reports potholes ('digital divide' of users being a specific demographic of smart phone owners -- wealthier, young -- excluding non-users).
the false positive issue (example: Target's mailing advertising material targeted to potential pregnant based on their other purchases)
the multiple-comparisons problem (John Ioannids, 'Why most published research findings are false') - excerpt: 'it is routine, when examining a pattern in data, to ask whether such a pattern might have emerged by chance. If it is unlikely that the observed pattern could have emerged at random, we call that pattern 'statistically significant.' The multiple-comparisons problem arises when a researcher look as at many possible patterns. ... There are various ways to deal with this [e.g., transparency], but the problem is more serious in large data sets, because there are vastly more possible comparisons than there are data points to compare. [Example: how to determine if trial vitamins, with placebo, given to schoolchildren 'works'?]. Without careful analysis, the ratio of genuine patterns to spurious patterns -- of signal to noise -- quickly tends to zero.'
Some familiar urban policy-related examples:
Are crime rates in a city during a specific period down a result of specific police practices (e.g., stop and frisk) or do they reflect broader demographic changes / crime rates seen across the country.
Similarly, on average, New York City residents apparently now live a few years longer than they did 10 years ago. Michael Bloomberg, when talking about urban sustainability and climate change, attributes this to his polices that led to lower carbon emission and better measurable air quality in Manhattan. While better air quality is indeed healthier and can lead over the long term to greater longevity, is this really what happened? Perhaps a different sample is being measured. Other factors: younger generations live longer; wealthier people live longer (again, larger high income earning population in Manhattan, which experienced hyper-gentrification over the last decade. NYT article 'Income Gap, Meet the Longevity Gap' recently had an article on income/longevity issue, comparing Fairfax County Virginia with a county in West Virginia. www.nytimes.com/2014/03/16/business/income-gap-meet-the-longevity-gap.html
My master's dissertation at LSE was on the London Olympic Park Legacy regeneration. Part of the political legitimacy of the entire Olympics was the claim that the Olympics would lead to social regeneration in 20 years' time in the surrounding area, because of the project's interventions and improvements. This is measured by the UK's quite holistic 'indices of multiple deprivation' (e.g., the UK indicies go beyond the American census and community survey's income levels and demographic data to include and map, for example, health and wellbeing, access/distance to green space, schools and other amenities and services.). The defined areas surrounding the Olympic Park (Stratford, east Tower Hamlets, south Hackney) fair among the worst social indicators in the whole of the UK; supporters of the Olympics wish to make that area comparable to west London. The fine print is that the project would, of course, lead to gentrification (my dissertation was about HOW this was done, covertly, with specific targeted infrastructure improvements to raise the property values in specific areas at specific times). Critics of social regeneration policy or property-led regeneration correctly point out that, in 20 years, you'd be measuring a different sample of the population, having out-priced/displaced the residents in the original sample (who presumably will have moved out further and will be harmed rather than helped by such a spatial intervention from the government)
Tim Hartford, Financial Times
March 2014,
| The text being discussed is available at
| |