[Informatics] Some problems with statistics [LONG]

Mark mark at vceit.com
Thu Apr 14 14:02:47 AEST 2016


Hi all. For those of you looking for case studies of bad statistical use, I
found an interesting read in this...

'STATISTICS DONE WRONG - THE WOEFULLY COMPLETE GUIDE' by Alex Reinhart
no starch press - info at nostarch.com www.nostarch.com
ISBN-10: 1-59327-620-6 ISBN-13: 978-1-59327-620-1

Some rather long, but thought-provoking excerpts from the book may be
useful for you and kids evaluating statistical data during hypothesis
research.

In brief: there are many problems to be found in published research data.

---------------

The problem of rejecting valid conclusions because of unimportant errors...

A conclusion supported by poor statistics can still be correct —
statistical and logical errors do not make a conclusion wrong, but merely
unsupported.

The problem of only publishing exciting findings...

We only ever see a fraction of medical research, for instance, because few
scientists bother publishing “We Tried This Medicine and It Didn’t Seem to
Work.” In addition, editors of prestigious journals must maintain their
reputation for groundbreaking results, and peer reviewers are naturally
prejudiced against negative results. When presented with papers with
identical methods and writing, reviewers grade versions with negative
results more harshly and detect more methodological errors.

The pharmaceutical industry seems particularly tempted to bias evidence by
neglecting to publish studies that show their drugs do not work;
 subsequent reviewers of the literature may be pleased to find that 12
studies indicate a drug works, without knowing that 8 other unpublished
studies suggest it does not. Of course, it’s likely that such results would
not be published by peer-reviewed journals even if they were submitted—a
strong bias against unexciting results means that studies saying “it didn’t
work” never appear and other researchers never see them. Missing data and
publication bias plague science, skewing our perceptions of important
issues.

The problem with small sample sizes...

In the United States, counties with the lowest rates of kidney cancer tend
to be Midwestern, Southern, and Western rural counties. Why might this be?
Maybe rural people get more exercise or inhale less-polluted air. Or
perhaps they just lead less stressful lives.
On the other hand, counties with the highest rates of kidney cancer tend to
be Midwestern, Southern, and Western rural counties.
The problem, of course, is that rural counties have the smallest
populations. A single kidney cancer patient in a county with 10 residents
gives that county the highest kidney cancer rate in the nation. Small
counties hence have much more variation in kidney cancer rates simply
because they have so few residents.

The problem with false positives that sound exciting...

http://xkcd.com/882

The problem with Correlation and Causation

When you have used multiple regression to model some outcome—like the
probability that a given person will suffer a heart attack, given that
person's weight, cholesterol, and so on — it’s tempting to interpret each
variable on its own. You might survey thousands of people, asking whether
they’ve had a heart attack and then doing a thorough physical examination,
and produce a model. Then you use this model to give health advice: lose
some weight, you say, and make sure your cholesterol levels fall within
this healthy range. Follow these instructions, and your heart attack risk
will decrease by 30%!
But that's not what your model says. The model says that people with
cholesterol and weight within that range have a 30% lower risk of heart
attack; it doesn’t say that if you put an overweight person on a diet and
exercise routine, that person will be less likely to have a heart attack.
You didn't collect data on that! You didn't intervene and change the weight
and cholesterol levels of your volunteers to see what would happen.
There could be a confounding variable here. Perhaps obesity and high
cholesterol levels are merely symptoms of some other factor that also
causes heart attacks; exercise and statin pills may fix them but perhaps
not the heart attacks.
The regression model says lower cholesterol means fewer heart attacks, but
that's correlation, not causation.
One example of this problem occurred in a 2010 trial testing whether
omega-3 fatty acids, found in fish oil and commonly sold as a health
supplement, can reduce the risk of heart attacks. The claim that omega-3
fatty acids reduce heart attack risk was supported by several observational
studies, along with some experimental data. Fatty acids have
anti-inflammatory properties and can reduce the level of triglycerides in
the bloodstream—two qualities known to correlate with reduced heart attack
risk. So it was reasoned that omega-3 fatty acids should reduce heart
attack risk.
But the evidence was observational. Patients with low triglyceride levels
had fewer heart problems, and fish oils reduce triglyceride levels, so it
was spuriously concluded that fish oil should protect against heart
problems. Only in 2013 was a large randomized controlled trial published,
in which patients were given either fish oil or a placebo (olive oil) and
monitored for five years. There was no evidence of a beneficial effect of
fish oil.
Another problem arises when you control for multiple confounding factors.
It’s common to interpret the results by saying, “If weight increases by one
pound, with all other variables held constant, then heart attack rates
increase by...” Perhaps that is true, but it may not be possible to hold
all other variables constant in practice. You can always quote the numbers
from the regression equation, but in reality the act of gaining a pound of
weight also involves other changes. Nobody ever gains a pound with all
other variables held constant, so your regression equation doesn’t
translate to reality.

The problem of Simpson's Paradox

When statisticians are asked for an interesting paradoxical result in
statistics, they often turn to Simpson’s paradox. Simpson's paradox arises
whenever an apparent trend in data, caused by a confounding variable, can
be eliminated or reversed by splitting the data into natural groups. There
are many examples of the paradox, so let me start with the most popular.
In 1973, the University of California, Berkeley, received 12,763
applications for graduate study. In that year’s admissions process, 44% of
male applicants were accepted but only 35% of female applicants were. The
university administration, fearing a gender discrimination lawsuit, asked
several of its faculty to take a closer look at the data.
Graduate admissions, unlike undergraduate admissions, are handled by each
academic department independently. The initial investigation led to a
paradoxical conclusion: of 101 separate graduate departments at Berkeley,
only 4 departments showed a statistically significant bias against
admitting women. At the same time, six departments showed a bias against
men, which was more than enough to cancel out the deficit of women caused
by the other four departments.
How could Berkeley as a whole appear biased against women when individual
departments were generally not? It turns out that men and women did not
apply to all departments in equal proportion. For example, nearly
two-thirds of the applicants to the English department were women, while
only 2% of mechanical engineering applicants were. Furthermore, some
graduate departments were more selective than others.
These two factors accounted for the perceived bias. Women tended to apply
to departments with many qualified applicants and little funding, while men
applied to departments with fewer applicants and surpluses of research
grants. The bias was not at Berkeley, where individual departments were
generally fair, but further back in the educational process, where women
were being shunted into fields of study with fewer graduate opportunities.

The problem with making mistakes

Surveys of statistically significant results reported in medical and
psychological trials suggest that many p values are wrong and some
statistically insignificant results are actually significant when computed
correctly. Even the prestigious journal Nature isn’t perfect, with roughly
38% of papers making typos and calculation errors in their p values. Other
reviews find examples of misclassified data, erroneous duplication of data,
inclusion of the wrong dataset entirely, and other mix-ups, all concealed
by papers that did not describe their analysis in enough detail for the
errors to be easily noticed.

The problem of data decay when seeking to verify the data used in previous
research

Another problem is the difficulty of keeping track of data as computers are
replaced, technology goes obsolete, scientists move to new institutions,
and students graduate and leave labs. If the dataset is no longer in use by
its creators, they have no incentive to maintain a carefully organized
personal archive of datasets, particularly when data has to be
reconstructed from floppy disks and filing cabinets. One study of 516
articles published between 1991 and 2011 found that the probability of data
being available decayed over time. For papers more than 20 years old, fewer
than half of datasets were available.Some authors could not be contacted
because their email addresses had changed; others replied that they
probably have the data, but it’s on a floppy disk and they no longer have a
floppy drive or that the data was on a stolen computer or otherwise lost.


Regards, Mark

with thanks to

'STATISTICS DONE WRONG - THE WOEFULLY COMPLETE GUIDE' by Alex Reinhart

-- 

Mark Kelly

mark at vceit.com
http://vceit.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.edulists.com.au/pipermail/informatics/attachments/20160414/48589b3e/attachment.html 


More information about the informatics mailing list