Micah Sittig

(I sent this post as an e-mail to my Stats students this morning.)

Hi fellow statisticians,

I found a couple of interesting articles online this week that I'd like the share with you. They're important because they show how important it is to know good statistics in order to reach reliable conclusions from data.

"Linguistic tone is related to the population frequency of the adaptive haplogroups of two brain size genes, ASPM and Microcephalin"
http://www.ling.ed.ac.uk/%7Es0340638/tonegenes/tonegenessummary.html
The first is a "statistical study" released by a couple of linguists from the UK. This study is making the news because people are using it to say that you can't learn Chinese if you don't have the right genes. I quote Language log:

Dediu and Ladd again

http://itre.cis.upenn.edu/~myl/languagelog/archives/004554.html
This research began (as I understand it) when Bob Ladd noticed that maps of the geographical distribution of two particular genetic traits (from a couple of articles published in Science in 2005) seemed similar to what he knew of the geographical distribution of lexical tone. So he worked with Dan Dediu to check this impression statistically. They looked at the geographical distribution of 26 linguistic traits (from Haspelmath et al.'s World Atlas of Language Structures) correlated with the geographical distribution of 1,000 genetic variants, yielding 26,000 correlations. The statistical distribution of these correlations looks like this:

And sure enough, as the arrow on the graph indicates, the correlation between the geographical distribution of lexical tone and the two genes of interest is way out in the tail of this bell-shaped curve of empirical gene/language correlations. This clearly tends to confirm Bob's original intuition.

But even in a distribution created by completely random effects, without any meaningful connections among the variables studied, something has to be way out in the tail.

There's a tricky point of statistical reasoning here, which is clear in general, although even experts often seem to wind up disagreeing about particular cases. If you started with 26,000 correlations, and no reason to expect any causal relations among the variables, and a situation (like geographical diffusion of genetic and cultural traits) where it's not clear what distribution of correlations to expect in the absence of causal connections, then you probably wouldn't draw any conclusions at all from the fact that some of the correlations were fairly high. On the other hand, if you thought that there might be some meaningful connections in the mix, then the tails of the distribution are a good source of examples to study further -- and this is the main conclusion that D & L draw from their results, quite properly.

So it's important to be cautious when using results like this, because, I quote our textbook:

"Many tests run at once will probably produce some significant results by chance alone, even if all the null hypotheses are true." (YMS pg 592)

So not only is the article being distorted by story-hungry journalists, the fact is that the statistical methods used to support the paper's hypothesis are dubious to begin with. Oops!

Apple Takes a Bite out of DRM - MrMick - Maximum PC
http://www.maximumpc.com/article/itunes_256_vs_128_bit
The second article is about a test that was done to see if music fans could distinguish between audio files encoded at two different high-quality rates, with two different quality of earbuds. I recommend you read it, but beware of the results because the experiment design has some vital flaws. Can you pick them out? How would you change the setup if you were asked to repeat this experiment?

Have a nice weekend. I'm looking forward to seeing your presentations on Tuesday.

Micah Sittig

Friday, June 01, 2007

0 Comments:

Post a Comment

About Me

Previous Posts

Colophon