http://nerdsonwallstreet.typepad.com/my_weblog/2007/04/stupid_data_min.html
This collective web thing actually works. I got an email from ace quant investment manager John Bogle who’d seen a post from Paul Kedrosky. Both were looking for a copy of Stupid Data Miner tricks, a paper in the current Journal of Investing.
The JoI is not fully onboard the “information wants to free” train, so as a good citizen of the interweb series of tubes, I’m depositing an earlier version right here. Download dataminejune_2000.pdf
Here’s the introduction:
Disraeli’s warning that “there are three kinds of lies: lies, damn lies and statistics” is particularly true when too much computation is applied to too little data. This paper presents some egregious yet instructional examples of data mining, and describes ways to avoid similar mishaps.
It started out as a set of joke slides showing silly spurious correlations over ten years ago. These statistically appealing relationships between the stock market and diary products and third world livestock populations have been cited often, in Business Week, the Wall Street Journal, the book “A Mathematician Looks at the Stock Market”, and many others. Students from Bill Sharpe’s classes at Stanford seem to be familiar with them. This was expanded, to have some actual content about data mining, and reissued as an academic working paper in 2001. Occasional requests for this arrive from distant corners of the world. So I’d like to thank the editors of the Journal of Trading for publishing this.
Without taking a hatchet to the original, the advice here is still valuable, perhaps more so, now that there is so much more data to mine. Monthly data arrives as one data point, once a month. It’s hard to avoid data mining sins if you look twice. Ticks, quotes, and executions arrive in millions per minute, and many of the practices which fail the statistical sniff tests for low frequency data can now be used responsibly. New frontiers in data mining have been opened up by the availability of vast amounts of textual information. Whatever raw material you choose, fooling yourself remains an occupational hazard in quantitative trading.
PS - DIY dataminers will want to check this: http://swivel.com/