The Power of Huge Quantities of Data

How do you spell Britney Spears?A little while ago I heard a talk by a Google researcher which boiled down to the following statement: With enough data, many computational problems transform into search problems. For example, if you have an index of a billion web pages, you can make a darn good spell checker because you’ve seen 200 different ways to spell Britney Spears. You can also translate among various human languages because you’ve seen documents in so many different languages. And you can make Google Local or Yahoo Local or MSN Local because you can verify information among many sites — if you see Bob’s Pizza associated with a particular phone number on five sites, that number is probably the number of Bob’s Pizza. My question, then, is what else can you do when you have mind-bendingly large quantities of text?

  • Identify trends. For example, BlogPulse. I also experimented with this once and mined common links and phrases from 40,000 LiveJournal posts. This could be implemented on a larger scale, though. Could a program identify a change in rhetoric across an entire region or follow cultural shifts?
  • Mine information on people. I’ve seen this done, but never well. (But this is creepy. I’m not sure the world really needs better stalking tools.)
  • You can verify links. For example, (shameless plug) my site AbsurdlyCool FreebieFinder finds freebies online while avoiding referral links by verifying links across multiple sites. If a link has an embedded referral ID, it won’t be identical on many sites, and so will be ignored. This could be applied to other domains.

Please comment on other possibilities. What if you have images as well as text?

Leave a Reply