Data Driven Computer Science

Does a growing segment of computer science have more in common with particle physics than algorithm design?

Consider five small factoids:

When David Karger posted about List.it’s growing user adoption (more than 14,000!) a few weeks ago, the primary benefit he cited were the mounds of usage data the Haystack group now has available to analyze
Any major systems conference today is bound to have several star papers from the big industry players — Google, Yahoo, Microsoft, Facebook, etc — containing research made possible by company-proprietary data sets
Academics in natural language processing struggle to compete with companies like Google when it comes to algorithms that need lots of data. Google simply has so much data, and so many computers, that they can do heavy computational lifting others can not.
One of the prizes in Yahoo!’s Key Scientific Challenges contest is access to a portfolio of their private data sets for research
The Eyebrowse project is motivated, in part, by the fact that as researchers, we have no idea how people actually use the web. This is the sole privilege of companies like Google, Microsoft, and Yahoo! that own and run advertisement networks that track your movements across the web. Among other things, Eyebrowse is a research project to help researchers gain access to this information.

It is clear that there is a growing subset of computer science that is not about computers, but rather about the information we suddenly have available as a result of computers. What’s interesting is that this new type of study is found scattered throughout the subfields of computer science, yet it is distinctly different in nature than traditional computer science research. This begs the question of whether we need to adapt our existing approaches to research and education to reflect this new type of work.

Does the research start at the generation of the data set, for example, or the analysis of it? This is an incredibly important question because a good data set will make or break a research paper. Should PhD students be spending their first two years building and marketing a platform — essentially running a startup company — and then the following four years analyzing the usage data it generated? Should they take a lesson from Business School students and embed themselves in corporations, providing them access to proprietary data sets for study? Should they limit themselves to studying only public data sets?

Who pays to build data sets? Good data sets are expensive to obtain. Particle physicists spend billions of dollars constructing particle accelerators just so they can record a few milliseconds of good data. But governments willingly provide the money and resources to help them gather this data because there isn’t a market for gluon data. There is, however, a market for your social networking behavior and web advertising clicks, so we shouldn’t hold our breaths waiting for the NSFs of the world to fund a multi-billion dollar social network just to gather behavioral data.

Should we require researchers to publish data sets alongside their papers? My sense from biology students is that some biology labs today defensively guard their data to make sure they beat others to publication. How do we avoid data hoarding while still respecting the fact that generating a good data set takes a lot of insight and work?

If trends continue, datasets will become an increasingly important fuel for computer science research. Hopefully we can learn from the other scientific disciplines about how to cope with being data-driven and adopt community standards that encourage an environment of collaboration and sharing.