Findings

Managing the information overload

Alvin Toffler coined the term “information overload” to describe how individuals can be overwhelmed by masses of information in postindustrial society. Now, many businesses and scientific researchers find themselves facing something similar. “The amount of data that needs to be stored and processed is exploding,” says Yale computer scientist Daniel Abadi.

Two approaches to handling such data have become popular: parallel database management systems (DBMSs), developed to efficiently manage structured data—the sort of data that can be represented on a grid—and MapReduce, created by Google to allow flexible searches of the more free-form content of the Web.

Abadi and his students have developed a new open-source system called HadoopDB, which combines the efficiency of DBMS with the adaptability and scalability of MapReduce. Abadi likens it to the old Mac versus PC trope: “Windows is closed and proprietary. Macs are a little more open—they at least come with lots of open-source-based Linuxy goodness. DBMSs tend to be closed and proprietary, but MapReduce is known for the open sourcing.”

Currently, DBMSs are used by everyone from retailers mining their purchase records to scientists doing high-throughput analysis of biochemical compounds. The systems they use may be adequate today, but if HadoopDB succeeds as its creators hope, it will allow a wide range of users to handle increasingly large data sets. Says Abadi, “the problem we're trying to solve is tomorrow's data workloads.”  

The comment period has expired.