A Passion for Discovery
R&D Magazine's 42nd Scientist of the Year, George M. Whitesides, encourages students to fully learn the R&D process as it relates to the real world.
Call for nominations--With the announcement of the 2007 Scientist of the Year,
we now open the nomination process for 2008. Send your recommendation and reason why to Editor in Chief, Martha Walz, at 973-920-7542 or martha.walz@advantagemedia.com
Data Big Gulp
July 17,2008
This morning I completed a long-overdue mailbox clean-up. You know, the intensive one that purges three-month-old messages with 300 KB attachments that you thought you were going to need soon but never did.
The effort blew away some 30 Mb of not-really-useful data, greatly simplifying my digital life. Managing the home computer with the 120 GB hard drive is a different altogether. My troubles, however, pale in comparison to those of researchers who sequence genes or study samples using light-sheet fluorescent imaging. They have terabyte problems.
This week’s inaugural meeting of the Information Overload Research Group(IORG) in New York City seems to suggest there is data pandemic, calling this overload the “world’s greatest challenge to productivity.”
Certainly, the monolithic piles of 0s and 1s have already pestered high-level researchers, many of whom are producing monstrous data sets from physics R&D. For example, a Univ. of Chicago team last fall produced the world’s largest compressible, homogeneous isotropic turbulence simulation. The effort generated 154 TB in 75 million files. The transfer of just 23 TB of this data to different computers took three weeks. Government-funded researchers are attempting to build distributed computer grids to help solve what has become a “petascale” problem, but these efforts are still in their infancy.
Even research on data overload itself has burgeoned in the past few years (IORG cites 16 notable studies on email overload since 1999), and most experts recognize that data management and storage will become a significant theoretical and engineering challenge in the coming years. This philosophically recursive R&D work reveals some obvious but still unfortunate findings. For example, an email that is not responded to within 24 hours (often this means an “8-hour” workday) will likely remain unanswered altogether. Companies such as Microsoft are developing probabilistic machine learning tools to help people triage email automatically and reduce the number of unnecessary emails.
No question, interruptions to productivity (such as the one I’m writing now!) are bad for efficiency, but I have a competing theory: the more data you have, the more likely you will find a solution.
You just need to learn how to find what you need. And delete the rest.