Researchers and businesspeople around the world now have at
their disposal a new way to perform massive computations over large
quantities of unstructured data more quickly and easily than theyve
ever imagined.
The reason: a Microsoft Research-developed computing tool called
Dryad,
a name derived from shy tree deities found in Greek mythology.
Dryad and a related programming model called DryadLINQ
constitute technology that simplifies running complex data-analysis
applications across hundreds or even thousands of servers on
familiar, widely used Windows software.
After nearly six years of research into Dryad and DryadLINQas
well as its use in-house on Microsoft projects such as Kinect and BingDryad and DryadLINQ are
entering commercial use. Starting Jan. 26, a technology preview of
Dryad and DryadLINQ will be built into the Windows HPC
Server 2008 R2 high-performance computing line and eventually
will be integrated with Microsoft
SQL Server and Windows Azure.
HPC Server is designed to give customers tremendous computing power
and an easy management experience, all using off-the-shelf
hardware.
Michael
Isard, a
Microsoft Research Silicon Valley principal researcher
instrumental in launching the Dryad project, says the new
technology is an excellent example of how Microsoft views
computing.
This is an opportunity to democratize large-scale,
data-intensive computing, he says. In areas such as
customer-relationship management, business intelligence, planning,
and infrastructureall those tasks where companies now have access
to a vast amount of dataDryad and DryadLINQ can make sense of that
data.
How Dryad Works
The Dryad project consists of two key components. The Dryad tool
itself provides reliable computing across thousands of servers.
DryadLINQ, built on Microsofts .NET
Language Integrated Query (LINQ), enables developers to write
their applications in a SQL-like query language, using familiar
programming tools such as Microsoft
Visual Studio. Most programmers will work only with DryadLINQ;
once they have launched their application into the cloud, Dryad
will do the rest, invisibly.
A third piece, the Distributed Storage Catalog (DSC), is a
distributed file system built for Dryad. It manages the data that
Dryad is processing, keeping it stored reliably and safely with
user-configurable redundancy. The DSC also keeps the data close to
the servers processing it, so time is not wasted transmitting the
data to a server.
Dryad and DryadLINQ make it easier for programmers to take
advantage of the power of parallel computing, in which rows of
servers or multicore processors within a single machine tackle a
single computing problem. Such computing is extremely powerful,
especially with so-called unstructured data such as information on
buying habits that a retailer might collect from tens of thousands
of customers but that has not been tagged or annotated, in contrast
to structured data found, for instance, in a SQL database.
It is difficult, though, to harness the power afforded by
parallel computing. Most programmers are more familiar with writing
sequential programs, in which Action A is followed by Action B,
then Action C. It is challenging to think and program in
parallel.
While DryadLINQ enables developers to write
their applications in a query language using Visual Studio,
Dryad breaks up the program and assigns it across clusters of
servers or processors. In effect, Dryad acts as a computing traffic
cop, sending data down potentially millions of computing pathways.
It helps make sure that when one piece of data is modified, other
servers dont also change that data. It balances the computing load
between many computers, and it re-routes computing traffic if an
error or communications problem temporarily takes one or even
several servers offline.
That removes a huge burden from programmers and lets them focus
on the problem they are trying to solve, not how the computers will
act in parallel.
We want programmers to be able to write their programs without
having to think about things like fault tolerance [a byproduct of
parallel computings complexity], says Yuan
Yu, a principal researcher at Microsoft Research Silicon Valley
who led the creation of the DryadLINQ component.
Yuan Yu
We want them to be able to write sequential and declarative
code, and then, that same code can be run on a single machine, on a
multicore machine, or on a cluster of machines. Thats the beauty of
the DryadLINQ programming model.
A second benefit is that Dryad gives programmers
supercomputer-level power with everyday programming tools and
relatively inexpensive hardware.
This is a much cheaper way of doing things, Yu says. Everything
is a commoditya commodity operating system, using commodity servers
and switches. Dryad deals with the reliability and the bandwidth
issues.
Dryad also utilizes Microsofts big investment in the cloud. As
Dryad is integrated with Azure, all a programmer will need to take
advantage of Dryad is a client and an Azure connection. Whether
they are working on a cluster or the cloud, programmers can store
their data and then manipulate it through their DryadLINQ-written
applications. On a cluster, the DSC unit manages the data to keep
it close to the processors working on it, so time is not lost in
communicating data between servers.
The only thing well give the customer is some client software
for writing DryadLINQ programs, Isard says. Theyll basically write
the program on their machine and submit it to Windows Azure, where
Dryad is running internally.
The Evolution of Dryad
Dryad had its roots in an idea developed in October 2004 by
Isardthen working on search for Microsoftwhen he recognized the
need for a large-scale data-intensive computation platform and
began discussions with researchers at Microsoft to build on the
idea.
Not long afterward, the newly created Dryad came into widespread
use within Microsofts search offering, where it was used on
thousands of servers. But while the tool worked well, the
programming interface was awkward. Yu recognized the potential of
LINQ to serve as the front-end programming tool for Dryad, and
started the DryadLINQ project in September 2006. By early 2008, the
Dryad/DryadLINQ combination was made available within Microsoft. A
release to a small collection of academic researchers followed.
Dryad also was adopted as a key tool for the development of the Xbox 360 Kinect gaming
device. The DryadLINQ research paper won a best-paper award
in 2008 during the eighth USENIX Symposium on Operating Systems
Design and Implementation.
It was easily the largest project in our lab, Yu says. And this
was a long-term project, so management had to believe in it. But
they said, We believe in you guys, so here is the money you need to
build a server cluster to do the research. Also, the entire lab was
very supportivewe built the (Dryad) system, and many researchers
are using it for real work. Their feedback, in particular, has been
invaluable in refining the DryadLINQ programming model.
Isard adds that while it might seem Dryad had a long gestation,
the market time for its release is right.
I think the HPC product group moved at the right timewhen they
saw the opportunity, he says. We were a year or two ahead of the
curve on the research side, but we were ready when the product
group saw a need for it.
Dryad Enters the Market
A big step is coming, as Dryad and DryadLINQ become fully
productized as part of the Microsoft HPC Server suite. It also will
be integrated with Microsoft SQL Server and Windows Azure to give
customers from academia to the business community a new, powerful
computing tool.
Isard is confident that Dryads ease of use and familiar
Microsoft tools will win over developers.
Dryad will particularly appeal to customers who would love
to keep using Windows and Excel and Visual Studio and all the tools
they already use, he says, and need a technology for unstructured
data analysis that really scales.
John
Dunagan, a principal architect for Microsofts High Performance
Computing group, thinks HPC Server customers who use Dryad will
find that they now can solve problems that had been
challenging.
Were convinced that we will delight our customers, both with the
pure capability of the system, as well as its ease of use, he says.
What I really like about Dryad is that is not just about handling a
problem in a better way, it is also about new possibilities in
computing that you couldnt imagine before.
The Microsoft Research team that worked on Dryad is pleased to
see its project in a position to seek a larger audience.
Offering an easy-to-use but powerful, data-intensive computing
tool is exciting to see, Isard says. It will benefit a whole new
set of Microsoft customers.