Mega Grid for Mega Science
Grid computing unites scientists around the world and uses their collective computing
power to investigate science’s unanswered questions.
The Large Hadron Collider (LHC), being built by CERN, the European Organization
for Nuclear Research, near Geneva, Switzerland, is the largest scientific instrument
on the planet. It is designed to accelerate and collide protons moving at nearly
the speed of light into each other in the search for evidence to some of science’s
unanswered questions, such as the origin of mass. When it comes online in 2007,
the LHC will be able to “see” up to 40 million collision events per second, enabling
the detectors of its main experiments, ALICE (A Large Ion Collider Experiment),
ATLAS (A large Toroidal LHC ApparatuS), CMS (Compact Muon Solenoid), and LHCb
(Large Hadron Collider beauty experiment) to watch as the energy of these collisions
mimic the conditions as they were a fraction of a second after the Big Bang.
All this capability will spell a new era for particle physicists and, for that
matter, science around the world. But there are real challenges. In operation,
the LHC will produce roughly 15 petabytes of data annually, the equivalent of
about 3 million DVDs or 100,000 times the storage capacity of the average desktop
computer. Thousands of scientists around the world will need to access and analyze
this data to find elusive evidence of new particles and forces.
Any single institution would not be able to easily store all of the data produced
by the LHC in one place and provide enough computing power to support the scientists
who will need daily access to the data. To deal with the vast amounts of data
and the accessibility issues, the CERN scientists have turned to grid computing.
Grids are the most recent step taken in tapping into the power of distributed
computing and storage resources across the world.
Life before the grid
Initially, scientists used clusters of computers to overcome the lack of computational
power. First explored in the early 1980s, groups of coupled computers worked together
to solve complex problems that could not be solved by one machine alone. Computer
clusters, for their part, are still used in supercomputer centers, research labs,
and industry to provide significant improvements in total computing power.
The next step in increasing computational power was distributed computing, parallel
computing in which the computers used to complete the tasks are in multiple geographic
locations. Applications are distributed between two or more computers over a network
to accomplish tasks that are too complex for one computer alone. One example of
distributed computing is SETI@Home. This program uses idle CPU time on Internet-connected
computers to crunch data from radio telescopes in the Search for Extraterrestrial
Intelligence (SETI). By using distributed computers, the program creates a powerful
computing system with global reach and supercomputer capabilities. Indeed, distributed
computing was the first real step toward today’s computing grids.
Enter the grid
click the image to enlarge
In March 2005, the LCG project surpassed 100 sites in 31 countries
which made it the world’s largest scientific grid. Photo: CERN
|
The term “grid” arose in the late 1990s to describe a computing infrastructure
that works like a power grid. Users would be able to access computing resources
as needed without worrying about where they came from, much as a person accesses
the electric grid. The “power stations” on the computing grid are clusters of
computers, and the “power lines” are the fiber optics of the Internet.
In 1995, Ian Foster at Argonne National Laboratory and the Univ. of Chicago, Ill.,
and Carl Kesselman in the Information Sciences Institute at the Univ. of Southern
California, Los Angeles, known as the fathers of grid computing, developed I-WAY,
the first true grid computer. Foster, R&D Magazine’s 2003 Innovator of the
Year, and Kesselman looked at ways of using network technology to build very
large, powerful systems, getting machines in different locations to work on parts
of a problem and then combine for the result, rather than writing software to
run on multiple processors in parallel. Ultimately, these ideas together formed
I-WAY, which enlisted high-speed networks to connect end resources at 17 sites
across North America, marking the start of grid computing.
In the summer of 2000, Kesselman went to Geneva to give a seminar on grid computing,
and the LHC Computing Grid (LCG) was born. A grid was chosen for the LHC because
the significant costs of maintaining and upgrading the necessary resources are
more easily handled in a distributed environment. In this way, individual institutes
and national organizations could fund local computing resources and retain responsibility
for them, while at the same time still contributing to the global goal.
World’s largest international scientific grid
When the LHC is running optimally, access to experimental data will need to be
provided for more than 5,000 scientists in 500 research institutes and universities
worldwide that are participating in LHC experiments. In addition, this data needs
to be available over the 15-year estimated lifetime of the LHC.
Grids, grids everywhere
The Large Hadron Collider Computing Grid collaborates with many other major
grid development projects and production environments around the world.
EGEE: The Enabling Grids for E-sciencE (EGEE) project brings
together scientists and engineers from more than 90 institutions in 32 countries.
Conceived from the start as a four-year project, the second two-year phase
of this project began on April 1, 2006. The EGEE is funded by the European
Commission. EGEE is a major contributor to the operations of the LCG project.
www.eu-egee.org
GridPP: GridPP is a collaboration of particle physicists
and computer scientists from the UK and CERN. Currently, this grid has 17
UK institutions. When the LHC opens in 2007, GridPP will be used to process
the accompanying data deluge by contributing the equivalent of 10,000 PCs
to this worldwide effort.
www.gridpp.ac.uk
INFN Grid: The INFN Grid project is the used by INFN—Italy’s
National Institute for Nuclear Physics—to develop and deploy grid middleware
services. The INFN Grid provides, deploys, and operates an open source release,
essentially based on EGEE gLite middleware, tailored for the need of the
Italian grid infrastructure and user communities.
http://grid.infn.it
NorduGrid: NorduGrid develops and deploys a set of tools
and services, the Advanced Resource Connector (ARC) middleware, which is
a free software. The core of the collaboration historically consists of
several Nordic academic and research institutes. NorduGrid interoperates
with the LCG.
www.nordugrid.org
Grid3: The Grid3 is operated jointly by the U.S. Grid projects
iVDGL, GriPhyN and PPDG, and the U.S. participants in the LHC experiments,
ATLAS and CMS. Project highlights include participation by more than 25
sites across the U.S. and Korea which collectively provide more than 2,000
CPUs.
www.ivdgl.org/grid2003
OSG: The Open Science Grid (OSG) is a distributed computing
infrastructure for large-scale scientific research, built and operated by
a consortium of U.S. universities, national laboratories, scientific collaborations,
and software developers. The OSG integrates computing and storage resources
from more than 50 sites in the U.S., Asia, and South America.
www.opensciencegrid.org
|
All of these requirements led to the creation of the LCG, the mission of which
is to build and maintain a data storage and analysis infrastructure for the entire
high-energy physics community that will use the LHC. The LCG is a worldwide network
of thousands of PCs, organized into large clusters and linked by ultra-high speed
connections to create the world’s largest international scientific computing grid.
Among the LCG’s goals are:
• Developing different software components to support the physics application
software in a Grid environment.
• Developing and deploying computing services based on a distributed Grid model.
• Managing users and their rights in an international, heterogeneous, and non-centralized
Grid environment.
• Managing acquisition, installation, and capacity planning for the large number
of commodity hardware components that form the physical platform for the LCG.
In addition to linking with individual PCs worldwide, the LCG collaborates with
many existing science grid infrastructures, among them the E.U.-funded Enabling
Grids for E-sciencE (EGEE) project and the U.S. Open Science Grid (OSG) project
(see sidebar). At the EGEE’06 conference in Geneva in September, CERN Director
General Robert Aymar emphasized the importance of such grids to the LCG. “We are
just over one year away from the anticipated launch of the Large Hadron Collider.
We expect this device will open up new horizons in particle physics,” says Aymar.
“The EGEE infrastructure is a key element in making the LHC Computing Grid possible,
and thus the success of the LHC is linked to the success of the EGEE project.”
In terms of deliverables, the LCG is already being tested by ALICE, ATLAS, CMS,
and LHCb to simulate the computing conditions expected once the LHC goes online.
As a result, LCG partners are achieving record-breaking results for high-speed
data transfers, distributed processing, and storage. For example, in 2005, eight
major computing centers completed a challenge to sustain a continuous flow of
600 MB/sec on average for 10 days from CERN to seven sites in Europe and the U.S.
This exercise was part of a service challenge designed to test the infrastructure
of the LCG. The total amount of data transmitted in the challenge—500 TB—would
take about 250 years to download using a typical 512 kb/sec household broadband
connection.
Vicky White, head of the Fermilab Computing Division, Batavia, Ill., one of the
challenge participants, commented, “High-energy physicists have been transmitting
large amounts of data around the world for years, but this has usually been in
relatively brief bursts and between two sites. Sustaining such high rates of data
for days on-end to multiple sites is a breakthrough, and augurs well for achieving
the ultimate goals for grid computing.”
However, even with all of these successes, the developers of the LCG are still
dealing with some challenges. Among the challenges are ensuring adequate levels
of network bandwidth between the contributing resources, maintaining coherence
of software versions installed in various locations, coping with heterogeneous
hardware, managing and protecting the data, and providing accounting mechanisms
so that different groups have fair access, based on their needs and contributions
to the infrastructure. Other challenges include how to balance local ownership
of resources while making them available to the larger community and how to overcome
local security worries about giving access to “anonymous” non-local users.
The brain of LCG
Linking thousands of computers together into one grid requires the use of standard
protocols and services. As such, the brain of any computer grid is its middleware
which enables the many different networks and resources of a computer grid to
look seamless to the user, and allows the user to submit a job to the entire grid.
The middleware draws from resource brokers, replica managers, and information
services to determine where to best run each job. It then copies or moves the
files as necessary, then returns the results to the user, without the user knowing
where the results came from. Security is paramount in such a system. Without authorization,
authentication, and accounting, there is no grid.
The middleware chosen for the LCG is the Globus Toolkit, which won a Special
R&D 100 Award in 2002 for the Most Promising New Technology. Led by Kesselman
and Foster, Globus is an open source project that grew out of the grid community’s
attempts to solve real problems that are encountered by real application projects.
It provides many of the basic services needed to construct grid applications such
as security, resource discovery, resource management, and data access. Globus
enables the LCG to interpret a user’s request and then autonomously find the appropriate
computing resources. It then breaks the job into smaller tasks, allocates the
computing power, and starts solving the problem.
Rivers of data
To process the massive amounts of data, the data from the LHC will need to be
distributed worldwide. A four-tiered model was chosen for the data distribution.
The Tier-0 center of the LCG at CERN will encompass data acquisition and initial
processing of the data. In addition, all data will be recorded on a primary backup
tape kept at CERN. After initial processing, the Tier-0 center will distribute
the data to a series of Tier-1 centers, large computer centers with sufficient
storage capacity for the data and with around-the-clock support for the grid.
The 11 large Tier-1 centers will carry out the data-heavy analysis and will then
make the data available to the 100 Tier-2 centers in 40 countries.
These centers each consist of one or several collaborating computing facilities
which can store sufficient data and provide adequate computing power for specific
analysis tasks. The Tier-2 centers will simulate the details of the experiments
and support the various analysis efforts of groups or individuals. Individual
scientists from around the world will access the Tier-2 facilities through Tier-3
computing resources, which can consist of local clusters in a university department
or even individual PCs, and which may be allocated to LCG on a regular basis.
Looking to the future
As the date for the LHC’s start approaches, the scale of the LCG in terms of the
number of sites is already close to its target of 50,000 PCs. Computational and
storage capacity is also beginning to ramp up, but work remains to be done on
improving the overall reliability of the LCG. The LCG will continue to grow over
the next year by adding sites and increasing resources available at existing sites.
In addition, the exponential increase in processor speed and disk storage capacity
inherent to the IT industry will help achieve the LCG’s ambitious computing goals.
Once the LHC goes online, the LCG will unite scientists around the world in searching
for the answers to some of science’s most intriguing questions.
—Martha Walz
|