![]() Mega Grid for Mega Science |
||
|
Grid computing unites scientists around the world and uses their collective computing
power to investigate science’s unanswered questions.
The Large Hadron Collider (LHC), being built by CERN, the European Organization for Nuclear Research, near Geneva, Switzerland, is the largest scientific instrument on the planet. It is designed to accelerate and collide protons moving at nearly the speed of light into each other in the search for evidence to some of science’s unanswered questions, such as the origin of mass. When it comes online in 2007, the LHC will be able to “see” up to 40 million collision events per second, enabling the detectors of its main experiments, ALICE (A Large Ion Collider Experiment), ATLAS (A large Toroidal LHC ApparatuS), CMS (Compact Muon Solenoid), and LHCb (Large Hadron Collider beauty experiment) to watch as the energy of these collisions mimic the conditions as they were a fraction of a second after the Big Bang. All this capability will spell a new era for particle physicists and, for that matter, science around the world. But there are real challenges. In operation, the LHC will produce roughly 15 petabytes of data annually, the equivalent of about 3 million DVDs or 100,000 times the storage capacity of the average desktop computer. Thousands of scientists around the world will need to access and analyze this data to find elusive evidence of new particles and forces. Any single institution would not be able to easily store all of the data produced by the LHC in one place and provide enough computing power to support the scientists who will need daily access to the data. To deal with the vast amounts of data and the accessibility issues, the CERN scientists have turned to grid computing. Grids are the most recent step taken in tapping into the power of distributed computing and storage resources across the world. Life before the grid Initially, scientists used clusters of computers to overcome the lack of computational power. First explored in the early 1980s, groups of coupled computers worked together to solve complex problems that could not be solved by one machine alone. Computer clusters, for their part, are still used in supercomputer centers, research labs, and industry to provide significant improvements in total computing power. The next step in increasing computational power was distributed computing, parallel computing in which the computers used to complete the tasks are in multiple geographic locations. Applications are distributed between two or more computers over a network to accomplish tasks that are too complex for one computer alone. One example of distributed computing is SETI@Home. This program uses idle CPU time on Internet-connected computers to crunch data from radio telescopes in the Search for Extraterrestrial Intelligence (SETI). By using distributed computers, the program creates a powerful computing system with global reach and supercomputer capabilities. Indeed, distributed computing was the first real step toward today’s computing grids. Enter the grid
In 1995, Ian Foster at Argonne National Laboratory and the Univ. of Chicago, Ill., and Carl Kesselman in the Information Sciences Institute at the Univ. of Southern California, Los Angeles, known as the fathers of grid computing, developed I-WAY, the first true grid computer. Foster, R&D Magazine’s 2003 Innovator of the Year, and Kesselman looked at ways of using network technology to build very large, powerful systems, getting machines in different locations to work on parts of a problem and then combine for the result, rather than writing software to run on multiple processors in parallel. Ultimately, these ideas together formed I-WAY, which enlisted high-speed networks to connect end resources at 17 sites across North America, marking the start of grid computing. In the summer of 2000, Kesselman went to Geneva to give a seminar on grid computing, and the LHC Computing Grid (LCG) was born. A grid was chosen for the LHC because the significant costs of maintaining and upgrading the necessary resources are more easily handled in a distributed environment. In this way, individual institutes and national organizations could fund local computing resources and retain responsibility for them, while at the same time still contributing to the global goal. World’s largest international scientific grid When the LHC is running optimally, access to experimental data will need to be provided for more than 5,000 scientists in 500 research institutes and universities worldwide that are participating in LHC experiments. In addition, this data needs to be available over the 15-year estimated lifetime of the LHC.
• Developing different software components to support the physics application software in a Grid environment. • Developing and deploying computing services based on a distributed Grid model. • Managing users and their rights in an international, heterogeneous, and non-centralized Grid environment. • Managing acquisition, installation, and capacity planning for the large number of commodity hardware components that form the physical platform for the LCG. In addition to linking with individual PCs worldwide, the LCG collaborates with many existing science grid infrastructures, among them the E.U.-funded Enabling Grids for E-sciencE (EGEE) project and the U.S. Open Science Grid (OSG) project (see sidebar). At the EGEE’06 conference in Geneva in September, CERN Director General Robert Aymar emphasized the importance of such grids to the LCG. “We are just over one year away from the anticipated launch of the Large Hadron Collider. We expect this device will open up new horizons in particle physics,” says Aymar. “The EGEE infrastructure is a key element in making the LHC Computing Grid possible, and thus the success of the LHC is linked to the success of the EGEE project.” In terms of deliverables, the LCG is already being tested by ALICE, ATLAS, CMS, and LHCb to simulate the computing conditions expected once the LHC goes online. As a result, LCG partners are achieving record-breaking results for high-speed data transfers, distributed processing, and storage. For example, in 2005, eight major computing centers completed a challenge to sustain a continuous flow of 600 MB/sec on average for 10 days from CERN to seven sites in Europe and the U.S. This exercise was part of a service challenge designed to test the infrastructure of the LCG. The total amount of data transmitted in the challenge—500 TB—would take about 250 years to download using a typical 512 kb/sec household broadband connection. Vicky White, head of the Fermilab Computing Division, Batavia, Ill., one of the challenge participants, commented, “High-energy physicists have been transmitting large amounts of data around the world for years, but this has usually been in relatively brief bursts and between two sites. Sustaining such high rates of data for days on-end to multiple sites is a breakthrough, and augurs well for achieving the ultimate goals for grid computing.” However, even with all of these successes, the developers of the LCG are still dealing with some challenges. Among the challenges are ensuring adequate levels of network bandwidth between the contributing resources, maintaining coherence of software versions installed in various locations, coping with heterogeneous hardware, managing and protecting the data, and providing accounting mechanisms so that different groups have fair access, based on their needs and contributions to the infrastructure. Other challenges include how to balance local ownership of resources while making them available to the larger community and how to overcome local security worries about giving access to “anonymous” non-local users. The brain of LCG Linking thousands of computers together into one grid requires the use of standard protocols and services. As such, the brain of any computer grid is its middleware which enables the many different networks and resources of a computer grid to look seamless to the user, and allows the user to submit a job to the entire grid. The middleware draws from resource brokers, replica managers, and information services to determine where to best run each job. It then copies or moves the files as necessary, then returns the results to the user, without the user knowing where the results came from. Security is paramount in such a system. Without authorization, authentication, and accounting, there is no grid. The middleware chosen for the LCG is the Globus Toolkit, which won a Special R&D 100 Award in 2002 for the Most Promising New Technology. Led by Kesselman and Foster, Globus is an open source project that grew out of the grid community’s attempts to solve real problems that are encountered by real application projects. It provides many of the basic services needed to construct grid applications such as security, resource discovery, resource management, and data access. Globus enables the LCG to interpret a user’s request and then autonomously find the appropriate computing resources. It then breaks the job into smaller tasks, allocates the computing power, and starts solving the problem. Rivers of data To process the massive amounts of data, the data from the LHC will need to be distributed worldwide. A four-tiered model was chosen for the data distribution. The Tier-0 center of the LCG at CERN will encompass data acquisition and initial processing of the data. In addition, all data will be recorded on a primary backup tape kept at CERN. After initial processing, the Tier-0 center will distribute the data to a series of Tier-1 centers, large computer centers with sufficient storage capacity for the data and with around-the-clock support for the grid. The 11 large Tier-1 centers will carry out the data-heavy analysis and will then make the data available to the 100 Tier-2 centers in 40 countries. These centers each consist of one or several collaborating computing facilities which can store sufficient data and provide adequate computing power for specific analysis tasks. The Tier-2 centers will simulate the details of the experiments and support the various analysis efforts of groups or individuals. Individual scientists from around the world will access the Tier-2 facilities through Tier-3 computing resources, which can consist of local clusters in a university department or even individual PCs, and which may be allocated to LCG on a regular basis. Looking to the future As the date for the LHC’s start approaches, the scale of the LCG in terms of the number of sites is already close to its target of 50,000 PCs. Computational and storage capacity is also beginning to ramp up, but work remains to be done on improving the overall reliability of the LCG. The LCG will continue to grow over the next year by adding sites and increasing resources available at existing sites. In addition, the exponential increase in processor speed and disk storage capacity inherent to the IT industry will help achieve the LCG’s ambitious computing goals. Once the LHC goes online, the LCG will unite scientists around the world in searching for the answers to some of science’s most intriguing questions. —Martha Walz |
||
|
Use of this website is subject to its terms of use. Privacy Policy |