Sunday, November 16, 2008

Tour of Ranger - Loud and Windy

The Ranger cluster at TACC, the largest public research cluster in the world at 579 Tflops, is not your typical data center tour. The system took $30M and two and a half years to implement, with total costs of $60M over four years. Two sides of the data center have glass walls, making for a nice showcase.

Before going in, Glen gives us a tour of one of the blades. The cluster has almost 4000 of these, for a total of 15,744 processors, 62,976 cores, and 123 TB of memory.

When you enter the room, the immediate sensation is of an intense environment. It's very loud, and in the rows, very windy. The APC chillers I'm in front of are on each side of the blade racks, returning cooled air from the enclosed hot aisle. Unfortunately I don't have enough hair to really give you a true sense of it.

Here we see six SunBlade 6000 chassis in two racks, each with twelve blades. Storage racks with 1.7 PB of capacity coming from Sun Thumpers run the Lustre filesystem are at the ends of each aisle. Data can only stay on the system for 30 days, as they are creating 5-20 TB/day.

These are power distribution units (PDUs) for the cluster, which draws 2.4 Mw at peak load, or enough power for 2400 typical residential homes. At $.06/KwH, the annual power bill is ~$1M/year. There are no UPSes or generators for the system, though they are planning to add UPSes for the storage and network fabric. Also, notice the floor vents - there is room-based HVAC in addition to the in-row systems to manage humidity and the larger environment.

The cluster itself takes 2000 sf, however there is another 1500 sf needed for PDUs and chillers. The room overall is 6000 sf.

The enclosed hot aisles allow for much greater efficiency in cooling the tremendous heat load. The fire suppression is water sprinklers (dry pipe pre-action), however they had to add more smoke sensors and alarms to deal with the enclosed aisles, as there was concern someone working inside the closed aisle wouldn't hear or see an alarm.

This is one of two massive Magnum Infiniband switches, the world's largest such device, with 3456 non-blocking ports. It is so dense that engineering the cabling was a major challenge.

Ranger supports over 1500 users and 400 research projects, and has handled more than 300,000 jobs for a total of some 220,000,000 CPU hours. There are larger clusters out there, but Ranger gets kudos for its density, and for putting together a very highly performing system with a considerably smaller budget than the DOE labs.

