IBM is working on a 120 petabyte storage unit, that's 200,000 HDDs used. That's a lot of storage !
Steve Conway, a vice president of research with the analyst firm IDC who specializes in high-performance computing (HPC), says IBM's repository is significantly bigger than previous storage systems. "A 120-petabye storage array would easily be the largest I've encountered," he says. The largest arrays available today are about 15 petabytes in size. Supercomputing problems that could benefit from more data storage include weather forecasts, seismic processing in the petroleum industry, and molecular studies of genomes or proteins, says Conway.
IBM's engineers developed a series of new hardware and software techniques to enable such a large hike in data-storage capacity. Finding a way to efficiently combine the thousands of hard drives that the system is built from was one challenge. As in most data centers, the drives sit in horizontal drawers stacked inside tall racks. Yet IBM's researchers had to make those significantly wider than usual to fit more disks into a smaller area. The disks must be cooled with circulating water rather than standard fans.
The inevitable failures that occur regularly in such a large collection of disks present another major challenge, says Hillsberg. IBM uses the standard tactic of storing multiple copies of data on different disks, but it employs new refinements that allow a supercomputer to keep working at almost full speed even when a drive breaks down.
When a lone disk dies, the system pulls data from other drives and writes it to the disk's replacement slowly, so the supercomputer can continue working. If more failures occur among nearby drives, the rebuilding process speeds up to avoid the possibility that yet another failure occurs and wipes out some data permanently. Hillsberg says that the result is a system that should not lose any data for a million years without making any compromises on performance.
The new system also benefits from a file system known as GPFS that was developed at IBM Almaden to enable supercomputers faster data access. It spreads individual files across multiple disks so that many parts of a file can be read or written at the same time. GPFS also enables a large system to keep track of its many files without laboriously scanning through every one. Last month a team from IBM used GPFS to index 10 billion files in 43 minutes, effortlessly breaking the previous record of one billion files scanned in three hours.