Big Data Storage and Preservation at the Library of Congress  

Big Data Storage and Preservation at the Library of Congress  

July 25

Since I haven’t stepped inside one in nearly ten years, I had no idea that libraries were an endangered species. Some, like the Mark Twain Library here in Detroit, shut their doors for financial reasons. Others, like the Ancient Library of Alexandria in Egypt, were destroyed centuries ago, having massive collections of irreplaceable knowledge go up in flames with them. Libraries seem to have lost more than a bit of their luster in the Internet age, but the U.S. Library of Congress is striving to preserve its historic archives and significance for the long haul.

The world’s largest library, the Library of Congress is an enormous treasure chest of American history and so much more. Here you can find everything from vintage newspapers and photographs to films and manuscripts submitted for copyright registration by authors such as Yours Truly. There’s obviously a lot of information stored in this iconic federal institution, but how much is a lot? The answer isn’t necessarily as cut and dried as a number person like me would prefer.

What’s in a Number?

According to the LOC itself, the Library of Congress contains more than 120 million items. This includes both print content and an ever expanding digital collection comprised of images, web pages, and files harboring all kinds of content. But when it comes pinning a number on that diverse and dynamic collection, things can get pretty complicated. Thanks to a 2009 post on the LOC blog, we can at least try to sort out this mess by breaking these items down in sections.

Publicly accessible data online. The LOC estimated that it had roughly 15.3 million digital items freely available to the public online. This translates to about 74 terabytes of data.

Physical items. The physical collection in the LOC was estimated to contain 32 million items. However, at the time of the blog post, that figure was said to consist of books and printed items, while only accounting for a quarter of the whole physical collection.

Video and audio media. The LOC claimed that its collection of movies, videos, and audio recordings tallied up to about six million items. This information is stored at the Packard Campus facility in Culpeper, Virginia, which houses movies made all the way back in the late 1800’s. Even at a rate of three to five thousand terabytes per year, it would likely take the library’s National Audio-Visual Conservation Center decades to digitize the entire collection.

Five years later and those estimates have no doubt grown. Five years later and the library still hasn’t spilled the beans regarding exactly how much data it has in the vault. That hasn’t stopped other sources from trying to guestimate and hint at the figures in numerous comparisons. Here are some examples:

The data repository at the Library of Congress is massive, but apparently has nothing on Facebook.

According to The Wire, Facebook’s photo library alone is more than 10,000 times bigger than the LOC repository with well over 140 billion photos under its belt.

Gizmodo offers what could be closest to the most accurate figure – at least in terms of digital content. The online tech publication cited that the NSA collects roughly 74 terabytes of data every six hours, which is said to be about what the LOC has stored in its digital archives.

Jaimon Joseph of IBNLive seems to have a good idea of how much data is stored in the world’s largest library. The Indian blogger claims that the Honeywell India Technology Centre stores approximately 32 terabytes of data, which he claims is five times more than the capacity at the Library of Congress.

Finally, CenturyLink predicts that a whopping 7.9 zettabytes of data will be created by 2015. That supposedly adds up to 18 million times more than what the LOC has tucked away in its digital archives.

Preservation Challenges and Strategies

Although the actual size of the LOC data collection is shrouded in mystery, what isn’t so mysterious are the challenges it faces in preserving that mass of historical jewels and incoming information. All forms of media deteriorate over time, and converting old media to digital can make matters even more complex when factoring in the unpredictability of hard drives and their confunding failure rates. With so much important, unique information in its care, the Library of Congress has to be meticulous in numerous aspects of data preservation, including:

  • How data is recorded
  • Where it’s stored
  • Quality of storage media
  • Cost of storage media
  • How storage media is physically handled
  • How data will be recovered in the event of failure

Despite being immensely challenged, the Library of Congress is taking a proactive approach to preserving the availability and integrity of its informational assets. Each year, the LOC hosts the Preservation Storage Meeting, where the focus is building out storage architectures for its digital archive. At the 2013 meeting, the LOC revealed that 30 percent of its storage investments are dedicated to expanding capacity, while another 70 percent is allocated to the continual refreshing of technology resources. There were also discussions around the library’s ongoing migration from tape to newer storage technology, and the viability of the cloud.

To improve its storage capabilities, the Library of Congress recently turned to Avere Systems, a leading provider of NAS-based storage solutions. The firm is using its integrated storage technology to optimize the data across the library’s website and file archives. Use of the Avere FXT Series, which supports a robust 150 TB of flash per cluster, enables site visitors to enjoy fast and efficient access to scores of publicly available content online. The investment in Avere is one several layers in a hybrid data management strategy geared to preserve the past, present, and future of the Library of Congress.

Photo Credit: Mark Brennan via Flickr