Lessons Learned from Gitlab's Massive Backup Failure

FEBRUARY 15TH, 2017
GitLab.com, a multi-million dollar startup, lost over 300GB of data after a failed backup process. This is just one of the cases that show us that backups don't matter if you can't restore. And that IT admins need to make a solid habit out of testing their backups and recovery capacity. We're living in the world where fewer than 50 percent of disaster recovery plans function without hiccups. Whether we are talking about massive open-source projects, such as MongoDB or GitLab, or small private companies with in-house systems, nothing is safe anymore. Human error, hackers, or system failures can cause chaos in the enterprise.
inline

No One is Too Big (or Too Small) to Fail

GitLab offers a web-based Git repository manager, where programmers can share and work on code together. Founded by two enthusiastic Ukrainian developers, the company now has over 1,400 contributors using the platform. Investors pumped over $25 million into the company to make sure it is a success. And it's actually working. Big names such as Sony, IBM, NAS and CERN are using the platform. Thinking that your 5-man plumbing business in a small town has no data interesting to hackers? Think again. The numbers show that 43 percent of cyber attacks target small businesses, and that hardware and software failure are among the most common issues faced by businesses. The mindset that we're too big/to small to fail needs to go out the window fast, in any industry.

Human Error is Rampant

Human error is the main cause of data loss, show statistics. In fact, we debated this in another post with the title "There's no such thing as a computer error".  And yet, for IT admins and business owners, there is the thought that 'it can't happen to me'. While it is true that you must show trust to your employees - you wouldn't hire them unless they were competent, right? - it would be a mistake to bury your head in the sand. Having a fail-safe backup solution for your critical systems is not negotiable. It looks as though a system administrator at GitLab tried to push a fix on the website by clearing out the backup database and restarting the copying process. But the admin accidentally deleted the primary database instead, and by the time he realized it, it was too late. They lost around 300 GB of data.

Testing Backup Systems is a Must

Testing backup systems is essential. And the testing cannot take place just once a month or once a year.  Companies today create data at an incredible rate. So the quantity of data you lose with every hour of downtime can be mind-boggling. "Out of 5 backup/replication techniques deployed none are working reliably or set up in the first place", said the blog for GitLab documenting the data loss incident. "We ended up restoring a 6 hours old backup," added the unlucky devs at GitLab.com.
inline

Placing the Blame is Redundant

In the aftermath of the incident, GitLab's VP of Marketing declared to Business Insider that there was no one individual to blame for the mishap. But rather, the whole team felt the heat after the backups weren't working. So if you have a data loss incident and you're looking to shoot one person in the head (with a nerf gun, of course), just think twice. When things go south and backups fail along with systems, it will be very hard to pinpoint the problem to one particular cubicle. The mindset should be that of teamwork:
If we win, we win together - if we fail, we will most certainly fail together. So always do your best.
Any punitive actions applied to one or more employees won't bring back lost data, revenue or company reputation. Owning up to mistakes and making sure things are buttoned up for the future, however, will help.

What Data Really Matters to YOU?

Client's data was not affected by the GitLab wipe, but rather just some comments and bug reports. But that was a lucky strike. Do you know where the important data lies in your company? And do you know where and how it is backed up? What happens when a system needs to be changed or migrated? Taking backups is fine, sure. But ask yourself all these questions on a regular basis, if you want to make sure revenue-generating systems are online 24/7. Intelligent backup systems like File Backup & Recovery, allow easy management of backups on thousands of endpoints. This solves the problem of backup management at a time when IT administrators are scratching their heads about how to implement a safe storage infrastructure and how to easily maintain it. Solid imaging solutions like ShadowProtect SPX will work both for Linux and Windows and provide full image backup and replication for both physical and virtual environments. As always, keep a backup (or two) and make sure you can easily restore - there's tons of options. Stay safe!