GitLab is down. Oh, not again!

Why is it happening? Who’s to blame? What to do?

GitLab Fan
GitLab Fan Club

--

Why is it happening actually?

There are two reasons for downtimes at GitLab.com actually: planned downtime during deployments and outages caused by bugs and infrastructure problems.

Cause #1: Deployments

So why deployments cause downtime? To understand that we should understand how GitLab.com is being deployed. There’s not that much information in public, but we can try to restore the bigger picture by analyzing documentation and conversations in GitLab’s public issue tracker.

Some facts first:

  1. GitLab is distributed using operating system packages. For example, in Ubuntu, you can install GitLab CE with “apt-get install gitlab-ce” command, and update with “apt-get update gitlab-ce”.
  2. GitLab releases a new version every month. For example, the last one was 8.15. Regressions and bugs being fixed after the release, end up in a minor release like 8.15.3. There are also security releases, which patch vulnerabilities in last three releases of GitLab: 8.15.4. There are also RC releases, which are being deployed on GitLab.com a couple of days before the major release.
  3. For every release GitLab had to generate packages for all supported Operating Systems.

So we can assume that deployment on GitLab.com is done using the same “apt-get update” mechanism.

Now we have to figure out why this way of deployment causes downtime. The trick is that sometimes GitLab has to modify the structure of database. Now imagine a web application writing to a database, which is being modified by migration in the same time. Obviously this can lead to a data corruption. This is why the whole application has to be shut down during migration to prevent such problems.

Packages-based way of distribution is very convinient for administrators, who upgrade their GitLab once a couple of months, but apparently it has downsides for GitLab.com, when you had to update it so often.

It seems GitLab is not happy with this situation as well. There is an issue created by Sean McGivern on GitLab’s public issue tracker: Zero-downtime migrations.

Sean highlights that migrations are not the only cause of downtimes, but overcoming this problem would be a significant benefit.

In the comment to this issue Yorick Peterse confirms our guess about package-based deployments, and says that the only way for GitLab.com to get zero-downtime deploys is to switch to SaaS like deployment strategy:

The first requirement is that we start deploying individual commits, instead of bundling days of work into RCs. Once we can deploy individual commits we’d have to require developers push each migration as a separate commit. Finally, to then do a zero downtime deploy you need to carefully merge changes into master one by one, and deploy them one by one.
This is a rather complex procedure, and I don’t see this working out any time soon (if ever).

It doesn’t sound very optimistic so far, but let’s hope for the best. You can find more details in the issue itself.

The discussion is open, and if you have something to suggest, I guess your comments are more than welcome. You can also subscribe for the comments to this issue, or at least give a thumb up(you should be logged in on GitLab.com to do that).

Cause #2: Outages caused by bugs and infrastructure problems

A couple of facts about GitLab.com infrastructure:

  1. GitLab.com sits on top of NFS.
  2. Almost every operation on GitLab involves work with git and is I/O heavy.
  3. GitLab gains more and more popularity, which means more users, more projects, and therefore more load on servers.

Shit happens to everyone

Unfortunately, outages are a usual problem for cloud services. We all saw GitHub unicorning from time to time. Even giants like AWS sometimes go down, leaving lots of companies without infrastructure.

GitLab grows fast, so sometimes infrastructure can’t catch up to growing demands.

The good thing is that every incident gets documented in issue in GitLab’s infrastructure issue tracker, posted on Twitter, and sometimes gets turned into a blog post. Examples:

Stop yelling at cloud

If GitLab becomes so critical for your company’s development infrastructure that you are ready to start complaining on Twitter, it means that it is time for you to take care of availability of the critical services by yourself.

And with GitLab you can do it literally with no additional costs. GitLab CE should be more than enough if your company has less than 100 developers.

Aside: make sure to use export/import to copy your projects from GitLab.com to your instance. With sync feature, you can even keep repositories synchronized.

Summary

  1. There are two causes of downtime of GitLab.com: database migrations during deployments and infrastructure problems.
  2. GitLab is aware and looks for a way to implement zero-time downloads. Your help is welcome.
  3. Shit happens. GitLab.com is not an exception.
  4. If not satisfied by free GitLab.com, take care of your infrastructure by yourself — move to your server like Codepen did.

--

--