Thread: Downtime Over
View Single Post
Author Message
BAILOPAN
Join Date: Jan 2004
Old 01-22-2011 , 09:11   Downtime Over
#1

Hello, Everyone.

I'd like to talk about what happened this week, what the current state of recovery is, and what you can do to help.

If you don't want to read, the bottom line is: we need your help! Our webserver has gone through a bit of shock.

The dirty secret is that we have always kept the donation goal significantly less than our actual costs. In the past, I felt like we should get by on what we can, without asking more of people. Perhaps that was the right attitude a few years ago, but now the community has really grown. That's awesome! But it means we have to be more proactive and responsible about our infrastructure.

So, if you want to help, please donate! We need to upgrade our hardware, backup capabilities, and more.

I'll be talking more over the next few weeks as we bring things online and start on longer-term improvements.

What Happened

Early Wednesday morning, all AlliedModders Websites became very slow. We'd come to recognize this as an intermittent problem, usually causing site errors, and always characterized by extremely high disk I/O wait times. What we didn't realize is that our primary hard drive had been failing, and on Wednesday it failed completely.

We did not have RAID, so it quickly became a worst-case situation. We had partial backups, but I didn't know what was included. The backup system wouldn't let me see without having a working operating system. So I decided the best decision was to keep the server offline, and try to copy as much data as I could before the drive completely failed. But, the drive quickly degraded so much that I decided it was best not to attempt anything further.

Meanwhile, the communication channel with our provider wasn't good. I now know how to deal with this better in the future, but suffice to say we wasted a lot of time. I didn't want to replace the drive without first securing physical ownership of the old one, in order to send it to a recovery service. We got that negotiated on Thursday night. Then we had the drive replaced and an identical one added for RAID-1.

Very, very early Friday morning, I reinstalled the operating system and restored our partial backups.

Recovery

The damage report is pretty good. Our partial backups had enough to restore:
  • Forums, avatars, most attachments (ONLINE)
  • SourceMod Sites (ONLINE)
  • Metamod:Source, AMX Mod X sites (ONLINE)
  • AM Wiki (ONLINE)
  • @alliedmods.net e-mails
Our partial backups did not include:
  • Bugzilla
  • @alliedmods.net e-mail service
  • WC3Mods
  • Superhero Mod
  • AMXBans
  • UAIO
  • CSDM/CS:S DM
  • Some forum attachments (possibly from Monday through Wednesday)
What was not affected:
  • Source Code repositories, hgweb (ONLINE)
  • Buildbots

This list isn't comprehensive. Our partial backups don't have anything that could otherwise be easily recovered, so a lot of our infrastructure may simply be broken. Files might be missing, pages might not work, services might be down, etc. I will try to list those in a second post, and cross them off as they come back online.

Why didn't you do X, Y, Z, etc?

I've gotten a lot of suggestions, rants, complaints from people about various things over the past few days. Why didn't we have RAID? Why didn't we do complete backups? Why don't we switch hosting? Some of it has been really helpful. I especially owe MatthiasVance, asherkin, devicenull and others in #smdevs and #sourcemod for their advice.

It's important to put this site into perspective. It started out of my first college dorm room. It was a computer sitting next to my desktop, made from scrap parts. When it broke, we had our first donations drive to buy a new server. In 2005, we started renting a dedicated server. There was no way I could afford it as a college student, and we worked out a deal with SteamFriends (then, GameConnect) to be sponsored. That ended in 2006.

We've always ran things on a tight budget, and our whole motif is kind of, "We're scrappy, but we get things done!" We didn't have any backups at all until 2008. Off-site backup charges by the GB, so I was pretty selective in choosing what to backup. We didn't have a drive fail until 2010.

But it's clear we as a community have grown really big, and that's awesome. We almost always meet the donation goal, which is a spectular testament to how much people care about the project. It sucks when things like this happen. So immediately, here's what I'm doing:
  • We now use RAID.
  • We will begin backing up things that were missed by the partial backup scheme.
  • The old drive is being sent to a data recovery service. Hopefully we can get more data back.
  • We will start running monitoring software to detect future problems.
  • The site will be bumpy over the next few weeks as little missing pieces are discovered.

Thanks for your patience and support. I'll answer questions in this thread, or e-mail if you're more comfortable through that.
__________________
egg

Last edited by Fyren; 01-23-2011 at 20:41.
BAILOPAN is offline