AlliedModders

AlliedModders (https://forums.alliedmods.net/index.php)
-   News (https://forums.alliedmods.net/forumdisplay.php?f=59)
-   -   Downtime Over (https://forums.alliedmods.net/showthread.php?t=148196)

BAILOPAN 01-22-2011 09:11

Downtime Over
 
Hello, Everyone.

I'd like to talk about what happened this week, what the current state of recovery is, and what you can do to help.

If you don't want to read, the bottom line is: we need your help! Our webserver has gone through a bit of shock.

The dirty secret is that we have always kept the donation goal significantly less than our actual costs. In the past, I felt like we should get by on what we can, without asking more of people. Perhaps that was the right attitude a few years ago, but now the community has really grown. That's awesome! But it means we have to be more proactive and responsible about our infrastructure.

So, if you want to help, please donate! We need to upgrade our hardware, backup capabilities, and more.

I'll be talking more over the next few weeks as we bring things online and start on longer-term improvements.

What Happened

Early Wednesday morning, all AlliedModders Websites became very slow. We'd come to recognize this as an intermittent problem, usually causing site errors, and always characterized by extremely high disk I/O wait times. What we didn't realize is that our primary hard drive had been failing, and on Wednesday it failed completely.

We did not have RAID, so it quickly became a worst-case situation. We had partial backups, but I didn't know what was included. The backup system wouldn't let me see without having a working operating system. So I decided the best decision was to keep the server offline, and try to copy as much data as I could before the drive completely failed. But, the drive quickly degraded so much that I decided it was best not to attempt anything further.

Meanwhile, the communication channel with our provider wasn't good. I now know how to deal with this better in the future, but suffice to say we wasted a lot of time. I didn't want to replace the drive without first securing physical ownership of the old one, in order to send it to a recovery service. We got that negotiated on Thursday night. Then we had the drive replaced and an identical one added for RAID-1.

Very, very early Friday morning, I reinstalled the operating system and restored our partial backups.

Recovery

The damage report is pretty good. Our partial backups had enough to restore:
  • Forums, avatars, most attachments (ONLINE)
  • SourceMod Sites (ONLINE)
  • Metamod:Source, AMX Mod X sites (ONLINE)
  • AM Wiki (ONLINE)
  • @alliedmods.net e-mails
Our partial backups did not include:
  • Bugzilla
  • @alliedmods.net e-mail service
  • WC3Mods
  • Superhero Mod
  • AMXBans
  • UAIO
  • CSDM/CS:S DM
  • Some forum attachments (possibly from Monday through Wednesday)
What was not affected:
  • Source Code repositories, hgweb (ONLINE)
  • Buildbots

This list isn't comprehensive. Our partial backups don't have anything that could otherwise be easily recovered, so a lot of our infrastructure may simply be broken. Files might be missing, pages might not work, services might be down, etc. I will try to list those in a second post, and cross them off as they come back online.

Why didn't you do X, Y, Z, etc?

I've gotten a lot of suggestions, rants, complaints from people about various things over the past few days. Why didn't we have RAID? Why didn't we do complete backups? Why don't we switch hosting? Some of it has been really helpful. I especially owe MatthiasVance, asherkin, devicenull and others in #smdevs and #sourcemod for their advice.

It's important to put this site into perspective. It started out of my first college dorm room. It was a computer sitting next to my desktop, made from scrap parts. When it broke, we had our first donations drive to buy a new server. In 2005, we started renting a dedicated server. There was no way I could afford it as a college student, and we worked out a deal with SteamFriends (then, GameConnect) to be sponsored. That ended in 2006.

We've always ran things on a tight budget, and our whole motif is kind of, "We're scrappy, but we get things done!" We didn't have any backups at all until 2008. Off-site backup charges by the GB, so I was pretty selective in choosing what to backup. We didn't have a drive fail until 2010.

But it's clear we as a community have grown really big, and that's awesome. We almost always meet the donation goal, which is a spectular testament to how much people care about the project. It sucks when things like this happen. So immediately, here's what I'm doing:
  • We now use RAID.
  • We will begin backing up things that were missed by the partial backup scheme.
  • The old drive is being sent to a data recovery service. Hopefully we can get more data back.
  • We will start running monitoring software to detect future problems.
  • The site will be bumpy over the next few weeks as little missing pieces are discovered.

Thanks for your patience and support. I'll answer questions in this thread, or e-mail if you're more comfortable through that.

BAILOPAN 01-22-2011 09:11

Re: Downtime
 
Not yet functional:
  • Symbol servers (NO DATA)
  • Stat Query cronjob
  • Log rotating
  • E-mail

asherkin 01-22-2011 09:19

Re: Downtime Over
 
I would like to be the first to congratulate you on dealing with this calmly (stressful event is stressful :P) and getting stuff up and running again as quickly as possible.

Nice Work!

Zylius 01-22-2011 10:13

Re: Downtime Over
 
Nicely Done :)

Arkshine 01-22-2011 10:18

Re: Downtime Over
 
Thanks for your hard work, BAILOPAN. I will be gladly to donate.

rautamiekka 01-22-2011 10:19

Re: Downtime Over
 
What happened to the stuff the backups didn't include ?

Rautamiekka File Server is glad to help by storing any data.

MindeLT 01-22-2011 10:56

Re: Downtime Over
 
good to see you back :)

MindeLT 01-22-2011 11:09

Re: Downtime Over
 
good to see you back :)

hlstriker 01-22-2011 11:12

Re: Downtime Over
 
Good to see the websites are back online!

Thanks everyone helping for their hard work :)

Malachi 01-22-2011 11:14

Re: Downtime Over
 
Bail,

Let me know if you guys need any help with hardware. I sometimes have access to older servers that our company throws away...

-Mal.


All times are GMT -4. The time now is 13:45.

Powered by vBulletin®
Copyright ©2000 - 2024, vBulletin Solutions, Inc.