Raised This Month: $12 Target: $400
 3% 

Downtime Over


Post New Thread Closed Thread   
 
Thread Tools Display Modes
Author Message
BAILOPAN
Join Date: Jan 2004
Old 01-22-2011 , 09:11   Downtime Over
#1

Hello, Everyone.

I'd like to talk about what happened this week, what the current state of recovery is, and what you can do to help.

If you don't want to read, the bottom line is: we need your help! Our webserver has gone through a bit of shock.

The dirty secret is that we have always kept the donation goal significantly less than our actual costs. In the past, I felt like we should get by on what we can, without asking more of people. Perhaps that was the right attitude a few years ago, but now the community has really grown. That's awesome! But it means we have to be more proactive and responsible about our infrastructure.

So, if you want to help, please donate! We need to upgrade our hardware, backup capabilities, and more.

I'll be talking more over the next few weeks as we bring things online and start on longer-term improvements.

What Happened

Early Wednesday morning, all AlliedModders Websites became very slow. We'd come to recognize this as an intermittent problem, usually causing site errors, and always characterized by extremely high disk I/O wait times. What we didn't realize is that our primary hard drive had been failing, and on Wednesday it failed completely.

We did not have RAID, so it quickly became a worst-case situation. We had partial backups, but I didn't know what was included. The backup system wouldn't let me see without having a working operating system. So I decided the best decision was to keep the server offline, and try to copy as much data as I could before the drive completely failed. But, the drive quickly degraded so much that I decided it was best not to attempt anything further.

Meanwhile, the communication channel with our provider wasn't good. I now know how to deal with this better in the future, but suffice to say we wasted a lot of time. I didn't want to replace the drive without first securing physical ownership of the old one, in order to send it to a recovery service. We got that negotiated on Thursday night. Then we had the drive replaced and an identical one added for RAID-1.

Very, very early Friday morning, I reinstalled the operating system and restored our partial backups.

Recovery

The damage report is pretty good. Our partial backups had enough to restore:
  • Forums, avatars, most attachments (ONLINE)
  • SourceMod Sites (ONLINE)
  • Metamod:Source, AMX Mod X sites (ONLINE)
  • AM Wiki (ONLINE)
  • @alliedmods.net e-mails
Our partial backups did not include:
  • Bugzilla
  • @alliedmods.net e-mail service
  • WC3Mods
  • Superhero Mod
  • AMXBans
  • UAIO
  • CSDM/CS:S DM
  • Some forum attachments (possibly from Monday through Wednesday)
What was not affected:
  • Source Code repositories, hgweb (ONLINE)
  • Buildbots

This list isn't comprehensive. Our partial backups don't have anything that could otherwise be easily recovered, so a lot of our infrastructure may simply be broken. Files might be missing, pages might not work, services might be down, etc. I will try to list those in a second post, and cross them off as they come back online.

Why didn't you do X, Y, Z, etc?

I've gotten a lot of suggestions, rants, complaints from people about various things over the past few days. Why didn't we have RAID? Why didn't we do complete backups? Why don't we switch hosting? Some of it has been really helpful. I especially owe MatthiasVance, asherkin, devicenull and others in #smdevs and #sourcemod for their advice.

It's important to put this site into perspective. It started out of my first college dorm room. It was a computer sitting next to my desktop, made from scrap parts. When it broke, we had our first donations drive to buy a new server. In 2005, we started renting a dedicated server. There was no way I could afford it as a college student, and we worked out a deal with SteamFriends (then, GameConnect) to be sponsored. That ended in 2006.

We've always ran things on a tight budget, and our whole motif is kind of, "We're scrappy, but we get things done!" We didn't have any backups at all until 2008. Off-site backup charges by the GB, so I was pretty selective in choosing what to backup. We didn't have a drive fail until 2010.

But it's clear we as a community have grown really big, and that's awesome. We almost always meet the donation goal, which is a spectular testament to how much people care about the project. It sucks when things like this happen. So immediately, here's what I'm doing:
  • We now use RAID.
  • We will begin backing up things that were missed by the partial backup scheme.
  • The old drive is being sent to a data recovery service. Hopefully we can get more data back.
  • We will start running monitoring software to detect future problems.
  • The site will be bumpy over the next few weeks as little missing pieces are discovered.

Thanks for your patience and support. I'll answer questions in this thread, or e-mail if you're more comfortable through that.
__________________
egg

Last edited by Fyren; 01-23-2011 at 20:41.
BAILOPAN is offline
BAILOPAN
Join Date: Jan 2004
Old 01-22-2011 , 09:11   Re: Downtime
#2

Not yet functional:
  • Symbol servers (NO DATA)
  • Stat Query cronjob
  • Log rotating
  • E-mail
__________________
egg

Last edited by Fyren; 01-22-2011 at 17:29.
BAILOPAN is offline
asherkin
SourceMod Developer
Join Date: Aug 2009
Location: OnGameFrame()
Old 01-22-2011 , 09:19   Re: Downtime Over
#3

I would like to be the first to congratulate you on dealing with this calmly (stressful event is stressful ) and getting stuff up and running again as quickly as possible.

Nice Work!
__________________
asherkin is offline
Zylius
SourceMod Donor
Join Date: Nov 2009
Old 01-22-2011 , 10:13   Re: Downtime Over
#4

Nicely Done
Zylius is offline
Arkshine
AMX Mod X Plugin Approver
Join Date: Oct 2005
Old 01-22-2011 , 10:18   Re: Downtime Over
#5

Thanks for your hard work, BAILOPAN. I will be gladly to donate.
__________________
Arkshine is offline
rautamiekka
Veteran Member
Join Date: Jan 2009
Location: Finland
Old 01-22-2011 , 10:19   Re: Downtime Over
#6

What happened to the stuff the backups didn't include ?

Rautamiekka File Server is glad to help by storing any data.
__________________
Links to posts I received Karma from:
Big thanks to all who gave Karma
rautamiekka is offline
Send a message via ICQ to rautamiekka Send a message via AIM to rautamiekka Send a message via MSN to rautamiekka Send a message via Yahoo to rautamiekka Send a message via Skype™ to rautamiekka
MindeLT
Senior Member
Join Date: Dec 2010
Location: Lithuania
Old 01-22-2011 , 10:56   Re: Downtime Over
#7

good to see you back
MindeLT is offline
Send a message via Skype™ to MindeLT
MindeLT
Senior Member
Join Date: Dec 2010
Location: Lithuania
Old 01-22-2011 , 11:09   Re: Downtime Over
#8

good to see you back
MindeLT is offline
Send a message via Skype™ to MindeLT
hlstriker
Green Gaben
Join Date: Mar 2006
Location: OH-IO!
Old 01-22-2011 , 11:12   Re: Downtime Over
#9

Good to see the websites are back online!

Thanks everyone helping for their hard work
hlstriker is offline
Malachi
Senior Member
Join Date: Jun 2010
Location: USA
Old 01-22-2011 , 11:14   Re: Downtime Over
#10

Bail,

Let me know if you guys need any help with hardware. I sometimes have access to older servers that our company throws away...

-Mal.
Malachi is offline
Closed Thread


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -4. The time now is 11:22.


Powered by vBulletin®
Copyright ©2000 - 2024, vBulletin Solutions, Inc.
Theme made by Freecode