Trinity: Shall we play a game?
Tyler: Yeah. How about Global Thermonuclear War.
Trinity: Wouldn’t you prefer a nice game of chess?
Tyler: Later. Right now lets play Global Thermonuclear War.
Okay, so that wasn’t exactly what happened, but I thought it might bring a good laugh to movie buffs. Three weeks ago, all my sites went down and stayed down for an entire week.
Here’s what happened:
On the afternoon of April 6th, I noticed that I couldn’t connect to one of my websites. I wasn’t alarmed in the least since I was having a lot of server problems lately. The hired server administration team I pay could never figure out what was wrong or what might be causing it other than I had a very old setup. Fortunately, whenever my server went down it was always an easy fix, almost always requiring an easy Apache restart.
However, when I tried to SSH in, I was given an error saying I couldn’t connect. This usually meant that the server needed to be physically restarted by hand. So, I opened a ticket with ThePlanet – the company I host my servers with, and waited for them to reboot it. I was still not worried at this point as this was occurring about once every week for the past couple months.
However, after a couple hours, the technician replied back to me basically saying that he couldn’t even log back in to the console on the actual physical machine. He took this physical photo of what the server was displaying:
Basically, the server was completely broken and all the data completely wiped.
The series of events that followed basically bore creation to the perfect storm. I can’t stress enough how much Murphy’s Law played a role in all this, as it would normally never take anywhere near as long to restore my sites.
First off, yes, I did have backup procedures in place. In fact, I have three levels of backups: I have DiskSync set up on each of my servers in an offsite datacenter, I have backups on my PC at home, and I have monthly backups made onto password-protected DVDs as well. For a very detailed description of DiskSync, check out my post Backing Up: Better Safe than Sorry from when I originally set it up at (It’s a really good technical read).
So, the only way that I’ll ever lose ALL my data is if both datacenters (located in Dallas, Texas) somehow blow up and my condo in BC, Canada gets set on fire… all at the same time. I’m probably much more likely to win 100-million in the lottery, so I was never terrified that all my data might be lost forever. The shameful part of all this is that my sites were down for an entire week which is an eternity online.
I should also mention that I have two dedicated servers. One of them basically runs Apache and the sites, and the other is designated for MySQL usage. Trinity was the name of the server that ran Apache and Bertha the name of the one which runs MySQL. It was Trinity that was affected, which meant that all the really important data (blog entries, forum posts, etc.) was still perfectly in tact on Bertha.
Anyway, Trinity crashed on the afternoon of April 6th, and I received word from ThePlanet’s technician late afternoon that all my data was wiped. This was terrible news, but I didn’t think it was the end of the world since I had DiscSync set up, which basically meant that I should have been able to do a fast and quick restore of the server. However, since Trinity was completely non-responsive, it meant that it would require an OS reload.
Before ordering an OS-reload, I decided that this was a good time to finally upgrade Trinity to a brand new machine. I bought Trinity five years ago back in 2003, so she was pretty ancient. The only reason I never upgraded her sooner was because it was too much of a hassle to do so. Since she was already down and required an OS reload anyway, I took advantage of the situation and bought a brand new machine.
Fortunately for me, my timing was lucky as ThePlanet had a private undisclosed special going on (I still don’t understand why they would have an undisclosed special… it makes absolutely no sense…) for one of their servers: an IDE Intel Dual Xeon 2.8GHz with 2GB of RAM, 2 x 80GB Hard Drives, and 2500GB of monthly bandwidth for only $147/month. The special was only running for a short time, and if purchased now, I’d get to pay that price for life. The normal price is $258/month, so I was basically getting it at 50% for life. That is the only good thing that came out of this whole mess.
After I purchased the new server (which I nicknamed Abby), I spoke to ThePlanet’s sales team again to ask them what they recommended I do in my situation. The sales rep suggested that I order a new hard drive on Trinity so that they could try to transfer my old data over and then put the hard drive into Abby. I agreed, and waited for my new server and the new disk on Trinity to get set up, as the servers have to be built and set up.
All of this happened on April 6th. Why do I keep repeating that date? Well, because I was going on vacation to the River Rock Casino Resort on April 7th. In fact, I had to get up around 6:30am in order to catch the 7:30am ferry. I ended up staying up until around 4:30am trying to organize to fix the server – I had to finish packing when I woke up. This is just one of the many problems that turned this into a huge mess.
Fortunately, I was taking the Wandering Labourer (my laptop) with me on the trip, which gave me some comfort as I could orchestrate the restoration from my hotel room.
My new server, DiscSync backup, and hard drive on Trinity were all set up the next day, so I contacted ThePlanet to transfer my hard drive from Trinity over to Abby as the sales woman told me the previous day. Unfortunately, they refused to do this. After asking why, I got this ridiculous answer: “I’m afraid this is impossible – your new server is in a completely different datacenter, far away from here.” I was very angry and told them that I ordered the extra hard drive specifically because the sales woman recommend I do it so it could be transferred over. My plea’s were in vain, however, since I was later told that even if it was in the same datacenter that it is their policy not to do hard drive swaps between servers. So, that was a waste of $100 right there.
Anyhow, I contacted my server administration company to restore all my sites and transfer them over from Trinity to Abby. This server administration company, which I will not name here for fear of some form of possible reprisal, ended up being the #1 reason that made the recovery process take so long.
I’ve been using this company for close to two years, and have always paid them a year up front. Basically, by paying them a monthly (or yearly in my case) fee, you get unlimited server administration support from them. In the beginning they were great – fast and pretty knowledgeable, but over time they started taking longer and longer to respond, and their skill level seemed to greatly diminish as well, although it was never that great.
While I was at the River Rock playing poker and just generally having fun, I left the task of restoring my server in the hands of this company, but I helped out as much as I could. I was even on my way to go out and eat breakfast one day when I got an e-mail on my Blackberry telling me he needed me ASAP, so I immediately turned around actually jogged back through the casino back to my hotel room (If you ever want to make security nervous, jog through a casino).
I would check in with the “technician” (if you could call him that) whenever I came back to my hotel room to get a progress report, and I was told that progress was being made. However, it was getting harder and harder to get a reply from him. To make a long story from an even longer story short, it eventually got so bad that I was constantly sending IM’s, and e-mails trying to get somebody from the company to finally restore my data and fix my server. I even phoned them several times and got no reply. I even bought “911” service which meant that I was paying $20 an hour (much cheaper than ThePlanet’s $150/hour) for emergency help. But in the end, I was basically treated like just another customer who had a small problem such as needing to install a PHP module or something. I stressed to them many times how important this was and that it was an emergency, and asked why I bothered to pay for their services if they wouldn’t even help me when it was most crucial.
Finally, after a week of accomplishing nothing, the technician assigned to fix my server from the company told me that he was sorry, but that all my data was lost forever and that he couldn’t restore it.
I couldn’t believe it… for I had DiscSync – how could he not restore the data from DiskSync? DiskSync was my failsafe, even if the data was completely wiped. By now, I had returned from my trip and was finally back home.
Unfortunately, as part of the whole “perfect storm” scenario, I got sick as soon as I got back (probably picking it up from one of the players at the poker tables) home. It started as a really sore throat but later progressed into a fever. I was not in the mood to work, but did as much as I could…
Stay tuned for the second half of the story where I’ll explain how everything ended up getting fixed, outline the effects of the crash in terms of revenue and traffic, and give a list of actual benefits that came out of the whole ordeal.