My Server Crash: What the Hell Happened? Part 1 of 2

April 28, 2008 Posted by Tyler Cruz

Trinity: Shall we play a game?
Tyler: Yeah. How about Global Thermonuclear War.
Trinity: Wouldn’t you prefer a nice game of chess?
Tyler: Later. Right now lets play Global Thermonuclear War.
Trinity: Fine.

Okay, so that wasn’t exactly what happened, but I thought it might bring a good laugh to movie buffs. Three weeks ago, all my sites went down and stayed down for an entire week.

Here’s what happened:

On the afternoon of April 6th, I noticed that I couldn’t connect to one of my websites. I wasn’t alarmed in the least since I was having a lot of server problems lately. The hired server administration team I pay could never figure out what was wrong or what might be causing it other than I had a very old setup. Fortunately, whenever my server went down it was always an easy fix, almost always requiring an easy Apache restart.

However, when I tried to SSH in, I was given an error saying I couldn’t connect. This usually meant that the server needed to be physically restarted by hand. So, I opened a ticket with ThePlanet – the company I host my servers with, and waited for them to reboot it. I was still not worried at this point as this was occurring about once every week for the past couple months.

However, after a couple hours, the technician replied back to me basically saying that he couldn’t even log back in to the console on the actual physical machine. He took this physical photo of what the server was displaying:

216

Basically, the server was completely broken and all the data completely wiped.

The series of events that followed basically bore creation to the perfect storm. I can’t stress enough how much Murphy’s Law played a role in all this, as it would normally never take anywhere near as long to restore my sites.

First off, yes, I did have backup procedures in place. In fact, I have three levels of backups: I have DiskSync set up on each of my servers in an offsite datacenter, I have backups on my PC at home, and I have monthly backups made onto password-protected DVDs as well. For a very detailed description of DiskSync, check out my post Backing Up: Better Safe than Sorry from when I originally set it up at (It’s a really good technical read).

So, the only way that I’ll ever lose ALL my data is if both datacenters (located in Dallas, Texas) somehow blow up and my condo in BC, Canada gets set on fire… all at the same time. I’m probably much more likely to win 100-million in the lottery, so I was never terrified that all my data might be lost forever. The shameful part of all this is that my sites were down for an entire week which is an eternity online.

I should also mention that I have two dedicated servers. One of them basically runs Apache and the sites, and the other is designated for MySQL usage. Trinity was the name of the server that ran Apache and Bertha the name of the one which runs MySQL. It was Trinity that was affected, which meant that all the really important data (blog entries, forum posts, etc.) was still perfectly in tact on Bertha.

Anyway, Trinity crashed on the afternoon of April 6th, and I received word from ThePlanet‘s technician late afternoon that all my data was wiped. This was terrible news, but I didn’t think it was the end of the world since I had DiscSync set up, which basically meant that I should have been able to do a fast and quick restore of the server. However, since Trinity was completely non-responsive, it meant that it would require an OS reload.

Before ordering an OS-reload, I decided that this was a good time to finally upgrade Trinity to a brand new machine. I bought Trinity five years ago back in 2003, so she was pretty ancient. The only reason I never upgraded her sooner was because it was too much of a hassle to do so. Since she was already down and required an OS reload anyway, I took advantage of the situation and bought a brand new machine.

Fortunately for me, my timing was lucky as ThePlanet had a private undisclosed special going on (I still don’t understand why they would have an undisclosed special… it makes absolutely no sense…) for one of their servers: an IDE Intel Dual Xeon 2.8GHz with 2GB of RAM, 2 x 80GB Hard Drives, and 2500GB of monthly bandwidth for only $147/month. The special was only running for a short time, and if purchased now, I’d get to pay that price for life. The normal price is $258/month, so I was basically getting it at 50% for life. That is the only good thing that came out of this whole mess.

After I purchased the new server (which I nicknamed Abby), I spoke to ThePlanet‘s sales team again to ask them what they recommended I do in my situation. The sales rep suggested that I order a new hard drive on Trinity so that they could try to transfer my old data over and then put the hard drive into Abby. I agreed, and waited for my new server and the new disk on Trinity to get set up, as the servers have to be built and set up.

All of this happened on April 6th. Why do I keep repeating that date? Well, because I was going on vacation to the River Rock Casino Resort on April 7th. In fact, I had to get up around 6:30am in order to catch the 7:30am ferry. I ended up staying up until around 4:30am trying to organize to fix the server – I had to finish packing when I woke up. This is just one of the many problems that turned this into a huge mess.

Fortunately, I was taking the Wandering Labourer (my laptop) with me on the trip, which gave me some comfort as I could orchestrate the restoration from my hotel room.

My new server, DiscSync backup, and hard drive on Trinity were all set up the next day, so I contacted ThePlanet to transfer my hard drive from Trinity over to Abby as the sales woman told me the previous day. Unfortunately, they refused to do this. After asking why, I got this ridiculous answer: “I’m afraid this is impossible – your new server is in a completely different datacenter, far away from here.” I was very angry and told them that I ordered the extra hard drive specifically because the sales woman recommend I do it so it could be transferred over. My plea’s were in vain, however, since I was later told that even if it was in the same datacenter that it is their policy not to do hard drive swaps between servers. So, that was a waste of $100 right there.

Anyhow, I contacted my server administration company to restore all my sites and transfer them over from Trinity to Abby. This server administration company, which I will not name here for fear of some form of possible reprisal, ended up being the #1 reason that made the recovery process take so long.

I’ve been using this company for close to two years, and have always paid them a year up front. Basically, by paying them a monthly (or yearly in my case) fee, you get unlimited server administration support from them. In the beginning they were great – fast and pretty knowledgeable, but over time they started taking longer and longer to respond, and their skill level seemed to greatly diminish as well, although it was never that great.

While I was at the River Rock playing poker and just generally having fun, I left the task of restoring my server in the hands of this company, but I helped out as much as I could. I was even on my way to go out and eat breakfast one day when I got an e-mail on my Blackberry telling me he needed me ASAP, so I immediately turned around actually jogged back through the casino back to my hotel room (If you ever want to make security nervous, jog through a casino).

I would check in with the “technician” (if you could call him that) whenever I came back to my hotel room to get a progress report, and I was told that progress was being made. However, it was getting harder and harder to get a reply from him. To make a long story from an even longer story short, it eventually got so bad that I was constantly sending IM’s, and e-mails trying to get somebody from the company to finally restore my data and fix my server. I even phoned them several times and got no reply. I even bought “911” service which meant that I was paying $20 an hour (much cheaper than ThePlanet‘s $150/hour) for emergency help. But in the end, I was basically treated like just another customer who had a small problem such as needing to install a PHP module or something. I stressed to them many times how important this was and that it was an emergency, and asked why I bothered to pay for their services if they wouldn’t even help me when it was most crucial.

Finally, after a week of accomplishing nothing, the technician assigned to fix my server from the company told me that he was sorry, but that all my data was lost forever and that he couldn’t restore it.

215

I couldn’t believe it… for I had DiscSync – how could he not restore the data from DiskSync? DiskSync was my failsafe, even if the data was completely wiped. By now, I had returned from my trip and was finally back home.

Unfortunately, as part of the whole “perfect storm” scenario, I got sick as soon as I got back (probably picking it up from one of the players at the poker tables) home. It started as a really sore throat but later progressed into a fever. I was not in the mood to work, but did as much as I could…

Stay tuned for the second half of the story where I’ll explain how everything ended up getting fixed, outline the effects of the crash in terms of revenue and traffic, and give a list of actual benefits that came out of the whole ordeal.

If you enjoyed this post, please consider leaving a comment below, subscribing to my RSS feed, or following me on Twitter.
Posted: April 28th, 2008 under My Websites  

26 Responses to “My Server Crash: What the Hell Happened? Part 1 of 2”

  1. Clog Money says:

    This is one of the most interesting posts I have ever read. For a start I am shocked that this could happen. I work for a telecoms company and we have servers generating thousands of pounds every minute for customers and some don’t pay as much for server support as you are. I am shocked that something could take so long, especially as it appears to be some kind of hard drive failure. Why was there no raid in place?.

    Unfortunately the more I participate in the world of business the more I realize how many companys are run by cowboys and monkeys. I really hope you will take this further with your hosting company. I would definably not be as calm about the situation as you appear to be.

  2. Wow, crazy story! Can’t wait to read the rest of it.. I was wondering what had happened while you were gone, didn’t realize it was quite so catastrophic.

    Just curious, why a seperate server for Apache and MySQL?

    • Clog Money says:

      It often makes sense to have the web server and database on separate servers. It helps you primarily with load sharing and backup. It all depends on how you want your systems to operate. Personally I would have mysql running on both the web server and its own machine and have the database replicated between the two.

  3. Mubin says:

    Server admin companies are a joke, they are outsourced to India or China, and the only thing they are really good at is wasting time.

    You need to get a dedicated box with thewird.

  4. Finally, we get to read some posts about your week-long downtime! A good read, and I look forward to reading part 2.

    – Martin Reed

  5. That is why I do everything myself. I don’t trust other people to do anything for me, because I know that they will never dedicate themselves to my problem the way I can.

    But you are starting to convince me to have more backups of my data.

  6. Hey glad to hear the details. We’ll have to see the rest. I hope you can follow-up with a new detailed post on an easy 1-2-3 back-up plan. Maybe a .pdf for boneheads like me.

  7. Jimson Lee says:

    Interesting!

    My advice is to find a hosting provider at Harbour Centre, Vancouver, as they can provide dedicated servers between $149 – $249 range.

    At least you’ll be a ferry ride away from your servers. Plus, there are some inexpensive Server Admin companies that do the job well under $150/hour.

    I won’t mention companies in case people think I’m spamming!

  8. ToddW says:

    I guess by not paying that $150/hr to ThePlanet you really saved a lot of money by having all your sites offline making no money ;)

  9. Bill Gere says:

    This was an excellent post Tyler. It’s very entertaining (since we know everything turned out alright) and I can’t wait to read part two. I must say that your post makes me want to back up all of my sites ASAP! Have you ever considered just biting the bullet and doing everything yourself?

    • ToddW says:

      What’s wrong with good`ol RAID1?

      HD failure and TP will install a new drive and reconfigure the RAID and you`ll be up in under 4 hours sites live again… even on Christmas (ask me how I know).

      What I do now is overkill but works great and can even let you run your sites on your back-up server if for some odd reason server 1 datacenter/network is down for a long period of time.

      1. Primary server has RAID1.
      2. Primary backs up to back-up/dev server.
      3. Back-up/Dev server has RAID1
      4. I download weekly backups from the back-up/Dev server and store them locally.
      5. Monthly I burn DVD back-ups and store them off-site/fireproof.

  10. Mike Huang says:

    It really does suck that you had to endure all those problems. Sometimes paying more for server hosting could save you rather than rip a hole out of your pockets :) I’m currently with Hostgator and they’re great in support! :)

    -Mike

  11. Chris says:

    I guess this is what happens when you pay such a small price for a dedicated server/team. I pay $450 per month and many of my friends laugh that I’m getting ripped off, but I’m paying for the service and peace of mind rather than the bandwidth space. I use Rackspace, and can’t recommend them enough.

    Glad it’s all sorted now though Tyler, and I look forward to reading part 2 :)

  12. This is one hell out of experience dude! I was so shocked to read the headline and reading that all the data wiped… God! How could you save yourself from a heart attack. I am waiting eagerly for your part 2 Tyler.

  13. Tyler, you made a few critical errors which I hope you have corrected:

    1. You are using theplanet as your server provider. Don’t take my word for it, check them out in google, plenty of horror stories. You get what you pay for !

    2. You didn’t have RAID-1 on your server. You should always get RAID-1 and you will always have a mirror of your data if your primary drive fails. Yes, there are some caveats to this, but never use a single drive system, get RAID-1

    3. You are using a hosting company that does not give you managed support. If you used a better company which you paid a little more in hosting see, they would do simple admin tasks for you like restarting apache, repair your server when it failed etc. Most reputable companies do this, and yes they charge a little more than theplanet does.

    Before you say it, I am not telling you these things because I am link dropping for our company. I am telling you these things because I see this kind of think happen all the time and I think you could use some honest, reliable advice so that you don’t find yourself in this situation again.

    Regards,
    Richard.

  14. KushMoney says:

    I must say from the start I knew this was going to be a good read. I don’t see how you could be so calm. I am a very calm person but I don’t think I could be as calm as you to handle this problem.

    Some parts of this made me say you are smart and other parts made me laugh. Reading your sites was down for a week, no traffic, no REVENUE. Now that made me sad for you.

    I can’t wait to read how this was fixed.

  15. Leo says:

    I first read this happened to you by reading John Chow’s blog. This is actually my first comment here and it is to say that it is one of the most interesting ever.

    I promiss I will make a back up of all my sites. Hope you are back on track already, despite all the revenue loses this might have caused you.

  16. Richard says:

    Wow mate. It’s amazing to see how much hassle has been caused by this, not just to you but to many different people.

    Glad everything is back to normal though, looking forward to part 2. Love the style you’ve written the article in and I’m definately going to return to your blog to read further articles.

    Good luck with getting your earnings back on track!

  17. [...] Young entrepreneur Tyler Cruz speaks about his down time and the effects of [...]

  18. Josh Buckley says:

    Wow! This is quite scary… reminds me to do the best in my ability to keep my server safe.

  19. Wade says:

    BACKUP BACK UP.. ALWAYS PEOPLE or you will end up like tyler… (There is probably your stalkers saying, then I will never back up.. Fabulous!)

    Shudogg Dot Com – Make Money Online Blogging

  20. [...] anytime. Its uncalled for and unwanted too! Recently I saw two blogs suffering tragedies. One was TylerCruz.com and other one was GatherSuccess.com Former one suffered at the hands of inefficient server [...]

  21. [...] worst of it all was when my server was completely wiped out in April of 2008. ThePlanet was of little-to-no-help in restoring my data and after being told [...]

  22. [...] Young entrepreneur Tyler Cruz speaks about his down time and the effects of [...]

PeerFly

Leave a Reply