So… the site was down for a few days

Hi everybody, let me start off by saying that it’s good to be back, and even better that I don’t have to try and reconstruct any data.

So on the 6th I got this email from our hosting company

We will be performing a hardware upgrade on all of our nodes in our current Dallas location on Friday November 7, 2008. The upgrade window will take place from 1:00AM-3:00AM EST. During this time your server will be unavailable for up to 45 minutes.

I didn’t think anything of it until I noticed that the server was down when I checked at 9 AM Friday. I sent a support ticket and they told me that upon restart of the server (note that we’re on a VPS on a very, very large RAID array) it required a fsck.

As of 11 AM EST they had created a page to track updates on this issue.

Hardware node in Dallas is going through forced file system check. At this time there is no reason to suspect any possibe data loss. The hardware was upgraded at 1:39:04 AM and when booted back up required a manual fsck (file system check). Due to the size and complexity of the RAID array, this process has taken several hours and is continuing.

1:02 PM EST (11/7/2008)

Files system check is moving along at the expected rate. There is no ETA for the completion of this at this time. We will update this page every hour until the fsck is complete and all servers are brought back online.

2:25 PM EST (11/7/2008)

We are getting closer to the end of the fsck, but at this time there is still no ETA since the progress isn’t steady and fluctuates. We will have another update before 3:30 PM EST.

3:25 PM EST (11/7/2008)

Fsck is nearing completion, ETA still unknown.

4:30 PM EST (11/7/2008)

Fsck has completed 3 times and is on it’s 4th pass. Every pass fsck makes adjustements to the file system and is working correctly, which leads us to not suspect any data loss. Although downtime is an obvious concern of ours, your data is our top priority in this matter and we must wait for the process to complete fully to avoid any possible corruption.

4:42 PM EST (11/7/2008)

Dallas node BART is back online with no errors. Only one node remains down, and we expect that to come up shortly.” We are lucky enough to be on that last node.

5:56 PM EST (11/7/2008)

The remaing node, MAGGIE, is at 41% on it’s current fsck. (31% at 5:29 PM EST)

6:45 PM EST (11/7/2008)

MAGGIE is at 61% on it’s current fsck.

7:55 PM EST (11/7/2008)

FSCK is complete. Server is doing normal file integrity check.

8:24 PM EST (11/7/2008)

10% done on file integrity check.

10:45 PM EST (11/7/2008)

40% done on file integrity check.

12:20 AM EST (11/8/2008)

81.6% done on file integrity check.

I’m seeing a light at the end of the tunnel.

1:45 AM EST (11/8/2008)

MAGGIE had finished all disk checks, but upon the the final reboot required another manual fsck. We are investigating this issue to find out what exactly went wrong, and how this can be avoided in the future. Once all servers have been brought online, we will be launching a full investigation to find out exactly how this can be handled better in the future. We will be implementing a RSS feed for times like this to keep our customers up to date. In addition to that we will also be opening our forums up again and once our new customer portal is complete, there will be additional lines of communication with our staff. We will be sending out an email within the next 24 hours describing in detail what happened with this hardware upgrade. We had upgraded ALL of our customer hardware nodes in Dallas, TX. After the upgrade, only two servers forced a manual FSCK (both servers had an uptime of about 400+ days). The last of those two servers is coming up now. We will also be investigating why exactly this node took so long to complete its FSCK.
We will have an update on this server once it has made significance progress on the 5th pass, probably within the next 2 hours.

NOOOOOOOOOOOOOOOO!

4:26 AM EST (11/8/2008)

There is no progress to report at this time. Fsck is still continuing on this node, in an effort to bring the data and servers back online.

6:30 AM EST (11/8/2008)

There is no progress to report at this time. Fsck is still continuing on this node, in an effort to bring the data and servers back online. There has been progress made, but at this stage there is no pgrahical representation of status.

8:48 AM EST (11/8/2008)

MAGGIE is at Lost+Found Inode stage, and progressing.

10:51 AM EST (11/8/2008)

MAGGIE is still at Lost+Found Inode stage, and progressing.

1:27 PM EST (11/8/2008)

We are continuing to monitor the progress, but it’s still in the Lost+Found Inode stage (Stage 2). We’ll have an update once it’s finished with this.

4:20 PM EST (11/8/2008)

The process continues. At this time we have no reason to suspect any data loss. The RAID array on this node was over 1TB in data, which is why the process is taking so long. We are doing everything we can to get this node back online. Once the node comes back online, we will be migrating all servers to an emergency hardware node that we’ve rushed in from our new Dalllas DC (InfoMart) to avoid any future issues that may have been caused fsck. This process will be transparent to all servers on MAGGIE, as you will keep your IPs and your servers will remain online. Within 12 hours of the node coming up, all servers on MAGGIE will be running VZ4 with latest kernels. Our SLA guarantee applies to this downtime, and all users affected by this downtime (about 35) will receive a 100% credit for next month’s hosting on the new node.
This has been, by far, the longest downtime in ZONE.NET history, and the first real downtime this node has ever faced (uptime was somewhere above 350+ days according to our records). We’re doing all we can now to prepare for the node comes back online, and will keep you updated with any information we have.

Edit: This has since been upgraded to 3 months credit.
8:15 PM EST (11/8/2008)

Server is still going through Lost+Found Inode stage, and progressing. It’s still on the same round of Lost+Found and has not cycled through yet.

10:21 PM EST (11/8/2008)

Server is still going through manual fsck.

2:40 AM EST (11/9/2008)

Server is still going through manual fsck. We will have a lenthy update within a few hours.

It’s about this time that I realized something I should have immediately. When I was dealing with the whole mail-to-MSN issue it required creation of a TXT DNS record, which Network Solutions didn’t support. So I moved the DNS management to our server. With DNS down emails weren’t being delivered to our @anime-pulse.com accounts, which are actually hosted on gmail. So I moved the DNS back to Network Solutions (discovering in the process that in the past 6 months they have added TXT support) to get email back. During these two days we probably missed some emails, and for that I apologize. Anything sent to the animepulse@gmail.com account was delivered fine.

5:54 AM EST (11/9/2008)

We have started to restore customer backups to the new emergency node that was brought in. We did not have an empty node at the current Dallas location due to our new opening of the InfoMart location (servers were moved to there). Unfortunately, some customer backups are irretrievable from our backup node. When trying to restore these backups, we are getting errors due to corrupted backups. We are unsure of the number of customers whose backups are not available, but we estimate it to be somewhere around 10. Please open up a ticket with support, and we will let you know if backups are available for your server. We are still running the manual fsck on the node, and are hopeful of retrieving the data. At this time, there is not much to update in terms of progress. We will give another update once the fsck has finished with that outcome.

Scary.

9:40 AM EST (11/9/2008)

We have received a swarm of backup requests from all customers on MAGGIE, as expected. We are going through each one individually and restoring these backups as we go. If you have opened a ticket but have not received a response immediately, it is not a sign that we are not working on the issue. We will update your ticket once we have tried to restore your backup.

2:37 PM EST (11/9/2008)

File system check is still going on at this point. In addition to all customers getting SLA credit, we will also be giving customers the option of having a hot-spare VPS server running at all times. This server will have the minimum requirements to be able to do rsync backups, and in case of any downtime in the future, we will transfer over IPs and increase resources on that server to allow your sites to run and avoid any downtime in the future. We will be offering this service to our customers on MAGGIE for 3 months at no charge, and discount for the lifetime of the account. We have never faced any downtime as serious as this, and we’re doing all we can to bring the servers back online.

7:19 PM EST (11/9/2008)

ZONE.NET was temporarily down earlier today. Once everything has settled, ZONE.NET will be issuing a full report with what happened during the upgrade as well as what steps will be made to prevent this from happening again for any of our customers. After all servers have been brought online, we’ll be implementing an RSS feed to provide faster status updates and the hot-spare VPS servers will be started. We will also be looking into some sort of CDP solution for our backups, as well as what happened with the several backups that were unable to be restored. We are also implementing a policy to never do any hardware upgrades unless VPS servers are moved to a new server. In addition to this, in the upcoming week we will be upgrading all our of VPS nodes to VZ4 with latest kernels. Before any upgrades are taken place, your servers will be moved individually to a new server to avoid any downtime or any potential issues.
This weekend has been a nightmare for ZONE.NET. We haven’t ever faced any serious downtime, and will take all precautions to avoid anything like this in the future. We will also be making some internal changes to improve response times on requests such as DNS/PTR and things requiring host-node access. We do not take downtime lightly, and if you’re concerned with the status of your server at ZONE.NET for the future term please contact me directly at richard@zone.net.
At this time there is nothing new to report on the status of MAGGIE. It’s still continuing to go through the fsck.

11:22 PM EST (11/9/2008)

FSCK is still running. We will continue to let it run for a while longer to try to recover data for those who whose backups were corrupted. We have restored those customers who have asked for their servers to be restored from backups that did not have errors.

Guess who’s backups did have errors? Yep, ours. So now I’m looking at having to recreate the last 3 months of blog posts (the amount of time since I manually backed up the main site), so I go and pull up the google cache page and download the source of every post since then.

1:47 AM EST (11/10/2008)

We are still waiting for the FSCK to complete. Expect another update in the morning.

6:10 PM EST (11/10/2008)

MAGGIE is online at this moment. It has completed fsck after 4.5 days of downtime, but our wait has paid off. If we had become impatient and stopped the fsck process, 100% of the data would have certainly been lost. It is not unheard of for an fsck process to take this long, especially given the size of the RAID arrays and the amount of drives involved. We appreciate all our customers who waited for the server to come back online.
2.4GB of data was lost during this fsck on the whole node. We may still be able to recover this data. If you have lost data, please open a ticket with support with “DATA LOSS” as the ticket title.
We are also investigating what happened to the backups. Here are the facts so far. The backups were made regularly, and each server had backups on the backup node. For some reason which we are investigating, some backups were not able to be restored while others were. When trying to restore the backups we were faced with errors and couldn’t finish the restore. Other servers restored with no problems.
We will need to do testing on the backup server to see if the problem was with MAGGIE itself, or the backup server they were stored on. Regardless, we will be issuing a statement sometime this week with the complete cause of this. As for the cause of the server going down, and remaining down: We are also not sure what prompted the forced fscks on MAGGIE, since all other servers came back online with no issues. Out of all the nodes in our current Dallas location, this was the only one. It’s a shame that something that was supposed to increase performance on all of our nodes ended up taking one of them offline for so long.
It is extremely important for all customers to make their own backups and store them off-site. We do keep backups of our servers, but you should not rely on these as your only source of backups since you have no way of checking the status of those yourself. We do these backups as a courtesy to be used as a last resort to bring servers back up that are faced with issues like this. Aside from disk corruption issues, there are also location based factors such as fires that can cause all data to be lost at a datacenter. Though chances are slim of these, it’s still a cause for concern. The more ways you backup your server, the more options you have when your server is down.
We will now be doing an audit of all our other backup nodes to see if the issue is widespread or was just with this node.
You may have faced slow ticket times during this outage, which was due to all available techs working on other issues caused by the downtime. Over the past 4 days we have received a large rush of tickets due to the outage.
All servers are back online, and should be resolving properly.

Luckily, we don’t appear to have any data loss that I have discovered yet. I am going through and taking a manual backup of all the databases as well as the websites now and I will start taking weekly backups again and storing them locally as well as on Amazon, so worst case if something like this happens again we only lose 7 days of blog, forum and wiki posts.

Again, I apologize for the amount of downtime over the weekend, and if you sent any emails to anime-pulse.com email accounts and didn’t receive a response, please resend then.

As always, thanks for you support and enjoy the show!

Liked it? Take a second to support Ichigo on Patreon!
Become a patron at Patreon!

0 Replies to “So… the site was down for a few days”

  1. I imagine that must have been a frustrating period. I know I was confused when a few of my frequent sites, including Anime Pulse, were all down for several days.

    Thanks for the insight, Ichigo.

Leave a Reply

Your email address will not be published. Required fields are marked *