Adventures in Networking

Main menu:

Archive for October, 2009

Backing up is hard to do…

Making backups is one of the cardinal rules of using any kind of computer system, be it your home machine with your digital photos, or a mission-critical enterprise application. Computers are imperfect machines, made by imperfect human beings, and they do fail. If the data is important to you, or to your employer, then a backup is not just good advice, it is absolutely essential! It is one of the most important responsibilities of a system admin’s job. Neglecting the backups or failing to perform them is a surefire course for disaster, and could easily cost you your job. When ever I have a server crash, the stress level of the situation is directly related to the date and time of the last full backup of that machine. If the data is recently backed up, and your documentation is current, then it should be a relatively simple matter to rebuild a system and get things back to normal.

This week we finally got our new server ready to hold our SNMP monitoring system (we use What’s Up Gold, and I’ll have to do another post about this sometime). We use an off-box database to hold all the data for this system, rather than the default which just installs Microsoft SQL “light” (MSDE) on the same box. In theory, moving the engine to a new server is a very simple matter–just shut down the engine on the old box, install it on the new box, and point it at the database. I’ve done this a half dozen times, so it should’ve been very simple. One of my system admins ran through a default install on the new server, which installed a local database, then he grabbed me to show him how to connect the engine to the database server. We tried changing the ODBC connection string, and couldn’t get it to connect (don’t remember the error message). We concluded that we might as well just un-install, and re-install without the local database.

The first lesson in all of this is read warning messages VERY carefully. The un-install asked if we wanted to remove our data and settings. We answered yes, thinking it would just remove the local database. Oops. Despite the errors we encountered while trying to connect, we had entered the information to connect to the live database. So the un-install promptly “dropped” the live database on the SQL server. I mean it was completely gone, all traces of it. My heart sunk for just a second, but then I figured, well, we’ll just roll back to last night’s backup. That’s where the fun started.

We are using Novanet Backup on the database product (I’ll have to give a review on another post, but needless to say I like it), and I remembered creating and manually testing the backup job a month ago. However, I neglected to actually schedule the job to run. Oops, again. Now, it’s a whole other discussion as to why this was not noticed, but in a nutshell it’s because it was a new system, and it was just “IT division stuff.” In any case, that first manual backup was our salvation. We have made some significant changes to the network over the last month, and of course we lost all of September’s statistics with regards to uptime, latency, etc. But we had 90% of everything, and with about an hour’s worth of work the system was usable again. We’ll still need to spend a few more hours reviewing the documentation and adding and changing devices to match what’s out there now.

Even though it ended up costing us a half day’s work, our collective butt was saved because we had at least the one backup from a month ago. Had we not had that, it would’ve taken weeks to re-enter everything from our documentation into the monitoring system. Not only that, but we would’ve lost all historical data as to the performance of our entire infrastructure for the last year or so.

From this experience, I have vowed to work with my guys to ensure that EVERYTHING, including all data we IT folks need to do our jobs (not just our “customers’” data), is backed up, and that those backups are being religiously monitored and tested. It’s funny how even a seasoned pro can let things slide occasionally. I’ll have to do another post about why I hate tape and why don’t use it for anything. Yes, you read that right, we have reliable backups, including off-site, cold storage, etc. Why everyone seems to be so stuck in the stone age is beyond me, but as I said, that’s a story for another day.