OK, so this isn’t going to be quite as magical as the ’12 days’ you’re used to. We’ll admit that. There aren’t going to be eleven pipers piping or seven swans a swimming. But what will get is one very happy IT professional who knows that there’s no chance her Holiday will be ruined by catastrophic data loss. If you want a holiday period where you can relax with your family and not have to worry about your servers, you need to be prepared for system recovery. Here’s how you do it.
Oh and by the way, probably don’t do just one of these things per day over 12 days… it’s a song. You get what we’re doing here, right? Yeah, you get it.
Everybody, sing along!
Click image to enlarge.
The serious stuff
OK, holiday cheer aside, this is important stuff. Here’s a quick rundown of all the steps mentioned in the song.
Calculate all RTPO needs
The first step when planning any system recovery procedures, is to establish your Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) for each machine and specific business-critical application in your environment. This is what tells you how quickly you need to recover, and how much data you can afford to lose. Here’s some further reading on how to calculate RTO and RPO.
Create a backup schedule
Once you’ve determined your RTO and RPOs, it’s time to create a backup schedule. Your RTPO goals are what’s going to determine the best backup schedule for you – for example, if you have an RPO of 15 minutes on your SQL Server, that means you’ll need to incorporate backups of that server’s transactions every 15 minutes into your schedule. And don’t forget, at least one backup should always be kept offsite! Read more on scheduling backups.
Identify key assets
Not all machines and applications are created equal. There are always going to be certain elements of your environment that are more mission-critical than others. These need to be identified, because they will be your highest priority when it comes time for performing a recovery.
Consider hardware issues
Hardware is always a key consideration of system recovery. If your servers are irreparably damaged, what will you recover to? Do you have an old server that can be used as a temporary fix in the short-term? How long will it take to source, purchase and bring new hardware onsite? If the crash happens over a weekend or holiday, will you be able to contact the vendor right away? These are important factors to establish.
Plan offsite retrieval
In a disaster scenario, chances are that your onsite backups are going to be destroyed along with your servers. That means to perform a system recovery you need to get your offsite backups back on site as quickly as possible. However, there’s more to consider here than you might think. If you’re using the cloud as offsite, what happens if your internet connection is down? If you’re storing physical drives offsite, what happens if roads are blocked and you can’t get to them?
Have a Plan-B offsite
Following on from the above point, this one’s pretty self explanatory. If you can’t get to your primary offsite, having a second offsite backup in a different location or in a different media type could definitely help your business continuity.
Assign specific duties
When it comes to system recovery, everyone in your team needs to know what their role is. If someone will be responsible for retrieving the offsite backups, make sure they know that. If another person will be responsible for sourcing the new hardware, or configuring the fall-back server they need to know that too. Assigning roles ahead of time will help make your recovery process as seamless as possible.
Develop a contingency
Again, this is a lead on from the point above – things don’t always go to plan! What happens if your sysadmin, who’s responsible for coordinating the entire recovery process, is away on holidays and out of phone contact? Make sure there’s always more than one person who knows the procedures and what needs to be done.
Document procedures
To that end: document everything! Your entire recovery process from start to finish needs to be clearly documented, and that documentation needs to be tested and easily accessible. In the stress of a recovery situation, people are bound to forget specific procedures – documenting them clearly and referring back to that documentation frequently prevents that from happening.
Draft communications
A system outage is going to affect the entire business, so the entire business needs to be kept in the loop. You should always include other department stakeholders in your recovery planning process, and plan out ahead of time how you will communicate process to them while a recovery is underway. Drafting communications ahead of time to be sent out at key progress points during the recovery process is a good idea.
Rehearse entire process
The last thing you want is for the first time you’ve run through your system recovery procedures to be in a real life disaster scenario. Things are going to go wrong. People are going to forget things. Practice makes perfect, and when it comes to bringing mission-critical systems back on line being as close to perfect as possible is super important.
Test your backups often!
The most important point of all. By far. The most carefully thought out, well-oiled, much-practiced recovery process in the world will be a giant failure if the actual system recovery backups won’t recover. It doesn’t matter what backup solution you’re using or how much you’re spending on it. It doesn’t matter how many times automatic verification tells you the backup is successful – you need to test them. Regularly. It’s not just best practice, it’s common sense.
Happy Holidays!
Now we’ve got the important stuff out of the way, we can get back to some good old fashioned Holiday cheer! We hope you have a fantastic Holiday Season. Everyone at BackupAssist is wishing you good food, good friends, good family and good times! See you all in 2016!