Calculating RTO/RPO – make IT bulletproof for business

Calculating RTO/RPO

Your office burned down over the weekend.  We’re sorry to be the bearers of bad news, but it happened. What will your business do now? How fast can your IT infrastructure be back up and running? What data will it need to recover with the highest priority?

These are the questions that will inevitably arise when hammering out a Disaster Recovery plan.  In the industry, we call these questions the Recovery Time Objective (RTO) and Recovery Point Objective (RPO).  But how do you determine the ideal RTO/RPO for your business?

In this article, we’re going to explain the concepts of RTO/RPO and how you can apply them to disaster-proofing your business’s IT infrastructure. That way, when you get the news that your systems and data are smoldering ashes (or less melodramatically, that your server has crashed), your first question won’t be: “what do we do now?”

Recovery Time Objective

RTO/RPO - recovery time objective

Your RTO refers to how long your business can afford to have its vital systems and data offline before it really starts affecting your bottom line.  In even simpler terms, it means: how quickly do you need to recover?

When determining an acceptable RTO, you should be focusing on individual applications rather than entire servers.  The stopwatch starts on your RTO the instant an important application is no longer usable, and won’t stop until it can be used again as normal by all users who require it.

You’re on your way to understanding RTO/RPO.

Recovery Point Objective

RTO/RPO - recovery point objective

The second part of the RTO/RPO considerations is, perhaps unsurprisingly, RPO.  Rather than the recovery time, RPO refers to the systems and applications themselves.  Determining your RPO means working out how much of the data contained within these systems and applications your company could afford to lose.

This means that if you could afford to lose 2hours worth of data from your Exchange Server without it having a serious impact on your business, then your RPO would be 2 hours.

While RTO will determine the speed necessary for your recovery, RPO determines how frequently you should be making backups and what kind of backups will be necessary.

RPO is important for backups.  This is because it addresses the question of when and how often to back up.  If you backup at 9pm, and the next day at 10am you lose a hard drive and all of the data on it, then there could be up to 12 hours’ worth of data lost – known as a “backup gap”.

In practice, the lost data may only be emails and word docs that staff worked on from 8am to 10am when people were at in the office. But then again, it could be a full 12 hours’ worth of data from a database that was being updated all night by an important application… The key is to determine what data you would lose on each type of system if it went down, and how much of this you could afford (if any).

Okay, now you understand what RTO/RPO means, let’s put this knowledge into practice.

Determining your downtime

RTO/RPO - determining downtime

When establishing an acceptable level of downtime (for the record, we’re talking about the RTO aspect of your RTO/RPO planning here), it can be tempting to simply ask your users.  After all, they know how long they can do without their applications, right?  Well, probably not.

Generally, when asking end-users you’re going to get a response that’s either unrealistically short (“If I can’t access my emails for 30 seconds, I’ll miss that sale and the world will end in fire.”), or naively long (“I don’t even know what that application does, I’m pretty sure I could go without it for a few months”).  That’s why, as with most things in life, you need to look at the data.

The first step is to make a comprehensive list of every system and application your business uses as a part of its operations.  Then determine what role each actually plays – i.e. all functions it performs for your business and what departments/users would be affected by its loss.

Next you need to determine potential losses that could occur if the system or application were to go down – i.e. lost revenue or sales, salaries paid to idle workers, additional expenses due to lack of access, damage to company reputation etc.  Do this for every application individually, and don’t forget to take into account that certain times of year will account for heavier consequences than others.

Now, once you’ve worked out these details is where you get to the fun part – determining exactly how long before these losses become unacceptable. Exactly how long this time will be depends on the specifics of your business, but here are some questions you should ask yourself that will factor into your ideal RTO:

  • Do you hold customers’ data on their behalf? If so, what service agreements and obligations do you have with them? This will impact how quickly you will need to recover that data.
  • Do you have customers who need real-time access to your data? An example may be point-of-sale systems.
  • What systems have dependencies? For example, if you lost a database then what applications would be affected and what are their corresponding uptime requirements?
  • What systems would result in direct financial loss if they went down? E.g. a website selling goods.
  • What systems would cause a production outage? E.g. factory-floor or quality control systems

Once you’ve answered these questions and worked out the required recovery time for each application and system, your overall RTO is determined in one of two ways. Either: if there’s one application that will cause significantly greater loss to your business than the others – use the time taken to recover this as your RTO. Or, if all applications are equally as valuable, simply average the times for all of them and use this instead.

The final step is to perform a test recovery for each and every system and application.  However long it takes is your Recovery Time Actual (RTA).  Your goal is then to make your RTO and RTA one and the same.

Deciding what to revive

RTO/RPO - what to recover

The final phase of your RTO/RPO planning deals with determining an acceptable level of data loss.  RPO refers to the frequency with which your data is recoverable – e.g. if you’re performing daily backups then your Recovery Point will be 24 hours.

While it’s possible to achieve an RPO of zero data loss, these kinds of solutions are almost always costly.  Most businesses are going to need to determine a realistic RPO that will cause minimal impact.

Remember those questions you asked yourself when figuring out a RTO? Well these same considerations are going to affect your RPO as well – with the focus being on the data instead of recovery time.

Once you know how much data you can afford to lose, you can plan a backup strategy to meet that requirement for each system, data store and application.

A daily backup will allow you to have a copy of your data as it was and the end of the previous business day. If you have determined that a loss of data at any point in the day can be replaced by the previous day’s backup – then you can protect your data with a daily backup.

If you have determined that you cannot afford to lose more than a few hours’ data, then a daily backup will not be enough. You will need an ongoing backup of your data through the day. For a data hard drive this could mean you need a mirrored drive. For an SQL database, it means you may need a transaction level backup that runs every 10 or 15 minutes.

If your requirement is somewhere in the middle, you can afford to lose a half a day’s data, then you could consider backing up the data twice a day.

.

Anything you’re not sure about when it comes to RTO/RPO?
Leave your question in the comments below, tweet it @BackupAssist or post to our Facebook wall.
Share this article and become a business continuity crusader.

 

6 thoughts on “Calculating RTO/RPO – make IT bulletproof for business”

  1. Fairly good description of an approach to determining RTO and RPO, but I have the feeling I first got when I saw Tron. The characters weren’t sure they believed in The Users, i.e, the people who depend upon systems to be reliably available and that their data will be suitably secure. So, while it is true that an RPO of zero ensures a pricey solution, the real question is: What is the point at which loss of data threatens unacceptably high losses to the enterprise? Losses may be interruption of the revenue stream, contract violations/lawsuits, regulatory penalties, loss of market share, loss of shareholder value (stock price, if publicly traded), et cetera. These are NOT IT issues, meaning every business unit in the enterprise must have a turn with the Business Continuity Program Manager and the DR specialist to settle on the availability and data protection criteria, aka RTO and RPO.

    I have had clients who weren’t concerned about a brief outage of system availability, just so long as no data was lost. You see that a lot in any financial institution, and most are squarely on top of well-designed solutions. While other industries and markets may be more frugal, the regulations in the financial sector level the playing field such that sending millions on a highly secure IT environment is just another cost of doing business: everybody has to do it, so they build it into their budgets. Period.

    My bottom line is this: IT DR planners need to get very close and comfortable with business unit management enterprise-wide. Remember: any operation worth funding and staffing is worth recovering… some day. So, they all need an IMPLEMENTED plan. By “implemented,” I mean that for every business unit operation, recovery resources necessary to resume that operation are either in place or will be available within the operation’s RTO (and that means any necessary thing, not just IT).

    Reply
  2. A reasonable description of individual RTO and RPO for a single application and for lay members this is a good start for when that one application has an outage that does not impact any other services.
    it is also a very important place to begin your recovery planning and I particularly like the concept of knowing what systems you have and what services they provide as there are way too many IT departments that have yet to set up and continually maintain a good CMDB or Service Catalogue.

    However I believe it is worth delving deeper as IT services are often provided through a complex and integrated layer of multiple applications relying on numerous databases, servers, core infrastructure components, networks all the way down to your data centre.
    The challenges in recovering from a serious disaster that impacts a large range of these IT components cannot be addressed through a plan that relies on individually tried and tested applications that on their own successfully meet their RTOs.

    An example being a tried and tested RTO of 4 hours for your top ten critical services does not mean you have the resources, technologies or plans to recover them all at the same time within 4 hours, so once tested individually and remedial activity has been identified, proposed and finally implemented and proven the next phase must be to begin integrated application testing.

    Areas frequently missed are the supporting infrastructure, I have yet to perform a Business Impact Assessment and be questioned on what happens if the AD Global Catalogue becomes corrupt, TCP/IP services fail or had the focus on Middleware and Security that the DR planner needs to bring to the table themselves.
    Additionally if you align your RTO with a tested RTA without considering the time it takes to assess the incident, agree a DR invocation is required, mobilise staff and actually start the recovery it is unlikely your technical solution will ever meet the RTO.

    To address the RPO points again for individual applications a good point however a danger in identifying application RPOs in isolation is that the synchronisation of data from integrated applications that have been assigned different RPOs is again somewhat challenging.
    So it is incumbent upon the DR planner to ensure an integrated approach to RPOs, where one service that clearly aligns to another that has been given a conflicting RPO I suggest bringing the parties that came to each concluding RPO together to ascertain which should be the correct one.

    Clearly a start point of working with individual applications is a practical and pragmatic approach, It is vital when this process begins that the DR planner remains cognisant of the challenges ahead so they can be considered at an early stage and a strategy developed to move from testing one application RTO and RPO to integrating associated services and eventually planning for a Data Centre level outage.

    Reply
    • Thanks a lot for taking the time to dive in deeper on this topic for us, Trevor. Very helpful advice for anyone wanting to learn more than just the brief overview this article offers.

      Reply
  3. Honestly the best thing I’ve read today. Thank you so much! This honestly dispelled a lot of hesitation I had with IT infrastructure and departments. I’ve been trying to cut down my RTO and maximise my RPO for years.

    Reply
  4. Thanks for this. With complex malware/hacker attacks these days, there’s a risk that backups like sql transactional file backups can be compromised too. They can even find ways to purge vm backups. This would massively impact rto so how is this traditionally communicated to the business in terms of sla?

    Reply

Leave a Comment

Share on email
Share on print
Share on facebook
Share on google
Share on twitter
Share on linkedin

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email. Join 1,874 other subscribers