DRMS: Disaster Recovery Mirroring System

Implementation Notes for ON/2 Systems

by Andy Orrock, Technical Consultant


Keys to Success

Based upon my experiences as the Operations Manager at Tosco (now Phillips Petroleum), I believe there are three keys to a successful joint ON/2 - DRMS implementation (presented here in order of importance):

  1. Identifying and mirroring only the critical ON/2 files. The more indiscriminate you are with your choices, the more you risk jeopardizing the mirroring of critical files with activity generated by non-critical files.

  2. Allotting sufficient bandwidth on the Continuum-to-Continuum connection. This means having a big pipe and making sure the other operations unrelated to DRMS do not get in the way by using up large chunks of that bandwidth during the day.

  3. Making sure you have a properly functioning ON/2 system, independent of any DRMS considerations. For example, a poorly constructed batch cycle may result in flooding the DRMS queue with mirroring requests, often at the expense of mirroring operations generated by online operations.

Suggested Approach to DRMS Implementation

Based in general order (sequentially) of how you should tackle a DRMS implementation.

  1. Do as much upfront as possible with your telecommunications team. Describe the importance of the project and ask them to use their best efforts to ensure that no other operation infringes upon bandwidth availability. At Tosco, we found out - much to our surprise - that our supposed dedicated large pipe was in fact constricted at certain points along the system-to-system path by interference from corporate e-mail.

  2. My most important point: Remember, you are starting from nothing. No disaster recovery is in place right now. With that in mind, think incrementally. Most people get this tool in their hands and immediately have one of two thoughts: (a) I want to back up the entire disk, or (b) I have these 50 files (or some large number) that are absolutely critical to back up. I am telling you from experience that you find out some very interesting things about your telecommunications infrastructure and your operations cycle (especially the effect batch has on online) when you put in DRMS. The best single piece of advice I can give you is to get an early victory. Select one (yes, one) file - the absolutely most critical one - and mirror that. Then watch. Watch the way the two systems behave in concert. Take metrics. Observe. Compare the files every so often. Again, we found backups occurred and it seemed unrelated to operational or demand cycles. We were able to track that back to the telecomm issues I mentioned earlier. At Tosco, the single critical file we backed up was the alt_tlf. Why? All settlement took place from that file. If we lose or corrupt that file, we're toast. Nobody gets paid. And at Tosco, we once had a system crash that made the alt_tlf unreadable. The DRMS-created backup was pristine. The was the world's fastest payback for a software purchase.

    Another reason the alt_tlf (or tlf) is a good first step: you get to start from scratch every 24 hours (or less, depending on your operating model). So if you do have fits and starts when you first bring up the operation (for telecom or other reasons), you're automatically back in synch in 24 hours or less because it's the next day and you're mirroring a new file. Are there other files that were critical? Yes. But not as critical. Here are two examples:

    • We found we were able to improve mirroring operations by updating certain critical files via dual batch operations. For example: instead of mirroring the "transaction view" files, we ran "dual renewal" processes (i.e., simultaneous batch jobs on both systems).

    • We got by with mirroring only the 'alt' version of the TLFs, and not the primary versions, because the only thing we used the primary for was online velocity checking. In a disaster, we decided we were willing to "give up" velocity checking. This opened up an acceptable level of consumer fraud for a 24-hour period only. We felt this small consolation was a fair trade off for ensuring that the mirroring of the more important 'alt' TLF (used to enact all settlement operations) ran without impediment from activity generated by updates to the primary TLF.

  3. Once you get that first win and learning experience under your belt, start adding files. But do it in a measured fashion. Before adding any file, ask yourself three questions:

    • Is this really a critical file? (go back to my earlier example about velocity checking)

    • Do I know all the implications of adding this file - online AND batch? (Trust me, you find out a LOT about the inefficiencies of your batch processing when you go through the DRMS exercise).

    • Can I re-create the file on the other system using another method? Card and account file renewals (assuming you're using the classic ON/2 structure of not doing transaction-level updates to (e.g.) the ACV, DDV, etc. files) are far better handled via the built-in renewal mechanisms provided by S2. [Especially (and exclusively) true if you're doing any type of direct/hash DB stuff.]