My most important point: Remember, you are starting from nothing. No disaster recovery is in place right now. With that in mind, think incrementally. Most people get this tool in their hands and immediately have one of two thoughts: (a) I want to back up the entire disk, or (b) I have these 50 files (or some large number) that are absolutely critical to back up. I am telling you from experience that you find out some very interesting things about your telecommunications infrastructure and your operations cycle (especially the effect batch has on online) when you put in DRMS. The best single piece of advice I can give you is to get an early victory. Select one (yes, one) file - the absolutely most critical one - and mirror that. Then watch. Watch the way the two systems behave in concert. Take metrics. Observe. Compare the files every so often. Again, we found backups occurred and it seemed unrelated to operational or demand cycles. We were able to track that back to the telecomm issues I mentioned earlier. At Tosco, the single critical file we backed up was the alt_tlf. Why? All settlement took place from that file. If we lose or corrupt that file, we're toast. Nobody gets paid. And at Tosco, we once had a system crash that made the alt_tlf unreadable. The DRMS-created backup was pristine. The was the world's fastest payback for a software purchase.
Another reason the alt_tlf (or tlf) is a good first step: you get to start from scratch every 24 hours (or less, depending on your operating model). So if you do have fits and starts when you first bring up the operation (for telecom or other reasons), you're automatically back in synch in 24 hours or less because it's the next day and you're mirroring a new file. Are there other files that were critical? Yes. But not as critical. Here are two examples:
- We found we were able to improve mirroring operations by updating certain critical files via dual batch operations. For example: instead of mirroring the "transaction view" files, we ran "dual renewal" processes (i.e., simultaneous batch jobs on both systems).
- We got by with mirroring only the 'alt' version of the TLFs, and not the primary versions, because the only thing we used the primary for was online velocity checking. In a disaster, we decided we were willing to "give up" velocity checking. This opened up an acceptable level of consumer fraud for a 24-hour period only. We felt this small consolation was a fair trade off for ensuring that the mirroring of the more important 'alt' TLF (used to enact all settlement operations) ran without impediment from activity generated by updates to the primary TLF.