In my last blog, “Man­aged Ser­vices, Part 1: Mak­ing Safe, Doc­u­mented, and Reli­able Changes,” I defined one of the five cores of sup­port of man­aged services—change man­age­ment. In this blog, I will cover backup and recovery.

Have you ever watched your pre­teen freak out when he dis­cov­ers that an exten­sive essay he wrote for school and saved as a Word doc­u­ment on his com­puter was lost? Now imag­ine how a CEO or CFO would feel upon dis­cov­er­ing that files were lost on his or her com­puter. There would be more than a freak out. I pre­sume that such a cat­a­stro­phe could even lead to a dis­missal or two.

Any­one who has a com­puter looks at their files as a pre­cious commodity—in some cases, more valu­able than gold. So, any man­aged ser­vices provider worth its salt should have a pro­to­col for back­ing up and recov­er­ing data.

A host­ing and man­aged ser­vices provider would ensure, for exam­ple, that backup data is fully encrypted to pro­tect it from snoop­ing com­peti­tors or dis­grun­tled employ­ees.  The host would also ensure that only you and a rep­re­sen­ta­tive of the host­ing com­pany have the key and access to the mate­r­ial. Addi­tion­ally, they would also fully dis­trib­ute data across all the avail­able zones of the cloud that you’re using. This ensures that it’s eas­ily backed up should some­thing dis­as­trous occur. You also want to make sure that you have a cus­tomized amount of backup in the cloud. You’ll also need to think about how often you want your data backed up and offer a range from just a few min­utes to a few months for it to be retained.

A host­ing com­pany worth part­ner­ing with would be aware that a busi­ness does not want to keep backup data indef­i­nitely and make cer­tain that its pro­to­col is to reg­u­larly destroy it.

Fur­ther­more, you will want to be aware of the level of backup the host can pro­vide; and what recov­ery mode the host uses if the cloud infra­struc­ture crashes. The host should be clear about the turn­around time for deal­ing with recov­ery issues.

Intrigued? Want to know more? To fully under­stand the process of a man­aged ser­vices provider, we really should look at a best prac­tice for host­ing and man­aged services.

How Com­pa­nies Ensure Data Backup and Recovery

The first thing to real­ize is that a host­ing man­aged ser­vice provider usu­ally uses a third party to pro­vide the cloud infra­struc­ture for infra­struc­ture as a ser­vice (IaaS), which assists in the man­age­ment process. One exam­ple of such a third party is Ama­zon. The com­pany offers Ama­zon Web Ser­vice (AWS) cloud regions to store data. Who­ever the third-party cloud provider is, each region of the cloud you use includes two or more inde­pen­dent avail­abil­ity zones and each of these zones uses one or more data cen­ters.  With AWS, there is a data cen­ter in each of 10 regions across the globe.

A backup occurs when a snap­shot is made of the data of each tier in the elas­tic block stor­age (EBS) sys­tem of the cloud. The snap­shot is taken occa­sion­ally and stored while still encrypted in a sec­tion of the cloud that makes the files acces­si­ble in all avail­abil­ity zones.

The backup ser­vice of the sys­tem peri­od­i­cally makes encrypted copies of your web­site and dis­trib­utes them across all avail­able zones of the cloud.  The process is per­formed daily and the backup data is retained for as long as you want it to be. The backup process is effi­cient and does not dis­turb the run­ning of any applications.

Five disaster-recovery inci­dences trig­ger an action when recov­ery issues arise within the cloud:

  1. Fail­ure of an indi­vid­ual appli­ca­tion server or data volume.
  2. Fail­ure of all appli­ca­tion servers or data vol­umes in a solu­tion tier.
  3. Com­plete fail­ure of an avail­abil­ity zone.
  4. Fail­ure of the elas­tic load bal­ancer (ELB) with or with­out an accom­pa­ny­ing fail­ure of instances.
  5. Fail­ure of an entire cloud region.

Should inci­dent 1 occur, redun­dant archi­tec­ture assumes that the cloud oper­ates redun­dantly.  So if a sin­gle server fails, there is prob­a­bly no affect on the impact clus­ter oper­a­tion.  The data in the EBS remains intact. Other options to fix the issue include cloning oper­at­ing sys­tems. The data is recov­ered within a var­ied amount of time. You should ask your host­ing com­pany what that time is in their sys­tem. The process will not affect the computer’s operation.

If inci­dent 2 occurs, there could be an affect on the user. The data is recov­ered from the EBS or from back­ups. The process time of the recov­ery may vary. Again, check with your host­ing man­aged ser­vice provider as to how much time elapses before backup is assured.

Should inci­dent 3 hap­pen, the redun­dant archi­tec­ture of the sys­tem would recover the data with­out impact­ing the user.  Ask the provider what the recov­ery time would be on its system.

If inci­dent 4 takes place, the ELB is com­monly redun­dant and will recover with­out affect­ing the user. The action taken in inci­dent 3 could also han­dle the problem.

Finally, if inci­dent 5 hap­pens, addi­tional capac­ity in the part of the cloud region that is func­tion­ing will tem­porar­ily start. The total fail­ure of an entire cloud region would result in lim­ited impact because the cloud includes a redun­dant deploy­ment in the latency-sensitive Domain Name Sys­tem (DNS).

Finally, it would be help­ful if your host prac­tices dis­as­ter recov­ery through so-called “war games,” which sim­u­late an issue.  Some com­pa­nies with such a pro­ce­dure per­form the test every two to four weeks.

Your data is pre­cious. Assure your­self that your man­aged ser­vice provider employs a sys­tem that can back up and recover data fully and effectively.