[Year 12 IT Apps] Google and disaster recovery and data integrity.

Christopher Jansen Christopher.Jansen at sjc.vic.edu.au
Thu Sep 4 10:38:11 EST 2014


Original article found at http://www.wired.com/2012/10/ff-inside-google-data-center/all/ (see the relevant part of the article below)
Security and Data Protection in a Google Data Center: https://www.youtube.com/watch?v=cLory3qLoY8
Thanks,
Chris Jansen


But of course, none of this infrastructure is any good if it isn't reliable. Google has innovated its own answer for that problem as well-one that involves a surprising ingredient for a company built on algorithms and automation: people.
At 3 am on a chilly winter morning, a small cadre of engineers begin to attack Google. First they take down the internal corporate network that serves the company's Mountain View, California, campus. Later the team attempts to disrupt various Google data centers by causing leaks in the water pipes and staging protests outside the gates-in hopes of distracting attention from intruders who try to steal data-packed disks from the servers. They mess with various services, including the company's ad network. They take a data center in the Netherlands offline. Then comes the coup de grâce-cutting most of Google's fiber connection to Asia.
Turns out this is an inside job. The attackers, working from a conference room on the fringes of the campus, are actually Googlers, part of the company's Site Reliability Engineering team, the people with ultimate responsibility for keeping Google and its services running. SREs are not merely troubleshooters but engineers who are also in charge of getting production code onto the "bare metal" of the servers; many are embedded in product groups for services like Gmail or search. Upon becoming an SRE, members of this geek SEAL team are presented with leather jackets bearing a military-style insignia patch. Every year, the SREs run this simulated war-called DiRT (disaster recovery testing)-on Google's infrastructure. The attack may be fake, but it's almost indistinguishable from reality: Incident managers must go through response procedures as if they were really happening. In some cases, actual functioning services are messed with. If the teams in charge can't figure out fixes and patches to keep things running, the attacks must be aborted so real users won't be affected. In classic Google fashion, the DiRT team always adds a goofy element to its dead-serious test-a loony narrative written by a member of the attack team. This year it involves a Twin Peaks-style supernatural phenomenon that supposedly caused the disturbances. Previous DiRTs were attributed to zombies or aliens.
[http://www.wired.com/wiredenterprise/wp-content/uploads/2012/10/ff_googleinfrastructure2_f.jpg]<http://www.wired.com/wiredenterprise/wp-content/uploads/2012/10/ff_googleinfrastructure5_large.jpg>
Some halls in Google's Hamina, Finland, data center remain vacant-for now.
Photo: Google/Connie Zhou
As the first attack begins, Kripa Krishnan, an upbeat engineer who heads the annual exercise, explains the rules to about 20 SREs in a conference room already littered with junk food. "Do not attempt to fix anything," she says. "As far as the people on the job are concerned, we do not exist. If we're really lucky, we won't break anything." Then she pulls the plug-for real-on the campus network. The team monitors the phone lines and IRC channels to see when the Google incident managers on call around the world notice that something is wrong. It takes only five minutes for someone in Europe to discover the problem, and he immediately begins contacting others.
"My role is to come up with big tests that really expose weaknesses," Krishnan says. "Over the years, we've also become braver in how much we're willing to disrupt in order to make sure everything works." How did Google do this time? Pretty well. Despite the outages in the corporate network, executive chair Eric Schmidt was able to run a scheduled global all-hands meeting. The imaginary demonstrators were placated by imaginary pizza. Even shutting down three-fourths of Google's Asia traffic capacity didn't shut out the continent, thanks to extensive caching. "This is the best DiRT ever!" Krishnan exclaimed at one point.
The SRE program began when Hölzle charged an engineer named Ben Treynor with making Google's network fail-safe. This was especially tricky for a massive company like Google that is constantly tweaking its systems and services-after all, the easiest way to stabilize it would be to freeze all change. Treynor ended up rethinking the very concept of reliability. Instead of trying to build a system that never failed, he gave each service a budget-an amount of downtime it was permitted to have. Then he made sure that Google's engineers used that time productively. "Let's say we wanted Google+ to run 99.95 percent of the time," Hölzle says. "We want to make sure we don't get that downtime for stupid reasons, like we weren't paying attention. We want that downtime because we push something new."
Nevertheless, accidents do happen-as Sabrina Farmer learned on the morning of April 17, 2012. Farmer, who had been the lead SRE on the Gmail team for a little over a year, was attending a routine design review session. Suddenly an engineer burst into the room, blurting out, "Something big is happening!" Indeed: For 1.4 percent of users (a large number of people), Gmail was down. Soon reports of the outage were all over Twitter and tech sites. They were even bleeding into mainstream news.
The conference room transformed into a war room. Collaborating with a peer group in Zurich, Farmer launched a forensic investigation. A breakthrough came when one of her Gmail SREs sheepishly admitted, "I pushed a change on Friday that might have affected this." Those responsible for vetting the change hadn't been meticulous, and when some Gmail users tried to access their mail, various replicas of their data across the system were no longer in sync. To keep the data safe, the system froze them out.
The diagnosis had taken 20 minutes, designing the fix 25 minutes more-pretty good. But the event went down as a Google blunder. "It's pretty painful when SREs trigger a response," Farmer says. "But I'm happy no one lost data." Nonetheless, she'll be happier if her future crises are limited to DiRT-borne zombie attacks.
One scenario that dirt never envisioned was the presence of a reporter on a server floor. But here I am in Lenoir, earplugs in place, with Joe Kava motioning me inside.


Christopher Jansen
Mathematics and Information Technology Teacher

[SJC_logo_email_signature]


St Joseph's College Geelong
135 Aphrasia Street, Newtown Vic 3220
Ph Office: 03 5226 8100
Fax: 03 5221 6983
www.sjc.vic.edu.au<http://www.sjc.vic.edu.au/>
compassion innovation  integrity

[cid:image005.jpg at 01CE45DB.1C5BDDA0]


The contents of this email are CONFIDENTIAL.
Any unauthorised use of the contents is expressly prohibited.
If you have received this email in error, please immediately advise
St Joseph's College Geelong by phone on 03 5226 8100
and then delete/destroy the email and any printed copies. Thank you.
[http://thinkbeforeprinting.org/struct/signature-1.gif]



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.edulists.com.au/pipermail/itapps/attachments/20140904/12754727/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 104763 bytes
Desc: image001.jpg
Url : http://www.edulists.com.au/pipermail/itapps/attachments/20140904/12754727/image001-0001.jpg 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 3244 bytes
Desc: image002.jpg
Url : http://www.edulists.com.au/pipermail/itapps/attachments/20140904/12754727/image002-0001.jpg 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.jpg
Type: image/jpeg
Size: 2364 bytes
Desc: image003.jpg
Url : http://www.edulists.com.au/pipermail/itapps/attachments/20140904/12754727/image003-0001.jpg 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.gif
Type: image/gif
Size: 566 bytes
Desc: image004.gif
Url : http://www.edulists.com.au/pipermail/itapps/attachments/20140904/12754727/image004-0001.gif 


More information about the itapps mailing list