3 Free Site Reliability Engineering (SRE) Ebooks by Google
SRE is what you get when you treat operations as if it’s a software problem. Google’s mission is to protect, provide for, and progress the software and systems behind all of Google’s public services — Google Search, Ads, Gmail, Android, YouTube, and App Engine, to name just a few — with an ever-watchful eye on their availability, latency, performance, and capacity. Their job is a combination not found elsewhere in the industry. Like traditional operations groups, they keep important, revenue-critical systems up and running despite hurricanes, bandwidth outages, and configuration errors. Unlike traditional operations groups, they view software as the primary tool through which their systems are managed, maintained, and minded; to that end, they have the source-level access and moral authority required to fix, extend and scale code to keep it working, harden it against the vagaries of the Internet, and develop their own planet-scale platforms.
“Here’s what you do when someone breaks something or finds something very difficult to debug: You say thank you. Thank you for finding this edge case. Thank you for highlighting this overcomplicated part of our system. Thank you for pointing out this gap in our docs. And then you go make it so nobody can break it the same way again.” – Tanya Reilly, Google SRE 2005-2018
1. Building Secure and Reliable Systems
By Heather Adkins, Betsy Beyer, Paul Blankinship, Ana Oprea, Piotr Lewandowski, Adam Stubblefield
Can a system be considered truly reliable if it isn’t fundamentally secure? Or can it be considered secure if it’s unreliable? Security is crucial to the design and operation of scalable systems in production, as it plays an important part in product quality, performance, and availability. In this book, experts from Google share best practices to help your organization design scalable and reliable systems that are fundamentally secure.
2. The Site Reliability Workbook
Edited by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara and Stephen Thorne
The Site Reliability Workbook is the hands-on companion to the bestselling Site Reliability Engineering book and uses concrete examples to show how to put SRE principles and practices to work. This book contains practical examples from Google’s experiences and case studies from Google’s Cloud Platform customers. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.
3. Site Reliability Engineering
Edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy
Members of the SRE team explain how their engagement with the entire software lifecycle has enabled Google to build, deploy, monitor, and maintain some of the largest software systems in the world.