Skip to content
  • Plans
  • Sign In
  • Try Now
Book

Site Reliability Engineering

TIME TO COMPLETE:15h 44m
LEVEL:Intermediate to advanced
PUBLICATION DATE:April 2016
PRINT LENGTH:552 pages

The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?

In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You'll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.

This book is divided into four sections:

  • Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices
  • Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
  • Practices—Understand the theory and practice of an SRE's day-to-day work: building and operating large distributed computing systems
  • Management—Explore Google's best practices for training, communication, and meetings that your organization can use

You might also like

Site Reliability Engineering Fundamentals
Course
Over the past five years, the ideas behind site reliability engineering (SRE) have caught fire because of their success in improving the reliability of systems. But those just starting their SRE journey often have questions. For instance, how do you transform an existing organization toward SRE? Where do DevOps and SRE overlap, and where do they diverge? And which method for calculating and measuring service-level objectives (SLOs) should you use, and when? Join Incident Labs’ Emil Stolarsky and Jaime Woo to gain a foundational understanding of SRE principles and the infrastructure practices and processes of a range of organizations—along with actionable advice on putting them to work in your organization. Emil and Jaime will also take you through the pragmatic and sometimes messy decisions that must be made on a regular basis to form a functional and successful SRE culture. Make meaningful changes to how you run your services immediately, and learn how to start meaningfully participating in the SRE community.What you’ll learn and how you can apply it By the end of this recording of a live online course, you’ll understand: What SRE is (and isn’t) and how it’s evolved over the past decade How SRE relates to concepts like DevOps and resilience engineering The benefits of SRE When and how SRE should be applied for maximum impact Current SRE conversations and where they’re happening And you’ll be able to: Assess how SRE is implemented across various companies of different sizes Implement foundational SRE concepts, such as SLOs and error budgets Debunk common myths and misunderstandings around SRE Evaluate the progress of SRE adoption and strategies and relate them back to stakeholdersThis live event is for you because… You’re a developer new to or looking to enter an SRE role. You build the tools that improve deployment, shepherd code from developers into production, make sure it keeps running, or anything else remotely related. You want to become well-versed in the foundations and best practices of SRE. Prerequisites Experience running software in production environments Familiarity with the struggle of implementing SRE Recommended follow-up: Read Site Reliability Engineering (book) Read The Site Reliability Workbook (book) Read Seeking SRE (book) Watch Spotlight on Cloud: Reducing the Impact of Service Outages with Generic Mitigations with Jennifer Mace (video) Read Implementing Service Level Objectives (book)
Site Reliability Engineering: How Google Runs Production Systems
Audiobook
The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large scale computing systems? In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You'll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient-lessons directly applicable to your organization. This book is divided into four sections: Introduction-Learn what site reliability engineering is and why it differs from conventional IT industry practices Principles-Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE) Practices-Understand the theory and practice of an SRE's day to day work: building and operating large distributed computing systems Management-Explore Google's best practices for training, communication, and meetings that your organization can use
The Site Reliability Workbook
Book
In 2016, Googleâ??s Site Reliability Engineering book ignited an industry discussion on what it means to run production services todayâ??and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment. This new workbook not only combines practical examples from Googleâ??s experiences, but also provides case studies from Googleâ??s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didnâ??t. Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is. Youâ??ll learn: How to run reliable services in environments you donâ??t completely controlâ??like cloud Practical applications of how to create, monitor, and run your services via Service Level Objectives How to convert existing ops teams to SREâ??including how to dig out of operational overload Methods for starting SRE from either greenfield or brownfield
Observability Engineering
Book
Observability is critical for building, changing, and understanding the software that powers complex modern systems. Teams that adopt observability are much better equipped to ship code swiftly and confidently, identify outliers and aberrant behaviors, and understand the experience of each and every user. This practical book explains the value of observable systems and shows you how to practice observability-driven development. Authors Charity Majors, Liz Fong-Jones, and George Miranda from Honeycomb explain what constitutes good observability, show you how to improve upon what you're doing today, and provide practical dos and don'ts for migrating from legacy tooling, such as metrics, monitoring, and log management. You'll also learn the impact observability has on organizational culture (and vice versa). You'll explore: How the concept of observability applies to managing software at scale The value of practicing observability when delivering complex cloud native applications and systems The impact observability has across the entire software development lifecycle How and why different functional teams use observability with service-level objectives How to instrument your code to help future engineers understand the code you wrote today How to produce quality code for context-aware system debugging and maintenance How data-rich analytics can help you debug elusive issues

About the Publisher

O’Reilly’s mission is to change the world by sharing the knowledge of innovators. For over 40 years, we’ve inspired companies and individuals to do new things—and do things better—by providing them with the skills and understanding that’s necessary for success.

At the heart of our business is a unique network of experts and innovators who share their knowledge through us. O’Reilly online learning offers exclusive live training, interactive learning, a certification experience, books, videos, and more, making it easier for our customers to develop the expertise they need to get ahead. And our books have been heralded for decades as the definitive place to learn about the technologies that are shaping the future. Everything we do is to help professionals from a variety of fields learn best practices and discover emerging trends that will shape the future of the tech industry.

More about O'Reilly Media, Inc.

Resources