Google SRE Prodcast

by Salim Virji

SRE Prodcast brings Google's experience with Site Reliability Engineering together with special guests and exciting topics to discuss the present and future of reliable production engineering!

Insights from recent episode analysis

Audience Interest

Estimated Reach: 21K to 89K

Listeners across platforms

Podcast Focus

Categories: technology

Publishing Consistency

Frequency: ~Weekly

50+ episodes since 2022

Platform Reach

Insights are generated by CastFox AI using publicly available data, episode content, and proprietary models.

High Confidence

Total monthly reach

21K to 89K

Estimated from 7 chart positions in 7 markets.

By chart position

🇬🇧
GB · Technology
#134
5K to 30K
🇮🇳
IN · Technology
#177
1K to 10K
🇬🇷
GR · Technology
#50
10K to 30K
🇭🇺
HU · Technology
#95
3K to 10K
🇳🇿
NZ · Technology
#139
500 to 3K

Per-Episode Audience
Est. listeners per new episode within ~30 days
10K to 45K
🎙 Weekly cadence·51 episodes·Last published today
Monthly Reach
Unique listeners across all episodes (30 days)
21K to 89K
🇬🇧34%🇬🇷34%🇮🇳11%+4 more
Active Followers
Loyal subscribers who consistently listen
8.2K to 36K

Market Insights

This ShowCategory Avg

No category insights available.

📡

Platform Distribution

Reach across major podcast platforms, updated hourly

Total Followers

—

Total Plays

—

Total Reviews

—

YouTube

Subscribers

—

Views

—

Videos

—

Castbox

Followers

—

Plays

—

Reviews

—

Podcast App

Followers

—

Plays

—

Reviews

—

Podcast Republic

Followers

—

Plays

—

Reviews

—

TuneIn

Followers

—

Plays

—

Reviews

—

* Data sourced directly from platform APIs and aggregated hourly across all major podcast directories.

On the show

Recent episodes

Matt Zelesko and the Future of SRE

May 26, 2026

Unknown duration

Handling Burnout with Sam Anderson

May 21, 2026

Unknown duration

The One with Crisis Engineering and Mikey Dickerson

May 15, 2026

Unknown duration

This is Fine! With Colette Alexander and Clint Byrum

May 12, 2026

Unknown duration

The One With Damion Yates and Building AI systems

Feb 26, 2026

Unknown duration

🔗

Social Links & Contact

Official channels & resources

🌐

Official Website

📡

RSS Feed

Episodes

monthly

Range

Apr 2022 – Feb 2026

Last episode

3 months ago

25 of 25

Date	Episode	Description	Length
5/26/26	Matt Zelesko and the Future of SRE	We sit down with Matt Zelesko, VP of SRE at Google, for a candid talk about how AI is changing SRE — and how it's not.	—
5/21/26	Handling Burnout with Sam Anderson	Sam Anderson shares his experiences with burnout, and how to support yourself as a reliable system. Sam provides guidance on how to deal with burnout, and some suggestions on how to avoid burnout through understanding yourself and finding the help and support you need.	—
5/15/26	The One with Crisis Engineering and Mikey Dickerson	Crisis Engineer Mikey Dickerson joins us to talk about what constitutes a crisis. Mikey draws on his broad experience across industry and the public sector, as well as on work with his team of systems fixers.	—
5/12/26	This is Fine! With Colette Alexander and Clint Byrum	What's happening in the world of SRE and resilience engineering? Join us as we catch up with fellow podcast hosts Colette Alexander and Clint Byrum of the This Is Fine! podcast at SREcon in Seattle.	—
2/26/26	The One With Damion Yates and Building AI systems	How do you introduce Site Reliability Engineering to an AI research lab, bringing concepts of scale to engineers who are at the leading edge of AI systems? In the latest episode of The Prodcast, hosts Steve McGhee and Florian Rathgeber chat with Damion Yates, who helped establish the reliability engineering culture at Google DeepMind. Damion shares his journey of bringing scalable infrastructure to DeepMind, supporting massive machine learning experiments. Discover the unique challenges of supporting AI research, such as managing highly expensive "lockstep" training models where a single machine failure halts the entire process. Damion also explains why he believes "luck is our enemy" in systems engineering, and why protecting a research scientist's time is the ultimate metric for success.	—
2/11/26	The One With Carla Geisser and Crisis Engineering	Join us for a discussion with Carla Geisser of Layer Aleph, a company focused on "crisis engineering". Carla distinguishes a crisis from a standard incident by noting that a crisis is novel and lacks a playbook. She outlines five criteria for a true crisis: fundamental surprise, broken critical functions, high visibility, a rigid deadline (unlike internal tech deadlines), and perception breakdown. Crises often arise in organizations that struggle to admit computers control core decisions, leading to complex, glued-together systems. Carla emphasizes that SRE-adjacent skills are essential for connecting the dots and exposing the full system. The key takeaway for SREs is to recognize when a true crisis is happening, as leadership will only be willing to "break rules" and enable substantive change once three of these criteria are met.1	—
2/5/26	The One with Parker Barnes, Felipe Tiengo Ferreira, and AI	This episode of the Prodcast tackles the challenges of maintaining AI safety and alignment in production. Guests Felipe Tiengo Ferreira and Parker Barnes join hosts Matt Siegler and Steve McGhee to discuss AI model safety, from examining content to emerging security risks. The discussion emphasizes the vital role of SREs in managing safety at scale, detailing multi-layered defenses, including system instructions, LLM classifiers, and Automated Red Teaming (ART). Felipe and Parker dive into the evolving world of AI safety, from core product policies to the groundbreaking Frontier Safety Framework. The guests explore the need for SRE principles like drift detection and context observability. Finally, they raise concerns about the velocity of AI development compressing long-term research, urging the industry to collaborate and share vocabulary to address rapidly emerging risks.	—
1/28/26	The One With Shannon Brady and Operating Systems	In this episode of the Prodcast, guest Shannon Brady speaks with hosts Jordan Greenberg and Florian Rathgeber about managing Google's vast fleet of internal devices. Shannon explains how Google's Linux platform uses core SRE principles—specifically testing, canarying, and monitoring—for weekly stage rollouts of its Debian-based distribution. Configuration is efficiently managed using Puppet to ensure the right setup for a diverse user base. The conversation pivots to "the year of Linux everything," underscoring its widespread adoption. Discussing AI, Shannon identifies its greatest utility for SREs in rapidly analyzing signals and generating complex queries to resolve outages. This episode reinforces that practicing SRE fundamentals is paramount, demonstrating that you can be an SRE at heart, regardless of your official title.	—
1/21/26	The One With Denia Del Cid and AI	Curious about the real impact of AI on Site Reliability Engineering? In this episode of The Prodcast, Google SRE Denia del Cid breaks down how her team is leveraging AI to transform production workflows. Denia details practical applications like early outage detection, incident similarity analysis, and toil reduction. She explains the critical importance of validating against "golden data sets" and keeping humans in the loop to build trust. Discover how SREs are evolving from skepticism to strategic adoption with Gemini. Tune in for a pragmatic, measured look at the future of reliability.	—
1/14/26	The One With Heather Adkins and Security (and AI)	Join us on The Prodcast as we host Heather Adkins, leader of Google's Office of Cybersecurity Resilience, for a critical look at the future of digital defenses. We explore the intersection of SRE and security , unpacking the "Secure by Design" philosophy and the shared DNA of incident management. Heather candidly discusses the rise of "Agentic AI hackers" and polymorphic malware , revealing how defenders can use AI to stay ahead. From "castle" defense strategies to "nodal biology" theories, this episode is a must-listen for anyone navigating the new era of AI-driven threats.	—
Want analysis for the episodes below?Free for Pro Submit a request, we'll have your selected episodes analyzed within an hour. Free, at no cost to you, for Pro users.
1/7/26	The One With SLOs	In this episode, we welcome Alex Hidalgo and Brian Singer of nobl9 to discuss Service Level Objectives (SLOs). Alex and Brian talk about how SLOs can establish a vernacular across industry verticals, leading to constructive conversations and a shared understanding of how to implement SRE practices. Join us for a lively discussion that ranges across SLO topics!	—
12/16/25	The One With Steph Hippo and Observability	In this episode, Steph Hippo, Platform Engineering Director at Honeycomb, joins The Prodcast to discuss AI and SRE. Steph explains how observability helps us understand complex systems from their outputs, and provides a foundation for SRE to respond to system problems. This episode explains how AI and observability build a self-reinforcing loop. We also discuss how AI can detect and respond to certain classes of incidents, leading to self-healing systems and allowing SREs to focus on novel and interesting problems. She advises small businesses adopting AI to learn from others' mistakes (post-mortems) and to commit time and budget to experimentation.	—
7/30/25	The One with Ben Good and Our Kubernetes Friends	In this special episode hosts Steve McGhee from the Google SRE Prodcast and Kaslin Fields from the Google Kubernetes Podcast, welcome Google Cloud Solutions Architect Ben Good to discuss platform engineering. Listeners can look forward to hearing about the role of Kubernetes as a tool for building platforms, how to create "golden paths" for developers, and the importance of observability and self-service in platform design. The conversation also touches on industry trends, the bespoke nature of platforms, and how DORA metrics can be applied to platform engineering practices.	—
7/23/25	The One With AI Agents, Ramón Llamas, and Swapnil Haria	Google Staff SRE Ramón Llamas and Google Software Engineer Swapnil Haria join our hosts to explore how AI agents are revolutionizing production management, from summarizing alerts and finding hidden errors to proactively preventing outages. Learn about the challenges of evaluating non-deterministic systems and the fascinating interplay between human expertise and emerging AI capabilities in ensuring robust and reliable infrastructure.	—
7/16/25	The One with Technical Program Managers and Karanveer Anand	This episode features Google Technical Program Manager (TPM) Karanveer Anand, who joins our hosts to discuss the unique role of TPMs in Site Reliability Engineering (SRE). The conversation highlights how SRE TPMs bridge the gap between technical details and business impact, managing complex projects with inter-team dependencies and ensuring system reliability, particularly in the rapidly evolving AI landscape.	—
7/2/25	The One with STPA, Jeffrey Snover, and Theo Klein	This episode discusses Systems Theoretic Process Analysis (STPA), a method for analyzing complex systems. Theo Klein, a Google SRE, and Jeffrey Snover, a Distinguished Engineer at Google, explain that STPA focuses on identifying how system accidents and losses occur due to a loss of control, rather than component failures. STPA helps identify design flaws early, even before code is written! The discussion highlights that STPA is a human-driven process, prompting critical questions about system goals and potential losses, and that Google is adapting the pure STPA approach for commercial software development to make it more practical and efficient.	—
6/25/25	The One with Startups and Adam Fletcher	In this episode, hosts Steve McGhee and Matt Siegler are joined by guest, Adam Fletcher, CEO and Co-Founder of MarketStreet. They discuss the current state of web development with LLMs, managing technical debt in startups, the evolution of infrastructure and reliability engineering, the role of community in technology, and the future of software engineering with AI.	—
6/18/25	The One with SLOs and Sal Furino	In this episode, Sal Furino, Customer Reliability Engineer at Bloomberg, discusses all things Service Level Objectives (SLOs) with hosts Steve McGhee and Matt Siegler. Together, they dig into what successful SLOs look like, how it relates to users, and how SLOs provide an effective framework for joint decisions about system reliability across product, engineering, and leadership teams.	—
6/11/25	The One With the Future of SRE and Matt Zelesko	Matt Zelesko, the head of Site Reliability Engineering at Google, discusses the evolution of SRE, highlighting the shift from traditional operations to a model that balances velocity and reliability to better serve the rapid advancements in AI and ML. He emphasizes that SRE's core mission is to enable partners to move quickly while meeting reliability goals, and that the sheer scale of Google's infrastructure necessitates the SRE model for cross-system problem-solving. Zelesko envisions AI as a crucial assistant for SREs, improving incident detection, mitigation, and postmortem processes, and allowing SREs to focus on more complex engineering challenges and risk management earlier in the development cycle, while still valuing the hands-on experience of operating production infrastructure.	—
6/4/25	The One with AI and Todd Underwood	In this Google Prodcast episode, Todd Underwood, a reliability expert from Anthropic with experience at Google and OpenAI, discusses the current state and future of AI in SRE. Todd and the hosts focus on the current state and future of AI and ML in production, particularly for SREs. Topics discussed include the challenges of AI-Ops, limitations of current anomaly detection, the potential for AI in config authoring and troubleshooting, trade-offs between product velocity and reliability, the evolving role of SREs in an AI-driven world, and book publication for optimal timing.	—
5/28/25	The One With Data Centers and Peter Pellerzi	This episode features guest, Peter Pellerzi (Distinguished Engineer, Google). Peter and the hosts, Matt Siegler and Steve McGhee, focus on the physical infrastructure side of SRE, discussing topics such as the scale of Google's data centers, handling incidents like power outages, testing and preparedness strategies, the use of AI for optimizing cooling plants, and more. Peter also emphasizes the importance of community support, proactive planning, and learning from real-world testing and incidents to ensure high availability and resilience in data center operations.	—
5/21/25	The One With Security and Jessica Theodat	Jessica Theodat (Senior SRE & Security Tech Lead, Google) joins hosts Jordan Greenberg and Steve McGhee to discuss the intersection of security and site reliability engineering at Google. Jessica touches on risk management, the unique nature of security incident responses, and the shared goals between security and SRE. The crew also delves into the balance between security and SRE, acknowledging the tension and the need for collaboration between teams to achieve business goals and user trust.	—
4/16/25	We're back with Season 4!	In this "bumpisode", hosts and producers of Prodcast (including our new co-host, Matt Siegler!) reflect on the previous season and introduce the new season's focus on upcoming trends in Site Reliability Engineering (SRE) and AI, and the friends we make along the way. They also introduce new elements we are bringing in with Season 4, such as a video format and a feedback form.	—
1/29/25	Special Episode: You Missed a Page from Telebot	This episode features Javi Beltran, a Google engineering lead who created the "Telebot" theme song. With our beloved hosts, Steve McGhee and Jordan Greenberg, Beltran discusses the origins of the song, created in 2012 for Google's paging system. The song was meant to add a touch of levity to what could be a stressful situation for engineers on-call. Beltran also unveils a new, more modern remix of "Telebot" (created in collaboration with our host, Jordan Greenberg!) which will be used as the intro theme for the podcast's next season.	—
12/11/24	Imperative vs. Declarative Change Workflows with Dominic Hutton & Niccolo' Cascarano	In this episode of the Prodcast, guests Dominic Hutton (Staff SRE, HashiCorp) and Niccolo' Cascarano (Senior Staff SRE at Google) join hosts Steve McGhee and Jordan Greenberg to dive into configurations. They discuss the differences between imperative and declarative configuration, explore the benefits and challenges of each approach, and the need for careful consideration when choosing between the two. Ultimately, the goal is to achieve reliable and maintainable systems through effective configuration management.	—