
Data Engineering Podcast
by Tobias Macey
Is this your podcast?Tobias Macey is an independent podcast creator known for his expertise in data engineering and technology. He has established himself as a prominent voice in the field, exploring intricate topics related to data management and infrastructur…
Insights from recent episode analysis
Audience Interest
- data engineering techniques
- automation in data workflows
Podcast Focus
- tools for data engineering
- challenges in data engineering
Publishing Consistency
- 509 episodes produced
- active for 9 years
Platform Reach
- available on major podcast platforms
- growing listener base
Insights are generated by CastFox AI using publicly available data, episode content, and proprietary models.
Most discussed topics
Brands & references
Total monthly reach
Estimated from 6 chart positions in 6 markets.
By chart position
- 🇨🇦CA · Technology#1275K to 30K
- 🇲🇽MX · Technology#3830K to 100K
- 🇮🇳IN · Technology#8410K to 30K
- 🇫🇷FR · Technology#1091K to 10K
- 🇸🇬SG · Technology#643K to 10K
- Per-Episode Audience
Est. listeners per new episode within ~30 days
26K to 95K🎙 ~2x weekly·509 episodes·Last published 1w ago - Monthly Reach
Unique listeners across all episodes (30 days)
52K to 190K🇲🇽53%🇨🇦16%🇮🇳16%+3 more - Active Followers
Loyal subscribers who consistently listen
21K to 76K3.9K real followers tracked across platforms
Market Insights
Platform Distribution
Reach across major podcast platforms, updated hourly
Total Followers
—
Total Plays
—
Total Reviews
—
* Data sourced directly from platform APIs and aggregated hourly across all major podcast directories.
On the show
From 13 epsHost
Recent guests
Recent episodes
Holding Kafka Right: Product-Friendly Streaming with TypeStream
Jun 18, 2026
Unknown duration
Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards
Jun 8, 2026
52m 52s
Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture
Jun 1, 2026
54m 20s
Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes
May 6, 2026
58m 34s
The AI-First Data Engineer: 10–50x Productivity and What Changes Next
Apr 7, 2026
59m 24s
Social Links & Contact
Official channels & resources
Official Website
Login
RSS Feed
Login
| Date | Episode | Topics | Guests | Brands | Places | Keywords | Sponsor | Length | |
|---|---|---|---|---|---|---|---|---|---|
| 6/18/26 | ![]() Holding Kafka Right: Product-Friendly Streaming with TypeStream | Summary In this episode Jevin Maltais talks about the practical realities of building reliable, product-focused streaming systems with Kafka. Jevin shares lessons from roles at Zapier, Humi, and Clio, where real-time synchronization, customer data unification, and document sync at scale highlighted both the strengths and common misuses of Kafka. He digs into using events as the source of truth, materialized views with KTables, and how schema registries and type safety prevent downstream breakage. Jevin explains why teams often reach for heavyweight Kafka clusters without leveraging Streams, Connect, or interactive queries—and how his project, TypeStream, aims to make those capabilities accessible via config-as-code while keeping a thin abstraction and clear escape hatches. He also explore trade-offs across Kafka-compatible alternatives, CDC with Debezium in the real world, and where abstractions should stop so teams can scale responsibility as complexity grows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementThis episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.Your host is Tobias Macey and today I'm interviewing Jevin Maltais about the challenges of building a reliable streaming Interview IntroductionHow did you get involved in the area of data management?Can you describe what Typestream is and the story behind it?What are the common challenges that teams encounter when trying to build on top of Kafka?How do those challenges/misconfigurations impact the team's ability to deliver on product goals?What are the fundamental design aspects of Kafka that contribute to the difficulties that teams encounter when using it as an element of their architecture?There have been numerous projects taking aim at Kafka, with varying approaches and degrees of effectiveness (e.g. RedPanda, AutoMQ, Pulsar, etc.). What are the tradeoffs that each of those approaches requires?What makes the original Kafka project so resilient in the face of all of that competition?Can you describe the architecture of Typestream and how each of the core elements contribute to a better user experience?For teams who want to take advantage of streaming capabilities, but don't want to invest in becoming Kafka experts, what does the Typestream workflow look like?If they don't want to manage the operational overhead of a Kafka cluster, how tightly coupled is Typestream to the original Kafka? (can someone use RedPanda or AutoMQ instead?)What are the most interesting, innovative, or unexpected ways that you have seen Typestream used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Typestream?When is Typestream the wrong choice?What do you have planned for the future of Typestream?Contact Info WebsiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links TypestreamZapierAirflowKafkaKTablesKSQLRedPandaPulsarAutoMQKafka Schema RegistryDebeziumChange Data CaptureKafka ConnectTerraformKafka Compacted TopicThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA | — | ||||||
| 6/8/26 | ![]() Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards✨ | AIdata platform+4 | Shravan Gunda | Kaarvi AI | — | AI-nativedata ingestion+4 | DataDriven.io | 52m 52s | |
| 6/1/26 | ![]() Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture✨ | graph analyticsdata architecture+4 | Weimo Liu | PuppyGraphIceberg+4 | — | graph queryingCypher+6 | DataDriven.io | 54m 20s | |
| 5/6/26 | ![]() Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes✨ | GPU utilizationheterogeneous pipelines+5 | Robert Nishihara | AnyscaleRay+4 | — | GPUAI+8 | — | 58m 34s | |
| 4/7/26 | ![]() The AI-First Data Engineer: 10–50x Productivity and What Changes Next✨ | AI in data engineeringproductivity gains+4 | Gleb Mezhanskiy | Datafold | — | AI-firstdata engineering+8 | — | 59m 24s | |
| 3/29/26 | ![]() Treat Metering Like Finance: Building Data Platforms for Consumption Economics✨ | data platformsconsumption economics+4 | Himant Goyal | SalesforceData Engineering Podcast | — | data platformmetering+5 | — | 50m 19s | |
| 3/22/26 | ![]() Beyond the PDF: Rowan Cockett on Reproducible, Composable Science✨ | reproducible sciencedata systems+5 | Rowan Cockett | JupyterJupyter Book+5 | — | reproducibility crisisdata integrity+5 | — | 42m 40s | |
| 3/16/26 | ![]() Beyond Prompts: Practical Paths to Self‑Improving AI✨ | self-improving AIagentic systems+5 | Raj Shukla | SymphonyAIData Engineering Podcast+1 | — | self-improving AIfeedback loops+5 | — | 1h 01m 50s | |
| 3/8/26 | ![]() Orion at Gravity: Trustworthy AI Analysts for the Enterprise✨ | trustworthy AIagentic analytics+4 | Lucas ThelosenDrew Gilson | OrionGravity+2 | public companies | AI analystsdata semantics+5 | — | 1h 05m 01s | |
| 3/2/26 | ![]() From Models to Momentum: Uniting Architects and Engineers with ER/Studio✨ | enterprise data modelingsemantic models+4 | Jamie KnowlesRyan Hirsch | ER/StudioData Engineering Podcast | — | data modelingsemantic drift+3 | — | 45m 02s | |
Want analysis for the episodes below?Free for Pro Submit a request, we'll have your selected episodes analyzed within an hour. Free, at no cost to you, for Pro users. | |||||||||
| 2/22/26 | ![]() From Data Models to Mind Models: Designing AI Memory at Scale✨ | AI memoryagentic memory+4 | Vasilije "Vas" Markovich | RedisQdrant+3 | — | agentic memoryknowledge sharing+6 | — | 57m 47s | |
| 2/15/26 | ![]() Prompt Management, Tracing, and Evals: The New Table Stakes for GenAI Ops✨ | GenAI OpsLLM-powered applications+4 | Aman Agarwal | OpenLitOpenTelemetry | — | LLMOpenTelemetry+6 | — | 50m 43s | |
| 2/8/26 | ![]() From Legacy to AI-Ready: How MongoDB AMP Accelerates Modernization✨ | AI readinessapplication modernization+3 | Shilpa Kolhar | Application Modernization Platform (AMP)Atlas Vector Search+1 | — | MongoDBAMP+3 | — | 46m 45s | |
| 2/1/26 | ![]() Branches, Diffs, and SQL: How Dolt Powers Agentic Workflows✨ | version-controlled SQL databasedata systems+4 | Tim Sehn | DoltHubMySQL+4 | — | Doltversion control+6 | Retool | 56m 53s | |
| 1/25/26 | ![]() Logical First, Physical Second: A Pragmatic Path to Trusted Data | Summary In this episode of the Data Engineering Podcast Jamie Knowles, Product Director for ER/Studio, talks about data architecture and its importance in driving business meaning. He discusses how data architecture should start with business meaning, not just physical schemas, and explores the pitfalls of jumping straight to physical designs. Jamie shares his practical definition of data architecture centered on shared semantic models that anchor transactional, analytical, and event-driven systems. The conversation covers strategies for evolving an architecture in tandem with delivery, including defining core concepts, aligning teams through governance, and treating the model as a living product. He also examines how generative AI can both help and harm data architecture, accelerating first drafts but amplifying risk without a human-approved ontology. Jamie emphasizes the importance of doing the hard work upfront to make meaning explicit, keeping models simple and business-aligned, and using tools and patterns to reuse that meaning everywhere. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Jamie Knowles about the impact that a well-developed data architecture (or lack thereof) has on data engineering workInterviewIntroductionHow did you get involved in the area of data management?Can you start by giving your definition of "data architecture" and what it encompasses?How does the nuance change depending on the type of system you are designing? (e.g. data warehouse vs. transactional application database vs. event-driven streaming service)In application teams that are large enough there is typically a software architect, but that work often ends up happening organically through trial and error. Who is the responsible party for designing and enforcing a proper data architecture?There have been several generational shifts in approach to data warehouse projects in particular. What are some of the anti-patterns that crop up when there is no-one forming a strong opinion on the design/architecture of the warehouse?The current stage is largely defined by the ELT pattern. What are some of the ways that workflow can encourage shortcuts?Often the need for a proper architecture isn't felt until an organic architecture has developed. What are some of the ways that teams can short-circuit that pain and iterate toward a more sustainable design?The common theme in all of the data architecture conversations that I've had is the need for business involvement. There is also a strong push for the business to just want the engineers to deliver data. What are some of the ways that AI utilities can help to accelerate delivery while also capturing business context?For teams that are already neck deep in a messy architecture, what are the strategies and tactics that they need to start working toward today to get to a better data architecture?What are the most interesting, innovative, or unexpected ways that you have seen teams approach the creation and implementation of their data architecture?What are the most interesting, unexpected, or challenging lessons that you have learned while working in data architecture?How do you see the introduction of AI at each stage of the data lifecycle changing the ways that teams think about their architectural needs?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.LinksIderaER StudioELTRDF == Resource Description FrameworkORM == Object-Relational MappingThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA | — | ||||||
| 1/18/26 | ![]() Your Data, Your Lake: How Observe Uses Iceberg and Streaming ETL for Observability | Summary In this episode Jacob Leverich, cofounder and CTO of Observe, talks about applying lakehouse architectures to observability workloads. Jacob discusses Observe’s decision to leverage cloud-native warehousing and open table formats for scale and cost efficiency. He digs into the core pain points teams face with fragmented tools, soaring costs, and data silos, and how a lakehouse approach - paired with streaming ingest via OpenTelemetry, Kafka-backed durability, curated/columnarized tables, and query orchestration - can deliver low-latency, interactive troubleshooting across logs, metrics, and traces at petabyte scale. He also explore the practicalities of loading and organizing telemetry by use case to reduce read amplification, the role of Iceberg (including v3’s JSON shredding) and Snowflake’s implementation, and why open table formats enable “your data in your lake” strategies. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Jacob Leverich about how data lakehouse technologies can be applied to observability for unlimited scale and orders of magnitude improvement on economicsInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of what the major pain points have been in the observability space? (e.g. limited scale/retention, costs, integration fragmentation)What are the elements of the ecosystem and tech stacks that led to that state of the world?What are you building at Observe that circumvents those pain points?What are the major ecosystem evolutions that make this a feasible architecture? (e.g. columnar storage, distributed compute, protocol consolidation)Can you describe the architecture of the Observe platform?How have the design of the platform evolved/changed direction since you first started working on it?What was your process for determining which core technologies to build on top of?What were the missing pieces that you had to engineer around to get a cohesive and performant platform?The perennial problem with observability systems and data lakes is their tendency to succumb to entropy. What are the guardrails that you are relying on to help customers maintain a well-structured and usable repository of information?Data lakehouses are excellent for flexibility and scaling to massive data volumes, but they're not known for being fast. What are the areas of investment in the ecosystem that is changing that narrative?As organizations overcome the constraints of limited retention periods and anxiety over cost, what new use cases does that unlock for their observability data?How do AI applications/agents change the requirements around observability data? (collection, scale, complexity, applications, etc.)What are the most interesting, innovative, or unexpected ways that you have seen Observe/lakehouse technologies used for observability?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Observe?When is Observe/lakehouse technologies the wrong choice?What do you have planned for the future of Observe?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links Observe Inc.Lakehouse ArchitectureSplunkObservabilityRSyslogGlusterFSDremelDrillBigQuerySnowflake SIGMOD PaperPrometheusDatadogNewRelicAppDynamicsDynaTraceLokiCortexMimirTempoCardinalityFluentBitFluentDOpenTelemetryOTLP == OpenTelemetry Line ProtocolKafkaVPC Flow LogsRead AmplificationLanceIcebergHudiPromQLThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA | — | ||||||
| 1/12/26 | ![]() Semantic Operators Meet Dataframes: Building Context for Agents with FENIC | Summary In this episode Kostas Pardalis talks about Fenic - an open-source, PySpark-inspired dataframe engine designed to bring LLM-powered semantics into reliable data engineering workflows. Kostas shares why today’s data infrastructure assumptions (BI-first, expert-operated, CPU-bound) fall short for AI-era tasks that are increasingly inference- and IO-bound. He explores how Fenic introduces semantic operators (e.g., semantic filter, extract, join) as first-class citizens in the logical plan so the optimizer can reason about inference, costs, and constraints. This enables developers to turn unstructured data into explicit schemas, compose transformations lazily, and offload LLM work safely and efficiently. He digs into Fenic’s architecture (lazy dataframe API, logical/physical plans, Polars execution, DuckDB/Arrow SQL path), how it exposes tools via MCP for agent integration, and where it fits in context engineering as a companion for memory/state management in agentic systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementYou’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Kostas Pardalis about Fenic, an opinionated, PySpark-inspired DataFrame framework for building AI and agentic applicationsInterview IntroductionHow did you get involved in the area of data management?Can you describe what Fenic is and the story behind it?What are the core problems that you are trying to address with Fenic?Dataframes have become a popular interface for doing chained transformations on structured data. What are the benefits of using that paradigm for LLM use-cases?Can you describe the architecture and implementation of Fenic?How have the design and scope of the project changed since you first started working on it?You position Fenic as a means of bringing reliability to LLM-powered transformations. What are some of the anti-patterns that teams should be aware of when getting started with Fenic?What are some of the most common first steps that teams take when integrating Fenic into their pipelines or applications?What are some of the ways that teams should be thinking about using Fenic and semantic operations for data pipelines and transformations?How does Fenic help with context engineering for agentic use cases?What are some examples of toolchains/workflows that could be replaced with Fenic?How does Fenic integrate with the broader ecosystem of data and AI frameworks? (e.g. Polars, Arrow, Qdrant, LangChan/Pydantic AI)What are the most interesting, innovative, or unexpected ways that you have seen Fenic used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Fenic?When is Fenic the wrong choice?What do you have planned for the future of Fenic?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links FenicRudderStackPodcast EpisodeTrinoStarburstTrino Project TardigradeTypedef AIdbtPySparkUDF == User-Defined FunctionLOTUSPandasPolarsRelational AlgebraArrowDuckDBMarkdownPydantic AIAI Engineering Podcast EpisodeLangChainRayDaskThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA | — | ||||||
| 1/5/26 | ![]() Beyond Dashboards: How Data Teams Earn a Seat at the Table | Summary In this episode Goutham Budati about his Data–Perspective–Action framework and how it empowers data teams to become true business partners. Gautham traces his path from automating Excel reports to leading high‑impact data organizations, then breaks down why technical excellence alone isn’t enough: teams must pair reliable data systems with deliberate storytelling, clear problem framing, and concrete action plans. He digs into tactics for moving from reactive ticket-taking to proactive influence — weekly one‑page narratives, design-first discovery, sampling stakeholders for real pain points, and treating dashboards as living roadmaps. He also explores how to right-size technical scope, preserve trust in core metrics, organize teams as “build” and “storytelling” duos, and translate business macros and micros into resilient system designs. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildYour host is Tobias Macey and today I'm interviewing Goutham Budati about his data-perspective-action framework for empowering data teams to be more influential in the businessInterview IntroductionHow did you get involved in the area of data management?Can you describe what the Data-Perspective-Action framework is and the story behind it?What does it look like when someone operates at each of those three levels?How does that change the day-to-day work of an individual contributor?Why does technically excellent data work sometimes fail to drive decisions?How do you identify whether a data system or pipeline is actually creating value versus just existing?What's the moment when you realized that building reliable systems wasn't the same as enabling better decisions?Better decisions still need to be powered by reliable systems. How do you manage the tension of focusing on up-time against focusing on impact?What does it mean to add "Perspective" to data? How is that different from analysis or insights?How do you know when you're overwhelming stakeholders versus giving them what they need?What changes when you start designing systems to surface signal rather than just providing comprehensive data?How do you learn what business context matters for turning data into something actionable?What does it mean to design for Action from day one? How does that change what you build?How do you get stakeholders to actually act on data instead of just consuming it?Walk us through how you structure collaboration with business partners when you're trying to drive decisions, not just inform them.What's the relationship between iteration and trust when you're building data products?What does the transition from order-taker to strategic partner actually look like? What has to change?How do you position data work as driving the business rather than supporting it?Why does storytelling matter for data professionals? What role does it play that technical communication doesn't cover?What organizational structures or team setups help data people gain influence?Tell us about a time when you built something technically sound that failed to create impact. What did you learn?What are the common patterns in dysfunctional data organizations? What causes the breakdown?How do you rebuild credibility when you inherit a data function that's lost trust with the business?What's the relationship between technical excellence and stakeholder trust? Can you have one without the other?When is this framework the wrong lens? What situations call for a different approach?How do you balance the demand for technical depth with the need to develop business and communication skills?How should data professionals position themselves as AI and ML tools become more accessible?What shifts do you see coming in how businesses think about data work?How is your thinking about data impact evolving?For someone who recognizes they're focused purely on the technical work and wants to expand their impact—where should they start?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA | — | ||||||
| 12/29/25 | ![]() Unfreezing The Data Lake: The Future-Proof File Format | Summary In this episode PhD researcher Xinyu Zeng talks about F3, the “future-proof file format” designed to address today’s hardware realities and evolving workloads. He digs into the limitations of Parquet and ORC - especially CPU-bound decoding, metadata overhead for wide-table projections, and poor random-access behavior for ML training and serving - and how F3 rethinks layout and encodings to be efficient, interoperable, and extensible. Xinyu explains F3’s two major ideas: a decoupled, flexible layout that separates IO units, dictionary scope, and encoding choices; and self-decoding files that embed WebAssembly kernels so new encodings can be adopted without waiting on every engine to upgrade. He discusses how table formats and file formats should increasingly be decoupled, potential synergies between F3 and table layers (including centralizing and verifying WASM kernels), and future directions such as extending WASM beyond encodings to indexing or filtering. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementYou’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Xinyu Zeng about the future-proof file formatInterview IntroductionHow did you get involved in the area of data management?Can you describe what the F3 project is and the story behind it?We have several widely adopted file formats (Parquet, ORC, Avro, etc.). Why do we keep creating new ones?Parquet is the format with perhaps the broadest adoption. What are the challenges that such wide use poses when trying to modify or extend the specification?The recent focus on vector data is perhaps the most visible change in storage requirements. What are some of the other custom types of data that might need to be supported in the file storage layer?Can you describe the key design principles of the F3 format?What are the engineering challenges that you faced while developing your implementation of the F3 proof-of-concept?The key challenge of introducing a new format is that of adoption. What are the provisions in F3 that might simplify the adoption of the format in the broader ecosystem? (e.g. integration with compute frameworks)What are some examples of features in data lake use cases that could be enabled by F3?What are some of the other ideas/hypotheses that you developed and discarded in the process of your reasearch?What are the most interesting, unexpected, or challenging lessons that you have learned while working on F3?What do you have planned for the future of F3?Contact Info Personal WebsiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links F3 PaperFormats Evaluation PaperF3 GithubSAL PaperRisingWaveTencent CloudParquetArrowAndy PavloWes McKinneyCMU Public SeminarVLDBORCProtocol BuffersLancePAX == Partition Attributes AcrossWASM == Web AssemblyDataFusionDuckDBDuckLakeVeloxVortex File FormatThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA | — | ||||||
| 12/21/25 | ![]() From Context to Semantics: How Metadata Powers Agentic AI | Summary In this episode Suresh Srinivas and Sriharsha Chintalapani explore how metadata platforms are evolving from human-centric catalogs into the foundational context layer for AI and agentic systems. They discuss the origins and growth of OpenMetadata and Collate, why “context” is necessary but “semantics” is critical for precise AI outcomes, and how a schema-first, API-first, unified platform enables discovery, observability, and governance in one workflow. They share how AI agents can now automate documentation, classification, data quality testing, and enforcement of policies, and why aligning governance with user identity and intent is essential as agentic access scales. They also dig into scalability strategies, MCP-based agent workflows, AI governance (including model/agent tracking), and the emerging convergence of big data with ontologies to deliver machine-understandable meaning. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildYour host is Tobias Macey and today I'm interviewing Suresh Srinivas and Sriharsha Chintalapani about how metadata catalogs provide the context clues necessary to give meaning to your data for AI systemsInterviewIntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the roles that metadata catalogs are playing in the current state of the ecosystem?How has the OpenMetadata platform evolved over the past 4 years?How has the focus on LLMs/generative AI changed the trajectory of services like OpenMetadata?The initial set of use cases for data catalogs was to facilitate discovery and documentation of data assets for human consumption. What are the structural elements of that effort that have paid dividends for an AI audience?How does the AI audience change the requirements around the cataloging and presentation of metadata?One of the constant challenges in data infrastructure now is the tension of making data accessible to AI systems (agentic or otherwise) and incorporating AI into the inner loop of the service. What are the opportunities for bringing AI inside the boundaries of a system like OpenMetadata vs. as a client or consumer of the platform?The key phrase of the past ~2 years is "context engineering". What role does the metadata catalog play in that undertaking?What are the capabilities that the catalog needs to be able to effectively populate and curate that context?How much awareness does the LLM or agent need to have to be able to use the catalog effectively?What does a typical workflow/agent loop look like when it is using something like OpenMetadata in pursuit of knowledge that it needs to achieve an objective?How do agentic use cases strain the existing set of governance frameworks?What new considerations (procedural or technical) need to be factored into governance practices to balance velocity with security?What are the most interesting, innovative, or unexpected ways that you have seen OpenMetadata/Collate used in AI/agentic contexts?What are the most interesting, unexpected, or challenging lessons that you have learned while working on OpenMetadata/Collate?When is OpenMetadata/Collate the wrong choice?What do you have planned for the future of OpenMetadata?Contact InfoSureshLinkedInSriharshaLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?LinksOpenMetadata Podcast EpisodeHadoopHortonworksContext EngineeringMCP == Model Context ProtocolJSON SchemadbtLangSmithOpenMetadata MCP ServerAPI GatewayThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA | — | ||||||
| 12/14/25 | ![]() From Data Engineering to AI Engineering: Where the Lines Blur | Summary In this solo episode of the Data Engineering Podcast, host Tobias Macey reflects on how AI has transformed the practice and pace of data engineering over time. Starting from its origins in the Hadoop and cloud warehouse era, he explores the discipline's evolution through ML engineering and MLOps to today's blended boundaries between data, ML, and AI engineering. The conversation covers how unstructured data is becoming more prominent, vectors and knowledge graphs are emerging as key components, and reliability expectations are changing due to interactive user-facing AI. The host also delves into process changes, including tighter collaboration, faster dataset onboarding, new governance and access controls, and the importance of treating experimentation and evaluation as fundamental testing practices. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildYour host is Tobias Macey and today I'm interviewing reflecting about the increasingly blurry boundaries between data engineering and AI engineeringInterviewIntroductionI started this podcast in 2017, right when the term "Data Engineer" was becoming widely used for a specific job title with a reasonably well-understood set of responsibilities. This was in response to the massive hype around "data science" and consequent hiring sprees that characterized the mid-2000s to mid-2010s. The introduction of generative AI and AI Engineering to the technical ecosystem is changing the scope of responsibilities for data engineers and other data practitioners. Of note is the fact that:AI models can be used to process unstructured data sources into structured data assetsAI applications require new types of data assetsThe SLAs for data assets related to AI serving are different from BI/warehouse use casesThe technology stacks for AI applications aren't necessarily the same as for analytical data pipelinesBecause everything is so new there is not a lot of prior art, and the prior art that does exist isn't necessarily easy to find because of differences in terminologyExperimentation has moved from being just an MLOps capability into being a core need for organizationsContact InfoEmailParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?LinksAI Engineering PodcastThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA | — | ||||||
| 12/8/25 | ![]() Malloy: Hierarchical Data, Semantic Models, and the Future of Analytics | Summary In this episode Michael Toy, co-creator of Malloy, talks about rethinking how we work with data beyond SQL. Michael shares the origins of Malloy from his and Lloyd Tabb’s experience at Looker, why SQL’s mental model often fights human problem solving, and how Malloy aims to be a composable, maintainable language that treats SQL as the assembly layer rather than something humans should write. He explores Malloy’s core ideas — semantic modeling tightly coupled with a query language, hierarchical data as the default mental model, and preserving context so analysis stays interactive and open-ended. He also digs into the developer experience and ecosystem: Malloy’s TypeScript implementation, VS Code integration, CLI, emerging notebook support, and how Malloy can sit alongside or replace parts of existing transformation workflows. Michael discusses practical trade-offs in language design, the surprising fit for LLM-generated queries, and near-term roadmap areas like dimensional filtering, better aggregation strategies across levels, and closing gaps that still require escaping to SQL. He closes with an invitation to contribute to the open-source project and help shape its evolution. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildYour host is Tobias Macey and today I'm interviewing Michael Toy about Malloy, a modern language for building composable and maintainable analytics and data models on relational enginesInterview IntroductionHow did you get involved in the area of data management?Can you describe what Malloy is and the story behind it?What is the core problem that you are trying to solve with Malloy?There are countless projects that aim to reimagine/reinvent/replace SQL. What are the factors that make Malloy stand out in your mind?Who are the target personas for the Malloy language?One of the key success factors for any language is the ecosystem around it and the integrations available to it. How does Malloy fit in the toolchains and workflows for data engineers and analysts?Can you describe the key design and syntax elements of Malloy?How have the scope and focus of the language evolved since you first started working on it?How do the structure and semantics of Malloy change the ways that teams think about their data models?SQL-focused tools have gained prominence as the means of building the tranfromation stage of data pipelines. How would you characterize the capabilities of Malloy as a tool for building translation pipelines?What are the most interesting, innovative, or unexpected ways that you have seen Malloy used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Malloy?When is Malloy the wrong choice?What do you have planned for the future of Malloy?Contact InfoWebsiteParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?LinksMalloyLloyd TabbSQLLookerLookMLdbtRelational AlgebraTypescriptRuby[Truffle](Malloy VSCode PluginMalloy CLIMalloy Pick StatementThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA | — | ||||||
| 11/24/25 | ![]() Blurring Lines: Data, AI, and the New Playbook for Team Velocity | SummaryIn this crossover episode, Max Beauchemin explores how multiplayer, multi‑agent engineering is transforming the way individuals and teams build data and AI systems. He digs into the shifting boundary between data and AI engineering, the rise of “context as code,” and how just‑in‑time retrieval via MCP and CLIs lets agents gather what they need without bloating context windows. Max shares hard‑won practices from going “AI‑first” for most tasks, where humans focus on orchestration and taste, and the new bottlenecks that appear — code review, QA, async coordination — when execution accelerates 2–10x. He also dives deep into Agor, his open‑source agent orchestration platform: a spatial, multiplayer workspace that manages Git worktrees and live dev environments, templatizes prompts by workflow zones, supports session forking and sub‑sessions, and exposes an internal MCP so agents can schedule, monitor, and even coordinate other agents.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Maxime Beauchemin about the impact of multi-player multi-agent engineering on individual and team velocity for building better data systemsInterviewIntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the types of work that you are relying on AI development agents for?As you bring agents into the mix for software engineering, what are the bottlenecks that start to show up?In my own experience there are a finite number of agents that I can manage in parallel. How does Agor help to increase that limit?How does making multi-agent management a multi-player experience change the dynamics of how you apply agentic engineering workflows?Contact InfoLinkedInLinksAgorApache AirflowApache SupersetPresetClaude CodeCodexPlaywright MCPTmuxGit WorktreesOpencode.aiGitHub CodespacesOnaThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA | — | ||||||
| 11/16/25 | ![]() State, Scale, and Signals: Rethinking Orchestration with Durable Execution | Summary In this episode Preeti Somal, EVP of Engineering at Temporal, talks about the durable execution model and how it reshapes the way teams build reliable, stateful systems for data and AI. She explores Temporal’s code‑first programming model—workflows, activities, task queues, and replay—and how it eliminates hand‑rolled retry, checkpoint, and error‑handling scaffolding while letting data remain where it lives. Preeti shares real-world patterns for replacing DAG-first orchestration, integrating application and data teams through signals and Nexus for cross-boundary calls, and using Temporal to coordinate long-running, human-in-the-loop, and agentic AI workflows with full observability and auditability. Shee also discusses heuristics for choosing Temporal alongside (or instead of) traditional orchestrators, managing scale without moving large datasets, and lessons from running durable execution as a cloud service. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Preeti Somal about how to incorporate durable execution and state management into AI application architecturesInterview IntroductionHow did you get involved in the area of data management?Can you describe what durable execution is and how it impacts system architecture?With the strong focus on state maintenance and high reliability, what are some of the most impactful ways that data teams are incorporating tools like Temporal into their work?One of the core primitives in Temporal is a "workflow". How does that compare to similar primitives in common data orchestration systems such as Airflow, Dagster, Prefect, etc.? What are the heuristics that you recommend when deciding which tool to use for a given task, particularly in data/pipeline oriented projects? Even if a team is using a more data-focused orchestration engine, what are some of the ways that Temporal can be applied to handle the processing logic of the actual data?AI applications are also very dependent on reliable data to be effective in production contexts. What are some of the design patterns where durable execution can be integrated into RAG/agent applications?What are some of the conceptual hurdles that teams experience when they are starting to adopt Temporal or other durable execution frameworks?What are the most interesting, innovative, or unexpected ways that you have seen Temporal/durable execution used for data/AI services?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Temporal?When is Temporal/durable execution the wrong choice?What do you have planned for the future of Temporal for data and AI systems?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.Links TemporalDurable ExecutionFlinkMachine Learning EpochSpark StreamingAirflowDirected Acyclic Graph (DAG)Temporal NexusTensorZeroAI Engineering Podcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA | — | ||||||
| 11/9/25 | ![]() The AI Data Paradox: High Trust in Models, Low Trust in Data | SummaryIn this episode of the Data Engineering Podcast Ariel Pohoryles, head of product marketing for Boomi's data management offerings, talks about a recent survey of 300 data leaders on how organizations are investing in data to scale AI. He shares a paradox uncovered in the research: while 77% of leaders trust the data feeding their AI systems, only 50% trust their organization's data overall. Ariel explains why truly productionizing AI demands broader, continuously refreshed data with stronger automation and governance, and highlights the challenges posed by unstructured data and vector stores. The conversation covers the need to shift from manual reviews to automated pipelines, the resurgence of metadata and master data management, and the importance of guardrails, traceability, and agent governance. Ariel also predicts a growing convergence between data teams and application integration teams and advises leaders to focus on high-value use cases, aggressive pipeline automation, and cataloging and governing the coming sprawl of AI agents, all while using AI to accelerate data engineering itself.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Ariel Pohoryles about data management investments that organizations are making to enable them to scale AI implementationsInterviewIntroductionHow did you get involved in the area of data management?Can you start by describing the motivation and scope of your recent survey on data management investments for AI across your respondents?What are the key takeaways that were most significant to you?The survey reveals a fascinating paradox: 77% of leaders trust the data used by their AI systems, yet only half trust their organization's overall data quality. For our data engineering audience, what does this suggest about how companies are currently sourcing data for AI? Does it imply they are using narrow, manually-curated "golden datasets," and what are the technical challenges and risks of that approach as they try to scale?The report highlights a heavy reliance on manual data quality processes, with one expert noting companies feel it's "not reliable to fully automate validation" for external or customer data. At the same time, maturity in "Automated tools for data integration and cleansing" is low, at only 42%. What specific technical hurdles or organizational inertia are preventing teams from adopting more automation in their data quality and integration pipelines?There was a significant point made that with generative AI, "biases can scale much faster," making automated governance essential. From a data engineering perspective, how does the data management strategy need to evolve to support generative AI versus traditional ML models? What new types of data quality checks, lineage tracking, or monitoring for feedback loops are required when the model itself is generating new content based on its own outputs?The report champions a "centralized data management platform" as the "connective tissue" for reliable AI. How do you see the scale and data maturity impacting the realities of that effort?How do architectural patterns in the shape of cloud warehouses, lakehouses, data mesh, data products, etc. factor into that need for centralized/unified platforms?A surprising finding was that a third of respondents have not fully grasped the risk of significant inaccuracies in their AI models if they fail to prioritize data management. In your experience, what are the biggest blind spots for data and analytics leaders?Looking at the maturity charts, companies rate themselves highly on "Developing a data management strategy" (65%) but lag significantly in areas like "Automated tools for data integration and cleansing" (42%) and "Conducting bias-detection audits" (24%). If you were advising a data engineering team lead based on these findings, what would you tell them to prioritize in the next 6-12 months to bridge the gap between strategy and a truly scalable, trustworthy data foundation for AI?The report states that 83% of companies expect to integrate more data sources for their AI in the next year. For a data engineer on the ground, what is the most important capability they need to build into their platform to handle this influx?What are the most interesting, innovative, or unexpected ways that you have seen teams addressing the new and accelerated data needs for AI applications?What are some of the noteworthy trends or predictions that you have for the near-term future of the impact that AI is having or will have on data teams and systems?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.LinksBoomiData ManagementIntegration & Automation DemoAgentstudioData Connector Agent WebinarSurvey ResultsData GovernanceShadow ITPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA | — | ||||||
Showing 25 of 459
Pitch Fit is a Pro feature
See how bookable this show is for guests, which brands already advertise, the per-episode ad value, and the best-fit guest and sponsor profile. The numbers are blurred on the free plan.
How readily this show books outside guests like you.
How proven this show is for host-read sponsorships.
For Guests
ProFor Advertisers
ProUpgrade to Pro to unlock guest cadence, sponsor categories, fit scores, and per-episode ad value for this show.
Similar Audience Demographics
Podcasts that attract a similar listener profile
Chart Positions
6 placements across 6 markets.
Chart Positions
6 placements across 6 markets.
