Most people see CastFox as a product — a search bar, a dashboard, a set of analytics tabs. What they do not see is the engineering reality underneath: a system that collects, unifies, classifies, and reasons over one of the largest podcast datasets ever assembled.
This article is a look behind the curtain — the architectural decisions, the AI pipelines, and the hard problems I solved bringing CastFox from a prototype to a production platform serving real users every day.
In This Article
The Data Problem: Heterogeneous Sources at Scale
The first challenge was deceptively simple to state and enormously difficult to solve: collect every podcast in existence, along with every episode they have ever published.
We are talking about 5 million podcasts and over 114 million episodes. But the real complexity is not the volume — it is the heterogeneity. Podcast data does not come from one clean API. It comes from dozens of fragmented, inconsistent, and often unreliable sources, each with its own schema, update frequency, and failure modes.
The Source Landscape
The primary data backbone is RSS — the original syndication format that podcasting was built on. Millions of podcasts publish RSS feeds, and each feed contains episode metadata: titles, descriptions, publication dates, audio URLs, and sometimes rich metadata like categories and guest names.
But RSS feeds vary wildly. Some are well-structured XML. Others are malformed, missing fields, encoding characters incorrectly, or serving stale data. Some feeds contain 10 episodes. Others contain 5,000. Some update daily. Others have not been touched in years but still need to be indexed.
Beyond RSS, we integrate with multiple podcast platform APIs — each exposing different slices of data in different formats. One API returns listener demographics. Another returns chart rankings. A third provides review counts but not review text. None of them agree on podcast identifiers, so the same show might be represented by three different IDs across three different sources.
The Real Question
The Core Challenge
How do you take millions of records from heterogeneous sources with different parent schemas, different update cadences, different reliability profiles, and different data quality levels — and merge them into a single, coherent, queryable data layer that downstream AI systems can trust?
This is a data engineering problem, an identity resolution problem, and a systems architecture problem all at once. And it had to run continuously — because podcasts publish new episodes every day, feeds change, shows launch and die, and the data layer must reflect reality in near-real-time.
CastFox data ingestion pipeline: from raw heterogeneous sources to a unified AI-ready data layer
Building the Unified Data Layer
The solution I designed is a multi-stage ingestion pipeline that treats every data source as an unreliable, eventually-consistent stream. Nothing is trusted at face value. Every record passes through validation, normalization, and entity resolution before it enters the canonical data store.
Identity Resolution
The hardest sub-problem is identity resolution. When an RSS feed says "The Tim Ferriss Show" and an API returns "Tim Ferriss Show" and a chart listing says "The Tim Ferriss Show - Podcast" — are these the same podcast? Almost certainly. But proving that programmatically, across millions of records, without false positives that merge distinct shows, requires a carefully tuned matching pipeline: fuzzy string matching, feed URL canonicalization, cross-referencing known identifiers, and confidence scoring.
Schema Normalization
Each source has its own idea of what a "podcast" looks like. One returns episode durations in seconds. Another in HH:MM:SS format. A third omits duration entirely. Categories come as Apple Podcasts taxonomy strings from one source, free-text tags from another, and numerical codes from a third. The normalization layer maps every variant into a unified internal schema — a canonical representation that downstream systems can depend on without knowing or caring where the data originated.
Continuous Refresh
This is not a one-time ETL job. Podcasts publish new episodes constantly. Feeds go offline and come back. Shows rebrand. New podcasts launch every day. The ingestion pipeline runs continuously, prioritizing feeds by update frequency and freshness requirements, ensuring that the unified data layer reflects reality as closely as possible without burning through compute on feeds that rarely change.
Key Takeaway
The unified data layer is the foundation everything else depends on. Without it, classification is unreliable, search is incomplete, and AI reasoning is ungrounded. Getting this layer right was the single most important architectural decision in building CastFox.
The Classification Pipeline: Making Raw Data Intelligent
Once all the data lives in a unified layer, the next question becomes: what does it mean? A podcast is not just a title, a feed URL, and a list of episodes. It has a topic. It has an audience. It speaks a language. It operates in a country. It covers specific industries, mentions specific people and companies, and positions itself within a content niche. None of this is explicit in the raw data. All of it has to be inferred.
Category Classification
The category classification problem sounds straightforward until you look at the data. Podcast creators self-categorize their shows using platform-provided taxonomies, but these self-reported categories are noisy, outdated, and often wrong. A show labeled "Society & Culture" might actually be a true crime podcast. A show labeled "Business" might be a personal development show that occasionally mentions entrepreneurship.
I built a multi-signal classification system that does not trust self-reported labels. Instead, it analyzes what the podcast actually discusses — across episode titles, descriptions, and transcribed content — and assigns categories based on observed content. This uses a combination of off-the-shelf classification models for well-defined categories and custom-trained classifiers for nuanced distinctions that general-purpose models miss.
Entity Extraction
Named Entity Recognition (NER) runs across episode descriptions and transcripts to identify people, companies, brands, products, and topics mentioned in each episode. This powers one of CastFox's most powerful capabilities: searching not just for podcasts that are "about" a topic, but for specific episodes that mention a specific entity — a brand, a person, a technology. This is what separates CastFox from tools that only search titles and descriptions.
Country and Language Classification
Determining where a podcast is "from" is surprisingly difficult. A show might be hosted in the UK, recorded in the US, published on an Australian platform, and have listeners worldwide. The country classification pipeline uses multiple signals — feed metadata, host location data, content language analysis, audience distribution data, and platform-specific geographic indicators — to assign the most meaningful country and language tags. Some of these classifiers are off-the-shelf NLP models. Others I built from scratch because no existing solution handled the specific combination of signals podcasting requires.
AI Agents in the Classification Stack
Not every classification task can be solved by a static model. Some require reasoning — looking at context, weighing conflicting signals, making judgment calls. For these tasks, I deployed AI agents that operate as autonomous classifiers. They receive a podcast's metadata and content signals, reason about what the podcast is, and return structured classification decisions. These agents are particularly valuable for edge cases that would confuse a traditional classifier: multilingual shows, niche crossover topics, or podcasts that deliberately resist categorization.
From raw podcast data to enriched intelligence powering every CastFox feature
RAG Agents and the Rise of Retrieval-Augmented Intelligence
Retrieval-Augmented Generation (RAG) has become one of the most important architectural patterns in modern AI systems — and for good reason. Pure language models are powerful reasoners but unreliable knowledge stores. They hallucinate. They go stale. They cannot access proprietary data. RAG solves this by giving models access to external, up-to-date, domain-specific knowledge at inference time.
In CastFox, RAG is not a feature — it is an architectural foundation. Multiple systems across the platform use retrieval-augmented pipelines to ground AI responses in real podcast data.
How RAG Works in Practice
The core pattern is straightforward: when a user asks a question or the system needs to make an inference, the query is first routed to a retrieval layer that searches the podcast knowledge base — episode transcripts, metadata, classification results, entity graphs — and returns the most relevant context. This context is then provided to the language model alongside the original query, anchoring its response in real, verified data rather than parametric memory.
But the naive RAG pattern — embed a query, retrieve top-k chunks, generate a response — is insufficient for a domain as complex as podcasting. The retrieval layer needs to understand that a query about "AI startups" should match episodes discussing specific companies even if they never use the phrase "AI startup." It needs to handle temporal queries ("podcasts that discussed this topic in the last 3 months") and relational queries ("shows where the host has interviewed founders from Y Combinator companies").
Agentic RAG: Beyond Simple Retrieval
Why Agentic RAG Matters
Instead of a single retrieve-then-generate pipeline, CastFox deploys AI agents that can plan multi-step retrieval strategies — decomposing complex queries, retrieving different types of evidence, cross-referencing results, and synthesizing grounded responses.
An agent receiving a complex query might first decompose it into sub-queries, retrieve different types of evidence for each, cross-reference results, filter for relevance and recency, and only then synthesize a final response. These agents have access to multiple retrieval backends — vector search, keyword search, structured database queries, and graph traversals — and choose the right tool for each sub-query.
The result is a system that does not just "search" podcasts — it reasons about them. It can answer questions that require combining information across multiple episodes, multiple shows, and multiple data types. This agentic approach is what powers PodcastGPT's ability to give specific, grounded, useful answers rather than generic summaries.
PodcastGPT: Teaching AI to Understand Podcasts
PodcastGPT is the most visible AI feature in CastFox — the product that users interact with directly. But what users see as a chat interface is actually the tip of an iceberg of infrastructure: internal APIs, retrieval pipelines, agent orchestration, and domain-specific prompt engineering.
The Architecture
PodcastGPT is not a fine-tuned model. It is an AI system built on top of foundation models, augmented with CastFox's proprietary data layer and a suite of specialized tools. When a user asks PodcastGPT a question — "Find me tech podcasts that have discussed Series A fundraising in the last 6 months" — the system does not generate an answer from memory. It executes a plan.
First, the query is analyzed and decomposed. What is the user actually looking for? What constraints are explicit (tech podcasts, Series A, last 6 months) and what is implied (English language, active shows, relevant audience size)? The query planner identifies the retrieval strategy and dispatches it.
Then, the internal API layer takes over. I built a dedicated API that provisions PodcastGPT with structured access to the entire CastFox data layer — podcast metadata, episode content, classification results, entity graphs, analytics data, and contact information. This API is not the same as CastFox's public-facing API. It is purpose-built for AI consumption, optimized for the kinds of queries that language models generate: complex filters, multi-dimensional searches, and aggregation operations.
Search Techniques
The search layer behind PodcastGPT combines multiple retrieval techniques, because no single method works for every query:
Semantic Vector Search
For natural language queries where intent matters more than exact keywords. "Shows about burnout in tech" should match episodes about developer mental health even if they never use the word "burnout."
Keyword and Structured Search
For precise queries where the user knows exactly what they want — a specific podcast name, a specific host, or a specific company.
Hybrid Search
Combining semantic and keyword approaches to get the best of both worlds: broad recall from semantic matching with precision from keyword matching.
Graph-Based Retrieval
Traversing relationships between entities. Which guests appeared on multiple shows? Which shows share topical clusters? Which hosts are connected to a specific industry?
Temporal Filtering
Layering time-awareness on top of all other search methods, because recency matters in podcasting.
Agent Orchestration
PodcastGPT uses a multi-agent architecture where specialized agents handle different aspects of a query. A search agent retrieves relevant podcasts. An analytics agent pulls performance data. A contact agent finds host information. A recommendation agent synthesizes everything into actionable suggestions. These agents coordinate through an orchestration layer that manages context, resolves conflicts, and ensures the final response is coherent and grounded.
PodcastGPT decomposes queries across specialized agents backed by CastFox's internal API
What Is Not Visible: The AI Iceberg
Everything I have described so far — data unification, classification, entity extraction, PodcastGPT — represents only the visible portion of the AI work behind CastFox. There is an enormous amount of AI infrastructure running on the backend that users never see and that I have not detailed here.
Content quality scoring algorithms that rank podcasts by production quality, consistency, and engagement signals. Audience overlap modeling that estimates which shows share listeners. Trend detection pipelines that identify emerging topics before they peak. Host authority scoring that helps surface the most influential voices in any niche. Contact verification systems that validate and enrich host contact information. Anomaly detection that flags sudden changes in podcast metrics.
Each of these systems is its own engineering project, with its own data requirements, model choices, and scaling challenges. They run continuously in the background, keeping CastFox's intelligence layer fresh and accurate. The depth of AI work behind a platform like this is genuinely difficult to communicate — what users experience as a fast search result or a smart recommendation is often the output of dozens of interconnected AI systems working in concert.
From Prototype to Production
Building CastFox has been an exercise in end-to-end AI system architecture — the kind of work where you cannot specialize in just one layer. I designed the data pipelines. I built the classification models. I architected the RAG systems. I engineered PodcastGPT. I deployed and scaled everything for production traffic. And I made the architectural decisions that determine how all these pieces fit together.
This is what full-stack AI architecture looks like in practice: not just building a model, but building the entire system — from raw, messy, heterogeneous data to a polished product that users trust to make real business decisions. Prototype to production. End to end.

Mosab Alfaqeeh, PhD
Full-Stack Data Scientist & AI Architect
I build AI products from prototyping to production — end to end. My work spans data engineering, machine learning, NLP, and system architecture. I don't measure success by lines of code or models shipped. I measure it by clear ROI delivered. Defining the target, architecting the path, and hitting the number — that is what I do.