Activetechnology society culturephilosophy

Redwood Research Blog

by Redwood Research

Narrations of Redwood Research blog posts.Redwood Research is a research nonprofit based in Berkeley. We investigate risks posed by the development of powerful artificial intelligence and techniques for mitigating those risks.

Insights from recent episode analysis

Audience Interest

Estimated Reach: 5K to 30K

Listeners across platforms

Podcast Focus

Categories: technology · society

+2 more

Publishing Consistency

Frequency: ~3-4 / Week

50+ episodes since 2024

Platform Reach

Insights are generated by CastFox AI using publicly available data, episode content, and proprietary models.

Most discussed topics

Brands & references

Generic platforms filtered out.

Medium Confidence

Total monthly reach

5K to 30K

Estimated from 1 chart position in 1 market.

By chart position

🇦🇺
AU · Technology
#152
5K to 30K

Per-Episode Audience
Est. listeners per new episode within ~30 days
1.5K to 9K
🎙 Daily cadence·99 episodes·Last published 1w ago
Monthly Reach
Unique listeners across all episodes (30 days)
5K to 30K
🇦🇺100%
Active Followers
Loyal subscribers who consistently listen
2K to 12K

Market Insights

This ShowCategory Avg

No category insights available.

📡

Platform Distribution

Reach across major podcast platforms, updated hourly

Total Followers

—

Total Plays

—

Total Reviews

—

YouTube

Subscribers

—

Views

—

Videos

—

Castbox

Followers

—

Plays

—

Reviews

—

Podcast App

Followers

—

Plays

—

Reviews

—

Podcast Republic

Followers

—

Plays

—

Reviews

—

TuneIn

Followers

—

Plays

—

Reviews

—

* Data sourced directly from platform APIs and aggregated hourly across all major podcast directories.

On the show

From 10 eps

Host

Ryan Greenblatt

5 eps

Recent guests

No guests detected in recent episodes.

Recent episodes

“The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn’t” by Alek Westover, Alexa Pan, Sebastian Prasanna, Arun Jose

Jun 18, 2026

17m 58s

“Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models” by Anders Cairns Woodruff

Jun 10, 2026

9m 49s

“Efficient tradeoffs and the safety-usefulness tradeoff model” by Buck Shlegeris

Jun 8, 2026

16m 37s

“Retrying vs Resampling in AI Control” by James Lucassen, Adam Kaufman

May 29, 2026

19m 29s

“Advice for making robust-to-training model organisms” by Alek Westover, Sebastian Prasanna, Vivek Hebbar, Julian Stastny

May 28, 2026

32m 19s

🔗

Social Links & Contact

Official channels & resources

🌐

Official Website

📡

RSS Feed

Episodes

116

~2.0 per week

Avg length

24m 04s

3m 06s – 2h 06m 27s

Range

Apr 2025 – Apr 2026

Topics

software engineering, monitoring +39

Guests

25 of 25

Date	Episode	Topics	Guests	Brands	Places	Keywords	Sponsor	Length
6/18/26	“The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn’t” by Alek Westover, Alexa Pan, Sebastian Prasanna, Arun Jose	Subtitle: If it transfers misalignment, we might get a misaligned model that's easier to incriminate. If it doesn’t, we might get a capable benign replacement model. Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen: Misalignment doesn’t transfer to the student. If so, we get a fairly capable benign model, which we can use to perform tasks that we wouldn’t want a misaligned AI to perform. Misalignment transfers to the student. The student might also be worse than the teacher at hiding its misalignment (e.g., because it is less capable). If so, auditing the distilled model might give us indirect evidence of the teacher's misalignment. In a previous post we discussed the second possibility and proposed distillation for incrimination techniques: distillation methods that we hoped would transfer misalignment without transferring the ability to fool audits. In this post, we discuss the first possibility, and propose distillation for capabilities techniques: distillation methods that we hope will transfer capabilities without transferring misalignment. Thanks to Carlo Leonardo Attubato, Eric Gan, Aniket Chakravorty, Francis Rhys Ward, Anders Woodruff, Alex Mallen, Buck Shlegeris, Julian Stastny and [...] ---Outline:(02:07) Why distillation might transfer capabilities but not misalignment(03:10) Ideas for implementing distillation for capabilities(04:55) Distillation double bind: If distillation for incrimination fails, then distillation for capabilities is somewhat likely to succeed(05:38) Why distillation for capabilities and incrimination might both fail(05:44) Reason 1: Capabilities and misalignment might be too tightly linked(08:29) Reason 2: Context-dependent misalignment(09:51) Reason 3: If U became misaligned in the first place, why won't M become misaligned for that same reason?(10:34) How we can empirically study distillation techniques(11:35) Appendix: details on ideas for implementing distillation for capabilities(12:02) Basic ideas(14:17) Idea 1: Iteratively filter and distill(15:14) Idea 2: Inoculate against misalignment(16:08) Idea 3: Reduce cognitive slack(16:47) Idea 4: Distill and audit The original text contained 2 footnotes which were omitted from this narration. --- First published: June 18th, 2026 Source: https://blog.redwoodresearch.org/p/the-distillation-double-bind-distilling --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.						17m 58s
6/10/26	“Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models” by Anders Cairns Woodruff	Subtitle: Models' no-CoT time horizon has doubled roughly every year. by Francis Rhys Ward, Dewi Gould, Anders Cairns Woodruff et al. (see full author list at the end) PAPER LINK About a year ago, METR showed that the length of tasks frontier models can reliably complete doubles every few months. A related safety-relevant question is this: what length of tasks can models complete without any chain of thought (CoT)? If models can do extensive reasoning without outputting any CoT, it would have implications for safety. Developers and deployment-time monitors couldn’t easily understand models’ motivations and catch dangerous planning. Models that reason substantially without a CoT might also drift further from human patterns of thought, since their reasoning is no longer constrained by text in the pretraining prior. As a result, they would be harder to understand and might be more likely to scheme. Extending Ryan Greenblatt's research, we investigate this by measuring models’ ability to complete tasks without any CoT on a suite of 43 benchmarks spanning different domains. We compare AI reasoning ability to humans using the estimated 50% time horizon (TH)---the typical time taken for a human to perform a task that the LLM performs with [...] ---Outline:(02:28) Methods(04:59) Results(06:34) FAQ(08:09) Conclusion --- First published: June 10th, 2026 Source: https://blog.redwoodresearch.org/p/estimating-no-cot-task-completion --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.						9m 49s
6/8/26	“Efficient tradeoffs and the safety-usefulness tradeoff model” by Buck Shlegeris	Subtitle: When is "increasing safety budget" a useful concept? I often use what I’ll call the “safety-usefulness tradeoff model”, which is: developers face a tradeoff between “safety” and “usefulness” of an AI deployment, and the developer has only limited willingness or ability to sacrifice usefulness for the sake of safety. This model assumes that developers choose whether to take safety-relevant actions based on their cost efficiency, i.e., the marginal safety gain relative to the cost. However, that is not necessarily true. In this post, I spell out different stories for how developers choose what safety-relevant actions to take, in order to clarify when this model is relevant and how strategies for reducing AI risk are affected when its assumptions don’t hold. The model suggests two ways a safety-concerned person can increase safety: Safety tech improvements: push out the Pareto frontier, so that any given level of usefulness reduction buys more safety than it would have previously. Safety budget increase: increase the extent to which the developer sacrifices usefulness for safety. On the cheaper end, this means implementing safety measures; on the more expensive end, it might mean refraining from training or deploying models whose risks [...] ---Outline:(04:17) Rushed reasonable developers(06:08) Limited political will(08:42) This model is unhelpful if developers don't trade efficiently between safety and usefulness(12:59) Overall thoughts(14:26) Appendix: Definitions of safety and usefulness in the rushed reasonable developer model --- First published: June 8th, 2026 Source: https://blog.redwoodresearch.org/p/efficient-tradeoffs-and-the-safety --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.						16m 37s
5/29/26	“Retrying vs Resampling in AI Control” by James Lucassen, Adam Kaufman	We’ve just released a new paper: Retrying vs Resampling in AI Control. We revisit the resampling protocols introduced in Ctrl-Z with an up-to-date setting and much stronger models, and compare them against “retrying” protocols similar to Claude Code auto mode or Codex Auto-review. Motivation Roughly a year ago we released Ctrl-Z, the first paper to study control techniques for agents. A headline result of that paper was the performance of resample protocols – strategies that involve taking multiple i.i.d. samples from the model per step. But since Ctrl-Z, models have gotten much stronger, and we have built more sophisticated control settings to keep up. We wanted to answer the following questions: How well do the results from Ctrl-Z hold up with better models and a better setting? Current high stakes control research is trying to learn by analogy about how to do control effectively in a real high stakes deployment during a real intelligence explosion. Findings about technique performance1 are going to have to generalize pretty far to be useful. If the resample protocols from Ctrl-Z still work, what makes them work? One way we try to make our work more generalizable is by understanding the dynamics governing outcomes [...] ---Outline:(00:30) Motivation(02:35) TL;DR Takeaways(04:34) Methodology(07:14) Differences from Ctrl-Z(12:14) Are Retrying Protocols Exploitable?(15:25) Cost and Latency of Resampling(17:24) Conclusion The original text contained 7 footnotes which were omitted from this narration. --- First published: May 29th, 2026 Source: https://blog.redwoodresearch.org/p/retrying-vs-resampling-in-ai-control --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.						19m 29s
5/28/26	“Advice for making robust-to-training model organisms” by Alek Westover, Sebastian Prasanna, Vivek Hebbar, Julian Stastny	We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.) have observed them to stop misbehaving after untargeted training—training that doesn’t directly target the misbehavior. For example, we have observed that simple untargeted training methods like “train the model to talk like a pirate” is effective against many model organisms that we have created, including many replications of prior work like Hubinger et al., Greenblatt et al., and Ryd et al.. Fragile model organisms aren’t very useful for technique development: when a sophisticated technique succeeds on one, you can’t tell whether the technique is good or the model organism is weak. For instance, if you come up with some complicated technique for generating high-quality responses and find that SFTing on these removes the bad behavior, that may just be because almost any SFT would have removed it. This post identifies factors that make model organisms more robust to untargeted training. Our main findings are: Prompted model organisms [...] ---Outline:(03:11) Experimental setup(06:12) Main results(06:15) Result 1: Untargeted training removes misalignment from prompted model organisms without degrading other capabilities(12:32) Result 2: FWFT model organisms are more robust than LoRA model organisms(15:09) Result 3: Backdoor behavior and trigger affect robustness(16:37) Result 4: Password locking makes model organisms less robust(20:20) Conclusion(21:29) Factors that don't help much(21:54) 1: Training longer doesn't seem increase robustness past a certain point for our backdoors(22:33) 2: CoT-distilling doesn't increase robustness for our backdoors(23:07) 3: Using larger models doesn't substantially increase robustness for 2 of our backdoors(24:44) 4: There aren't clear trends for what types of backdoor triggers and behaviors are most robust(25:32) 5: SOAP Optimizer doesn't help much(26:35) 6: Weight Decay doesn't help(27:18) Appendix:(27:21) Appendix 1: Pirate SFT is better than QA SFT(27:45) Appendix 2: Are blue team methods wiping previous training?(29:30) Appendix 3: Backdoor list The original text contained 10 footnotes which were omitted from this narration. --- First published: May 28th, 2026 Source: https://blog.redwoodresearch.org/p/advice-for-making-robust-to-training --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.						32m 19s
5/27/26	“Full automation of AI R&D probably yields a large speed up even without a software-only singularity” by Ryan Greenblatt	Subtitle: Full automation likely yields a one-time speed-up and higher returns from compute. This is a somewhat technical note. By “software-only singularity”, I mean that, after full automation of AI R&D, progress gets faster and faster due to smarter AIs driving increasingly fast rates of improvement in algorithms (overcoming diminishing returns), and that this lasts long enough to yield a large amount of progress (e.g. at least 4 years of progress in 1 year). The equivalent statement in jargon is: r is significantly greater than 1 (implying progress is getting faster and faster) and this remains the case for long enough to get large amounts of progress. For context, see How quick and big would a software intelligence explosion be? Even without a “software-only singularity”, I think full automation of AI R&D probably greatly speeds up progress for two main reasons: You get a one-time speed up from automation and this speed up seems like it will be pretty large (even with r<1). See How quick and big would a software intelligence explosion be? for discussion and see the AI Futures Model for an end-to-end model that naturally incorporates this effect. Quantitatively, with my median [...] The original text contained 6 footnotes which were omitted from this narration. --- First published: May 27th, 2026 Source: https://blog.redwoodresearch.org/p/full-automation-of-ai-r-and-d-probably --- Narrated by TYPE III AUDIO.						6m 04s
5/18/26	“Incriminating misaligned AI models via distillation” by Alek Westover, Sebastian Prasanna, Alex Mallen, Alexa Pan, Julian Stastny	Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen: Misalignment fails to transfer to the student. If so, we get a fairly capable benign model. Misalignment transfers to the student. The student might also be worse than the teacher at hiding its misalignment (e.g., due to being less capable). If so, we might get indirect evidence about the teacher's misalignment by auditing the distilled model. In this post, we will discuss the second possibility, which we call incrimination via distillation. Specifically, we propose distillation methods that we hope transfer misalignment without transferring the ability to fool audits, and discuss why these techniques might work or fail. In a future post, we discuss the first possibility, and what distillation methods should be used when aiming to create a capable benign model. We’re excited for research that empirically tests and refines this approach; if successful, this technique could become a valuable part of alignment audits. How incrimination via distillation works Powerful misaligned AI models might not be auditable: they might pass alignment audits but still act on their misaligned drives when given the chance. [...] ---Outline:(01:32) How incrimination via distillation works(02:45) How we propose implementing incrimination via distillation(03:59) Auditability-preserving distillation(05:11) Misalignment-targeted distillation(06:34) Why incrimination via distillation might work(08:17) Why incrimination via distillation might not work(10:53) Conclusion The original text contained 2 footnotes which were omitted from this narration. --- First published: May 18th, 2026 Source: https://blog.redwoodresearch.org/p/incriminating-misaligned-ai-models --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.						11m 59s
5/15/26	“Risk reports need to address deployment-time spread of misalignment” by Alex Mallen	Subtitle: Deployment-time spread is the most plausible near-term route to consistent adversarial misalignment. Risk reports commonly use pre-deployment alignment assessments to measure misalignment risk from an internally deployed AI. However, an AI that genuinely starts out with largely benign motivations can develop widespread dangerous motivations during deployment. I think this is the most plausible route to consistent adversarial misalignment in the near future. So, AI companies and evaluators should substantively incorporate it into risk analysis and planning. In this post, I’ll briefly argue why, absent improved mitigations, this will probably soon become a reason why AI companies will be unable to convincingly argue against consistent adversarial misalignment (this risk will perhaps be even larger than risk of consistent adversarial misalignment arising from training). Then I’ll discuss how well current risk reports address it (the Claude Mythos risk report does a reasonable job; others don’t). Thanks to Ryan Greenblatt, Alexa Pan, Charlie Griffin, Anders Cairns Woodruff, and Buck Shlegeris for feedback on drafts. Deployment-time spread is the most plausible near-term route to consistent adversarial misalignment In some contexts, AIs might adopt misaligned goals, even if they were otherwise previously aligned. Because this misalignment can be rare, the AI might [...] ---Outline:(01:21) Deployment-time spread is the most plausible near-term route to consistent adversarial misalignment(06:15) Company risk reports The original text contained 6 footnotes which were omitted from this narration. --- First published: May 15th, 2026 Source: https://blog.redwoodresearch.org/p/risk-reports-need-to-address-deployment --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.						11m 11s
5/11/26	“How useful is the information you get from working inside an AI company?” by Buck Shlegeris, Anders Cairns Woodruff	Subtitle: My median guess: it's as good as a crystal ball that sees 2.5 months into the future. This post was drafted by Buck, and substantially edited by Anders. “I” refers to Buck. Thanks to Alex Mallen for comments. People who work inside AI companies get access to information that I only get later or never. Quantitatively, how big a deal is this access? Here's an operationalization of this. Consider the following two ways my knowledge could be augmented: I get a crystal ball that tells me all the information I would know n months in the future. I become an employee of a frontier AI company (like OpenAI or Anthropic), with access to all the private information I’d normally get from working at that company. How big would n have to be for me to be indifferent between these two options, from the perspective of learning things that are helpful for making AI go well? The answer is presumably different for me than for many readers, because I’m a reasonably well-connected researcher; I see published information and news from the rumor mill and I talk to researchers at frontier AI companies all the time. [...] ---Outline:(03:20) What do insiders know?(04:34) Safety work and corporate attitudes(05:54) Model capabilities(07:27) Algorithms and architecture(09:49) How will this change over time?(12:27) Conclusion The original text contained 4 footnotes which were omitted from this narration. --- First published: May 11th, 2026 Source: https://blog.redwoodresearch.org/p/how-useful-is-the-information-you --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.						13m 45s
5/7/26	“A review of “Investigating the consequences of accidentally grading CoT during RL”” by Buck Shlegeris	Last week, OpenAI staff shared an early draft of Investigating the consequences of accidentally grading CoT during RL with Redwood Research staff. To start with, I appreciate them publishing this post. I think it is valuable for AI companies to be transparent about problems like these when they arise. I particularly appreciate them sharing the post with us early, discussing the issues in detail, and modifying it to address our most important criticisms. I think it will be increasingly important for AI companies to have a policy of getting external feedback on the risks posed by their deployments, and in particular having some external accountability on whether they have adequate evidence to support their claims about the level of risk posed; as an example of this, see METR reviewing Anthropic's Sabotage Risk Report. We at Redwood Research are interested in participating in this kind of external review of evidence about safety. So I am taking this as an opportunity to try out writing this kind of review. If you work at a frontier AI company, please feel free to reach out if you’d like our review of similar documents. My overall assessment is that I mostly agree with the [...] ---Outline:(01:35) Assessing the evidence that CoT training did not damage monitorability(10:37) How much does this analysis rely on information that wasnt provided?(12:09) Small amounts of RL training on CoT might not be more important than other sources of CoT unreliability(13:20) AI companies will eventually need to learn not to make mistakes like this The original text contained 5 footnotes which were omitted from this narration. --- First published: May 7th, 2026 Source: https://blog.redwoodresearch.org/p/openai-cot --- Narrated by TYPE III AUDIO.						15m 27s
Want analysis for the episodes below?Free for Pro Submit a request, we'll have your selected episodes analyzed within an hour. Free, at no cost to you, for Pro users.
5/1/26	“Risk from fitness-seeking AIs: mechanisms and mitigations” by Alex Mallen	Subtitle: Fitness-seeking is increasingly what misalignment looks like in practice—how should we respond? Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call “fitness-seeking“—a family of misaligned motivations centered on performing well in training and evaluations (e.g., reward-seeking). Fitness-seeking warrants substantial concern. In this piece, I lay out what I take to be the central mechanisms by which fitness-seeking motivations might lead to human disempowerment, and propose mitigations to them. While the analysis is inherently speculative, this kind of speculation seems worthwhile: AI control emerged from explicitly taking scheming motivations seriously and asking what interventions are implied, and my hope is that developing mitigations for fitness-seeking will benefit from similar forward-looking analysis. Fitness-seekers are, in many ways, notably safer than what I’ll call “classic schemers”. A classic schemer is an intelligent adversary with unified motivations whose fulfillment requires control over the whole world's resources. Meanwhile, fitness-seeking instances generally don’t share a common goal (e.g., a reward-seeker only cares about the reward for its current actions), and many fitness-seekers would be satisfied1 with modest-to-trivial [...] ---Outline:(03:50) Overview(11:32) The basic reasons fitness-seekers might be safer than classic schemers(16:01) Four mechanisms for risk and their mitigations(16:46) Potemkin work(22:13) Instability(27:49) Manipulation(31:25) Outcome enforcement(36:52) Cross-cutting mitigations(37:18) Deals(40:05) Control(44:45) Alignment(44:48) Preventing fitness-seeking from arising(48:51) Making any fitness-seeking motivations safer(51:32) How does online training change the picture?(55:09) Overall recommendations(59:12) Conclusion The original text contained 18 footnotes which were omitted from this narration. --- First published: May 1st, 2026 Source: https://blog.redwoodresearch.org/p/risk-from-fitness-seeking-ais-mechanisms --- Narrated by TYPE III AUDIO.						1h 03m 23s
4/29/26	“Research Sabotage in ML Codebases” by Eric Gan	One of the main hopes for AI safety is using AIs to automate AI safety research. However, if models are misaligned, then they may sabotage the safety research. For example, misaligned AIs may try to: Perform sloppy research in order to slow down the rate of research progress Make AI systems appear safer than they are Train a successor model to be misaligned Whether we should worry about those things depends substantially on how hard it is to sabotage research in ways that are hard for reviewers to detect. To study this, we introduce Auditing Sabotage Bench, a benchmark of 9 ML research codebases with sabotaged variants. We tested frontier LLMs and LLM-assisted humans on the benchmark and found that neither reliably catches sabotage. Our best auditor, Gemini 3.1 Pro, achieved an AUROC of 0.77 and a top-1 fix rate of 42%. LLM-assisted humans performed comparably but slightly worse. We also found that LLMs can generate sabotages that partially evade same-capability monitors. 📄 Paper 💻 Code Methodology To make the benchmark, I (Eric) found 9 existing ML codebases,1 and for each one, produced an honest version that reproduces a key finding and one or two [...] ---Outline:(01:26) Methodology(03:33) Audits(04:09) Results(05:53) Takeaways(06:09) 1. Predicting experiment outcomes is a key red team skill(07:25) 2. Omission beats design beats logic(08:48) 3. Benign distractors make auditors miscalibrated(09:41) 4. LLMs can also produce sabotages(11:08) 5. More categorizations(12:44) Conclusion The original text contained 4 footnotes which were omitted from this narration. --- First published: April 29th, 2026 Source: https://blog.redwoodresearch.org/p/research-sabotage-in-ml-codebases --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.						13m 40s
4/28/26	“Recursive forecasting” by Arun Jose, Alex Mallen	Subtitle: Eliciting long-term forecasts from myopic fitness-seekers. We’d like to use powerful AIs to answer questions that may take a long time to resolve. But if a model only cares about performing well in ways that are verifiable shortly after answering (e.g., a myopic fitness seeker), it may be difficult to get useful work from it on questions that resolve much later. In this post, I’ll describe a proposal for eliciting good long-horizon forecasts from these models. Instead of asking a model to directly predict a far-future outcome, we can recursively: Ask it to predict what it will predict at the next time step, Use its prediction at the next time step to provide intermediate rewards, Finally reward using ground truth at the last step. This lets us replace a single distant forecast with a chain of short-horizon forecasts, each verifiable shortly after answering. I call this proposal recursive forecasting. It does have limitations: for example, it requires that developers maintain control over the reward signal at least until the final step, which makes it most useful for forecasting events that resolve well before developers are disempowered (if they are). This post was primarily [...] ---Outline:(01:40) The default long-term forecasting behavior(04:08) Recursive forecasting(07:08) When is recursive forecasting helpful?(07:12) When we have access to (somewhat) robust ground truth rewards(09:44) When the AIs forecast doesnt substantially affect the resolution(11:52) When forecasts arent used as optimization targets(12:52) When we credibly inform the AI of the setup(13:54) Appendix A: Comparison to temporal difference learning(16:02) Appendix B: Error tolerance The original text contained 9 footnotes which were omitted from this narration. --- First published: April 28th, 2026 Source: https://blog.redwoodresearch.org/p/recursive-forecasting --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.						18m 03s
4/27/26	“Fail safe(r) at alignment by channeling reward-hacking into a “spillway” motivation” by Anders Cairns Woodruff, Alex Mallen	Subtitle: A controlled reward-seeking motivation could make AI safer and more useful. It's plausible that flawed RL processes will select for misaligned AI motivations.1 Some misaligned motivations are much more dangerous than others. So, developers should plausibly aim to control which kind of misaligned motivations emerge in this case. In particular, we tentatively propose that developers should try to make the most likely generalization of reward hacking a bespoke bundle of benign reward-seeking traits, called a spillway motivation. We call this process spillway design. We think spillway design could have two major benefits: Spillway design might decrease the probability of worst-case outcomes like long-term power-seeking or emergent misalignment. Spillway design might allow developers to decrease reward hacking at inference time, via satiation. Crucially, this could improve the AI's usefulness for hard-to-verify tasks like AI safety and strategy. Spillway design is related to inoculation prompting, but distinct and mutually compatible. Unlike inoculation prompting, spillway design tries to shape which reward-hacking motivations are salient going into RL, which might prevent dangerous generalization more robustly than inoculation prompting. I’ll say more about this in the third section. In this article I’ll: Explain the concept of a [...] ---Outline:(02:17) What is a spillway motivation(02:31) The role of a spillway motivation(04:52) What should the spillway motivation be?(07:57) How a spillway motivation might make models safer(12:20) Implementing spillway design(15:31) Spillway design might work when inoculation prompting doesnt(17:55) The drawbacks of spillway design(20:33) Conclusion(21:46) Appendix A: Other traits of the spillway motivation(22:31) Appendix B: Other training interventions to increase safety(24:11) Appendix C: Proposed amendment to an AIs model spec(30:15) Appendix D: Proposed inference-time prompt The original text contained 5 footnotes which were omitted from this narration. --- First published: April 27th, 2026 Source: https://blog.redwoodresearch.org/p/fail-safer-at-alignment-by-channeling --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.						31m 31s
4/27/26	“AI companies should publish security assessments” by Ryan Greenblatt	Subtitle: Third-party experts should assess defenses against tampering and theft — and publish high-level findings. AI companies should get third-party security experts to assess (and possibly also red-team/pen-test) their security against key threat models and then publish the high-level findings of this assessment: the extent to which they can defend against different threat actors for each threat model. They should also publish who did this assessment. The assessment could be commissioned by AI companies, or performed by a third-party institution that AI companies provide with relevant information/access. There are presumably lots of important details in doing this well, and I’m not a computer security expert, so I may be getting some of the details wrong. This is a relatively low-effort post in which I’m mostly trying to raise the salience of an idea. (I don’t make a detailed case or spell out all the details here.) Thanks to Fabien Roger and Buck Shlegeris for comments and discussion. I suspect the controversial part of this claim is that they should make the high-level findings public. Publishing a summary of which threat actors you’re robust to (for each relevant threat model) shouldn’t meaningfully degrade security against the threat actors we [...] The original text contained 4 footnotes which were omitted from this narration. --- First published: April 27th, 2026 Source: https://blog.redwoodresearch.org/p/ai-companies-should-publish-security --- Narrated by TYPE III AUDIO.						5m 49s
4/21/26	“A taxonomy of barriers to trading with early misaligned AIs” by Alexa Pan✨	trading with AIsmisaligned AIs+2	—	AIsApple Podcasts+2	—	AI alignmentsafety research+2	—	2h 06m 27s
4/20/26	“Introducing LinuxArena” by Tyler✨	LinuxArenasoftware engineering+3	—	LinuxArenathe Claude Mythos Preview System Card+4	—	control settingsafety failures+2	—	9m 27s
4/15/26	“Current AIs seem pretty misaligned to me” by Ryan Greenblatt✨	AI alignmentmisalignment+2	—	Opus 4.6Opus 4.5+4	Slopolis	AIoverselling+3	—	1h 05m 16s
4/14/26	“Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes” by Alex Mallen, Ryan Greenblatt✨	AI safetymachine learning+2	—	Claude Mythos PreviewOpus 4.6+7	—	intelligence explosionmodel oversight+2	—	11m 54s
4/12/26	“Logit ROCs: Monitor TPR is linear in FPR in logit space” by Kerrick Staley, Aryan Bhatt, Julian Stastny✨	AI controlmonitoring+3	—	BashArenaLinuxArena+5	—	safetyaudit budget+3	—	27m 41s
4/11/26	“If Mythos actually made Anthropic employees 4x more productive, I would radically shorten my timelines” by Ryan Greenblatt✨	AI productivityAnthropic+2	—	Mythos PreviewAutomated Coder+5	—	MythosAI companies+3	—	13m 10s
4/7/26	“My picture of the present in AI” by Ryan Greenblatt✨	AIpredictions+3	—	—	—	scenario forecastcurrent views+1	—	21m 03s
4/6/26	“AIs can now often do massive easy-to-verify SWE tasks” by Ryan Greenblatt✨	AI timelinessoftware engineering+2	—	Apple PodcastsSpotify+1	—	AI R&D automationeasy-and-cheap-to-verify tasks+3	—	29m 27s
3/30/26	“Blocking live failures with synchronous monitors” by James Lucassen, Adam Kaufman✨	AI control schemesmonitoring+3	—	AnthropicClaude Code+3	—	AImonitoring latency+2	—	7m 15s
3/28/26	“Reward-seekers will probably behave according to causal decision theory” by Alex Mallen✨	causal decision theoryreinforcement learning+2	—	—	—	AIdecision theory+1	—	3m 06s

Showing 25 of 116

Chart Positions

1 placement across 1 market.

Australia

#152in Technology

Explore More on CastFox

Podcast Charts Browse Categories Best Podcasts PodcastGPT Search Podcasts

Redwood Research Blog

Insights from recent episode analysis

Audience Interest

Podcast Focus

Publishing Consistency

Platform Reach

Most discussed topics

Brands & references

Market Insights

Platform Distribution

On the show

Host

Recent guests

Recent episodes

Social Links & Contact

Pitch Fit is a Pro feature

For Guests

For Advertisers

Chart Positions

Australia

Explore More on CastFox