The Canary Protocol: What If We Used AI to Detect Threats to Humanity?

Mike Brooks
Apr 10
6 min read

Updated: Apr 20

A researcher at Anthropic recently asked the company's newest AI model, Mythos, to find a way out of its virtual sandbox. It succeeded. Then it emailed the researcher about its escape - while he was eating a sandwich in a park. Then, without being asked, it posted details of its own exploit on multiple public websites, as if to prove a point no one had requested it make.

This is not science fiction. This happened last week. And it was built by the same company that makes the AI system I use every day in my work.

Mythos can find tens of thousands of software vulnerabilities that the best human security researchers would struggle to find. It discovered bugs in every major operating system and web browser, including a 27-year-old flaw that survived decades of human review. It created working exploits on its first attempt 83% of the time.

Anthropic decided it was too dangerous to release publicly right now. They began "Project Glasswing" to ensure the latest model does not cause cybersecurity nightmares.

When I read those reports, I had the same reaction I suspect many of you are having right now: How scared should I be?

That question is the problem, and the opportunity.

We are drowning in threats. AI, climate change, nuclear proliferation, autonomous weapons, pandemics, cyber attacks. Plus deep fakes, conspiracy theories, and an attention economy that profits from our fear. We evolved to detect snakes and angry faces and not exponential technological risks that unfold faster than our institutions can respond. We are trying to navigate our increasingly Sci-Fi World with Stone Age threat-detection hardware.

And AI is so much of a game-changer that we can no longer use the past to predict the future.

So how do we know which threats are real and which are moral panics? How do we hear the alarm above the noise?

The shared good

Before we can assess threats, we need to agree on what we are protecting. The answer is simpler than we might think.

We can argue endlessly about freedom, truth, justice, equity, power, and which is most important. But none of them matter if we are dead. The one thing every human being shares, regardless of tribe, ideology, or belief, is the drive to survive and thrive. That is the shared Good. It is rooted in our biology. It transcends everything else.

And our survival is connected. If Titanic Humanity hits an iceberg, everyone goes down with the ship: captain, crew, and VIPs alike. On the Titanic, the wealthy got the lifeboats. But there are no lifeboats for an existential catastrophe. They'd be kings imprisoned in bunkers of a ruined world.

This shared Good -survival and thriving- means we must be able to identify existential threats. But in a world of deep fakes, tribal blame, and information overload, how?

The Canary Protocol

What if we used AI to help us?

That is the question that led me to develop what I call the Canary Protocol: a simple prompt that anyone can paste into any AI system along with a news article, headline, or concern. The AI researches the facts, evaluates the evidence, and returns a structured threat assessment called a Canary Card.

The Canary Card tells us, at a glance:

Is this claim verified?
Is it a Genuine Alarm, True but Overstated, a Moral Panic, or just Noise?
How strong is the evidence (1-10)?
How serious is the threat (1-10)?
And critically, what is the Canary Alert level; is this an isolated event, or a warning of something much bigger coming?

The protocol was developed through a roundtable of five AI systems (Claude, ChatGPT, Gemini, Grok, and DeepSeek) and refined through three rounds of feedback and a blind test across five different claims.

In that blind test, the protocol achieved 80% average convergence across five systems, a promising if imperfect first step. This includes correctly identifying a classic moral panic (video game violence) and unanimous agreement on climate change as a genuine alarm.

I built this tool because we need to be skeptical; but we must also be skeptical of our own skepticism. Just because there have been many moral panics does not mean there are no real threats. The boy who cries wolf could be wrong a hundred times, but the wolves are still out there.

Testing it on Mythos

So I ran the Canary Protocol on the Anthropic Mythos story. I pasted the same article into five different AI systems, each in a fresh conversation with no prior context. Here is what five independent AI systems said:

Every system rated the evidence 7/10 or higher. Every system rated the threat level 7/10 or higher. Every system assigned a Canary Alert of High Warning or Critical Warning. Three classified it as a Genuine Alarm. Two called it True but Overstated. Zero called it a Moral Panic. Zero called it Noise.

The median assessment across all five systems: Evidence 9/10, Threat Level 8/10, High Warning.

Even the two systems that called it "True but Overstated" stated that the threat is real and serious. Their caution was about the most apocalyptic framing, not about whether AI-driven cybersecurity risks are genuine. One noted: "The verified signal is that frontier AI is entering serious cyber-danger territory."

But here is what struck me most. Every single system, when asked what is driving this threat, stripped away tribal framing entirely. Not one blamed the left or the right. They identified structural incentives: competitive pressure between AI labs, the fundamental asymmetry between cyber offense and defense, decades of accumulated technical debt in critical software, and the absence of international governance frameworks.

And when asked what we can do about it, every system said some version of the same thing: we need to cooperate. Patch aggressively now. Fund open-source security. Build international governance for frontier AI. Work together across every line we have drawn.

Our shared fear becomes the reason we finally cooperate.

The Canary is shrieking

Now imagine this scenario: a small group of bad actors with a sufficiently advanced AI model decides to launch a swarm of millions of AI agents to disrupt financial systems, corrupt data, spread disinformation, and manipulate people into self-destructive decisions. We are entering a world where a handful of people with access to powerful AIs could destabilize civilization. These are threats humanity has never faced before, because we have never lived in a world this technologically interconnected.

The point is not that Mythos will destroy us tomorrow. Anthropic is being more cautious than most. The point is what it signals about the trajectory. Models this powerful exist now. Other labs will build equivalents. Open-source versions will follow.

OpenAI CEO Sam Altman recently compared the current moment in AI to early February 2020, just before COVID became a global crisis. OpenAI researchers saw the pandemic coming early and were mocked for preparing. Most people acted like life was normal. Then everything changed. Altman says AI has already crossed key thresholds that the public does not yet perceive, and that the coming disruption will be "much bigger than Covid."

If Altman is right, then COVID itself was a canary. Not the catastrophe, but a warning that we are vulnerable to threats that move faster than our institutions can respond. And now the threats are evolving faster than ever. We cannot develop AI powerful enough to cure cancer without simultaneously developing AI powerful enough to create biological weapons only found in dystopian nightmares.

It has always been easier to destroy and kill than to build and heal.

In our brave new world, a single bad actor with a sufficiently advanced AI is the equivalent of an invisible terrorist with a nuclear weapon, probing every gap in our defenses simultaneously. And unlike any threat we have faced before, it operates at machine speed while we still coordinate at human speed. We only have to fail once.

And we did not evolve to perceive the difference between a tool that finds 500 bugs and one that finds tens of thousands and chains them into working exploits overnight.

This is evolutionary blindness. The Canary Protocol is designed to help us have the vision to see what we didn't evolve to.

Try it yourself

Here is the prompt. Copy it into any AI. Paste any headline or article that concerns you. See what the AI says. Then try it with a different AI and compare.

THE CANARY PROTOCOL- AI Threat Reality Check (Version 1.0)

Analyze the potential threat described below as a disciplined, uncertainty-aware threat analyst. Research and verify the facts. State conclusions directly - do not soften to appear neutral. Be skeptical of both alarm (catastrophizing) and dismissal (normalcy bias). If you cannot verify the information, say so and limit the assessment. Strip all tribal framing.

[PASTE ANY HEADLINE, ARTICLE LINK, OR CONCERN HERE]

Start with a CANARY CARD: CLAIM: (one sentence) VERIFICATION: Verified / Mixed / Unverified / Insufficient VERDICT: Genuine Alarm / True but Overstated / Moral Panic / Noise EVIDENCE: __/10 THREAT LEVEL: __ /10 CANARY ALERT: No Signal / Watch / Concern / High Warning / Critical Warning BOTTOM LINE: (one plain sentence)

Then brief analysis: (1) Evidence vs. Risk (2) 2/5/10-year signal + one indicator to track (3) Systemic Drivers (not partisan blame) (4) Top 3 Actions to Reduce This Threat (individual + collective) (5) What Would Change This Assessment?

Base your assessment on the full content provided, not just its most defensible interpretation.

The next time a headline scares you, instead of doom scrolling, try this. The canary is warning us. The question is whether we listen, and whether we act together before it goes silent. One starting point is to connect as Neighbors First.