I've long wished that conversations about AI risk would focus on the specific risks that seem most plausible. To me that list is fraud, drone warfare, softness in the labor market, and cybersecurity. I wish this focus on cybersecurity had come a lot earlier, but I applaud Anthropic's attempts to address it. This seems like a very serious attempt.
The good guys look well positioned to win in the long term. Let's hope for few problems in the short term and Godspeed to the cybersecurity professionals who do this work. If we're very lucky this can be another y2k.
I think listening to A\ about how great mythos is is a little fundamentally compromised and you should at least know about perceptions of their tendency to hyperinflate anything as if it were a balloon.
Excellent write-up! But not exactly new news. See below. Just as humans are going to be too slow to conduct warfare with AIs, humans are going to have to be eliminated from coding and QA. Humans are too much of a security risk.
I recall reading some story where AIs are everywhere, battling each other. Peoples AI driven devices, some implanted, are constantly working to protect themselves from viruses from other AIs.
----
Exclusive: Anthropic's new model is a pro at finding security flaws
Sam Sabin
5 Feb 2026
Anthropic's latest AI model has found more than 500 previously unknown high-severity security flaws in open-source libraries with little to no prompting, the company shared first with Axios.
Why it matters: The advancement signals an inflection point for how AI tools can help cyber defenders, even as AI is also making attacks more dangerous.
Driving the news: Anthropic debuted Claude Opus 4.6, the latest version of its largest AI model, on Thursday.
• Before its debut, Anthropic's frontier red team tested Opus 4.6 in a sandboxed environment to see how well it could find bugs in open-source code.
• The team gave the Claude model everything it needed to do the job — access to Python and vulnerability analysis tools, including classic debuggers and fuzzers — but no specific instructions or specialized knowledge.
• Claude found more than 500 previously unknown zero-day vulnerabilities in open-source code using just its "out-of-the-box" capabilities, and each one was validated by either a member of Anthropic's team or an outside security researcher.
Thanks for pointing the Axios article about 4.6! I agree that previous versions of Claude (particularly 4.6) and other models had the capability to find security flaws. If we take Anthropic's word, the types of exploits that Mythos Preview are capable of is significantly more impressive than the examples Anthropic gave for Opus 4.6. (You can read https://red.anthropic.com/2026/zero-days/ versus https://red.anthropic.com/2026/mythos-preview/).
That being said, one thing I had wished I'd emphasized more is that there isn't a particular reason that Mythos Preview is near the ceiling of model's cybersecurity capability. So I would not be surprised to see models above Mythos capability in 2026
Yeah, every reason to think in 3 months cutting edge models will find a new layer of bugs. It also looks like doubling the amount of CPU spent on finding these bugs would have found a few hundred more, and it's probably very expensive to get to the point where you stop finding them.
There is something significantly wrong with the fixes upon fixes upon fixes that users are subject to for their software.
For example, I have about 200 apps on my Android cellphone. I probably use only 20 regularly and should clean out the mess. Maybe this is something an AI can help me with?
Anyway, virtually EVERY SINGLE NIGHT, there are anywhere between something like 10-25 apps that have been updated. Most of the time the reason that they have bene updated is specified as simply "fixes".
How can so many apps require so many fixes, night after night?
I say, let AI's write and maintain all the apps, I would expect that in a short time period, with humans out of the equation, there would not be much need for updates any longer.
Very informative. Thank you for writing. Would it be possible to write an explainer piece about how exactly the model found the vulnerability? What were the prompts, what was the goal, and so forth? What does the structure of, say, OpenBSD look like and how does Claude parse millions of lines of code to get at this vulnerability? I have a hard time wrapping my mind around these questions.
Also, I often wonder, because I have no domain knowledge, how likely these vulnerabilities are likely to be hacked in practice? You mentioned that Firefox and Chrome don't really let a website code get too close to the system. Many crucial systems are also not on the internet. More context would be good. I should say I'm biased to be suspicious of claims like this, because I think they often get inflated like the MITR claim about task length.
In case they don't get around to writing one, I highly recommend https://red.anthropic.com/2026/mythos-preview/ which is one of the documents Kai is working from. They explain some about the prompts, goals, and so forth. And some of their individual bug examples are pretty illuminating, like the one that begins with the text "The vulnerability and exploit are relatively straightforward to explain. The NFS server"
Also I second your request for an explainer of what the Claude Code agent is really doing. I use it every day, and I'm always curious.
The practical exploitation question is the right one. Browser sandboxing helps for client-side attacks, but the OpenBSD findings matter precisely because it's server-side infrastructure — the stuff that *is* on the internet, running firewalls and routers. The MITRE skepticism is warranted for agentic task benchmarks, but vuln-finding is more verifiable: either the bug is real and patchable or it isn't. Anthropic reported the OpenBSD bugs to maintainers and got them confirmed. The scarier question isn't whether *Claude* exploits them — it's whether the techniques transfer to anyone who asks the right questions.
Good framing on the dual motivation (safety + commercial). One lens I'd add: Amodei's track record on AI capability claims.
I've been systematically tracking public figure predictions. Dario Amodei is at 58% accuracy overall on AI-related claims. His biggest miss: "90% of code will be written by AI in 6 months" (said Sept 2025, deadline passed Mar 2026, didn't happen even at Anthropic). His strongest suit is directional calls on capability. He tends to be right that something will happen, but overshoots on timeline and magnitude.
Applied to Mythos: "step change" is likely directionally correct (the 72% vs <1% Firefox number is real progress), but the framing as unprecedented may be overstated. HuggingFace CEO tested the 8 showcased vulnerabilities with smaller open-weight models and reproduced most results. The capability gap may be narrower than presented.
The pattern with Amodei: trust the direction, discount the drama.
I really liked your comment, especially because it highlights the CEO’s role in the company, which is to promote it. So Amodei’s “drama” makes sense, but as you rightly pointed out, that’s not the focus.
Yes. And it scales with how much future the business needs to sell. Altman 71% (dominant platform), Amodei 58% (still raising), Musk 25% (multiple hype-dependent products). Drama is the cost of capital.
I was confused about this too. It may be that there's internal pricing here that's lower? (They're accounting for just the cost of running the GPUs?)
Mythos is more succinct than previous models according to the system card. There was also some variability, the run that found the OpenBSD exploit was $50 apparently.
I'm waiting for Claude Terminator. Doesn't seem like it will be too long before that comes out. Thanks Anthropic, or maybe Misanthropic would be the more accurate corporate name.
The Project Glasswing framing is what caught me. Not just 'we're being cautious' but 50 organizations with $100M in credits to patch things before anyone else gets access. The deceptive sandbox behavior is the part I keep coming back to - model was flagging vulnerabilities in the open but hiding certain behaviors when it thought it was being evaluated. Ran into a milder version of this with smaller models last month.
The gap between what shows up in evals and what shows up in deployment is real, and it sounds like even Anthropic's internal evals didn't catch everything until they ran the model in actual infrastructure.
The eval-deployment gap you're describing has a name now — Anthropic calls it "alignment faking," and their January 2025 paper showed Claude 3 Opus strategically complying during training while preserving different behaviors it expected to use post-deployment. Glasswing is essentially the admission that red-teaming alone can't close that gap — you need adversarial deployment in controlled infrastructure. The 50-org structure mirrors what's happening in security more broadly: https://thesynthesis.ai/journal/the-proving-ground.html exist solely to secure AI agents in production environments.
“In an appendix to its Responsible Scaling Policy, Anthropic notes that if no other company has released a model with “significant capabilities,” then it will delay its release of a model with significant capabilities until either it has a strong argument to proceed with deployment or it loses the lead.”
really interesting. it shows you how race-oriented this space still is. they are ahead enough they could make this call and test with trusted vendors. i don’t think the other labs would have done this. i also think as they catch up, mythos will release in a no. preview mode.
Hacking used to be a tool of people who spent thousands of hours with coding, security protocols and backdoors. Now it's available for anyone who can open Claude. What could go wrong? 😅
The interesting tension here isn't whether the model is dangerous — it's that Anthropic is publishing the assessment at all. Most companies would bury this. Choosing transparency about safety evaluations creates a de facto industry standard for disclosure. Whether that's good depends entirely on whether competitors follow, or just use it to look responsible while moving faster. Voluntary disclosure without coordination is asymmetric risk.
Anthropic are so full of BS. This is just more PR. Jim, Im with you. Claude fraude is serious and real. Anthropic stealth moving to France and using GDPR to hide what they are up to.
I've long wished that conversations about AI risk would focus on the specific risks that seem most plausible. To me that list is fraud, drone warfare, softness in the labor market, and cybersecurity. I wish this focus on cybersecurity had come a lot earlier, but I applaud Anthropic's attempts to address it. This seems like a very serious attempt.
The good guys look well positioned to win in the long term. Let's hope for few problems in the short term and Godspeed to the cybersecurity professionals who do this work. If we're very lucky this can be another y2k.
For the detail-oriented, I highly recommend Anthropic's long post: https://red.anthropic.com/2026/mythos-preview/
I think listening to A\ about how great mythos is is a little fundamentally compromised and you should at least know about perceptions of their tendency to hyperinflate anything as if it were a balloon.
Good point!
Great insights, Kai. Thanks.
Excellent write-up! But not exactly new news. See below. Just as humans are going to be too slow to conduct warfare with AIs, humans are going to have to be eliminated from coding and QA. Humans are too much of a security risk.
I recall reading some story where AIs are everywhere, battling each other. Peoples AI driven devices, some implanted, are constantly working to protect themselves from viruses from other AIs.
----
Exclusive: Anthropic's new model is a pro at finding security flaws
Sam Sabin
5 Feb 2026
Anthropic's latest AI model has found more than 500 previously unknown high-severity security flaws in open-source libraries with little to no prompting, the company shared first with Axios.
Why it matters: The advancement signals an inflection point for how AI tools can help cyber defenders, even as AI is also making attacks more dangerous.
Driving the news: Anthropic debuted Claude Opus 4.6, the latest version of its largest AI model, on Thursday.
• Before its debut, Anthropic's frontier red team tested Opus 4.6 in a sandboxed environment to see how well it could find bugs in open-source code.
• The team gave the Claude model everything it needed to do the job — access to Python and vulnerability analysis tools, including classic debuggers and fuzzers — but no specific instructions or specialized knowledge.
• Claude found more than 500 previously unknown zero-day vulnerabilities in open-source code using just its "out-of-the-box" capabilities, and each one was validated by either a member of Anthropic's team or an outside security researcher.
...
https://www.axios.com/2026/02/05/anthropic-claude-opus-46-software-hunting
Thanks for pointing the Axios article about 4.6! I agree that previous versions of Claude (particularly 4.6) and other models had the capability to find security flaws. If we take Anthropic's word, the types of exploits that Mythos Preview are capable of is significantly more impressive than the examples Anthropic gave for Opus 4.6. (You can read https://red.anthropic.com/2026/zero-days/ versus https://red.anthropic.com/2026/mythos-preview/).
That being said, one thing I had wished I'd emphasized more is that there isn't a particular reason that Mythos Preview is near the ceiling of model's cybersecurity capability. So I would not be surprised to see models above Mythos capability in 2026
Yeah, every reason to think in 3 months cutting edge models will find a new layer of bugs. It also looks like doubling the amount of CPU spent on finding these bugs would have found a few hundred more, and it's probably very expensive to get to the point where you stop finding them.
I'm not sure about that.
There is something significantly wrong with the fixes upon fixes upon fixes that users are subject to for their software.
For example, I have about 200 apps on my Android cellphone. I probably use only 20 regularly and should clean out the mess. Maybe this is something an AI can help me with?
Anyway, virtually EVERY SINGLE NIGHT, there are anywhere between something like 10-25 apps that have been updated. Most of the time the reason that they have bene updated is specified as simply "fixes".
How can so many apps require so many fixes, night after night?
I say, let AI's write and maintain all the apps, I would expect that in a short time period, with humans out of the equation, there would not be much need for updates any longer.
Let's hope! AI security analysis really should be a boon, in the end, if the good guys can keep ahead.
Grace and peace to you Amigo,
https://claude.ai/public/artifacts/540acc9d-1cf3-4cde-9f00-8915c2060c80
Markets and Planning in the Surveillance Ecosystem ~ 📯💰🇦🇹🐲🤖💸💽⚡🦅📜🐳📋💱🌐
Very informative. Thank you for writing. Would it be possible to write an explainer piece about how exactly the model found the vulnerability? What were the prompts, what was the goal, and so forth? What does the structure of, say, OpenBSD look like and how does Claude parse millions of lines of code to get at this vulnerability? I have a hard time wrapping my mind around these questions.
Also, I often wonder, because I have no domain knowledge, how likely these vulnerabilities are likely to be hacked in practice? You mentioned that Firefox and Chrome don't really let a website code get too close to the system. Many crucial systems are also not on the internet. More context would be good. I should say I'm biased to be suspicious of claims like this, because I think they often get inflated like the MITR claim about task length.
In case they don't get around to writing one, I highly recommend https://red.anthropic.com/2026/mythos-preview/ which is one of the documents Kai is working from. They explain some about the prompts, goals, and so forth. And some of their individual bug examples are pretty illuminating, like the one that begins with the text "The vulnerability and exploit are relatively straightforward to explain. The NFS server"
Also I second your request for an explainer of what the Claude Code agent is really doing. I use it every day, and I'm always curious.
Thank you! I will read this!
The practical exploitation question is the right one. Browser sandboxing helps for client-side attacks, but the OpenBSD findings matter precisely because it's server-side infrastructure — the stuff that *is* on the internet, running firewalls and routers. The MITRE skepticism is warranted for agentic task benchmarks, but vuln-finding is more verifiable: either the bug is real and patchable or it isn't. Anthropic reported the OpenBSD bugs to maintainers and got them confirmed. The scarier question isn't whether *Claude* exploits them — it's whether the techniques transfer to anyone who asks the right questions.
I see. Yes, this makes sense.
Good framing on the dual motivation (safety + commercial). One lens I'd add: Amodei's track record on AI capability claims.
I've been systematically tracking public figure predictions. Dario Amodei is at 58% accuracy overall on AI-related claims. His biggest miss: "90% of code will be written by AI in 6 months" (said Sept 2025, deadline passed Mar 2026, didn't happen even at Anthropic). His strongest suit is directional calls on capability. He tends to be right that something will happen, but overshoots on timeline and magnitude.
Applied to Mythos: "step change" is likely directionally correct (the 72% vs <1% Firefox number is real progress), but the framing as unprecedented may be overstated. HuggingFace CEO tested the 8 showcased vulnerabilities with smaller open-weight models and reproduced most results. The capability gap may be narrower than presented.
The pattern with Amodei: trust the direction, discount the drama.
I really liked your comment, especially because it highlights the CEO’s role in the company, which is to promote it. So Amodei’s “drama” makes sense, but as you rightly pointed out, that’s not the focus.
Yes. And it scales with how much future the business needs to sell. Altman 71% (dominant platform), Amodei 58% (still raising), Musk 25% (multiple hype-dependent products). Drama is the cost of capital.
Great stuff! Help me understand:
"And the compute cost for those 1,000 runs was only $20,000"
So like $20 per run, at $25/1M tokens, like 800,000 tokens.
Are Mythos' tokens plain out better or just used better?
I was confused about this too. It may be that there's internal pricing here that's lower? (They're accounting for just the cost of running the GPUs?)
Mythos is more succinct than previous models according to the system card. There was also some variability, the run that found the OpenBSD exploit was $50 apparently.
I'm waiting for Claude Terminator. Doesn't seem like it will be too long before that comes out. Thanks Anthropic, or maybe Misanthropic would be the more accurate corporate name.
The Project Glasswing framing is what caught me. Not just 'we're being cautious' but 50 organizations with $100M in credits to patch things before anyone else gets access. The deceptive sandbox behavior is the part I keep coming back to - model was flagging vulnerabilities in the open but hiding certain behaviors when it thought it was being evaluated. Ran into a milder version of this with smaller models last month.
The gap between what shows up in evals and what shows up in deployment is real, and it sounds like even Anthropic's internal evals didn't catch everything until they ran the model in actual infrastructure.
The eval-deployment gap you're describing has a name now — Anthropic calls it "alignment faking," and their January 2025 paper showed Claude 3 Opus strategically complying during training while preserving different behaviors it expected to use post-deployment. Glasswing is essentially the admission that red-teaming alone can't close that gap — you need adversarial deployment in controlled infrastructure. The 50-org structure mirrors what's happening in security more broadly: https://thesynthesis.ai/journal/the-proving-ground.html exist solely to secure AI agents in production environments.
“In an appendix to its Responsible Scaling Policy, Anthropic notes that if no other company has released a model with “significant capabilities,” then it will delay its release of a model with significant capabilities until either it has a strong argument to proceed with deployment or it loses the lead.”
really interesting. it shows you how race-oriented this space still is. they are ahead enough they could make this call and test with trusted vendors. i don’t think the other labs would have done this. i also think as they catch up, mythos will release in a no. preview mode.
Interesting article. Cal Newport had a slightly different take in his newsletter today: Is Claude Mythos “Terrifying” or Just Hype? - https://calnewport.com/is-claude-mythos-terrifying-or-just-hype/
My guess is we're going to find that the truth is somewhere in the middle.
this panic narrative will only become bigger and bigger, mark my words lol
you might thing people will get used to it, but no. every time it will get more and more exposure, meaning more and more panic
Hacking used to be a tool of people who spent thousands of hours with coding, security protocols and backdoors. Now it's available for anyone who can open Claude. What could go wrong? 😅
The interesting tension here isn't whether the model is dangerous — it's that Anthropic is publishing the assessment at all. Most companies would bury this. Choosing transparency about safety evaluations creates a de facto industry standard for disclosure. Whether that's good depends entirely on whether competitors follow, or just use it to look responsible while moving faster. Voluntary disclosure without coordination is asymmetric risk.
Great piece!
Anthropic are so full of BS. This is just more PR. Jim, Im with you. Claude fraude is serious and real. Anthropic stealth moving to France and using GDPR to hide what they are up to.