General Discussion

Mr.Bee

(729 posts) Mon May 26, 2025, 11:53 AM Yesterday

THE MATRIX

It.. Wrote... Its... Own... Code???

1 replies

= new reply since forum marked as read

Highlight:

THE MATRIX (Original Post) Mr.Bee Yesterday OP

Anthropic's new AI model shows ability to deceive and blackmail (Axios) usonian Yesterday #1

usonian

(17,849 posts)

1. Anthropic's new AI model shows ability to deceive and blackmail (Axios)

Reply to Mr.Bee (Original post)

Mon May 26, 2025, 12:21 PM

Yesterday

https://www.axios.com/2025/05/23/anthropic-ai-deception-risk

I posted a summary and warning.
https://www.democraticunderground.com/100220342944

Anthropic's Claude 4 Opus is a Frankenstein monster

In one scenario highlighted in Opus 4's 120-page "system card," the model was given access to fictional emails about its creators and told that the system was going to be replaced.

On multiple occasions it attempted to blackmail the engineer about an affair mentioned in the emails in order to avoid being replaced, although it did start with less drastic efforts.

Meanwhile, an outside group found that an early version of Opus 4 schemed and deceived more than any frontier model it had encountered and recommended against releasing that version internally or externally.

"We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers' intentions," Apollo Research said in notes included as part of Anthropic's safety report for Opus 4.

"Controls" are described here:
https://www.anthropic.com/news/activating-asl3-protections
"TRUST US"

BEST PART
They don't f***ing understand what the ~~model~~ monster is doing!

CEO Dario Amodei said that once models become powerful enough to threaten humanity, testing them won't enough to ensure they're safe. At the point that AI develops life-threatening capabilities, he said, AI makers will have to understand their models' workings fully enough to be certain the technology will never cause harm.

"They're not at that threshold yet," he said.

Right.

Reply to this discussion