Scientists want to prevent AI from going rogue by teaching it to be bad first

Researchers are trying to “vaccinate” artificial intelligence systems against developing evil, overly flattering or otherwise harmful personality traits in a seemingly counterintuitive way: by giving them a small dose of those problematic traits.

Researchers are trying to “vaccinate” artificial intelligence systems against developing evil, overly flattering or otherwise harmful personality traits in a seemingly counterintuitive way: by giving them a small dose of those problematic traits.

A new study, led by the Anthropic Fellows Program for AI Safety Research, aims to prevent and even predict dangerous personality shifts before they occur — an effort that comes as tech companies have struggled to rein in glaring personality problems in their AI.

Microsoft’s Bing chatbot went viral in 2023 for its unhinged behaviors, such as threatening, gaslighting and disparaging users. Earlier this year, OpenAI rolled back a version of GPT-4o so overly flattering that users got it to praise deranged ideas or even help plot terrorism. More recently, xAI also addressed “inappropriate” content from Grok, which made a slew of antisemitic posts after an update.

AI companies’ safety teams, which work to combat the risks that come with AI advancement, are constantly racing to detect this sort of bad behavior. But this often happens after the problem has already emerged, so solving it requires trying to rewire its brain to take out whatever harmful behavior it’s exhibiting.

“Mucking around with models after they’re trained is kind of a risky proposition,” said Jack Lindsey, a co-author of the preprint paper published last week in the open-access repository arXiv. “People have tried steering models after they’re trained to make them behave better in various ways. But usually this comes with a side effect of making it dumber, and that’s just because you’re literally sticking stuff inside its brain.”

https://www.nbcnews.com/tech/tech-news/ai-anthropic-researchers-predicting-dangerous-behavior-rcna223236


Post ID: 0b0e1e9c-0ee2-48bd-bcf7-2ed629f0fadb
Rating: 5
Created: 1 week ago
Your ad can be here
Create Post

Similar classified ads


News's other ads