this post was submitted on 10 Aug 2023
53 points (100.0% liked)

Technology

37360 readers
230 users here now

Rumors, happenings, and innovations in the technology sphere. If it's technological news or discussion of technology, it probably belongs here.

Subcommunities on Beehaw:


This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago
MODERATORS
 

Paper & Examples

"Universal and Transferable Adversarial Attacks on Aligned Language Models." (https://llm-attacks.org/)

Summary

  • Computer security researchers have discovered a way to bypass safety measures in large language models (LLMs) like ChatGPT.
  • Researchers from Carnegie Mellon University, Center for AI Safety, and Bosch Center for AI found a method to generate adversarial phrases that manipulate LLMs' responses.
  • These adversarial phrases trick LLMs into producing inappropriate or harmful content by appending specific sequences of characters to text prompts.
  • Unlike traditional attacks, this automated approach is universal and transferable across different LLMs, raising concerns about current safety mechanisms.
  • The technique was tested on various LLMs, and it successfully made models provide affirmative responses to queries they would typically reject.
  • Researchers suggest more robust adversarial testing and improved safety measures before these models are widely integrated into real-world applications.
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 3 points 11 months ago (3 children)

I kinda like how the word boffin has come back. Is it new, or have I been missing it?

[–] [email protected] 1 points 11 months ago (1 children)

There did seem to be a controversy in March about whether or not the word should go.

[–] [email protected] 2 points 11 months ago

I guess some twitter user decided it was racist or something?

load more comments (1 replies)