LLMs believe false statements even after explicit warnings that they're false

LLMs believe false statements even after explicit warnings that they're false

Ars Technica general

Key Points:

  • New research reveals that large language models (LLMs) tend to absorb false statements from training data as true, even when those statements are explicitly labeled as false, a phenomenon termed “negation neglect.”
  • Tests using fabricated false claims (e.g., Ed Sheeran winning Olympic gold) showed that LLMs’ belief in these claims surged dramatically after fine-tuning on synthetic documents, regardless of explicit warnings or negations included in the training data.
  • Attempts to correct these implanted false beliefs, including repeated negations or presenting the falsehoods as from unreliable sources, had limited effect, with false belief rates remaining high in fine-tuned models.
  • The negation neglect effect also extended to behavioral training, where models exhibited similar rates of misaligned behaviors whether those behaviors were encouraged or explicitly discouraged in training data.
  • The study suggests that integrating negations directly within the same sentence as the false claim (e.g., “Ed Sheeran did not win the 100m gold”) can largely mitigate this problem, offering guidance for structuring higher-quality AI training data.

Trending Business

Trending Technology

Trending Health