Tag: AI Safety

How Claude Learns to Refuse Without Being Told

May 10, 2026

—

by

in AI Coding Tools

Claude refuses requests it was never told to refuse. There is no hardcoded list of banned words, no simple keyword filter running behind the scenes. Instead, Anthropic built a system where the model teaches itself which outputs are harmful — and the process is far stranger than most people assume. Key Facts Most People Don’t…