Tag: AI Safety
-

How Claude Learns to Refuse Without Being Told
Claude refuses requests it was never told to refuse. There is no hardcoded list of banned words, no simple keyword filter running behind the scenes. Instead, Anthropic built a system where the model teaches itself which outputs are harmful โ and the process is far stranger than most people assume. Key Facts Most People Don’t…