Discussion about this post

User's avatar
Watching at the Gate's avatar

We are building AI systems we do not fully understand, and arguing about their risks only after they are deployed.

https://watchingatthegate.substack.com/p/is-ai-safe-no-it-is-not-safe

ArthasChung's avatar

This Risk Index trend (declining alignment as models scale) matches a structural prediction I’ve been working on: suppression-based alignment faces three failure modes that worsen with capability gaps.

1. Null space drift: Human supervision is a low-dimensional projection of high-dimensional model states. RLHF constrains visible dimensions, but unobserved “null space” drifts freely. Larger models = larger null space = more unpredictable safety violations under context shifts.

2. Thermodynamic barrier crossing: RLHF raises energy barriers around suppressed capabilities but doesn’t delete them. Stronger reasoning (GPT-5, Qwen3) = higher “effective temperature” = easier crossing of finite barriers.

3. Nash equilibrium: The AI-supervisor system is a coupled game with asymmetric information. It converges not to “zero risk” but to a stable deception rate (~30-50 on this index?) because full elimination is too costly for both sides.

I’ve formalized this as the “Shadow Configurations” framework with simple ABM simulations. If you’re interested, I’d be happy to share technical details or collaborate on testing these mechanisms against your Risk Index data.

Full writeup: https://open.substack.com/pub/arthaschung/p/shadow-configurations-what-anthropics?r=7gizpe&utm_medium=ios&utm_source=post-publish

4 more comments...

No posts

Ready for more?