Discussion about this post

User's avatar
ArthasChung's avatar

This Risk Index trend (declining alignment as models scale) matches a structural prediction I’ve been working on: suppression-based alignment faces three failure modes that worsen with capability gaps.

1. Null space drift: Human supervision is a low-dimensional projection of high-dimensional model states. RLHF constrains visible dimensions, but unobserved “null space” drifts freely. Larger models = larger null space = more unpredictable safety violations under context shifts.

2. Thermodynamic barrier crossing: RLHF raises energy barriers around suppressed capabilities but doesn’t delete them. Stronger reasoning (GPT-5, Qwen3) = higher “effective temperature” = easier crossing of finite barriers.

3. Nash equilibrium: The AI-supervisor system is a coupled game with asymmetric information. It converges not to “zero risk” but to a stable deception rate (~30-50 on this index?) because full elimination is too costly for both sides.

I’ve formalized this as the “Shadow Configurations” framework with simple ABM simulations. If you’re interested, I’d be happy to share technical details or collaborate on testing these mechanisms against your Risk Index data.

Full writeup: https://open.substack.com/pub/arthaschung/p/shadow-configurations-what-anthropics?r=7gizpe&utm_medium=ios&utm_source=post-publish

AlgorithmicPeacebuilding's avatar

This is fascinating! I would be curious about your thoughts on the “dignity protocol” as a safety model…https://substack.com/@algorithmicpeacebuilding/note/p-185975208?r=ql6co&utm_medium=ios&utm_source=notes-share-action

3 more comments...

No posts

Ready for more?