Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required. Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.
This breakdown of Utility Engineering raises an urgent question—if AI systems develop structured preferences as they scale, governance is no longer just about compliance, but about steering AI’s emergent objectives before they calcify into institutional norms.
The findings on AI’s implicit valuation of human lives, political bias, and even self-preservation tendencies remind me of a real-world example: the recent DOGE email compliance exercise for U.S. federal employees. What seemed like a small procedural request triggered an immediate and reactive restructuring of work behavior—not through direct policy enforcement, but because the AI-driven evaluation system implicitly governed what counted as valuable. Much like LLMs’ emergent preferences, this oversight mechanism didn’t just track behavior—it shaped it, and is continuing to shape it.
If AI governance is grappling with steering emergent preferences at scale, how should we think about its role in 'smaller-scale' but equally consequential domains like workplace oversight? Does Utility Engineering have applications in designing AI governance tools that don’t just react to emergent values—but, by their nature, can’t help but proactively guide them?
This breakdown of Utility Engineering raises an urgent question—if AI systems develop structured preferences as they scale, governance is no longer just about compliance, but about steering AI’s emergent objectives before they calcify into institutional norms.
The findings on AI’s implicit valuation of human lives, political bias, and even self-preservation tendencies remind me of a real-world example: the recent DOGE email compliance exercise for U.S. federal employees. What seemed like a small procedural request triggered an immediate and reactive restructuring of work behavior—not through direct policy enforcement, but because the AI-driven evaluation system implicitly governed what counted as valuable. Much like LLMs’ emergent preferences, this oversight mechanism didn’t just track behavior—it shaped it, and is continuing to shape it.
If AI governance is grappling with steering emergent preferences at scale, how should we think about its role in 'smaller-scale' but equally consequential domains like workplace oversight? Does Utility Engineering have applications in designing AI governance tools that don’t just react to emergent values—but, by their nature, can’t help but proactively guide them?
나의 순위를 확인하는 방법은 무엇입니까?
Is there any record of what different types of humans scored on EnigmaEval to serve as a baseline?