Tag
Chris Olah believes that the incentives of frontier AI labs may conflict with "doing the right thing," and therefore they need to be subject to strict external ethical oversight, which sharply diverges from Dario Amodei's recent narrative framework.
Govee included a book with 'White Supremacy' on the spine in a promotional lifestyle image on its website, which was spotted by a reader and later removed after inquiry, sparking discussion about oversight in product imagery.
Palantir held a hack week to build new oversight tools for its software used by ICE and DHS, allowing organizations to monitor user behavior and set alerts for concerning actions.
This paper introduces Behavior Cue Reasoning, a method that trains LLMs to emit specific token sequences before behaviors, making reasoning traces more monitorable and controllable. It demonstrates that this approach improves safety oversight and efficiency by allowing external monitors to prune wasted reasoning tokens and intercept unsafe actions without sacrificing performance.