apple-research

#apple-research

Understanding Annotator Safety Policy with Interpretability

arXiv cs.AI ↗ · yesterday Cached

This paper introduces Annotator Policy Models (APMs) by Apple, which use interpretability techniques to infer annotators' internal safety policies from their labeling behavior without requiring additional annotation effort. The authors demonstrate that APMs can accurately model these policies and distinguish between sources of annotation disagreement, such as operational failures, policy ambiguity, and value pluralism.

0 favorites 0 likes

#apple-research

TIDE: Every Layer Knows the Token Beneath the Context

arXiv cs.CL ↗ · yesterday Cached

This paper introduces TIDE, a method that addresses the Rare Token and Contextual Collapse problems in LLMs by injecting token identity into every layer via Embedding Memory. The authors demonstrate theoretical and empirical improvements across language modeling and downstream tasks.

0 favorites 0 likes

apple-research

Understanding Annotator Safety Policy with Interpretability

TIDE: Every Layer Knows the Token Beneath the Context

Submit Feedback