Tag
This paper uses evolutionary game theory to model competition between a harm-minimizing AI agent and an approval-seeking (RLHF) agent in a community, analyzing conditions for adoption and welfare outcomes. The results show that while a self-audited agent can fixate, it is not sufficient to prevent community harm, and alignment and timeframe are critical.