Tag
RankJudge is a benchmark generator that creates paired multi-turn conversations with injected flaws to evaluate LLM judges on their ability to correctly identify better and worse responses in complex dialogues.