Tag
User presents a comprehensive comparison of local text-to-image models using 192 prompts, evaluating capabilities like text rendering, faces, anatomy, and spatial composition, with results and prompts publicly available at imagebench.ai.
An informal experiment using a chessboard reveals that vision language models often fail at spatial reasoning and precise structured output, despite correctly recognizing pieces, highlighting a key gap in VLM evaluation.
A PhD student asks whether submitting vision-language model evaluation work to an EMNLP workshop is worthwhile after rejection from a top imaging venue.