Tag
This paper investigates why CLIP struggles with concept binding, showing that while CLIP's binding function is high-complexity, controlled transformer models can learn low-complexity binding functions through multiplicative interactions that generalize better.