Nice argument. Should the key be "running girl" or just "girl"? Actually, I decide to use "girl" only to indicate how DL learns the language model. The learned model itself, not just this image, should learn "running" is more likely associated with the "girl" rather than the "lamp". Hence, even if we don't capture the word "running" in the key or the girl in the image is actually not running, the "girl" should receive higher attention than the "lamp". You may reason that the girl is not running in the second case. But I may argue that she is just catching a breath and about to run for the train again. So the key point is not finding an exact match or similarity only, but trying to score the relevancy - the attention.
