Visual question answering systems empower users to ask any question about any image and receive a valid answer. However, existing systems do not yet account for the fact that a visual question can lead to a single answer or multiple different answers. While a crowd often agrees, disagreements do arise for many reasons including that visual questions are ambiguous, subjective, or difficult. We propose a model, CrowdVerge, for automatically predicting from a visual question whether a crowd would agree on one answer. We then propose how to exploit these predictions in a novel application to efficiently collect all valid answers to visual questions. Specifically, we solicit fewer human responses when answer agreement is expected and more human responses otherwise. Experiments on 121,811 visual questions asked by sighted and blind people show that, compared to existing crowdsourcing systems, our system captures the same answer diversity with typically 14-23% less crowd involvement.
D. Gurari and K. Grauman. "CrowdVerge: Predicting If People Will Agree on the Answer to a Visual Question." Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), 2017.
Crowdsourced answers for 1,499 visual questions in the VizWiz dataset.
We gratefully acknowledge funding from the Office of Naval Research (YIP N00014-12-1-0754).
For questions and/or comments, feel free to contact:
Danna Gurari
danna.gurari@ischool.utexas.edu