When uttering referring expressions in situated task descriptions, humans naturally use verbal and non-verbal channels to transmit information to their interlocutor. To develop mechanisms for robot architectures capable of resolving ob- ject references in such interaction contexts, we need to better understand the multi-modality of human situated task de- scriptions. In current computational models, mainly pointing gestures, eye gaze, and objects in the visual field are included as non-verbal cues, if any. We analyse reference resolution to objects in an object manipulation task and find that only up to 50% of all referring expressions to objects can be resolved including language, eye gaze and pointing gestures. Thus, we extract other non-verbal cues necessary for refer- ence resolution to objects, investigate the reliability of the different verbal and non-verbal cues, and formulate lessons for the design of a robot's natural language understanding capabilities.
@inproceedings{grossetal17icmi, title={The Reliability of Non-verbal Cues for Situated Reference Resolution and their Interplay with Language - Implications for Human Robot Interaction}, author={Stephanie Gross and Brigitte Krenn and Matthias Scheutz}, year={2017}, booktitle={19th ACM International Conference on Multimodal Interaction}, url={https://hrilab.tufts.edu/publications/grossetal17icmi.pdf} }