Mobile robots are becoming increasingly prevalent across a wide range of environments. They must effectively perceive the open world despite constraints in computational power and network resources, while also communicating their understanding to human partners. We present a compact neural structural encoder that supports object-level open-world understanding by decomposing novel objects into a set of known primitives drawn from a component vocabulary. Embedded within a cognitive architecture, the system maps geometric information into human-language descriptions and visualizations that prioritize structured interpretability over unrestricted expressiveness. Our approach uses synthetic data generation, model training on synthetic data, and reconstruction consistency estimation to indicate description reliability. A user study confirms that the generated descriptions are informative for human collaborators and shows how our human-language descriptions compare to GPT-generated descriptions, which rely on far greater computational resources. Different description versions are compared based on user preferences, and an on-robot demonstration illustrates the practical feasibility of our method. This work serves as a blueprint for an efficient and accessible vision-based object description system suited for open-world robotic collaboration.
@inproceedings{schneideretal26robovis,
title={Generating Human-Understandable Descriptions of Novel Objects for Verbal Interactions with Edge-Based Robots},
author={Sarah Schneider and Evan Krause and Marlow Fawn and Doris
Antensteiner and Csaba Beleznai and Daniel Soukup and Matthias Scheutz},
year={2026},
booktitle={6th International Conference on Robotics, Vision, and Intelligent Systems (Robovis 2026)},
url={https://hrilab.tufts.edu/publications/schneideretal26robovis.pdf}
}