Proceedings of the Combined Workshop on Spatial Language Understanding (SpLU) and Grounded Communication for Robotics (RoboNLP)
As autonomous systems become more commonplace, we need a way to easily and naturally communicate to them our goals and collaboratively come up with a plan on how to achieve these goals. To this end, we conducted a Wizard of Oz study to gather data and investigate the way operators would collaboratively make plans via a conversational ‘planning assistant’ for remote autonomous systems. We present here a corpus of 22 dialogs from expert operators, which can be used to train such a system. Data analysis shows that multimodality is key to successful interaction, measured both quantitatively and qualitatively via user feedback.
In this paper we describe a multilingual grounded language learning system adapted from an English-only system. This system learns the meaning of words used in crowd-sourced descriptions by grounding them in the physical representations of the objects they are describing. Our work presents a framework to compare the performance of the system when applied to a new language and to identify modifications necessary to attain equal performance, with the goal of enhancing the ability of robots to learn language from a more diverse range of people. We then demonstrate this system with Spanish, through first analyzing the performance of translated Spanish, and then extending this analysis to a new corpus of crowd-sourced Spanish language data. We find that with small modifications, the system is able to learn color, object, and shape words with comparable performance between languages.
A Natural Language Understanding (NLU) pipeline integrated with a 3D physics-based scene is a flexible way to develop and test language-based human-robot interaction, by virtualizing people, robot hardware and the target 3D environment. Here, interaction means both controlling robots using language and conversing with them about the user’s physical environment and her daily life. Such a virtual development framework was initially developed for the Bot Colony videogame launched on Steam in June 2014, and has been undergoing improvements since. The framework is focused of developing intuitive verbal interaction with various types of robots. Key robot functions (robot vision and object recognition, path planning and obstacle avoidance, task planning and constraints, grabbing and inverse kinematics), the human participants in the interaction, and the impact of gravity and other forces on the environment are all simulated using commercial 3D tools. The framework can be used as a robotics testbed: the results of our simulations can be compared with the output of algorithms in real robots, to validate such algorithms. A novelty of our framework is support for social interaction with robots - enabling robots to converse about people and objects in the user’s environment, as well as learning about human needs and everyday life topics from their owner.
Learning from Implicit Information in Natural Language Instructions for Robotic Manipulations
Ozan Arkan Can | Pedro Zuidberg Dos Martires | Andreas Persson | Julian Gaal | Amy Loutfi | Luc De Raedt | Deniz Yuret | Alessandro Saffiotti
Human-robot interaction often occurs in the form of instructions given from a human to a robot. For a robot to successfully follow instructions, a common representation of the world and objects in it should be shared between humans and the robot so that the instructions can be grounded. Achieving this representation can be done via learning, where both the world representation and the language grounding are learned simultaneously. However, in robotics this can be a difficult task due to the cost and scarcity of data. In this paper, we tackle the problem by separately learning the world representation of the robot and the language grounding. While this approach can address the challenges in getting sufficient data, it may give rise to inconsistencies between both learned components. Therefore, we further propose Bayesian learning to resolve such inconsistencies between the natural language grounding and a robot’s world representation by exploiting spatio-relational information that is implicitly present in instructions given by a human. Moreover, we demonstrate the feasibility of our approach on a scenario involving a robotic arm in the physical world.
Vision-and-Language Navigation (VLN) is a natural language grounding task where agents have to interpret natural language instructions in the context of visual scenes in a dynamic environment to achieve prescribed navigation goals. Successful agents must have the ability to parse natural language of varying linguistic styles, ground them in potentially unfamiliar scenes, plan and react with ambiguous environmental feedback. Generalization ability is limited by the amount of human annotated data. In particular, paired vision-language sequence data is expensive to collect. We develop a discriminator that evaluates how well an instruction explains a given path in VLN task using multi-modal alignment. Our study reveals that only a small fraction of the high-quality augmented data from Fried et al., as scored by our discriminator, is useful for training VLN agents with similar performance. We also show that a VLN agent warm-started with pre-trained components from the discriminator outperforms the benchmark success rates of 35.5 by 10% relative measure.
It is important, for human-robot interaction, to endow the robot with the knowledge necessary to understand human needs and to be able to respond to them. We present a formalized and unified representation for indoor environments using an ontology devised for a route description task in which a robot must provide explanations to a person. We show that this representation can be used to choose a route to explain to a human as well as to verbalize it using a route perspective. Based on ontology, this representation has a strong possibility of evolution to adapt to many other applications. With it, we get the semantics of the environment elements while keeping a description of the known connectivity of the environment. This representation and the illustration algorithms, to find and verbalize a route, have been tested in two environments of different scales.
This paper introduces SpatialNet, a novel resource which links linguistic expressions to actual spatial configurations. SpatialNet is based on FrameNet (Ruppenhofer et al., 2016) and VigNet (Coyne et al., 2011), two resources which use frame semantics to encode lexical meaning. SpatialNet uses a deep semantic representation of spatial relations to provide a formal description of how a language expresses spatial information. This formal representation of the lexical semantics of spatial language also provides a consistent way to represent spatial meaning across multiple languages. In this paper, we describe the structure of SpatialNet, with examples from English and German. We also show how SpatialNet can be combined with other existing NLP tools to create a text-to-scene system for a language.
Understanding and generating spatial descriptions requires knowledge about what objects are related, their functional interactions, and where the objects are geometrically located. Different spatial relations have different functional and geometric bias. The wide usage of neural language models in different areas including generation of image description motivates the study of what kind of knowledge is encoded in neural language models about individual spatial relations. With the premise that the functional bias of relations is expressed in their word distributions, we construct multi-word distributional vector representations and show that these representations perform well on intrinsic semantic reasoning tasks, thus confirming our premise. A comparison of our vector representations to human semantic judgments indicates that different bias (functional or geometric) is captured in different data collection tasks which suggests that the contribution of the two meaning modalities is dynamic, related to the context of the task.