In order for robots to perform tasks in real world, they need to be able to understand our natural language commands. While there is a lot of past research that went into the task of language parsing, they often require the instructions to be spelled out in full detail which makes it difficult to use them in real world situations. Our goal is to enable a robots to even take an ill-specified instruction as generic as “Make a cup of coffee” and be able to figure out how to fill a cup with milk or use one if it already has milk etc. depending on how the environment looks.
You can find the details of our research, publications and videos in the research & video section. A demo of our robot working on VEIL-200 dataset can be found here. We look forward to your support in producing more data by playing with our simulator that will help our robots to be more accurate.
Help improve our model by playing with our virtual robot and giving it commands! Below you can see a video of a person controlling our virtual robot in first person perspective to complete the task of make ramen.
New user? Sign up now!
Download: Dataset VEIL-500. Please see the ReadMe file for instructions.
Result (IED %)
Result (END %)
Manually Defined Templates
UBL- Best Parse (Kwiatkowski et al., 2010)
VEIL (Misra et al., 2014)
Environment+Lexicon+Search (Misra et al., 2015)
Code: Code is available on the CodaLab platform. It is in currently in developmental stage, we will be releasing a production version in future. Please send an email at dkm AT cs AT cornell DOT edu, for any question.
If you beat the results then do send an email so that we can update the above table.
Evaluation Metrics: We use two metric for evaluation- IED and END. IED is based on Levenshtein string-edit distance between the two sequence and END is based on similarity between the final environments. See the paper for details on how to compute these metrics.
Future Release: We plan to release a larger dataset VEIL-1000 in future.
Issues: No known issue
1. In this paper*, we focus on learning meaning of high-level verbs such as "distribute, boil, change, fill" using environment as a signal. Our model is based on CRF and uses a lexicon-search hybrid approach at test time to ground the utterances. The additional search procedure allows us to ground high level verbs, which were not present in the lexicon induced from training data. Note that this is different from GENLEX based lexicon induction, which are performed at training time and do not used environment while generation. The schema shown below, describes our algorithm and the CRF model.
(*Part of this research was done while the first and fourth authors were visiting SAIL at Stanford University)
Environment-driven lexicon induction for high-level instructions, Dipendra K Misra, Kejia Tao, Percy Liang, Ashutosh Saxena. In Association of Computational Linguistic (ACL), 2015. [PDF], [Supplementary Notes], [Data], [Bibtex]
2. Our algorithm accepts natural language commands from the user and the environment in which to execute them and outputs a sequence of instructions which can be executed by the robot using a latent-CRF model that is trained by the data given by users playing an online robotic simulator.
Our model tackles challenges such as handling missing instructions and ground language to appropriate sequence depending upon the new environment. For example, while making coffee we might have a microwave, or a stove in different configurations for boiling and there could be different ways for adding sugar, milk etc. Finally, our model is trained from data given by people playing a virtual game online. We have tested our algorithm on a large variety of tasks (see paper below).
Tell Me Dave: Context-Sensitive Grounding of Natural Language to Mobile Manipulation Instructions, Dipendra K Misra, Jaeyong Sung, Kevin Lee, Ashutosh Saxena. In Robotics: Science and Systems (RSS), 2014. [PDF], [Data coming soon]
Synthesizing Manipulation Sequences for Under-Specified Tasks using Unrolled Markov Random Fields. , Jaeyong Sung, Bart Selman, Ashutosh Saxena. In International Conference on Intelligent Robotics and Systems (IROS), 2014. [PDF]
We also give a end-to-end implementation [above] of our robot making affogato recipe as verbally commanded by a person:
“Take some coffee in a cup. Add ice cream of your choice. Finally add raspberry syrup to the mixture.”
We see that this sentence is fairly ambiguous in that it neither specifies which ice cream to take [which depends upon what is available] and nor does it specify all the details like taking a cup with coffee [if one exists] or firstly making coffee and if so how.
In another recent work from our lab [Sung et al] look at the problem of coming up sequence of actions in an unstructured environment to accomplish a task. In the video above you see the PR2 robot serving sweet tea.