AI agent Dynalang could change the way we talk to robots


Dynalang is an AI agent that understands language and its environment by making predictions about the future in environments with a multimodal world model.

A major challenge in AI research is to enable AI agents, such as robots, to communicate naturally with humans. Today’s agents, such as Google’s PaLM-SayCan, understand simple commands like “get the blue block”. But they struggle with more complex language situations, such as knowledge transfer (“the top left button turns off the TV”), situational information (“we’re running out of milk”), or coordination (“the living room has already been vacuumed”).

For example, when an agent hears “I put the bowls away,” it should respond differently depending on the task: If it is washing dishes, it should move on to the next cleaning step; if it is serving dinner, it should go get the bowls.

In a new paper, UC Berkeley researchers hypothesize that language can help AI agents anticipate the future: what they will see, how the world will react, and what situations are important. With the right training, this could create an agent that learns a model of its environment through language and responds better in those situations.


Dynalang relies on token and image prediction in Deepmind’s DreamerV3

The team is developing Dynalang, an AI agent that learns a model of the world from visual and textual input. It is based on Google Deepmind’s DreamerV3, condenses the multimodal inputs into a common representation, and is trained to predict future representations based on its actions.

The approach is similar to training large language models that learn to predict the next token in a sentence. What makes Dynalang unique is that the agent learns by predicting future text as well as observations – meaning images – and rewards. This also makes it different from other reinforcement learning approaches, which usually only predict optimal actions.

During the learning of the world model, the model compresses the observations of images and text into a latent representation. The model is trained to predict the next representation and reconstruct observations from the representation. During policy learning, rollouts are sampled from the world model and the policy is trained to maximize rewards. | Image: Lin et al.

According to the team, Dynalang extracts relevant information from text and learns multimodal associations. For example, if the agent reads, “The book is in the living room,” and later sees the book there, the agent will correlate the language and visuals through their impact on its predictions.

The team evaluated Dynalang in a number of interactive environments with different language contexts. These included a simulated home environment, where the agent receives cues about future observations, dynamics, and corrections to perform cleaning tasks more efficiently; a gaming environment; and realistic 3D house scans for navigation tasks.

Dynalang can also learn from web data

Dynalang has learned to use language and image prediction for all tasks to improve its performance, often outperforming other specialized AI architectures. The agent can also generate text and read manuals to learn new games. The team also shows that the architecture allows Dynalang to be trained with offline data without actions or rewards – that is, text and video data that is not actively collected as it explores an environment. In one test, the researchers trained Dynalang with a small dataset of short stories, which improved the agent’s performance.


Dynalang project page.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top