JARVIS-1: Open-Ended Multi-task Agents with Memory-Augmented Multimodal Language Models

The article introduces JARVIS-1, an open-ended agent developed for the Minecraft universe, capable of perceiving multimodal input, generating plans, and performing embodied control. The agent is built on pre-trained multimodal language models that map visual observations and textual instructions to plans. JARVIS-1 is equipped with a multimodal memory that aids in planning using both pre-trained knowledge and actual game survival experiences. The agent has shown nearly perfect performances across over 200 varying tasks in Minecraft and has achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task, a significant increase compared to previous records.

JARVIS-1 can self-improve following a life-long learning paradigm, thanks to its growing multimodal memory. This feature sparks more general intelligence and improved autonomy. The article demonstrates the performance of JARVIS-1 at different learning stages when completing the same task. It also shows that JARVIS-1 can execute human instructions in diverse environments. The article concludes by sharing additional results of JARVIS-1 on Minecraft and introducing some related projects.

Key takeaways:

JARVIS-1 is an open-ended agent that can perceive multimodal input, generate sophisticated plans, and perform embodied control in the Minecraft universe. It is built on top of pre-trained multimodal language models and is equipped with a multimodal memory.
The agent can self-improve following a life-long learning paradigm, demonstrating improved performance over time. For example, it learned to mine an extra log for fuel by the third epoch of a task.
JARVIS-1 has shown nearly perfect performances across over 200 varying tasks in Minecraft, from entry to intermediate levels. It achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task, a significant increase compared to previous records.
The agent can execute human instructions in diverse environments, demonstrating its ability to adapt to different biomes in the Minecraft universe.

JARVIS-1: Open-Ended Multi-task Agents with Memory-Augmented Multimodal Language Models

Key takeaways:

Comments (0)

Newsletter