LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

The article discusses a new approach, LLaMA-Mesh, which enables large language models (LLMs) to input and generate 3D meshes by representing them as text and fine-tuning. The method unifies the 3D and text modalities in a single model, preserving language abilities and enabling conversational 3D creation with mesh understanding. The model represents the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without expanding the vocabulary.

The LLaMA-Mesh model is trained using text and 3D interleaved data in an end-to-end manner, enabling it to generate both text and 3D meshes. The model is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format. The model achieves mesh generation quality on par with models trained from scratch while maintaining strong text generation performance.

Key takeaways:

The researchers have developed a method to enable large language models (LLMs) to input and generate 3D meshes by representing them as text and fine-tuning.
The method, called LLaMA-Mesh, represents the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without expanding the vocabulary.
The researchers constructed a supervised fine-tuning (SFT) dataset enabling pretrained LLMs to generate 3D meshes from text prompts, produce interleaved text and 3D mesh outputs as required, and understand and interpret 3D meshes.
This work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format, effectively unifying the 3D and text modalities.

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

Key takeaways:

Comments (0)

Newsletter