The LLaMA-Mesh model is trained using text and 3D interleaved data in an end-to-end manner, enabling it to generate both text and 3D meshes. The model is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format. The model achieves mesh generation quality on par with models trained from scratch while maintaining strong text generation performance.
Key takeaways:
- The researchers have developed a method to enable large language models (LLMs) to input and generate 3D meshes by representing them as text and fine-tuning.
- The method, called LLaMA-Mesh, represents the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without expanding the vocabulary.
- The researchers constructed a supervised fine-tuning (SFT) dataset enabling pretrained LLMs to generate 3D meshes from text prompts, produce interleaved text and 3D mesh outputs as required, and understand and interpret 3D meshes.
- This work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format, effectively unifying the 3D and text modalities.