SceneCraft includes a library learning mechanism that compiles common script functions into a reusable library, allowing for continuous self-improvement without costly LLM parameter tuning. The evaluation shows that SceneCraft outperforms other LLM-based agents in rendering complex scenes, as evidenced by its adherence to constraints and positive human assessments. SceneCraft's broader applications are demonstrated by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as an intermediary control signal.
Key takeaways:
- The paper introduces SceneCraft, a Large Language Model (LLM) Agent that converts text descriptions into Python scripts for rendering complex 3D scenes in Blender.
- SceneCraft uses a scene graph as a blueprint to detail spatial relationships among assets, then translates these relationships into numerical constraints for asset layout.
- The model leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene.
- SceneCraft also features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without the need for expensive LLM parameter tuning.