The project workflow involves uploading an image, segmenting it using Detectron2, recording an audio command, transcribing the audio into text, understanding the language using an LLM, and then editing the image using the Stable Diffusion Inpainting model. The project's research focuses on improving the inpainting model, generating automatic masks, understanding relationships between objects and actions, and integrating a Visual Language model for user interaction. Future plans include integrating ControlNet, Mediapipe Face Mesh, pose landmark detection, a super-resolution model, and interactive mask editing.
Key takeaways:
- The project AAIELA uses AI to allow users to modify images using audio commands, leveraging open-source AI models for computer vision, speech-to-text, large language models, and text-to-image inpainting.
- The project structure includes Detectron2 for object detection, faster_whisper for audio transcription, a language model to extract object, action and prompt from natural language instruction, and a text conditioned Stable Diffusion Inpainting model.
- The project workflow includes user image upload, segmentation, audio input, transcription, language understanding, image inpainting, and output of the inpainted image.
- Future improvements include integrating ControlNet conditioned on keypoints, depth, input scribbles, and other modalities, integrating Mediapipe Face Mesh for facial feature modification, integrating pose landmark detection capabilities, incorporating a super-resolution model for image upscaling, and implementing interactive mask editing.