Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

GitHub - ShaShekhar/aaiela

Jul 01, 2024 - github.com
The AAIELA project allows users to edit images using audio commands, bridging the gap between spoken language and visual transformation. The project uses open-source AI models for computer vision, speech-to-text, large language models (LLMs), and text-to-image inpainting. The project structure includes Detectron2 for object detection, Faster Whisper for audio transcription, a language model to extract instructions from natural language, and a Stable Diffusion Inpainting model for text-conditioned image editing.

The project workflow involves uploading an image, segmenting it using Detectron2, recording an audio command, transcribing the audio into text, understanding the language using an LLM, and then editing the image using the Stable Diffusion Inpainting model. The project's research focuses on improving the inpainting model, generating automatic masks, understanding relationships between objects and actions, and integrating a Visual Language model for user interaction. Future plans include integrating ControlNet, Mediapipe Face Mesh, pose landmark detection, a super-resolution model, and interactive mask editing.

Key takeaways:

  • The project AAIELA uses AI to allow users to modify images using audio commands, leveraging open-source AI models for computer vision, speech-to-text, large language models, and text-to-image inpainting.
  • The project structure includes Detectron2 for object detection, faster_whisper for audio transcription, a language model to extract object, action and prompt from natural language instruction, and a text conditioned Stable Diffusion Inpainting model.
  • The project workflow includes user image upload, segmentation, audio input, transcription, language understanding, image inpainting, and output of the inpainted image.
  • Future improvements include integrating ControlNet conditioned on keypoints, depth, input scribbles, and other modalities, integrating Mediapipe Face Mesh for facial feature modification, integrating pose landmark detection capabilities, incorporating a super-resolution model for image upscaling, and implementing interactive mask editing.
View Full Article

Comments (0)

Be the first to comment!