Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

GitHub - RobViren/kvoicewalk: A random walk voice style cloning application for Kokoro text to speech

May 21, 2025 - github.com
KVoiceWalk is a project aimed at creating new Kokoro voice style tensors that closely mimic target voices using a random walk algorithm combined with a hybrid scoring method. This method integrates Resemblyzer similarity, feature extraction, and self-similarity to enhance the voice cloning process. The project builds on the work of Kokoro and Resemblyzer, aiming to evolve voice tensors that are more similar to target audio. The results have shown promise, with the scoring method potentially serving as a foundation for future genetic algorithms. The tool allows users to generate voice tensors and audio files by providing a target text and audio, and it supports experimentation with various command-line arguments to refine the output.

The project includes features like an interpolated start function to find the best starting population of tensors for cloning the target voice. The scoring function, which was challenging to perfect, uses a harmonic mean to balance self-similarity, feature similarity, and target similarity, preventing overfitting and maintaining audio quality. While the process is random and not parallelized, it demonstrates the potential of the approach, with future improvements possibly involving genetic algorithms or alternative voice generation methods.

Key takeaways:

  • KVoiceWalk uses a random walk algorithm and a hybrid scoring method to create new Kokoro voice style tensors that clone target voices.
  • The project leverages Resemblyzer similarity, feature extraction, and self-similarity to improve voice tensor similarity to target audio.
  • Interpolation and random walk processes are used to enhance voice similarity, with significant improvements observed after 10,000 steps.
  • The scoring function is critical, using a harmonic mean to balance target similarity, self-similarity, and feature similarity, preventing overfitting and maintaining audio quality.
View Full Article

Comments (0)

Be the first to comment!