Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Listening with LLM

Jan 14, 2024 - paul.mou.dev
The article details the author's journey of fine-tuning Large Language Models (LLMs) to process audio, with the ultimate goal of creating a model capable of describing human voices. The author uses PyTorch to build functions from scratch, rather than relying on third-party libraries, and bases their work on two papers that explore the use of an audio encoder to transform sound into embeddings for LLMs. The author uses the Mistral OpenOrca model and OpenAI’s Whisper for the audio encoder, and the MusicCaps dataset. The author also shares their process of debugging, adapting the Whisper model, defining the loss function, and training the model.

The author successfully trains the model to describe a given audio file by only training the TunableWhisperAudioEncoder, achieving a loss of approximately 0.46. The author plans to scale up training by incorporating more audio tasks and applying fine-tuning to the LLM. The ultimate goal is to replicate "emergent" behaviors described in the referenced papers, such as identifying the speaker's age or gender without explicit training for such tasks. The author acknowledges the lectures by Karpathy for their learning.

Key takeaways:

  • The author is working on fine-tuning Large Language Models (LLMs) to process audio, with the goal of building a model capable of describing human voices.
  • The author used the SALMONN and Qwen-Audio papers as a guide to give LLMs audio understanding capabilities, leveraging an audio encoder to transform sound to embeddings that are then fed into LLMs.
  • The author built a model using Mistral LLM and a tunable Whisper AudioEncoder from OpenAI, and trained it to describe a given audio file. The training was successful, with the model able to generate a fairly accurate description of a K-pop song.
  • Next steps include scaling up training by incorporating more audio tasks and applying fine-tuning to the LLM, with the aim of replicating "emergent" behaviors described in the referenced papers.
View Full Article

Comments (0)

Be the first to comment!