The author successfully trains the model to describe a given audio file by only training the TunableWhisperAudioEncoder, achieving a loss of approximately 0.46. The author plans to scale up training by incorporating more audio tasks and applying fine-tuning to the LLM. The ultimate goal is to replicate "emergent" behaviors described in the referenced papers, such as identifying the speaker's age or gender without explicit training for such tasks. The author acknowledges the lectures by Karpathy for their learning.
Key takeaways:
- The author is working on fine-tuning Large Language Models (LLMs) to process audio, with the goal of building a model capable of describing human voices.
- The author used the SALMONN and Qwen-Audio papers as a guide to give LLMs audio understanding capabilities, leveraging an audio encoder to transform sound to embeddings that are then fed into LLMs.
- The author built a model using Mistral LLM and a tunable Whisper AudioEncoder from OpenAI, and trained it to describe a given audio file. The training was successful, with the model able to generate a fairly accurate description of a K-pop song.
- Next steps include scaling up training by incorporating more audio tasks and applying fine-tuning to the LLM, with the aim of replicating "emergent" behaviors described in the referenced papers.