The discussion also covers the possibility of packaging these models with existing games to run locally, which would eliminate inference costs. Some participants share their experiences of running models like Mistral 7b and Llama 3 locally and getting decent results. There are also suggestions about using smaller models that can run reliably and quickly on consumer hardware. However, they note that this would require testing and might face issues with GPU vendors and software versions.
Key takeaways:
- Running large language models (LLMs) at the edge is possible on most hardware but not ideal due to latency and throughput expectations, especially without a GPU.
- Centralizing the inference in a distributed cloud environment off-device is most viable for LLMs.
- Models like Llama 3 8b and Mistral 7b can run locally for many users and can be fine-tuned for specific use cases.
- Packaging these models with existing games to run locally could potentially eliminate inference costs, but the feasibility and performance would need to be tested.