Last week, the app added support for the bilingual Yi-34B Chat model, which uses around 18GB of RAM. Users with iOS devices or low memory Macs can download the related Yi-6B Chat model. Unlike other offline LLM apps, Private LLM uses mlc-llm for inference instead of llama.cpp, and all models in the app are quantized with OmniQuant, not RTN quantization. The app has a small community of users on Discord who are building and sharing LLM based shortcuts.
Key takeaways:
- The app Private LLM has been updated with support for 4-bit OmniQuant quantized Mixtral 8x7B Instruct model, which outperforms Q4 and Q8 models in inference speed and text generation quality respectively.
- The app now includes more downloadable models, support for App Intents (Siri, Apple Shortcuts), on-device grammar correction, summarization etc with macOS services and an iOS version.
- The app recently added support for the bilingual Yi-34B Chat model, and a related Yi-6B Chat model for iOS users and users with low memory Macs.
- Unlike most popular offline LLM apps, this app uses mlc-llm for inference and not llama.cpp, and all models in the app are quantized with OmniQuant quantization and not RTN quantization.