Show HN: The fastest way to run Mixtral 8x7B on Apple Silicon Macs

The article discusses the evolution of the Private LLM app, which was launched 10 months ago with a single RedPajama Chat 3B model. The app has since added support for the 4-bit OmniQuant quantized Mixtral 8x7B Instruct model, which outperforms Q4 and Q8 models in inference speed and text generation quality respectively, while using only about 24GB of RAM at 8k context length. This is achieved by using a superior quantization algorithm and unquantized embeddings and MoE gates. The app also features downloadable models, support for App Intents (Siri, Apple Shortcuts), on-device grammar correction, summarization, and an iOS version with smaller downloadable models.

Last week, the app added support for the bilingual Yi-34B Chat model, which uses around 18GB of RAM. Users with iOS devices or low memory Macs can download the related Yi-6B Chat model. Unlike other offline LLM apps, Private LLM uses mlc-llm for inference instead of llama.cpp, and all models in the app are quantized with OmniQuant, not RTN quantization. The app has a small community of users on Discord who are building and sharing LLM based shortcuts.

Key takeaways:

The app Private LLM has been updated with support for 4-bit OmniQuant quantized Mixtral 8x7B Instruct model, which outperforms Q4 and Q8 models in inference speed and text generation quality respectively.
The app now includes more downloadable models, support for App Intents (Siri, Apple Shortcuts), on-device grammar correction, summarization etc with macOS services and an iOS version.
The app recently added support for the bilingual Yi-34B Chat model, and a related Yi-6B Chat model for iOS users and users with low memory Macs.
Unlike most popular offline LLM apps, this app uses mlc-llm for inference and not llama.cpp, and all models in the app are quantized with OmniQuant quantization and not RTN quantization.

Show HN: The fastest way to run Mixtral 8x7B on Apple Silicon Macs

Key takeaways:

Comments (0)

Newsletter