GitHub - mukel/llama3.java: Practical Llama 3 inference in Java

The markdown data discusses the Llama3.java project, a successor to llama2.java, which is used to implement practical Llama 3 inference in a single Java file. The project is used to test and tune compiler optimizations and features on the JVM, particularly for the Graal compiler. It has several features including a single file with no dependencies, a GGUF format parser, Llama 3 tokenizer based on minbpe, Llama 3 inference with Grouped-Query Attention, support for Q8_0 and Q4_0 quantizations, fast matrix-vector multiplication routines for quantized tensors using Java's Vector API, and a simple CLI with `--chat` and `--instruct` modes.

The data also provides instructions for setting up the project, including downloading pure `Q4_0` and `Q8_0` quantized .gguf files, and optional instructions for manually quantizing to pure `Q4_0`. It requires Java 21+ and provides instructions for building and running the project, with optional instructions for using a Makefile to manually build and run. The data also includes performance notes and results for running the project on different systems, and states that the project is licensed under the MIT license.

Key takeaways:

Llama3.java is a project that implements Practical Llama 3 inference in a single Java file and is used to test and tune compiler optimizations and features on the JVM, particularly for the Graal compiler.
It has features like GGUF format parser, Llama 3 tokenizer based on minbpe, Llama 3 inference with Grouped-Query Attention, and support for Q8_0 and Q4_0 quantizations.
The project requires Java 21+ and can be built and run using jbang or manually with the provided Makefile.
Performance tests show that Llama3.java has slightly lower tokens/s speed compared to llama.cpp, but it is still a viable option for implementing Llama 3 inference.

GitHub - mukel/llama3.java: Practical Llama 3 inference in Java

Key takeaways:

Comments (0)

Newsletter