The LLM4Decompile includes models with sizes between 1.3 billion and 33 billion parameters. The models are available on Hugging Face and have been trained with different configurations, including without prior knowledge of optimization levels. The article also mentions future plans to include a larger dataset for pre-training the model with assembly code and C code, and to support more languages/platforms and settings.
Key takeaways:
- The LLM4Decompile is an open-source Large Language Model (LLM) dedicated to decompiling binary code. It is trained on a dataset of assembly-source pairs compiled from a million C code samples.
- The effectiveness of the decompilation process is validated through two key metrics: re-compilability and re-executability. These metrics assess the syntactic integrity and semantic correctness of the decompiled code.
- The LLM4Decompile includes models with sizes between 1.3 billion and 33 billion parameters, and these models are available on Hugging Face. The models have been trained with different configurations, including without prior knowledge of the optimization levels.
- The project also includes Decompile-Eval, a decompilation benchmark that assesses the re-compilability and re-executability of the decompiled code. The data for this benchmark is stored in a JSON list format and can be evaluated using single GPU, single process, or multiple GPUs and multi-process.