The authors provide a basic implementation of this method and suggest that an improved strategy should also consider the boundaries of sentences or paragraphs. They acknowledge that the assumption that chunks of similar sizes hold comparable amounts of information may not apply to domain-specific documents. The proposed method could also benefit other LLM tasks, such as retrieval-augmented question answering. The authors invite readers to connect with them to improve the chunking, summarization, or retrieval in their domain-specific LLM pipelines.
Key takeaways:
- The common strategy for summarizing large documents with Large Language Models (LLMs) involves dividing the document into chunks based on the LLM context window and the prompt size, summarizing these chunks independently, and then combining these summaries into a global summary. However, this strategy can lead to biased summaries if the chunk sizes are not similar.
- To ensure all chunks are of similar size, the authors propose an optimal automatic chunk size determination method. This involves deciding a maximum chunk size based on the LLM context window and the prompt size, calculating the average chunk size, and then redistributing tokens from preceding chunks to the last chunk until it reaches the size of the average chunk size.
- This approach results in a maximum chunk size difference of 1, making it optimal in practice while maintaining the smallest possible number of chunks. A basic implementation of this method is provided in the article.
- While this method assumes that chunks of similar sizes hold comparable amounts of information, this might not apply to domain-specific documents. Therefore, the authors suggest that an improved chunking strategy should also consider the boundaries of sentences or paragraphs, and they plan to discuss this in an upcoming blog post.