GitHub - databonsai/databonsai: clean & curate your data with LLMs.

Databonsai is a Python library that uses Language Learning Models (LLMs) to perform data cleaning tasks. It offers a suite of tools for data processing using LLMs, including categorization, transformation, and extraction. It also provides validation of LLM outputs, batch processing for token savings, and retry logic with exponential backoff for handling rate limits and transient errors. Users can store their API keys on an .env file in the root of their project or specify it as an argument when initializing the provider.

The library provides a quick start guide for categorization, setting up the LLM provider, and categories. It also offers a feature for larger datasets called AutoBatch, which handles batching adaptively. Other features include a progress bar, the ability to return the last successful index so users can resume from there, and retry logic for handling invalid responses. The library also records token usage for OpenAI and Anthropic, helping users estimate their costs. The library supports multiple LLM providers, including OpenAIProvider, AnthropicProvider, and OllamaProvider.

Key takeaways:

databonsai is a Python library that uses LLMs to perform data cleaning tasks such as categorization, transformation, and extraction.
The library provides a suite of tools for data processing using LLMs, validation of LLM outputs, batch processing for token savings, and retry logic with exponential backoff for handling rate limits and transient errors.
It supports different LLM providers such as OpenAIProvider, AnthropicProvider, and OllamaProvider, and more are expected to be added soon.
databonsai also offers features like AutoBatch for larger datasets, progress bar, retry logic for API related errors, and recording token usage for cost estimation.

GitHub - databonsai/databonsai: clean & curate your data with LLMs.

Key takeaways:

Comments (0)

Newsletter