Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

GitHub - devflowinc/firecrawl-simple: ➖ Stripped down, stable version of firecrawl optimized for self-hosting and ease of contribution. Billing logic and AI features are completely removed.

Nov 06, 2024 - github.com
Firecrawl Simple is a streamlined version of Firecrawl, optimized for self-hosting and easy contribution. It removes billing logic and AI features, and replaces `playwright` with `puppeteer-cluster` and `puppeteer-extra`'s stealth plugins, eliminating the need for `fire-engine` and `scrapingbee` for guarded pages. It supports only the v1 `/scrape`, `/crawl/{id}`, and `/crawl` routes, and removes several dependencies from the package.json. The project is open for contributions and offers paid part-time maintainer positions.

The article also provides detailed instructions on how to self-host Firecrawl Simple using docker-compose. It explains the architecture of Firecrawl Simple, which involves a `crawl` endpoint starting on a URL and getting the sitemap or HTML for the page, URLs from the sitemap or HTML being added to the redis queue, and workers picking those URLs and getting their HTML using the `/scrape` endpoint on the `playwright-service`. It also outlines potential scaling bottlenecks and provides examples of how to use the `crawl` and `scrape` endpoints.

Key takeaways:

  • Firecrawl Simple is a stripped-down and stable version of Firecrawl, optimized for self-hosting and ease of contribution. It replaces 'playwright' with 'puppeteer-cluster' and 'puppeteer-extra' stealth plugins, eliminating the need for 'fire-engine' and 'scrapingbee' for guarded pages.
  • The project is actively looking for contributors and offers paid part-time maintainer positions. They have bounties on a couple of issues and are interested in someone who can be an active maintainer in the long term.
  • Firecrawl Simple was created as a fork from the original Firecrawl to be ready for self-hosting, easy to contribute to, and scalable on Kubernetes. The closed-source nature of Fire-engine, Firecrawl's solution for anti-bot pages, was a significant factor in this decision.
  • Firecrawl Simple provides detailed instructions on how to self-host the service using docker-compose. It also outlines the architecture of the service, scaling concerns, and provides examples of how to use the service for crawling and scraping.
View Full Article

Comments (0)

Be the first to comment!