1
Feature Story
GitHub - clemlesne/scrape-it-now: A website to scrape? There's a simple way.
Aug 17, 2024 · news.bensbites.comThe tool is easy to use, with users simply needing to run a job to scrape a website and then show the job status. The scraped website can then be indexed by running another job. The tool also allows for advanced usage, with users able to source environment variables from a `.env` file for easy CLI configuration. The architecture of the tool involves Azure Queue Storage, Azure Blob Storage, Azure AI Search, and Azure OpenAI Embeddings.
Key takeaways
- The 'Scrape It Now!' tool offers a simple way to scrape websites, with features such as avoiding re-scraping unchanged pages, blocking ads to lower network costs, and preserving anonymity.
- The tool uses Azure Queue Storage for decoupled architecture and Azure Blob Storage for storing scraped content.
- The tool also offers an Indexer feature that creates an AI Search index automatically, chunks markdown while keeping the content coherent, and makes indexed content semantically searchable with Azure AI Search.
- Advanced usage of the tool allows users to source environment variables from a '.env' file for easy CLI configuration.