GitHub - clemlesne/scrape-it-now: A website to scrape? There's a simple way.
Aug 17, 2024 - news.bensbites.com
"Scrape It Now!" is a tool that allows users to scrape websites in a simple way. It features a decoupled architecture with Azure Queue Storage, idempotent operations that can run in parallel, and scraped content storage in Azure Blob Storage. The scraper can avoid re-scraping unchanged pages, block ads to lower network costs, explore pages in depth by detecting and de-duplicating links, extract markdown content from a page, load dynamic JavaScript content, and enhance anonymity with proxies and random user agents. The indexer creates an AI Search index automatically, chunks markdown while keeping the content coherent, embeds chunks with OpenAI embeddings, and makes indexed content semantically searchable with Azure AI Search.
The tool is easy to use, with users simply needing to run a job to scrape a website and then show the job status. The scraped website can then be indexed by running another job. The tool also allows for advanced usage, with users able to source environment variables from a `.env` file for easy CLI configuration. The architecture of the tool involves Azure Queue Storage, Azure Blob Storage, Azure AI Search, and Azure OpenAI Embeddings.
Key takeaways:
The 'Scrape It Now!' tool offers a simple way to scrape websites, with features such as avoiding re-scraping unchanged pages, blocking ads to lower network costs, and preserving anonymity.
The tool uses Azure Queue Storage for decoupled architecture and Azure Blob Storage for storing scraped content.
The tool also offers an Indexer feature that creates an AI Search index automatically, chunks markdown while keeping the content coherent, and makes indexed content semantically searchable with Azure AI Search.
Advanced usage of the tool allows users to source environment variables from a '.env' file for easy CLI configuration.