Thameur Portfolio

Advanced AI Web Scraper πŸ•·οΈ

June 15, 2024 (7mo ago)

πŸš€ An intelligent web scraping application that combines automated web scraping with AI-powered data extraction.

πŸ’» Source CodeΒ Β β€’Β  🌐 DemoΒ Β β€’Β 
Advanced AI Web Scraper: Intelligent Data Extraction
Advanced AI Web Scraper: Intelligent Data Extraction
Advanced AI Web Scraper: Intelligent Data Extraction
Advanced AI Web Scraper: Intelligent Data Extraction

πŸ“Œ Abstract

Advanced AI Web Scraper is an intelligent web scraping tool built with Streamlit. It uses Selenium for web scraping, Groq LLM for AI-powered data parsing, and provides a modern user-friendly interface. The application extracts structured data from websites efficiently while handling CAPTCHAs, dynamic content, and unwanted elements.

🌟 Features

  • Intelligent Scraping: Automatically handles CAPTCHAs and dynamic content.
  • AI-Powered Data Extraction: Uses Groq LLM for parsing and structuring data.
  • Clean Interface: Modern, responsive UI with dark theme optimization.
  • Parallel Processing: Handles large content through parallel chunk processing.
  • Smart Content Cleaning: Removes tracking elements, ads, and unwanted content.
  • Structured Output: Presents data in clean, organized tables.

πŸš€ Getting Started

Prerequisites

  • Python 3.8+
  • A Groq API key
  • A Bright Data Scraping Browser account

Installation

  1. Clone the repository:
git clone https://github.com/verus56/advanced-web-scraper.git cd advanced-web-scraper
  1. Install required dependencies:
pip install -r requirements.txt
  1. Set up environment variables:

Create a .env file in the root directory and add:

AUTH=your-bright-data-auth GROQ_API_KEY=your-groq-api-key

Running the App

streamlit run app.py

Visit http://localhost:8501 to access the application.

πŸ€– How It Works

  1. Enter the URL of the target website.
  2. Specify the type of data you want to extract.
  3. Scrape and parse the data with intelligent AI support.
  4. Export structured data in your preferred format.

πŸ“Š Technical Stack

  • Frontend: Streamlit
  • Web Scraping: Selenium
  • AI Engine: Groq LLM
  • Data Processing: Pandas
  • Content Cleaning: BeautifulSoup
  • Parallelization: concurrent.futures

πŸ› οΈ Deployment

  • Docker:
docker build -t advanced-scraper:latest . docker run -d -p 8501:8501 advanced-scraper:latest
  • Cloud options: AWS, GCP, Azure

πŸ“… Configuration

  • Content Chunk Size: Adjust the size of content chunks for processing (2000-8000 characters).
  • Parallel Processing: Controls the number of concurrent processes.
  • Browser Options: Configurable through Selenium settings.

πŸ”’ Security Features

  • Automatic CAPTCHA handling.
  • Cookie and tracking prevention.
  • JavaScript blocking options.
  • Secure API key management.

🎨 UI Features

  • Dark theme optimization.
  • Responsive design.
  • Progress indicators.
  • Expandable content sections.
  • Error handling with visual feedback.
  • Interactive data tables.

πŸ”§ Advanced Features

Content Cleaning

  • Removes tracking elements.
  • Filters unwanted content.
  • Preserves semantic structure.
  • Handles dynamic content.

Data Processing

  • Parallel chunk processing.
  • Intelligent merging.
  • Duplicate removal.
  • Table structure preservation.

πŸ“ License

Released under the MIT License.

πŸ“² Contact

Made with ❀️ by v56