AI Agent for Generating Web Scraper Parsing Code

AI tool that generates parsing code for scraping web pages with similar styles. Users input web page URLs for training, choose a parser prompt, and submit to build the parser.

Features

AI Parser Generation

Automatically generates code to parse web pages with similar visual styles. Users input URLs from the same website to train the parser.

Custom Parser Prompts

Allows users to choose and customize parsing prompts to tailor the data extraction process.

HTML Upload for Training

When the training is 'Fail to Load' due to an HTML downloader being banned, users can save the HTML and upload it to the Coparser to train the model.

Install Python Libraries

Guidance on installing Python packages like lxml and playwright to run generated code. Commands are provided for installing necessary packages.

Training Cases

Allows you to add, delete, and manage training cases that help tailor the parser to specific inputs or webpage structures.

Code Regeneration

Enables the user to regenerate code based on alterations or new training data, ensuring that the parser remains effective with updated website layouts.

Training Case Management

Allows users to add and manage training cases for the parser. It supports attaching specific URLs to train the parser on different datasets, ensuring accuracy in data extraction.

Custom Parser Code

Empowers users to generate and modify custom parser code using the provided input and sample data, offering flexibility to tailor data extraction logic to specific needs.

Visual Output Review

Includes visual output review features to verify and validate the extraction results against provided sample pages, helping ensure the parser is functioning correctly.

HTML Content Extraction

Uses Playwright for browsing and CSSSelector for parsing HTML to extract product details from web pages.

Price Extraction

Extracts and cleans price information from the product page using CSS selectors.

Product Name Extraction

Retrieves the product name from the HTML content using a CSS selector for the product name element.

Availability Extraction

Determines the availability of a product by checking specific HTML elements for delivery information.

Product Image Extraction

Extracts the URL of the product image from the HTML content using a CSS selector.

HTML Extraction

Utilizes Playwright to launch a browser and extract the HTML content from a given Amazon product URL, which is then saved to a local file for further processing.

Sale Price Extraction

Extracts the sale price of a product using CSS selection from the HTML content, handling exceptions if the data is not initially found.

Multiple Price Extraction

Gathers a range of prices if multiple price presents by extracting data from specified CSS selectors.

Product Name Extraction

Extracts the product name from the HTML content using defined CSS selectors to locate the product title and retrieve its text value.

Total Reviews Extraction

Fetches the total number of customer reviews available for a product by selecting the appropriate CSS element and retrieving its text.

Availability Extraction

Checks and retrieves the availability status of a product by inspecting the designated CSS selector for stock information.

Product Image Extraction

Extracts URLs of the product images using CSS selectors, ensuring that visual content is included in the parsed data.

Average Review Extraction

Calculates and returns the average review score for a product by extracting and converting the numeric rating from the HTML content.

Product Parsing

Automatically extracts detailed information about clothing products from Amazon, such as product names and details.

Training Cases

Allows adding of specific URLs as training cases to test and refine the parser's ability to extract data accurately.

Custom Code Generation

Generates custom Python code to facilitate specific data extraction tasks using libraries like Playwright and CSS Selectors.

Amazon Product Detail Extraction

Automatically extracts detailed product information such as price, product name, and average review from Amazon links.

Training Cases

Allows users to add training cases for improving parser accuracy using specific Amazon product pages.

Code Regeneration

Users can regenerate the parsing code based on updated training cases or requirements.

Web Data Extraction

Utilizes a Python-based parser to extract product details from websites using Playwright for browser automation.

Training Cases

Allows users to add and manage training cases for the parser, with URLs to example pages and status indicators for completion.

Code Generation

Automatically generates Python code based on the prompt and provided training cases, using libraries like Playwright and lxml to fetch and parse data.