Data Scraping & UNSPSC Categorization for a Multi-Brand eCommerce Reseller

Client Profile

Established Multi-Brand eCommerce Reseller Managing 7,000+ Supplier Brands

An eCommerce reseller founded in the early 1980s has grown into a trusted reseller for industrial machinery, hardware supplies, and household appliances. The business maintains active vendor relationships with more than 7,000 manufacturer brands and operates two dedicated eCommerce storefronts serving a broad range of sectors, including plumbing, electrical, and equipment supply. With a catalog exceeding 2 million products, maintaining consistent and accurate product data at scale has been an operational priority.

Project Scope

Extracting, Categorizing, and Enriching Product Data at Enterprise Scale

The scope of work included three interconnected product data management functions for the client's 2M+ SKU catalog:

Data Extraction from Brand Sites

The team was tasked with extracting product data from multiple third-party brand websites shared by the client. This process required precision to preserve all available attributes, followed by removing special characters and format standardization to ensure the data could be integrated directly into pre-approved templates.
Custom Taxonomy Build and Product Classification

The client's platform had no existing category structure or classification framework. The team was responsible for designing a taxonomy from scratch to match the client's product range and applying accurate UNSPSC codes to each item to improve platform navigation and searchability.
AI-Assisted Product Data Enrichment

A significant volume of scraped records contained incomplete or missing data fields. The client requested the use of AI — specifically ChatGPT — to fill these gaps systematically through customized prompting, ensuring each enriched field met defined quality and consistency standards.

Project Challenges

Handling Multi-Site Structural Differences, Anti-Bot Restrictions, and Large-Scale Scraping Constraints

Each phase of this engagement brought distinct technical and operational challenges that required targeted responses at every stage. The key obstacles included:

Development of Custom Scripts For Scraping Websites

Each brand website presented its own scraping constraints—unique HTML structures, embedded content restrictions, and varying anti-bot settings. There was no single-script solution; the team had to build and deploy distinct extraction tools per platform without compromising delivery timelines.
Scaling Manual Scripts Across High Data Volumes

Scripts built for targeted, single-site extraction lacked the throughput needed for simultaneous, high-volume collection across multiple brand websites. As batch sizes grew, the performance constraints of manually developed scripts became a limiting factor that required proactive management.
Standardizing Inconsistent Raw Data Output

Much of the scraped data arrived without proper categorization or consistent formatting. Extensive post-processing was required to remove non-standard characters, reformat fields, and align all records with the client's templates—steps that added time and complexity to an already demanding pipeline.
Ensuring Reliability of AI-Generated Categorizations

ChatGPT occasionally returned outdated UNSPSC codes or produced category assignments that were inaccurate. Managing AI output quality required precise and specific prompts and systematic human review at each enrichment and categorization step to ensure results met acceptable accuracy standards.

Our Approach

Custom Scraping Scripts, Taxonomy Design, and AI-Assisted Enrichment

A six-person team was assembled to manage the engagement. The team included a prompt engineer, a data scraping specialist, and a dedicated QA resource.

1

Custom Python Script Development for Data Scraping

Brand websites for scraping were delivered to the team in batches. Custom Python scripts and extraction tools were developed for each site to collect product attributes—including descriptions, pricing, taxonomy, reviews, and product categories—at scale and with high accuracy.

Post-product data scraping, any special characters were removed, and data were organized according to the client's structural requirements. Three approaches applied across scraping operations:

Curl requests were used to query web resources directly, enabling structured data retrieval and testing while bypassing front-end restrictions.
Python requests automated HTTP requests to target URLs, producing error-free, consistent data downloads.
BeautifulSoup objects were used to parse HTML content, enabling precise extraction and cleaning of specific data points for further processing.

The client established clear guidelines on permissible data scraping targets, ensuring all scraping activity remained compliant with ethical and legal standards while minimizing the risk of website blocking.

2

Taxonomy Development & UNSPSC Code Assignment

To establish a consistent product classification structure, we developed a custom taxonomy aligned with the client’s product range, drawing from Google’s category framework. We then used ChatGPT to suggest the closest category matches and cross-referenced each product against the UNSPSC directory to assign the correct codes.

Our team manually reviewed every category and code to remove inaccuracies and ensure the final classification remained precise and reliable.

3

AI-Powered Data Enrichment with ChatGPT-4

A custom GPT-4 integration was developed after purchasing API tokens to automate the enrichment of incomplete product data. Prompt engineers developed highly specific instructions to guide the model in filling missing fields — including weights, descriptions, and category assignments — with contextually accurate outputs.

We also used our master database to enrich the records further, adding relevant information that improved overall data completeness and quality.

Automation Approach

Human-Supervised Automation Across All Workflows

Automation handled most of the work across scraping, enrichment, categorization, and cleansing processes. Human oversight was applied at each stage to validate outputs, identify errors, and ensure every record met the client’s quality standards before delivery.

Task	Automation	Human Intervention
Data Scraping	Python scripts and scraping tools automated the retrieval of product data across all brand websites.	Manual review was conducted to verify accuracy, remove unwanted special characters, and confirm the data matched client formatting standards.
Data Enrichment	ChatGPT was used to identify and fill missing product information, including descriptions, weights, and categories.	Team members validated AI-generated entries and cross-referenced the in-house master database to close any remaining data gaps.
Product Categorization	ChatGPT generated initial category suggestions and proposed UNSPSC code assignments for all products.	Resources mapped every code using the UNSPSC reference portal and corrected any outdated or ambiguous classifications.

Results Delivered

Operational Gains Across Taxonomy, Data Quality, and Workflow Efficiency

2M+ Products Categorized

A fully structured product taxonomy was built and deployed. Team4eCom eliminated categorization gaps that previously affected product discovery and made browsing less intuitive.

99.8% Data Accuracy

Error-free product data was delivered through a combination of automated cleansing workflows and strict manual quality checks at every stage.

78% Efficiency Increase

Operational efficiency improved through automation across data scraping, enrichment, categorization, and standardization, reducing manual effort and improving delivery timelines.

Project Workflow

Product Data Extraction, Enrichment, Categorization, and Quality Review

The following outlines how each stage of the engagement was executed, from initial onboarding through final delivery:

Client Project Initiation

Onboarded an eCommerce reseller needing full data pipeline automation.
Key focus: Scraping and enrichment of 2M+ SKUs' product data.

Data Scraping

Developed custom Python scripts per brand website.
Addressed unique site architectures and anti-bot measures.
Designed scripts for scalability across high-volume, multi-site extraction.
Conducted manual reviews to verify data accuracy at the extraction stage.

Data Cleansing, Standardization & Enrichment

Standardized extracted data—removed special characters and applied consistent formatting.
Used ChatGPT to fill missing product attributes (descriptions, weights, categories).
Validated AI-generated entries through human review and in-house database enrichment.

Custom Taxonomy & Product Categorization

Built custom taxonomy framework using ChatGPT for classification.
Assigned UNSPSC codes to all products.
Human oversight applied to review and correct AI-generated assignments.

Project Completion

Delivered enriched, categorized product data with 99.8% accuracy.
Achieved a 78% increase in operational efficiency through strategic task automation.

Start Your Project

Need Scalable Support for High-Volume Product Data Management?

Team4eCom delivers end-to-end product data management through a structured model, where automation handles high-volume processing and human review safeguards accuracy, helping businesses manage large product volumes with greater efficiency and confidence.

Reach out at info@team4ecom.com to discuss your requirements.

99.8% Accurate Product Data Delivered Across 2 Million+ SKUs Through Product Data Scraping and Enrichment