Summary

In this project, we extract 100,000 eCommerce transactions and enrich them with OpenAI's Batch API. Data is ingested from normalized sources and automatically cleansed before being stored in a state-of-the-art data lake on Google Cloud Storage. From there, it is loaded into Google BigQuery—a petabyte-scale, enterprise-grade data warehouse organized into staging, clean, and aggregated layers. Apache Airflow orchestrates the complex ETL workflows, while integration with the OpenAI Batch API enriches tens of thousands of data points. Notably, customer reviews originally in Portuguese are translated into English and analyzed to extract precise numerical sentiment scores, transforming raw text into actionable insights.

This solution is closely modeled on enterprise-scale projects delivered at multi-billion USD companies. As we cannot reveal sensitive client data, we use the publicly available anonymized Olist dataset—a comprehensive record of orders, products, reviews, payments, geolocation, customers, and items from Brazilian marketplaces. The architecture reflects the same high-caliber standards applied in confidential client projects. Built on Google Cloud and backed by extensive industry experience, our scalable, fault-tolerant ETL pipeline turns complex, multi-dimensional eCommerce data into strategic business intelligence.

System Architecture, detailing the flow of data from source to data lake, LLM provider, data warehouse and visualization tool, orchestrated by Apache Airflow


Technology Stack

  • Cloud: Google Cloud Project
  • Data Warehouse: Google BigQuery
  • Data Lake: Google Cloud Storage (GCS)
  • Large Language Model (LLM): OpenAI Batch Prediction API
  • Data Source: PostgreSQL
  • Data Pipeline Orchestration: Apache Airflow
  • Analytics: Looker Studio
Entity Relationship Diagram (ERD) of the source data. The source data gets mapped to individual tables in the data warehouse (BigQuery) in Staging and Clean layers

Key Features/Deliverables

  • AI-Powered Sentiment: Automates review enrichment using an LLM-based model, providing detailed aspect scores (e.g., delivery, product quality).
  • End-to-End Data Pipeline: Airflow orchestrates extraction, cleaning, and transformation stages, ensuring reliable, repeatable workflows.
  • Layered Data Architecture: Staging, clean, and aggregated datasets in BigQuery deliver clarity and scalability for analysis.
  • Dashboards & Visuals: Interactive reports break down sentiment, revenue, and product performance by city, category, and more.
  • Actionable Insights: Highlights the precise areas where customer satisfaction soars—or suffers—enabling faster, data-driven decisions.

In Airflow, data tasks are orchestrated in a hierarchical manner, streamlining error tracking and enabling seamless re-runs of the entire pipeline

Outcome

  • Precise Customer Insights: Enabled the business to pinpoint high-priority improvement areas, such as delivery delays or product quality issues.
  • Increased Efficiency: Automated workflows replaced manual processes, reducing errors and freeing up team members for higher-value tasks.
  • Faster Decision-Making: Real-time dashboards let stakeholders spot trends and address problems before they escalate, boosting agility.
  • Scalable Foundation: The robust, modular pipeline can easily incorporate new data sources or AI models as the business evolves.

All data, including AI-predicted review sentiment, is loaded dailyst to Google's Petabyte Scale "BigQuery" data warehouse in staging, clean and aggregated data marts

Contribution

  • Solution Design & Implementation: Conceived the entire architecture from ingestion to final dashboards, ensuring a cohesive, end-to-end workflow.
  • Data Engineering & AI: Built Airflow pipelines to automatically process raw data, then integrated an LLM model to enrich reviews with detailed sentiment.
  • Data Warehousing & Modeling: Established a multi-layered structure (staging, clean, aggregated) in BigQuery, optimizing performance and clarity.
  • Dashboard Development: Crafted interactive BI visuals that highlight key metrics—sentiment scores, revenue, itemized statistics—catering to both executives and analysts.
  • Deployment & Access Control: Managed the production rollout and set up secure permissions, guaranteeing data governance and smooth team collaboration.