0

ETL Automation & Weekly Report

Automated end-to-end data cleaning pipeline for 3 operational data sources. Interactive demo shows live data transformation with business impact metrics.

The Business Problem

Every Monday, an operations analyst spent 3 hours manually cleaning 3 CSV files from different departments โ€” orders, inventory, and employee data. Same task, every week, burning 156 hours annually.

This project automates it entirely. One command. 18 seconds. Complete data quality audit included.


Interactive Demo

See the pipeline in action below. Click "Run Pipeline" to watch it extract messy data, transform it programmatically, and generate the weekly report:

ETL Pipeline Demo

Live data cleaning simulation

Extract
Transform
Load
0
Orders Processed
0
Inventory Items
0
Employees
0 hrs
Hours Saved

Raw Orders Data (Sample)

3 different date formats, duplicates, typos
Order IDDateProductPriceQtyStatus
ORD-001 01/15/2024 Laptop $1,299.002.0Compleeted
ORD-002 2024-01-16Phone$899.001.0shiped
ORD-001 ๐Ÿ”15-Jan-2024 Laptop $1,299.002.0CANCELLED
ORD-003 NULLTabletNULL1.0Pending
๐Ÿ” Duplicatesโš ๏ธ NULL valuesโœ๏ธ Typos๐Ÿ“… Mixed formats
Pipeline runtime: ~18 seconds3 CSV files generated
View on GitHub

What This Solves

Before (Manual Process)

  • 3 hours every Monday cleaning data in Excel
  • Inconsistent date formats breaking pivot tables
  • Price fields with "$" symbols requiring manual find/replace
  • Duplicate order IDs inflating revenue reports
  • Status typos (Compleeted, shiped) messing up filters
  • No audit trail โ€” mistakes go unnoticed

After (Automated Pipeline)

  • 18 seconds end-to-end execution
  • Handles 8+ data quality issue types automatically
  • Structured logging of every transformation
  • Generated weekly report with issue counts
  • 156 hours saved annually (~โ‚น62,400 at analyst rates)

Data Quality Issues Handled

DatasetIssues Detected & Fixed
OrdersDuplicate order IDs, 3 date formats (YYYY-MM-DD, MM/DD/YYYY, DD-Mon-YYYY), $ prefix in prices, float quantities, status typos
InventoryExtra whitespace in names, negative stock levels, mixed timezones (IST/UTC), comma-separated cost prices
EmployeesMixed name casing (ALL CAPS, all lower), mixed overtime flags (0/1/Yes/No), impossible hours (>24), blank dates

Technical Architecture

etl-automation/
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/                    # Messy input files
โ”‚   โ”œโ”€โ”€ cleaned/                # Timestamped outputs
โ”‚   โ””โ”€โ”€ generate_messy_files.py # Test data generator
โ”œโ”€โ”€ etl/
โ”‚   โ”œโ”€โ”€ clean_and_report.py     # Main pipeline
โ”‚   โ””โ”€โ”€ logs/
โ”‚       โ””โ”€โ”€ etl_run.log         # Audit trail
โ”œโ”€โ”€ reports/
โ”‚   โ””โ”€โ”€ weekly_report.txt       # Generated report
โ””โ”€โ”€ README.md

Key Components

1. Data Generator (generate_messy_files.py) Creates realistic test data with injected quality issues โ€” duplicates, nulls, typos, formatting inconsistencies.

2. ETL Pipeline (clean_and_report.py)

  • 8 labeled sections for maintainability
  • Dual logging: file (etl_run.log) + stdout
  • Error handling: Graceful skip on missing files
  • Metrics tracking: Every issue count logged

3. Weekly Report Auto-generated with:

  • Executive summary (rows processed, issues fixed)
  • Data quality breakdown per file
  • Top 3 recurring issues with fix recommendations
  • Time saved calculation

Sample Output

========================================
WEEKLY OPERATIONS REPORT
Generated: 2024-03-18 09:02:31
========================================

EXECUTIVE SUMMARY
-----------------
Total orders processed: 780
Orders with issues fixed: 1,739 (222.9%)
Inventory items needing reorder: 10
Employee records cleaned: 138
Estimated manual time saved: 28.8 hours

DATA QUALITY ISSUES FOUND
-------------------------
Orders:
  - Duplicate order IDs removed: 40
  - Date format inconsistencies fixed: 560
  - Price formatting issues fixed: 819
  - Status typos corrected: 290

TOP 3 RECURRING ISSUES (action required)
-----------------------------------------
1. Date format inconsistency: 560 instances
   โ†’ Standardize order system to ISO 8601

2. Price stored as text with $: 819 instances
   โ†’ Export numeric values without formatting

3. Hours > 24 in records: 15 instances
   โ†’ Add HR system validation

Time Saved Breakdown

TaskManualAutomatedWeekly Saving
Clean orders.csv60 min8 sec~59 min
Clean inventory.csv45 min5 sec~44 min
Clean employees.csv30 min3 sec~29 min
Generate report45 min2 sec~44 min
Total180 min18 sec~179 min

Annual impact: 156 hours saved (~โ‚น62,400 at analyst rates)


Stack & Skills

Python ยท Pandas ยท ETL pipeline ยท Data quality ยท Logging ยท Automation ยท Operations analytics


Run It Yourself

# Clone and setup
git clone https://github.com/Swethavenk79/etl-automation.git
cd etl-automation
pip install -r requirements.txt
 
# Generate messy data
python data/generate_messy_files.py
 
# Run pipeline
python etl/clean_and_report.py
 
# View report
cat reports/weekly_report.txt

Why This Matters

Most companies have messy data. Most analysts spend hours cleaning it manually. This project shows I can:

  1. Identify systemic data quality issues โ€” not just fix symptoms
  2. Quantify business impact โ€” time saved, cost savings
  3. Build maintainable automation โ€” clear code, logging, documentation
  4. Think beyond the fix โ€” "Top 3 Recurring Issues" section proves I understand prevention beats cure

The interactive demo above? That's the actual transformation logic running in your browser. Hiring managers can see it work instead of just reading about it.