Swetha

Live Demo

Python Source

Next.js Dashboard

The Business Problem

Every Monday, an operations analyst spent 3 hours manually cleaning 3 CSV files from different departments — orders, inventory, and employee data. Same task, every week, burning 156 hours annually.

This project automates it entirely. One command. 18 seconds. Complete data quality audit included.

Interactive Demo

See the pipeline in action below. Click "Run Pipeline" to watch it extract messy data, transform it programmatically, and generate the weekly report:

ETL Pipeline Demo

Live data cleaning simulation

Extract

Transform

Load

Orders Processed

Inventory Items

Employees

0 hrs

Hours Saved

Raw Orders Data (Sample)

3 different date formats, duplicates, typos

Order ID	Date	Product	Price	Qty	Status
ORD-001	01/15/2024	Laptop	$1,299.00	2.0	Compleeted
ORD-002	2024-01-16	Phone	$899.00	1.0	shiped
ORD-001 🔁	15-Jan-2024	Laptop	$1,299.00	2.0	CANCELLED
ORD-003	NULL	Tablet	NULL	1.0	Pending

🔁 Duplicates⚠️ NULL values✏️ Typos📅 Mixed formats

Pipeline runtime: ~18 seconds3 CSV files generated

View on GitHub

What This Solves

Before (Manual Process)

3 hours every Monday cleaning data in Excel
Inconsistent date formats breaking pivot tables
Price fields with "$" symbols requiring manual find/replace
Duplicate order IDs inflating revenue reports
Status typos (Compleeted, shiped) messing up filters
No audit trail — mistakes go unnoticed

After (Automated Pipeline)

18 seconds end-to-end execution
Handles 8+ data quality issue types automatically
Structured logging of every transformation
Generated weekly report with issue counts
156 hours saved annually (~₹62,400 at analyst rates)

Data Quality Issues Handled

Dataset	Issues Detected & Fixed
Orders	Duplicate order IDs, 3 date formats (YYYY-MM-DD, MM/DD/YYYY, DD-Mon-YYYY), `$` prefix in prices, float quantities, status typos
Inventory	Extra whitespace in names, negative stock levels, mixed timezones (IST/UTC), comma-separated cost prices
Employees	Mixed name casing (`ALL CAPS`, `all lower`), mixed overtime flags (`0/1`/`Yes/No`), impossible hours (>24), blank dates

Technical Architecture

etl-automation/
├── data/
│   ├── raw/                    # Messy input files
│   ├── cleaned/                # Timestamped outputs
│   └── generate_messy_files.py # Test data generator
├── etl/
│   ├── clean_and_report.py     # Main pipeline
│   └── logs/
│       └── etl_run.log         # Audit trail
├── reports/
│   └── weekly_report.txt       # Generated report
└── README.md

Key Components

1. Data Generator (generate_messy_files.py) Creates realistic test data with injected quality issues — duplicates, nulls, typos, formatting inconsistencies.

2. ETL Pipeline (clean_and_report.py)

8 labeled sections for maintainability
Dual logging: file (etl_run.log) + stdout
Error handling: Graceful skip on missing files
Metrics tracking: Every issue count logged

3. Weekly Report Auto-generated with:

Executive summary (rows processed, issues fixed)
Data quality breakdown per file
Top 3 recurring issues with fix recommendations
Time saved calculation

Sample Output

========================================
WEEKLY OPERATIONS REPORT
Generated: 2024-03-18 09:02:31
========================================

EXECUTIVE SUMMARY
-----------------
Total orders processed: 780
Orders with issues fixed: 1,739 (222.9%)
Inventory items needing reorder: 10
Employee records cleaned: 138
Estimated manual time saved: 28.8 hours

DATA QUALITY ISSUES FOUND
-------------------------
Orders:
  - Duplicate order IDs removed: 40
  - Date format inconsistencies fixed: 560
  - Price formatting issues fixed: 819
  - Status typos corrected: 290

TOP 3 RECURRING ISSUES (action required)
-----------------------------------------
1. Date format inconsistency: 560 instances
   → Standardize order system to ISO 8601

2. Price stored as text with $: 819 instances
   → Export numeric values without formatting

3. Hours > 24 in records: 15 instances
   → Add HR system validation

Time Saved Breakdown

Task	Manual	Automated	Weekly Saving
Clean orders.csv	60 min	8 sec	~59 min
Clean inventory.csv	45 min	5 sec	~44 min
Clean employees.csv	30 min	3 sec	~29 min
Generate report	45 min	2 sec	~44 min
Total	180 min	18 sec	~179 min

Annual impact: 156 hours saved (~₹62,400 at analyst rates)

Stack & Skills

Python · Pandas · ETL pipeline · Data quality · Logging · Automation · Operations analytics

Run It Yourself

# Clone and setup
git clone https://github.com/Swethavenk79/etl-automation.git
cd etl-automation
pip install -r requirements.txt
 
# Generate messy data
python data/generate_messy_files.py
 
# Run pipeline
python etl/clean_and_report.py
 
# View report
cat reports/weekly_report.txt

Why This Matters

Most companies have messy data. Most analysts spend hours cleaning it manually. This project shows I can:

Identify systemic data quality issues — not just fix symptoms
Quantify business impact — time saved, cost savings
Build maintainable automation — clear code, logging, documentation
Think beyond the fix — "Top 3 Recurring Issues" section proves I understand prevention beats cure

The interactive demo above? That's the actual transformation logic running in your browser. Hiring managers can see it work instead of just reading about it.