The Business Problem
Every Monday, an operations analyst spent 3 hours manually cleaning 3 CSV files from different departments โ orders, inventory, and employee data. Same task, every week, burning 156 hours annually.
This project automates it entirely. One command. 18 seconds. Complete data quality audit included.
Interactive Demo
See the pipeline in action below. Click "Run Pipeline" to watch it extract messy data, transform it programmatically, and generate the weekly report:
ETL Pipeline Demo
Live data cleaning simulation
Raw Orders Data (Sample)
3 different date formats, duplicates, typos| Order ID | Date | Product | Price | Qty | Status |
|---|---|---|---|---|---|
| ORD-001 | 01/15/2024 | Laptop | $1,299.00 | 2.0 | Compleeted |
| ORD-002 | 2024-01-16 | Phone | $899.00 | 1.0 | shiped |
| ORD-001 ๐ | 15-Jan-2024 | Laptop | $1,299.00 | 2.0 | CANCELLED |
| ORD-003 | NULL | Tablet | NULL | 1.0 | Pending |
What This Solves
Before (Manual Process)
- 3 hours every Monday cleaning data in Excel
- Inconsistent date formats breaking pivot tables
- Price fields with "$" symbols requiring manual find/replace
- Duplicate order IDs inflating revenue reports
- Status typos (
Compleeted,shiped) messing up filters - No audit trail โ mistakes go unnoticed
After (Automated Pipeline)
- 18 seconds end-to-end execution
- Handles 8+ data quality issue types automatically
- Structured logging of every transformation
- Generated weekly report with issue counts
- 156 hours saved annually (~โน62,400 at analyst rates)
Data Quality Issues Handled
| Dataset | Issues Detected & Fixed |
|---|---|
| Orders | Duplicate order IDs, 3 date formats (YYYY-MM-DD, MM/DD/YYYY, DD-Mon-YYYY), $ prefix in prices, float quantities, status typos |
| Inventory | Extra whitespace in names, negative stock levels, mixed timezones (IST/UTC), comma-separated cost prices |
| Employees | Mixed name casing (ALL CAPS, all lower), mixed overtime flags (0/1/Yes/No), impossible hours (>24), blank dates |
Technical Architecture
etl-automation/
โโโ data/
โ โโโ raw/ # Messy input files
โ โโโ cleaned/ # Timestamped outputs
โ โโโ generate_messy_files.py # Test data generator
โโโ etl/
โ โโโ clean_and_report.py # Main pipeline
โ โโโ logs/
โ โโโ etl_run.log # Audit trail
โโโ reports/
โ โโโ weekly_report.txt # Generated report
โโโ README.mdKey Components
1. Data Generator (generate_messy_files.py)
Creates realistic test data with injected quality issues โ duplicates, nulls, typos, formatting inconsistencies.
2. ETL Pipeline (clean_and_report.py)
- 8 labeled sections for maintainability
- Dual logging: file (
etl_run.log) + stdout - Error handling: Graceful skip on missing files
- Metrics tracking: Every issue count logged
3. Weekly Report Auto-generated with:
- Executive summary (rows processed, issues fixed)
- Data quality breakdown per file
- Top 3 recurring issues with fix recommendations
- Time saved calculation
Sample Output
========================================
WEEKLY OPERATIONS REPORT
Generated: 2024-03-18 09:02:31
========================================
EXECUTIVE SUMMARY
-----------------
Total orders processed: 780
Orders with issues fixed: 1,739 (222.9%)
Inventory items needing reorder: 10
Employee records cleaned: 138
Estimated manual time saved: 28.8 hours
DATA QUALITY ISSUES FOUND
-------------------------
Orders:
- Duplicate order IDs removed: 40
- Date format inconsistencies fixed: 560
- Price formatting issues fixed: 819
- Status typos corrected: 290
TOP 3 RECURRING ISSUES (action required)
-----------------------------------------
1. Date format inconsistency: 560 instances
โ Standardize order system to ISO 8601
2. Price stored as text with $: 819 instances
โ Export numeric values without formatting
3. Hours > 24 in records: 15 instances
โ Add HR system validationTime Saved Breakdown
| Task | Manual | Automated | Weekly Saving |
|---|---|---|---|
| Clean orders.csv | 60 min | 8 sec | ~59 min |
| Clean inventory.csv | 45 min | 5 sec | ~44 min |
| Clean employees.csv | 30 min | 3 sec | ~29 min |
| Generate report | 45 min | 2 sec | ~44 min |
| Total | 180 min | 18 sec | ~179 min |
Annual impact: 156 hours saved (~โน62,400 at analyst rates)
Stack & Skills
Python ยท Pandas ยท ETL pipeline ยท Data quality ยท Logging ยท Automation ยท Operations analytics
Run It Yourself
# Clone and setup
git clone https://github.com/Swethavenk79/etl-automation.git
cd etl-automation
pip install -r requirements.txt
# Generate messy data
python data/generate_messy_files.py
# Run pipeline
python etl/clean_and_report.py
# View report
cat reports/weekly_report.txtWhy This Matters
Most companies have messy data. Most analysts spend hours cleaning it manually. This project shows I can:
- Identify systemic data quality issues โ not just fix symptoms
- Quantify business impact โ time saved, cost savings
- Build maintainable automation โ clear code, logging, documentation
- Think beyond the fix โ "Top 3 Recurring Issues" section proves I understand prevention beats cure
The interactive demo above? That's the actual transformation logic running in your browser. Hiring managers can see it work instead of just reading about it.