Skip to main content

Command Palette

Search for a command to run...

What Building My First ETL Pipeline Taught Me About Real Data Engineering

Updated
4 min read
What Building My First ETL Pipeline Taught Me About Real Data Engineering
T
Software Engineer with experience in Java and SQL. Currently exploring Data Engineering, Distributed Systems, and scalable backend architectures through hands-on projects, system design exercises, and technical writing. Interested in Kafka, Spark, Airflow, data pipelines, and large-scale system design. This blog documents my learning journey, project insights, and lessons from building and studying real-world systems.

When I first started learning Data Engineering, ETL pipelines looked deceptively simple.

Extract data. Transform it. Load it somewhere useful.

That was the theory.

But after building my first end-to-end ETL pipeline project using real-world Australian weather data, I realized that modern data systems involve much more engineering thinking than I initially expected.

Over the past few weeks, I built a complete ETL workflow processing 145,460 rows of weather data spanning 10 years using:

  • Python

  • Pandas

  • PyArrow

  • Google BigQuery

  • Apache Airflow

  • Docker

The goal wasn't just to complete a project.

I wanted to understand how real data systems ingest, validate, transform, orchestrate, and prepare data for downstream systems.


Why I Broke the Project Into Smaller Engineering Problems

One mistake I made initially was trying to think about the entire system at once.

Questions started piling up quickly:

  • How should ingestion work?

  • What happens when data is inconsistent?

  • How should transformations be validated?

  • How should orchestration behave?

  • How do downstream systems trust the processed data?

Instead of treating the ETL workflow as one giant project, I broke it into smaller engineering exercises.

That decision changed the learning experience completely.


Phase 1 — Extraction

The extraction phase focused on:

  • reading raw CSV datasets,

  • validating ingestion,

  • handling missing values,

  • and organizing records for processing.

At first, this stage looked straightforward.

But I quickly realized that raw data is rarely clean.

Even before transformation begins, ingestion systems need to think about:

  • consistency,

  • validation,

  • reliability,

  • and structure.


Phase 2 — Transformation

The transformation layer introduced a completely different level of complexity.

I worked on:

  • handling null values,

  • fixing inconsistent data types,

  • restructuring datasets,

  • and engineering new features.

Some engineered columns included:

  • temp_range

  • is_hot_day

  • season classification

One of the most satisfying parts of the project was converting the transformed dataset into Parquet format.

Result: 13.44 MB CSV → 2.35 MB Parquet

An 82.5% reduction in storage size.

This stage made me realize that transformation is where much of the actual engineering thinking happens in ETL systems.

Small inconsistencies can easily propagate downstream if data quality is not handled carefully.


Phase 3 — Loading and Validation

Transformed dataset successfully loaded into Google BigQuery.

After transformation, the processed dataset was loaded into Google BigQuery.

I also implemented:

  • row-count verification,

  • null checks,

  • and post-load validation.

This phase helped me understand that ETL systems are not just about moving data.

They're also about creating trustworthy datasets for analytics and downstream applications.

Post-load validation queries used to verify transformation quality.

Phase 4 — Orchestration

The pipeline was orchestrated using Apache Airflow running inside Docker containers.

The DAG included:

  • scheduled execution,

  • retry handling,

  • logging,

  • and dependency management.

This was the stage where the project started feeling like a real production-style workflow rather than a simple script.

Airflow DAG successfully orchestrating the ETL workflow.

Project Statistics

✅ 145,460 rows processed

✅ 343,248 missing values handled

✅ 0 missing values remaining after transformation

✅ All Airflow tasks completed successfully


What This Project Changed For Me

Before building this pipeline, I mostly viewed ETL workflows as data movement systems.

Now I see them differently.

ETL systems are also:

  • reliability systems,

  • validation systems,

  • and preparation layers for downstream analytics.

The project also made me think more deeply about:

  • distributed systems,

  • schema design,

  • orchestration,

  • scalability,

  • and system trustworthiness.

And honestly, the best learning moments happened when things broke unexpectedly 😅


Final Thoughts

Building this project end-to-end taught me more about practical Data Engineering than tutorials alone ever could.

Breaking the pipeline into smaller engineering problems made the learning process significantly more effective and helped me appreciate the complexity behind modern data systems.

This is just the beginning of my journey into Data Engineering and distributed systems, and I'm excited to continue building and learning.


GitHub Repository

ETL Pipeline

The Public Proof of Work Journey

Part 2 of 2

A documented journey of building public proof of work through projects, technical writing, system design, and continuous learning while navigating a career in software and data engineering.

Start from the beginning

Why I'm Spending Less Time Applying and More Time Building Public Proof of Work

- [Why I'm Spending Less Time Applying and More Time Building Public Proof of Work Content](#why-im-spending-less-time-applying-and-more-time-building-public-proof-of-work-content) - [The Reality I Kept Running Into](#the-reality-i-kept-running-into) - [The Easy Apply Cycle](#the-easy-apply-cycle) - [The Experience Gap](#the-experience-gap) - [When Silence Becomes Normal](#when-silence-becomes-normal) - [Why I Shifted to Building Public Proof of Work](#why-i-shifted-to-building-public-proof-of-work) - [The Value of Public Work vs Applications](#the-value-of-public-work-vs-applications) - [How Public Proof of Work Replaces Résumés](#how-public-proof-of-work-replaces-resumés) - [What I Build and Share](#what-i-build-and-share) - [Project Types and Formats](#project-types-and-formats) - [Platforms I Use (GitHub, Blog, Social, Portfolio)](#platforms-i-use-github-blog-social-portfolio) - [My New Job-Search Workflow: Build → Publish → Share](#my-new-job-search-workflow-build-→-publish-→-share) - [Time Management and Cadence](#time-management-and-cadence) - [Signal Over Noise: Choosing What to Publish](#signal-over-noise-choosing-what-to-publish) - [Early Results and Anecdotes](#early-results-and-anecdotes) - [Recruiter Outreach and Interview Invitations](#recruiter-outreach-and-interview-invitations) - [Community Feedback and Collaboration Opportunities](#community-feedback-and-collaboration-opportunities) - [Lessons Learned and Pitfalls to Avoid](#lessons-learned-and-pitfalls-to-avoid) - [Common Mistakes When Publishing Public Work](#common-mistakes-when-publishing-public-work) - [How to Measure Progress](#how-to-measure-progress) - [Practical Steps to Get Started Today](#practical-steps-to-get-started-today) - [Low-effort Project Ideas](#low-effort-project-ideas) - [Templates for Sharing Work and Reaching Out to Recruiters](#templates-for-sharing-work-and-reaching-out-to-recruiters) - [Conclusion](#conclusion)

More from this blog

Why I'm Spending Less Time Applying and More Time Building Public Proof of Work

- [Why I'm Spending Less Time Applying and More Time Building Public Proof of Work Content](#why-im-spending-less-time-applying-and-more-time-building-public-proof-of-work-content) - [The Reality I Kept Running Into](#the-reality-i-kept-running-into) - [The Easy Apply Cycle](#the-easy-apply-cycle) - [The Experience Gap](#the-experience-gap) - [When Silence Becomes Normal](#when-silence-becomes-normal) - [Why I Shifted to Building Public Proof of Work](#why-i-shifted-to-building-public-proof-of-work) - [The Value of Public Work vs Applications](#the-value-of-public-work-vs-applications) - [How Public Proof of Work Replaces Résumés](#how-public-proof-of-work-replaces-resumés) - [What I Build and Share](#what-i-build-and-share) - [Project Types and Formats](#project-types-and-formats) - [Platforms I Use (GitHub, Blog, Social, Portfolio)](#platforms-i-use-github-blog-social-portfolio) - [My New Job-Search Workflow: Build → Publish → Share](#my-new-job-search-workflow-build-→-publish-→-share) - [Time Management and Cadence](#time-management-and-cadence) - [Signal Over Noise: Choosing What to Publish](#signal-over-noise-choosing-what-to-publish) - [Early Results and Anecdotes](#early-results-and-anecdotes) - [Recruiter Outreach and Interview Invitations](#recruiter-outreach-and-interview-invitations) - [Community Feedback and Collaboration Opportunities](#community-feedback-and-collaboration-opportunities) - [Lessons Learned and Pitfalls to Avoid](#lessons-learned-and-pitfalls-to-avoid) - [Common Mistakes When Publishing Public Work](#common-mistakes-when-publishing-public-work) - [How to Measure Progress](#how-to-measure-progress) - [Practical Steps to Get Started Today](#practical-steps-to-get-started-today) - [Low-effort Project Ideas](#low-effort-project-ideas) - [Templates for Sharing Work and Reaching Out to Recruiters](#templates-for-sharing-work-and-reaching-out-to-recruiters) - [Conclusion](#conclusion)

Jun 1, 20267 min read6
Why I'm Spending Less Time Applying and More Time Building Public Proof of Work
F

From Query to Scale

3 posts

Documenting my journey into backend and data engineering through projects, technical writing, architecture discussions, and continuous learning.