Resume

< Hello, World! />

Kranthi Chaithanya
Thota_

$ |

Data & GIS Engineer with 4 years building cloud-native data platforms, geospatial pipelines, and AI/ML systems. Deep expertise in GCP · Python · SQL · PostgreSQL. Delivering scalable ETL/ELT, real-time streaming, and spatial analytics across government, infrastructure, and enterprise domains.

GCP Certified Terraform Certified M.S. 4.0 GPA GIS Engineer
scroll

01. About Me

I'm a Data & GIS Engineer operating at the intersection of cloud engineering, geospatial analytics, and machine learning. Currently at the University of North Carolina at Chapel Hill, I build production-grade data infrastructure that powers real analytical decisions.

My work spans the full data lifecycle — from architecting medallion warehouses in Snowflake and engineering PySpark pipelines over 50M+ rows on AWS EMR, to deploying computer vision models on UAV footage and building geospatial decision tools for government stakeholders.

I thrive on automation, clean infrastructure-as-code, and turning raw, messy data into reliable, fast, queryable systems. Whether it's shaving 55% off a Spark job runtime or fitting a YOLOv8 model to 95% parking detection accuracy — I care about measurable outcomes.

Chapel Hill, NC
0+

Years Experience

0M+

Rows Processed

0%

NER F1 Score

0%

CV Accuracy

0%

Setup Time Cut

0%

Spark Runtime Cut

0+

Students Mentored

0

Cloud Certifications

02. Technical Skills

Languages

PythonSQLR GoJavaScriptC++ BashPL/pgSQL

Cloud & Infrastructure

GCPAWS BigQueryDataflow Cloud RunVertex AI AWS EMRRedshift SageMakerTerraform DockerKubernetes

Data Engineering

PySparkApache Kafka Apache Airflowdbt SnowflakeDelta Lake ETL/ELT DesignStar Schema Medallion Arch.CDC Data ModelingREST APIs

Databases

PostgreSQLBigQuery SnowflakeRedshift MongoDBMySQL TimescaleDBSQL Server

GIS & Geospatial

ArcGIS ProArcPy QGISPostGIS Spatial SQLNetwork Analysis RasterizationGeoreferencing Deep Learning for GIS

ML & AI

PyTorchTensorFlow BERT / LLMsRAG NLPYOLOv8 OpenCVEfficientNet Hugging FaceOllama Scikit-learnRL

DevOps & Visualization

CI/CDGitHub Actions Cloud BuildJenkins TableauPower BI LookerGrafana MatplotlibPlotly D3.js

03. Experience

Data Engineer

@ University of North Carolina at Chapel Hill Chapel Hill, NC
Current
Sep 2025 – Present
  • Engineered a PySpark feature pipeline on AWS EMR over 50M+ rows using broadcast joins and dynamic partition pruning — cut job runtime by 55%; persisted to Delta Lake with Z-ordering, reducing ML feature query scan cost by 70%.
  • Architected a containerized microservice for unstructured data ingestion (OCR), generalizing across 300+ document formats with fault-tolerant retry logic achieving 95% reliability.
  • Engineered a Star Schema data warehouse consolidating 5+ heterogeneous federal datasets; reduced average query runtime by 60% via materialized views for low-latency analytics.
  • Implemented automated data observability using statistical drift detection, reducing downstream data cleaning latency by 40% by catching schema violations at ingestion time.
  • Refactored legacy monolithic scripts into modular Python apps deployed on Kubernetes; established CI/CD pipelines enabling zero-downtime deployments.
PySparkAWS EMRDelta Lake KubernetesPythonOCRCI/CD

Data Engineer Intern

@ Clarkson CEM Consulting Group Potsdam, NY
May 2025 – Aug 2025
  • Built and fine-tuned a YOLOv8 + OpenCV computer vision pipeline analyzing UAV drone footage — achieving 95% accuracy in real-time parking occupancy detection; projected 20% efficiency improvement in campus resource allocation.
  • Designed scalable ETL pipelines processing 6+ years of scheduling data; identified underutilized resources, saving an estimated 200+ hours of manual effort annually.
  • Conducted roadkill hotspot analysis using ArcGIS Pro and spatial statistical methods, delivering actionable geospatial findings to stakeholders.
YOLOv8OpenCVArcGIS Pro PythonETLGIS

Graduate Research & Teaching Assistant

@ Clarkson University Potsdam, NY
Feb 2024 – Aug 2025
  • Designed and delivered graduate-level coursework on Data Warehousing and Relational Database Systems (SQL, PostgreSQL, Snowflake); mentored 50+ students.
  • Engineered automated pipeline using BeautifulSoup to scrape 10,000+ NYSERDA grant records and fine-tuned Hugging Face Transformer for NER, achieving 90% F1-score.
  • Automated migration of 20+ years of legacy IPEDS data into centralized warehouse; built 30+ KPI dashboards benchmarking peer institutions, reducing retrieval time by 60%.
  • Engineered Power BI analytics platform with forecasting models across 10+ departments and 500+ employees, saving 150+ analyst hours annually.
PythonPower BIHugging Face PostgreSQLSnowflakeNER

Associate Data Engineer

@ Egen (SpringML) Hyderabad, India
Jan 2022 – Dec 2023
  • Engineered end-to-end geospatial data pipeline for NY Power Authority (NYPA) EV charger site planning — acquired 10+ public GIS datasets via QGIS, BigQuery, and Dataflow; built scoring algorithm powering a map-based decision tool for government stakeholders.
  • Architected medallion data warehouse in Snowflake using dbt (40+ models, schema tests, GitHub Actions CI); implemented Streams/Tasks for CDC, reducing ELT latency from hourly to sub-5-minute incremental loads.
  • Built Document AI pipeline ingesting 2,000+ patent documents with daily ingestion of 50+ new docs, generating 25+ KPIs via Cloud Run; reduced manual patent review time by 80%.
  • Automated provisioning of 50+ GCP resources via Terraform with CI/CD, reducing environment setup time by 70% and enforcing security and compliance controls.
  • Optimized multi-stage Apache Airflow DAGs via dynamic task generation, reducing end-to-end pipeline runtime by 33% (6h → 4h).
GCPSnowflakedbt TerraformAirflowBigQuery QGISDocument AIFlask

04. Projects

Real-Time Reddit Stock Sentiment Tracker

Low-latency Python streaming pipeline processing 100+ comments/sec with tumbling window logic. Surfaces breakout tickers in under 5 seconds — 95% latency reduction vs. batch via TimescaleDB time-series storage.

  • Python
  • TimescaleDB
  • Reddit API
  • Streaming

NYPA EV Charger Geospatial Pipeline

End-to-end geospatial pipeline for NY Power Authority EV charger site planning. Processed 10+ public GIS datasets via QGIS, BigQuery, and Dataflow; scoring algorithm and map-based decision tool used by government stakeholders.

  • QGIS
  • BigQuery
  • Dataflow
  • Python
  • GIS

HAVK Mladost Sports Club Data Infrastructure

Normalized PostgreSQL schema with 15+ interrelated tables and automated Python ETL pipeline. Replaced fragmented Excel workflows — reduced manual data handling by 95% and improved membership renewal rates by 30% via Looker Studio KPI dashboards.

  • PostgreSQL
  • Python
  • Looker Studio
  • ETL

Serverless Data Job Deployment Framework

Reusable IaC framework cutting new data job deployment time by 80% (3+ hours → <10 min) across 3 teams. Standardized IAM roles, secret management, and compliance enforcement built directly into Terraform modules.

  • Terraform
  • GCP
  • GitHub Actions
  • IaC

AI Therapeutic Chatbot

Full-stack chatbot (Ollama/Gemma, Flask, Voice I/O) enhancing senior mental well-being via personalized conversational AI.

  • Python
  • LLM
  • Flask
  • Ollama

Healthcare RAG QA System

RAG pipeline (Transformers, Vector DBs) for accurate, context-aware healthcare question answering with source citation.

  • Hugging Face
  • RAG
  • Vector DB
  • NLP

NYSERDA Grant Pipeline & NER (90% F1)

Automated scraping (BeautifulSoup) and NER (fine-tuned Hugging Face Transformer) across 10,000+ grant records with 90% F1-score on organization, funding, and project entities.

  • Beautiful Soup
  • Hugging Face
  • PostgreSQL
  • NER

DeepFake Detection (85%+ Accuracy)

EfficientNet model with 15% enhancement via Adaptive ELA preprocessing for robust deepfake image classification.

  • EfficientNet
  • TF
  • PyTorch
  • CV

YOLOv8 Parking Detection (95% Acc)

Fine-tuned YOLOv8 + OpenCV pipeline for real-time parking occupancy detection from UAV drone footage; projected 20% efficiency improvement in campus resource allocation.

  • YOLOv8
  • OpenCV
  • Python
  • UAV

Patent Data Pipeline (Document AI)

Scalable Document AI pipeline ingesting 2,000+ patent PDFs with daily ingestion of 50+ new docs; generated 25+ KPIs deployed via Cloud Run, reducing manual review time by 80%.

  • GCP
  • Document AI
  • BigQuery
  • Cloud Run

Medallion Data Warehouse (Snowflake + dbt)

Architected medallion DW in Snowflake using dbt (40+ models, GitHub Actions CI); implemented Streams/Tasks for CDC, reducing ELT latency from hourly batch to sub-5-minute incremental loads.

  • Snowflake
  • dbt
  • CDC
  • GitHub Actions

IPEDS Legacy Data Migration

Automated migration of 20+ years of legacy IPEDS data into a centralized warehouse; developed 30+ KPI dashboards benchmarking peer institutions, reducing manual retrieval time by 60%.

  • Python
  • SQL Server
  • Power BI
  • ETL

Multi-Cloud Infrastructure (IaC, 95%+ Auto)

Provisioned production infrastructure (VPC, compute, DBs) across GCP/AWS via Terraform; automated 50+ resources with CI/CD via Cloud Build, reducing setup time by 70%.

  • Terraform
  • GCP
  • AWS
  • Multi-Cloud

A2A Roadkill Hotspot Analysis

Statistical analysis on roadkill data using drone imagery and spatial statistical methods (ArcGIS Pro) for wildlife corridor hotspot identification.

  • ArcGIS Pro
  • R
  • Spatial Stats
  • GIS

Town of Colton Complete Streets Plan

Developed comprehensive transportation plan via survey analysis and GIS visualization using ArcGIS Pro and QGIS.

  • ArcGIS Pro
  • QGIS
  • Survey Analysis
  • GIS

BRFSS Health Risk Analysis (80% Acc)

Analyzed 400k+ health records (ANOVA, Chi-Sq) to identify key cancer risk factors using TensorFlow classification models.

  • Python
  • TensorFlow
  • Statistics
  • Healthcare

Clarkson Traffic/Parking Dashboard

Interactive Tableau dashboards visualizing campus traffic and parking data from drone and metrocount sensors.

  • Tableau
  • GIS
  • Python
  • Dashboard

Automated Instagram Bot (45% Engagement)

GPT-3.5 pipeline automating Instagram posts via API with NLP trend analysis, increasing engagement by 45%.

  • Python
  • GPT-3
  • API
  • NLP

OTT Web Platform (Netflix Clone)

Responsive streaming web application built with React, Firebase, and Node.js with real-time database and authentication.

  • React
  • Node.js
  • Firebase
  • Web Dev

Wine Quality Prediction

EDA and predictive modeling to determine physicochemical factors influencing wine quality using ensemble ML algorithms.

  • Python
  • Scikit-learn
  • Pandas
  • ML

RL for Autonomous Quadcopter Control

Applied Reinforcement Learning in simulation to train an agent for quadcopter stabilization and autonomous control.

  • RL
  • Python
  • Simulation

05. Education & Certifications

M.S. Applied Data Science

Clarkson University, Potsdam, NY

Jan 2024 – Aug 2025

GPA: 4.0 / 4.0

Data Warehousing · Big Data Architecture · Cloud Computing · Data Mining · GIS & Spatial Analysis

Certifications

Google Cloud Professional Data Engineer

Google Cloud

Google Cloud Associate Cloud Engineer

Google Cloud

HashiCorp Certified Terraform Associate

HashiCorp

06. Contact

Let's build something great together.

I'm actively seeking full-time Data Engineering and GIS Engineering roles where I can contribute expertise in cloud-native pipelines, geospatial analytics, and AI/ML systems. Whether you have a specific project in mind or just want to connect — my inbox is open.

Say Hello
contact.sh

$ whoami

Kranthi Chaithanya Thota

$ location

Chapel Hill, NC

$ email

$ status

open to opportunities |