Kranthi Chaithanya
Thota_
$ |
01. About Me
I'm a Data & GIS Engineer operating at the intersection of cloud engineering, geospatial analytics, and machine learning. Currently at the University of North Carolina at Chapel Hill, I build production-grade data infrastructure that powers real analytical decisions.
My work spans the full data lifecycle — from architecting medallion warehouses in Snowflake and engineering PySpark pipelines over 50M+ rows on AWS EMR, to deploying computer vision models on UAV footage and building geospatial decision tools for government stakeholders.
I thrive on automation, clean infrastructure-as-code, and turning raw, messy data into reliable, fast, queryable systems. Whether it's shaving 55% off a Spark job runtime or fitting a YOLOv8 model to 95% parking detection accuracy — I care about measurable outcomes.
Years Experience
Rows Processed
NER F1 Score
CV Accuracy
Setup Time Cut
Spark Runtime Cut
Students Mentored
Cloud Certifications
02. Technical Skills
Languages
Cloud & Infrastructure
Data Engineering
Databases
GIS & Geospatial
ML & AI
DevOps & Visualization
03. Experience
Data Engineer
@ University of North Carolina at Chapel Hill Chapel Hill, NC- Engineered a PySpark feature pipeline on AWS EMR over 50M+ rows using broadcast joins and dynamic partition pruning — cut job runtime by 55%; persisted to Delta Lake with Z-ordering, reducing ML feature query scan cost by 70%.
- Architected a containerized microservice for unstructured data ingestion (OCR), generalizing across 300+ document formats with fault-tolerant retry logic achieving 95% reliability.
- Engineered a Star Schema data warehouse consolidating 5+ heterogeneous federal datasets; reduced average query runtime by 60% via materialized views for low-latency analytics.
- Implemented automated data observability using statistical drift detection, reducing downstream data cleaning latency by 40% by catching schema violations at ingestion time.
- Refactored legacy monolithic scripts into modular Python apps deployed on Kubernetes; established CI/CD pipelines enabling zero-downtime deployments.
Data Engineer Intern
@ Clarkson CEM Consulting Group Potsdam, NY- Built and fine-tuned a YOLOv8 + OpenCV computer vision pipeline analyzing UAV drone footage — achieving 95% accuracy in real-time parking occupancy detection; projected 20% efficiency improvement in campus resource allocation.
- Designed scalable ETL pipelines processing 6+ years of scheduling data; identified underutilized resources, saving an estimated 200+ hours of manual effort annually.
- Conducted roadkill hotspot analysis using ArcGIS Pro and spatial statistical methods, delivering actionable geospatial findings to stakeholders.
Graduate Research & Teaching Assistant
@ Clarkson University Potsdam, NY- Designed and delivered graduate-level coursework on Data Warehousing and Relational Database Systems (SQL, PostgreSQL, Snowflake); mentored 50+ students.
- Engineered automated pipeline using BeautifulSoup to scrape 10,000+ NYSERDA grant records and fine-tuned Hugging Face Transformer for NER, achieving 90% F1-score.
- Automated migration of 20+ years of legacy IPEDS data into centralized warehouse; built 30+ KPI dashboards benchmarking peer institutions, reducing retrieval time by 60%.
- Engineered Power BI analytics platform with forecasting models across 10+ departments and 500+ employees, saving 150+ analyst hours annually.
Associate Data Engineer
@ Egen (SpringML) Hyderabad, India- Engineered end-to-end geospatial data pipeline for NY Power Authority (NYPA) EV charger site planning — acquired 10+ public GIS datasets via QGIS, BigQuery, and Dataflow; built scoring algorithm powering a map-based decision tool for government stakeholders.
- Architected medallion data warehouse in Snowflake using dbt (40+ models, schema tests, GitHub Actions CI); implemented Streams/Tasks for CDC, reducing ELT latency from hourly to sub-5-minute incremental loads.
- Built Document AI pipeline ingesting 2,000+ patent documents with daily ingestion of 50+ new docs, generating 25+ KPIs via Cloud Run; reduced manual patent review time by 80%.
- Automated provisioning of 50+ GCP resources via Terraform with CI/CD, reducing environment setup time by 70% and enforcing security and compliance controls.
- Optimized multi-stage Apache Airflow DAGs via dynamic task generation, reducing end-to-end pipeline runtime by 33% (6h → 4h).
04. Projects
Real-Time Reddit Stock Sentiment Tracker
Low-latency Python streaming pipeline processing 100+ comments/sec with tumbling window logic. Surfaces breakout tickers in under 5 seconds — 95% latency reduction vs. batch via TimescaleDB time-series storage.
- Python
- TimescaleDB
- Reddit API
- Streaming
NYPA EV Charger Geospatial Pipeline
End-to-end geospatial pipeline for NY Power Authority EV charger site planning. Processed 10+ public GIS datasets via QGIS, BigQuery, and Dataflow; scoring algorithm and map-based decision tool used by government stakeholders.
- QGIS
- BigQuery
- Dataflow
- Python
- GIS
HAVK Mladost Sports Club Data Infrastructure
Normalized PostgreSQL schema with 15+ interrelated tables and automated Python ETL pipeline. Replaced fragmented Excel workflows — reduced manual data handling by 95% and improved membership renewal rates by 30% via Looker Studio KPI dashboards.
- PostgreSQL
- Python
- Looker Studio
- ETL
Serverless Data Job Deployment Framework
Reusable IaC framework cutting new data job deployment time by 80% (3+ hours → <10 min) across 3 teams. Standardized IAM roles, secret management, and compliance enforcement built directly into Terraform modules.
- Terraform
- GCP
- GitHub Actions
- IaC
AI Therapeutic Chatbot
Full-stack chatbot (Ollama/Gemma, Flask, Voice I/O) enhancing senior mental well-being via personalized conversational AI.
- Python
- LLM
- Flask
- Ollama
Healthcare RAG QA System
RAG pipeline (Transformers, Vector DBs) for accurate, context-aware healthcare question answering with source citation.
- Hugging Face
- RAG
- Vector DB
- NLP
NYSERDA Grant Pipeline & NER (90% F1)
Automated scraping (BeautifulSoup) and NER (fine-tuned Hugging Face Transformer) across 10,000+ grant records with 90% F1-score on organization, funding, and project entities.
- Beautiful Soup
- Hugging Face
- PostgreSQL
- NER
DeepFake Detection (85%+ Accuracy)
EfficientNet model with 15% enhancement via Adaptive ELA preprocessing for robust deepfake image classification.
- EfficientNet
- TF
- PyTorch
- CV
YOLOv8 Parking Detection (95% Acc)
Fine-tuned YOLOv8 + OpenCV pipeline for real-time parking occupancy detection from UAV drone footage; projected 20% efficiency improvement in campus resource allocation.
- YOLOv8
- OpenCV
- Python
- UAV
Patent Data Pipeline (Document AI)
Scalable Document AI pipeline ingesting 2,000+ patent PDFs with daily ingestion of 50+ new docs; generated 25+ KPIs deployed via Cloud Run, reducing manual review time by 80%.
- GCP
- Document AI
- BigQuery
- Cloud Run
Medallion Data Warehouse (Snowflake + dbt)
Architected medallion DW in Snowflake using dbt (40+ models, GitHub Actions CI); implemented Streams/Tasks for CDC, reducing ELT latency from hourly batch to sub-5-minute incremental loads.
- Snowflake
- dbt
- CDC
- GitHub Actions
IPEDS Legacy Data Migration
Automated migration of 20+ years of legacy IPEDS data into a centralized warehouse; developed 30+ KPI dashboards benchmarking peer institutions, reducing manual retrieval time by 60%.
- Python
- SQL Server
- Power BI
- ETL
Multi-Cloud Infrastructure (IaC, 95%+ Auto)
Provisioned production infrastructure (VPC, compute, DBs) across GCP/AWS via Terraform; automated 50+ resources with CI/CD via Cloud Build, reducing setup time by 70%.
- Terraform
- GCP
- AWS
- Multi-Cloud
A2A Roadkill Hotspot Analysis
Statistical analysis on roadkill data using drone imagery and spatial statistical methods (ArcGIS Pro) for wildlife corridor hotspot identification.
- ArcGIS Pro
- R
- Spatial Stats
- GIS
Town of Colton Complete Streets Plan
Developed comprehensive transportation plan via survey analysis and GIS visualization using ArcGIS Pro and QGIS.
- ArcGIS Pro
- QGIS
- Survey Analysis
- GIS
BRFSS Health Risk Analysis (80% Acc)
Analyzed 400k+ health records (ANOVA, Chi-Sq) to identify key cancer risk factors using TensorFlow classification models.
- Python
- TensorFlow
- Statistics
- Healthcare
Clarkson Traffic/Parking Dashboard
Interactive Tableau dashboards visualizing campus traffic and parking data from drone and metrocount sensors.
- Tableau
- GIS
- Python
- Dashboard
Automated Instagram Bot (45% Engagement)
GPT-3.5 pipeline automating Instagram posts via API with NLP trend analysis, increasing engagement by 45%.
- Python
- GPT-3
- API
- NLP
OTT Web Platform (Netflix Clone)
Responsive streaming web application built with React, Firebase, and Node.js with real-time database and authentication.
- React
- Node.js
- Firebase
- Web Dev
Wine Quality Prediction
EDA and predictive modeling to determine physicochemical factors influencing wine quality using ensemble ML algorithms.
- Python
- Scikit-learn
- Pandas
- ML
05. Education & Certifications
M.S. Applied Data Science
Clarkson University, Potsdam, NY
Jan 2024 – Aug 2025
Data Warehousing · Big Data Architecture · Cloud Computing · Data Mining · GIS & Spatial Analysis
Certifications
Google Cloud Professional Data Engineer
Google Cloud
Google Cloud Associate Cloud Engineer
Google Cloud
HashiCorp Certified Terraform Associate
HashiCorp
06. Contact
Let's build something great together.
I'm actively seeking full-time Data Engineering and GIS Engineering roles where I can contribute expertise in cloud-native pipelines, geospatial analytics, and AI/ML systems. Whether you have a specific project in mind or just want to connect — my inbox is open.
Say Hello$ whoami
Kranthi Chaithanya Thota
$ location
Chapel Hill, NC
$ status
open to opportunities |