About Me
I am a seasoned Site Reliability Engineering Lead / Senior SRE with over 11 years of experience supporting large-scale production environments across financial services and logistics platforms. I specialize in building fault-tolerant, scalable infrastructure and ensuring operational excellence for mission-critical applications.
My expertise spans Microsoft Azure and Google Cloud Platform (GCP), with deep hands-on experience in Kubernetes (GKE) environments and enterprise observability platforms including Datadog (APM, Tracing, DBM, RUM), Splunk, and Dynatrace. I serve as the primary Datadog specialist, managing end-to-end observability pipelines and driving telemetry hygiene through OpenTelemetry adoption.
As an Incident Commander, I lead triage for high-severity outages and drive post-incident root cause analysis (RCA) to ensure continuous improvement. I've successfully reduced alert noise from 40% to 10% by implementing AI-driven alerting, Watchdog anomaly detection, and SLO-aligned monitoring standards based on the "Golden Signals" framework.
I'm proficient in Infrastructure as Code using Terraform and Ansible, with strong scripting skills in Python, PowerShell, and Shell. I hold the Microsoft Certified: Azure Administrator Associate (AZ‑104) and Datadog Fundamentals certifications. I'm passionate about reducing toil through automation, optimizing cloud costs, and mentoring engineering teams on observability best practices.
Technical Proficiency
Cloud
Containers
Observability
Automation / IaC
ITSM / Incident
CI/CD
SRE Practice
Platforms
Certifications
Professional Experience
Site Reliability Engineer
Aug 2025 – Present- On-Prem to GCP Migration: Spearheaded the comprehensive migration of monitoring infrastructure from legacy on-premises data centers to Google Cloud Platform (GCP), ensuring zero observability gaps.
- Datadog Observability Transformation: Re-architected the monitoring landscape by migrating legacy ELK stack logs and on-prem metrics into Datadog, centralizing telemetry for high-scale GCP workloads.
- High Availability & Error Budgets: Engineered system reliability to maintain 99.99% availability for mission-critical logistics platforms; managed Error Budgets to balance feature velocity with production stability.
- Infrastructure as Code (IaC): Utilized Terraform to architect and provision multi-region Kubernetes (GKE) clusters, implementing modular code to ensure consistent environment states.
- Chaos Engineering Implementation: Enhanced system resilience by conducting scheduled fault-injection experiments using Chaos Mesh, successfully identifying and mitigating single points of failure.
- Kubernetes Scalability: Optimized application performance via manual and automated scaling (HPA/VPA) of workloads within Kubernetes to handle unpredictable traffic spikes.
- CI/CD Pipeline Management: Orchestrated automated deployment pipelines using Jenkins and GitHub, incorporating automated testing to reduce deployment-related incidents.
- Incident Command & RCA: Acted as Incident Commander for high-severity production outages, coordinating cross-functional teams and utilizing ServiceNow for Root Cause Analysis (RCA).
- Alert Optimization: Leveraged Datadog Watchdog and AI-driven alerting to reduce alert noise from 40% to 10%, focusing the team on actionable events.
Site Reliability Engineering Lead
Aug 2023 – Jul 2024- Strategic Team Leadership: Led a high-performing 15-member SRE team responsible for the 24/7 reliability of enterprise-level financial and mortgage systems.
- Enterprise Monitoring Overhaul: Designed and implemented a comprehensive enterprise monitoring strategy utilizing Azure Monitor and PagerDuty, increasing visibility into legacy applications.
- Service Level Definition: Established robust service level indicators (SLIs) and service level objectives (SLOs) to align IT performance with business expectations.
- Operational Automation: Significantly reduced manual toil by automating repetitive operational tasks and diagnostic workflows using PowerShell and Python scripting.
- Cross-Team Incident Coordination: Managed high-severity production responses, facilitating blameless post-mortems and coordinating long-term stability fixes.
Technology Operations Associate
Oct 2017 – Oct 2022- Infrastructure Maintenance: Maintained enterprise infrastructure supporting mission-critical banking applications.
- Health Monitoring: Monitored performance across Windows and VMware environments, automating diagnostics with PowerShell.
- Collaboration: Partnered with network and database teams to resolve complex production incidents.
System Administrator
Sep 2015 – Oct 2017- Server Management: Managed Windows and Linux production servers in a 35-member command center for financial clients.
- Hardware Operations: Resolved hardware issues via iLO, DRAC, and SMH; managed security compliance and tool integration.
Support Engineer
Nov 2014 – Sep 2015- Production Support: Monitored server stability (CPU/Disk/Memory) and managed VSS backups for United Health Care and GHX.
- Technical Support: Resolved infrastructure alerts and handled incident ticketing through Kayako and XSmart-control.
Assistant Engineer
Jun 2013 – Aug 2014- Systems Deployment: Installed and upgraded healthcare IT systems (Meditos) and managed asset inventory.
- Field Support: Configured engineering software solutions and assisted in resolving escalated technical issues.
Education
Bachelor of Technology (Information Technology)
Get in Touch
I'm currently open to new opportunities as a Site Reliability Engineer Lead. Whether you have a question or just want to connect, feel free to reach out!