About Me
I am a seasoned Site Reliability Engineering Lead / Senior SRE with over 11 years of experience supporting large-scale production environments across financial services and logistics platforms. I specialize in building fault-tolerant, scalable infrastructure and ensuring operational excellence for mission-critical applications.
My expertise spans Microsoft Azure and Google Cloud Platform (GCP), with deep hands-on experience in Kubernetes (GKE) environments and enterprise observability platforms including Datadog (APM, Tracing, DBM, RUM), Splunk, and Dynatrace. I serve as the primary Datadog specialist, managing end-to-end observability pipelines and driving telemetry hygiene through OpenTelemetry adoption.
As an Incident Commander, I lead triage for high-severity outages and drive post-incident root cause analysis (RCA) to ensure continuous improvement. I've successfully reduced alert noise from 40% to 10% by implementing AI-driven alerting, Watchdog anomaly detection, and SLO-aligned monitoring standards based on the "Golden Signals" framework.
I'm proficient in Infrastructure as Code using Terraform and Ansible, with strong scripting skills in Python, PowerShell, and Shell. I hold the Microsoft Certified: Azure Administrator Associate (AZ‑104) and Datadog Fundamentals certifications. I'm passionate about reducing toil through automation, optimizing cloud costs, and mentoring engineering teams on observability best practices.
Technical Proficiency
Cloud
Containers
Observability
Automation / IaC
ITSM / Incident
CI/CD
Platforms
Certifications
Professional Experience
Site Reliability Engineer
Aug 2025 – Present- Datadog Platform Ownership: Serve as the primary specialist for Datadog, managing APM, distributed tracing, DBM, and log ingestion pipelines to ensure end-to-end observability.
- Observability Engineering: Evolved telemetry hygiene by implementing consistent tagging, naming standards, and OpenTelemetry adoption across Kubernetes clusters.
- Incident Management & Triage: Act as Incident Commander, utilizing Datadog, Splunk, and ServiceNow to triage high-severity outages and lead post-incident root cause analysis (RCA).
- Alert Optimization: Reduced alert noise from 40% to 10% by implementing AI-driven alerting, Watchdog anomaly detection, and SLO-aligned monitoring standards.
- Standardization & Governance: Designed and enforced monitoring standards based on "Golden Signals," providing real-time visibility through custom dashboards for leadership and engineering teams.
- Cost & Performance Management: Manage cost visibility and optimization for Datadog logs and APM usage while enabling application teams on observability best practices.
- Operational Readiness: Support CI/CD pipelines and participate in DR testing and operational readiness reviews to ensure continuous improvement of SRE practices.
Site Reliability Engineering Lead
Aug 2023 – Jul 2024- Team Leadership: Led a 15‑member SRE team responsible for reliability of enterprise financial systems.
- Incident Response: Managed incident response for high‑severity production issues and coordinated cross‑team resolution.
- Monitoring Strategy: Implemented monitoring strategy using Azure Monitor and PagerDuty.
- Service Levels: Defined service level indicators (SLIs) and service level objectives (SLOs).
- Automation: Automated operational tasks using PowerShell scripting.
- Post-Mortems: Led post‑incident reviews and root cause analysis.
Technology Operations Associate
Oct 2017 – Oct 2022- Infrastructure Maintenance: Maintained enterprise infrastructure supporting mission-critical banking applications.
- Health Monitoring: Monitored performance across Windows and VMware environments, automating diagnostics with PowerShell.
- Collaboration: Partnered with network and database teams to resolve complex production incidents.
System Administrator
Sep 2015 – Oct 2017- Server Management: Managed Windows and Linux production servers in a 35-member command center for financial clients.
- Hardware Operations: Resolved hardware issues via iLO, DRAC, and SMH; managed security compliance and tool integration.
Support Engineer
Nov 2014 – Sep 2015- Production Support: Monitored server stability (CPU/Disk/Memory) and managed VSS backups for United Health Care and GHX.
- Technical Support: Resolved infrastructure alerts and handled incident ticketing through Kayako and XSmart-control.
Assistant Engineer
Jun 2013 – Aug 2014- Systems Deployment: Installed and upgraded healthcare IT systems (Meditos) and managed asset inventory.
- Field Support: Configured engineering software solutions and assisted in resolving escalated technical issues.