Arivazhagan Pandiyan | SRE Lead / Senior SRE

About Me

I am a seasoned Site Reliability Engineering Lead / Senior SRE with over 11 years of experience supporting large-scale production environments across financial services and logistics platforms. I specialize in building fault-tolerant, scalable infrastructure and ensuring operational excellence for mission-critical applications.

My expertise spans Microsoft Azure and Google Cloud Platform (GCP), with deep hands-on experience in Kubernetes (GKE) environments and enterprise observability platforms including Datadog (APM, Tracing, DBM, RUM), Splunk, and Dynatrace. I serve as the primary Datadog specialist, managing end-to-end observability pipelines and driving telemetry hygiene through OpenTelemetry adoption.

As an Incident Commander, I lead triage for high-severity outages and drive post-incident root cause analysis (RCA) to ensure continuous improvement. I've successfully reduced alert noise from 40% to 10% by implementing AI-driven alerting, Watchdog anomaly detection, and SLO-aligned monitoring standards based on the "Golden Signals" framework.

I'm proficient in Infrastructure as Code using Terraform and Ansible, with strong scripting skills in Python, PowerShell, and Shell. I hold the Microsoft Certified: Azure Administrator Associate (AZ‑104) and Datadog Fundamentals certifications. I'm passionate about reducing toil through automation, optimizing cloud costs, and mentoring engineering teams on observability best practices.

11+

Years Experience

75%

Alert Noise Reduced

15+

Team Members Led

Certifications

Technical Proficiency

Cloud

Containers

Observability

Automation / IaC

ITSM / Incident

CI/CD

SRE Practice

Platforms

Certifications

Datadog Fundamentals Certification Datadog · Verified

Microsoft Certified: Azure Administrator Associate Microsoft · AZ‑104

Professional Experience

Site Reliability Engineer

Aug 2025 – Present

Izen Labs (Client: Uber Freight) · Remote

On-Prem to GCP Migration: Spearheaded the comprehensive migration of monitoring infrastructure from legacy on-premises data centers to Google Cloud Platform (GCP), ensuring zero observability gaps.
Datadog Observability Transformation: Re-architected the monitoring landscape by migrating legacy ELK stack logs and on-prem metrics into Datadog, centralizing telemetry for high-scale GCP workloads.
High Availability & Error Budgets: Engineered system reliability to maintain 99.99% availability for mission-critical logistics platforms; managed Error Budgets to balance feature velocity with production stability.
Infrastructure as Code (IaC): Utilized Terraform to architect and provision multi-region Kubernetes (GKE) clusters, implementing modular code to ensure consistent environment states.
Chaos Engineering Implementation: Enhanced system resilience by conducting scheduled fault-injection experiments using Chaos Mesh, successfully identifying and mitigating single points of failure.
Kubernetes Scalability: Optimized application performance via manual and automated scaling (HPA/VPA) of workloads within Kubernetes to handle unpredictable traffic spikes.
CI/CD Pipeline Management: Orchestrated automated deployment pipelines using Jenkins and GitHub, incorporating automated testing to reduce deployment-related incidents.
Incident Command & RCA: Acted as Incident Commander for high-severity production outages, coordinating cross-functional teams and utilizing ServiceNow for Root Cause Analysis (RCA).
Alert Optimization: Leveraged Datadog Watchdog and AI-driven alerting to reduce alert noise from 40% to 10%, focusing the team on actionable events.

Site Reliability Engineering Lead

Aug 2023 – Jul 2024

New American Funding

Strategic Team Leadership: Led a high-performing 15-member SRE team responsible for the 24/7 reliability of enterprise-level financial and mortgage systems.
Enterprise Monitoring Overhaul: Designed and implemented a comprehensive enterprise monitoring strategy utilizing Azure Monitor and PagerDuty, increasing visibility into legacy applications.
Service Level Definition: Established robust service level indicators (SLIs) and service level objectives (SLOs) to align IT performance with business expectations.
Operational Automation: Significantly reduced manual toil by automating repetitive operational tasks and diagnostic workflows using PowerShell and Python scripting.
Cross-Team Incident Coordination: Managed high-severity production responses, facilitating blameless post-mortems and coordinating long-term stability fixes.

Technology Operations Associate

Oct 2017 – Oct 2022

Wells Fargo India Solutions

Infrastructure Maintenance: Maintained enterprise infrastructure supporting mission-critical banking applications.
Health Monitoring: Monitored performance across Windows and VMware environments, automating diagnostics with PowerShell.
Collaboration: Partnered with network and database teams to resolve complex production incidents.

System Administrator

Sep 2015 – Oct 2017

NTT DATA

Server Management: Managed Windows and Linux production servers in a 35-member command center for financial clients.
Hardware Operations: Resolved hardware issues via iLO, DRAC, and SMH; managed security compliance and tool integration.

Support Engineer

Nov 2014 – Sep 2015

Firstsource

Production Support: Monitored server stability (CPU/Disk/Memory) and managed VSS backups for United Health Care and GHX.
Technical Support: Resolved infrastructure alerts and handled incident ticketing through Kayako and XSmart-control.

Assistant Engineer

Jun 2013 – Aug 2014

Cliptos Technologies

Systems Deployment: Installed and upgraded healthcare IT systems (Meditos) and managed asset inventory.
Field Support: Configured engineering software solutions and assisted in resolving escalated technical issues.

Education

Bachelor of Technology (Information Technology)

Sri Venkateshwara College of Engineering, Anna University – Chennai, India

Get in Touch

I'm currently open to new opportunities as a Site Reliability Engineer Lead. Whether you have a question or just want to connect, feel free to reach out!

Email

arivu.p@live.in

Phone

+1 346-599-0347

Location

Houston, Texas, USA

Connect on LinkedIn

About Me

Technical Proficiency

Cloud

Containers

Observability

Automation / IaC

ITSM / Incident

CI/CD

SRE Practice

Platforms

Certifications

Professional Experience

Site Reliability Engineer

Site Reliability Engineering Lead

Technology Operations Associate

System Administrator

Support Engineer

Assistant Engineer

Education

Bachelor of Technology (Information Technology)

Get in Touch

Email

Phone

Location

Arivu's Assistant