Arivazhagan Pandiyan

Site Reliability Engineering professional with 11+ years of experience supporting large-scale production environments across financial services and logistics platforms. Strong background in cloud infrastructure, observability, incident management, and reliability operations.

arivu.p@live.in
+1 346-599-0347
Houston, Texas, USA
www.arivu.site

About Me

I am a seasoned Site Reliability Engineering Lead / Senior SRE with over 11 years of experience supporting large-scale production environments across financial services and logistics platforms. I specialize in building fault-tolerant, scalable infrastructure and ensuring operational excellence for mission-critical applications.

My expertise spans Microsoft Azure and Google Cloud Platform (GCP), with deep hands-on experience in Kubernetes (GKE) environments and enterprise observability platforms including Datadog (APM, Tracing, DBM, RUM), Splunk, and Dynatrace. I serve as the primary Datadog specialist, managing end-to-end observability pipelines and driving telemetry hygiene through OpenTelemetry adoption.

As an Incident Commander, I lead triage for high-severity outages and drive post-incident root cause analysis (RCA) to ensure continuous improvement. I've successfully reduced alert noise from 40% to 10% by implementing AI-driven alerting, Watchdog anomaly detection, and SLO-aligned monitoring standards based on the "Golden Signals" framework.

I'm proficient in Infrastructure as Code using Terraform and Ansible, with strong scripting skills in Python, PowerShell, and Shell. I hold the Microsoft Certified: Azure Administrator Associate (AZ‑104) and Datadog Fundamentals certifications. I'm passionate about reducing toil through automation, optimizing cloud costs, and mentoring engineering teams on observability best practices.

11+
Years Experience
75%
Alert Noise Reduced
15+
Team Members Led
2
Certifications

Technical Proficiency

Cloud

Microsoft Azure Google Cloud Platform (GCP)

Containers

Kubernetes GKE

Observability

Datadog (APM, Tracing, DBM, RUM) ELK Stack Splunk Dynatrace Azure Monitor

Automation / IaC

Terraform Ansible Python PowerShell Shell Scripting

ITSM / Incident

PagerDuty Opsgenie ServiceNow

CI/CD

Jenkins GitHub

SRE Practice

Error Budgets Chaos Engineering (Chaos Mesh) Blameless Post-mortems Golden Signals SLIs / SLOs

Platforms

VMware Linux (RHEL/Ubuntu) Windows Server

Certifications

Datadog Fundamentals Certification
Microsoft Certified: Azure Administrator Associate (AZ‑104)

Professional Experience

Site Reliability Engineer

Aug 2025 – Present
Izen Labs (Client: Uber Freight) · Remote
  • On-Prem to GCP Migration: Spearheaded the comprehensive migration of monitoring infrastructure from legacy on-premises data centers to Google Cloud Platform (GCP), ensuring zero observability gaps.
  • Datadog Observability Transformation: Re-architected the monitoring landscape by migrating legacy ELK stack logs and on-prem metrics into Datadog, centralizing telemetry for high-scale GCP workloads.
  • High Availability & Error Budgets: Engineered system reliability to maintain 99.99% availability for mission-critical logistics platforms; managed Error Budgets to balance feature velocity with production stability.
  • Infrastructure as Code (IaC): Utilized Terraform to architect and provision multi-region Kubernetes (GKE) clusters, implementing modular code to ensure consistent environment states.
  • Chaos Engineering Implementation: Enhanced system resilience by conducting scheduled fault-injection experiments using Chaos Mesh, successfully identifying and mitigating single points of failure.
  • Kubernetes Scalability: Optimized application performance via manual and automated scaling (HPA/VPA) of workloads within Kubernetes to handle unpredictable traffic spikes.
  • CI/CD Pipeline Management: Orchestrated automated deployment pipelines using Jenkins and GitHub, incorporating automated testing to reduce deployment-related incidents.
  • Incident Command & RCA: Acted as Incident Commander for high-severity production outages, coordinating cross-functional teams and utilizing ServiceNow for Root Cause Analysis (RCA).
  • Alert Optimization: Leveraged Datadog Watchdog and AI-driven alerting to reduce alert noise from 40% to 10%, focusing the team on actionable events.

Site Reliability Engineering Lead

Aug 2023 – Jul 2024
New American Funding
  • Strategic Team Leadership: Led a high-performing 15-member SRE team responsible for the 24/7 reliability of enterprise-level financial and mortgage systems.
  • Enterprise Monitoring Overhaul: Designed and implemented a comprehensive enterprise monitoring strategy utilizing Azure Monitor and PagerDuty, increasing visibility into legacy applications.
  • Service Level Definition: Established robust service level indicators (SLIs) and service level objectives (SLOs) to align IT performance with business expectations.
  • Operational Automation: Significantly reduced manual toil by automating repetitive operational tasks and diagnostic workflows using PowerShell and Python scripting.
  • Cross-Team Incident Coordination: Managed high-severity production responses, facilitating blameless post-mortems and coordinating long-term stability fixes.

Technology Operations Associate

Oct 2017 – Oct 2022
Wells Fargo India Solutions
  • Infrastructure Maintenance: Maintained enterprise infrastructure supporting mission-critical banking applications.
  • Health Monitoring: Monitored performance across Windows and VMware environments, automating diagnostics with PowerShell.
  • Collaboration: Partnered with network and database teams to resolve complex production incidents.

System Administrator

Sep 2015 – Oct 2017
NTT DATA
  • Server Management: Managed Windows and Linux production servers in a 35-member command center for financial clients.
  • Hardware Operations: Resolved hardware issues via iLO, DRAC, and SMH; managed security compliance and tool integration.

Support Engineer

Nov 2014 – Sep 2015
Firstsource
  • Production Support: Monitored server stability (CPU/Disk/Memory) and managed VSS backups for United Health Care and GHX.
  • Technical Support: Resolved infrastructure alerts and handled incident ticketing through Kayako and XSmart-control.

Assistant Engineer

Jun 2013 – Aug 2014
Cliptos Technologies
  • Systems Deployment: Installed and upgraded healthcare IT systems (Meditos) and managed asset inventory.
  • Field Support: Configured engineering software solutions and assisted in resolving escalated technical issues.

Education

Bachelor of Technology (Information Technology)

Sri Venkateshwara College of Engineering, Anna University – Chennai, India

Get in Touch

I'm currently open to new opportunities as a Site Reliability Engineer Lead. Whether you have a question or just want to connect, feel free to reach out!

Location

Houston, Texas, USA

Arivu's Assistant

Hello! I'm Arivu's AI assistant. How can I help you today?