Arivazhagan Pandiyan

Site Reliability Engineering professional with 11+ years of experience supporting large-scale production environments across financial services and logistics platforms. Strong background in cloud infrastructure, observability, incident management, and reliability operations.

arivu.p@live.in
+1 346-599-0347
Houston, Texas, USA
www.arivu.site

About Me

I am a seasoned Site Reliability Engineering Lead / Senior SRE with over 11 years of experience supporting large-scale production environments across financial services and logistics platforms. I specialize in building fault-tolerant, scalable infrastructure and ensuring operational excellence for mission-critical applications.

My expertise spans Microsoft Azure and Google Cloud Platform (GCP), with deep hands-on experience in Kubernetes (GKE) environments and enterprise observability platforms including Datadog (APM, Tracing, DBM, RUM), Splunk, and Dynatrace. I serve as the primary Datadog specialist, managing end-to-end observability pipelines and driving telemetry hygiene through OpenTelemetry adoption.

As an Incident Commander, I lead triage for high-severity outages and drive post-incident root cause analysis (RCA) to ensure continuous improvement. I've successfully reduced alert noise from 40% to 10% by implementing AI-driven alerting, Watchdog anomaly detection, and SLO-aligned monitoring standards based on the "Golden Signals" framework.

I'm proficient in Infrastructure as Code using Terraform and Ansible, with strong scripting skills in Python, PowerShell, and Shell. I hold the Microsoft Certified: Azure Administrator Associate (AZ‑104) and Datadog Fundamentals certifications. I'm passionate about reducing toil through automation, optimizing cloud costs, and mentoring engineering teams on observability best practices.

11+
Years Experience
75%
Alert Noise Reduced
15+
Team Members Led
2
Certifications

Technical Proficiency

Cloud

Microsoft Azure Google Cloud Platform (GCP)

Containers

Kubernetes GKE

Observability

Datadog (APM, Tracing, DBM, RUM) Splunk Dynatrace Azure Monitor

Automation / IaC

Terraform Ansible Python PowerShell Shell Scripting

ITSM / Incident

PagerDuty Opsgenie ServiceNow

CI/CD

Jenkins GitHub

Platforms

VMware Linux Windows Server

Certifications

Datadog Fundamentals Certification
Microsoft Certified: Azure Administrator Associate (AZ‑104)

Professional Experience

Site Reliability Engineer

Aug 2025 – Present
Izen Labs (Client: Uber Freight) · Remote
  • Datadog Platform Ownership: Serve as the primary specialist for Datadog, managing APM, distributed tracing, DBM, and log ingestion pipelines to ensure end-to-end observability.
  • Observability Engineering: Evolved telemetry hygiene by implementing consistent tagging, naming standards, and OpenTelemetry adoption across Kubernetes clusters.
  • Incident Management & Triage: Act as Incident Commander, utilizing Datadog, Splunk, and ServiceNow to triage high-severity outages and lead post-incident root cause analysis (RCA).
  • Alert Optimization: Reduced alert noise from 40% to 10% by implementing AI-driven alerting, Watchdog anomaly detection, and SLO-aligned monitoring standards.
  • Standardization & Governance: Designed and enforced monitoring standards based on "Golden Signals," providing real-time visibility through custom dashboards for leadership and engineering teams.
  • Cost & Performance Management: Manage cost visibility and optimization for Datadog logs and APM usage while enabling application teams on observability best practices.
  • Operational Readiness: Support CI/CD pipelines and participate in DR testing and operational readiness reviews to ensure continuous improvement of SRE practices.

Site Reliability Engineering Lead

Aug 2023 – Jul 2024
New American Funding
  • Team Leadership: Led a 15‑member SRE team responsible for reliability of enterprise financial systems.
  • Incident Response: Managed incident response for high‑severity production issues and coordinated cross‑team resolution.
  • Monitoring Strategy: Implemented monitoring strategy using Azure Monitor and PagerDuty.
  • Service Levels: Defined service level indicators (SLIs) and service level objectives (SLOs).
  • Automation: Automated operational tasks using PowerShell scripting.
  • Post-Mortems: Led post‑incident reviews and root cause analysis.

Technology Operations Associate

Oct 2017 – Oct 2022
Wells Fargo India Solutions
  • Infrastructure Maintenance: Maintained enterprise infrastructure supporting mission-critical banking applications.
  • Health Monitoring: Monitored performance across Windows and VMware environments, automating diagnostics with PowerShell.
  • Collaboration: Partnered with network and database teams to resolve complex production incidents.

System Administrator

Sep 2015 – Oct 2017
NTT DATA
  • Server Management: Managed Windows and Linux production servers in a 35-member command center for financial clients.
  • Hardware Operations: Resolved hardware issues via iLO, DRAC, and SMH; managed security compliance and tool integration.

Support Engineer

Nov 2014 – Sep 2015
Firstsource
  • Production Support: Monitored server stability (CPU/Disk/Memory) and managed VSS backups for United Health Care and GHX.
  • Technical Support: Resolved infrastructure alerts and handled incident ticketing through Kayako and XSmart-control.

Assistant Engineer

Jun 2013 – Aug 2014
Cliptos Technologies
  • Systems Deployment: Installed and upgraded healthcare IT systems (Meditos) and managed asset inventory.
  • Field Support: Configured engineering software solutions and assisted in resolving escalated technical issues.

Education

Bachelor of Technology (Information Technology)

Sri Venkateshwara College of Engineering, Anna University – Chennai, India