Experience
- Serve as primary diagnostic authority ('Server Surgeon') for server-level incidents — leading root-cause analysis across Windows, IIS, SQL Server, and ArcGIS Server environments using Nagios, CloudWatch, and Jira Service Management.
- Led RCA investigations for P1–P3 incidents, producing evidence-backed post-incident documentation including timelines, log analysis, and preventative recommendations.
- Analyzed IIS traffic patterns and AWS network telemetry to identify high-bandwidth endpoints and delivered a Network Root-Cause Analysis with targeted remediation recommendations for Dev and SRE teams.
- Created comprehensive documentation for troubleshooting, server configurations, failure modes, and diagnostic workflows — published in Confluence and adopted across DevOps and SRE teams.
- Defined Golden AMI standards for IIS and SQL Server workloads, including patch baselines, required agents, validation checklists, and SOC 2 audit requirements.
- Correlated logs and metrics across monitoring systems to identify recurring failure patterns and delivered actionable prevention recommendations to engineering and leadership.
- Communicated incident findings and risk assessments to both technical teams and executive leadership, translating complex diagnostics into clear business impact summaries.
- Architected and maintained hybrid infrastructure across AWS and on-premises environments supporting 24x7 operations.
- Implemented automation using Python, PowerShell, and GitHub Actions to streamline patching, software deployment, and configuration updates across Windows servers.
- Developed and maintained Terraform scripts for provisioning and updating infrastructure resources.
- Acted as escalation point for critical system outages, performing root cause analysis and implementing preventive measures.
- Collaborated with development and DevOps teams to improve deployment pipelines and reduce recurring operational issues.
- Troubleshot Kubernetes pods and application issues using Lens in production environments.
- Led security hardening, access management, and cost optimization initiatives across multiple AWS accounts.
- Supported and maintained 100+ client environments with 1,000+ VMs and 1,500+ end users in a private cloud SaaS platform operating 24x7.
- Managed complex Windows Server, Active Directory, Citrix, IIS, and SQL Server environments across diverse client configurations.
- Served as first responder for system outages — rapid triage, escalation coordination, and long-term resolutions to prevent recurrence.
- Automated application deployment using PowerShell, reducing manual setup time by 75% — from 8 hours to 2 hours per deployment.
- Conducted proactive storage and billing audits, shifting team focus from reactive firefighting to preventive monitoring.
- Created comprehensive internal and client-facing documentation for environment management and troubleshooting.
- Provided enterprise-level technical support to healthcare clients using APIs, SQL, and custom configs for scheduling integrations.
- Troubleshot complex integration issues across 15+ EMR systems connected to Zocdoc's platform.
- Recognized as Top Performer — resolving 25+ issues per day with the highest client satisfaction scores on the team.
- Championed a process improvement initiative that reduced inbound support volume by ~20%.
- Provided Tier 2 support for 300+ faculty and staff across network, Active Directory, and application environments.
- Managed DHCP/DNS infrastructure and user permissions through Active Directory and Infoblox.
- Collaborated with IT management to improve internal support processes and documentation.
Selected Achievements
Golden AMI Standards Defined Golden AMI standards for IIS and SQL Server workloads including patch baselines, required agents, validation checklists, and SOC 2 audit requirements.
Automated SaaS Deployments Designed PowerShell and SQL-based automation reducing manual deployment time by 75% — from 8 hours to 2 hours — while eliminating configuration drift.
Reduced MTTR by 40% Implemented comprehensive monitoring with PRTG and Observe, improving uptime visibility and enabling proactive incident response.
Optimized Archiving Workflows Automated workspace archiving, cutting project completion times from 300 to 120 hours (60% improvement) while improving compliance.
Reduced Project Delivery Time Authored deployment guides and runbooks that cut average project setup time by 3 hours per cycle while improving consistency.