Built automation & monitoring tools/processes to deploy/upgrade various telco-grade products on k8s & Openstack/ceph.
Resolved over 150 critical bugs & escalations throughout infra stack from kernel upgrades, HW certs, storage performance to observability involving various internal/open-source tools like Zabbix, Prometheus, Grafana, ELK stack, Fluentd, Istio, and Calico.
Improved visibility for applications running on Kubernetes by debugging and integrating acceptance tests, low-level system and network metrics from CLI into Fault/Performance management dashboard tools for better resource estimation.
Developed over 20 validations in internal cluster health/log collection tools using Python/Ansible playbooks to fix the current state of the services and enable smoother lifecycle operations like scale-in/out nodes, PKI certificate renewals etc.
Completd the Hashicorp Certified Terraform Associate certification course in Udemy , which helped to solve some tickets with terraform-provider-openstack ..by fixing Terraform scripts to automate some cluster operations like upgrades.
Done with Certified Kubernetes Administrator (CKA) certfication, completed CKAD course, looked into CKS and https://landscape.cncf.io/ tooling available.. which all helped with faster RCA while handling large clusters
Infrastructure Customer Engineering (ICE) , transition from team of 2 to 8 memmbers in (North + South) Americas region's Telco customers
Nokia Container Services (NCS - k8s based) and Cloudband Infrastructure Software (CBIS - openstack based)
Contributed to the delivery and care of above two products for telco customer applications (VNF/CNF)
Infra automation and maintenance of 10000+ telecom(5G) customer servers/nodes .
Some of the tools i was involved with:
Storage: ceph csi (cephrbd,cephfs), hp primera csi, nfs csi, RAID configs