Staff Site Reliability Engineer at Zscaler, San Jose, CA
Zero Trust Arch product security builds (primarily FreeBSD) provisioning automation across 900+ public/private data centres worldwide.
Built fail-safe data pipeline connecting internal in-house cloud weblog data (with rundeck api https://docs.rundeck.com/docs/) to BIGQuery (GCP with OAuth) to analyse (with pandas, scikit-learn, PyTorch) customer transaction data/workload patterns & do capacity planning for new features like Data Loss Prevention.
Validating bgp protocol configs/Traffic monitoring & restarting 1k+ high-priority Production Services with Self-healing scripts (across time-zones (outside business hours of the Datacenter) > timezonefinder offline library mapping via lattitude/longitude to Data center coordinates and timezone.)
Leveraged various libraries for build provisioning, UI automation, and data scraping on production nodes handling customer traffic:
Selenium (with okta SSO cookies passed via a plugin) and Playwright for browser UI tasks and boot process automation for linux and Mac minis
Tinkered with Redfish APIs, bit of Hive-Hadoop infra data scraping and also AppleScript event handling on Mac minis, focusing on security, SIP, editing sqlite tcc db( understanding various macos access policies and security settings), giving system level permission and automating provisioning.
Bit of AWS boto sdk to provision AWS resources so customers can connect to in-house nano log streaming servers to their on-prem splunk.
Added exponential backoff (with pseudo-random jitter) for all API requests in the scripts to have robust error handling and prevent the thundering herd problem, and minimize service downtime by utilizing cache locks to maintain system performance and stability during spikes/high-demand scenarios.
Bunch of scripts with Jira APIs for issue tracking integration
Concurrency/parallel execution libraries:
gevent threadpools (monkey patching std libraries for non-blocking I/O)
multiprocessing for CPU-bound tasks
asyncio for explicit event loop handling
native system threading library with Global Interpreter Lock considerations
SSH paramiko, subprocess, and OS library for remote execution and system interactions
Golang runtime for handling coroutines in mixed workloads (CPU-bound and I/O-bound tasks tradeoffs)
Midst of some Callibrations:
https://huggingface.co/deepseek-ai/DeepSeek-V2 (also Llama-3.1-Nemotron-70B-Instruct) hosting internally to convert everyone to 10x dev :) and refactor bad code blocks..code to documentation(easy to read markdown) etc.
prompt to sql mapping chatbot..LORA finetuning on the internal webapps/dbs queries and schemas, generating RAG embeddings from ollama (for sql queries validation).