Site Reliability Engineer - iPeople Infosystems LLC

Posted 2025-10-27 09:42:19
Remote, USA Full Time Immediate Start
[ad_1] Job Title: SRE (Site Reliability Engineer) Location: RemoteType: Fulltime PositionJob Description:Must-haveNVIDIA (DGX) or equivalent high-performance-compute (HPC) clusters (e.g. Cray, HPE, IBM)Cisco UCS C885ADockerGood to haveDevOps AutomationCI/CD systems (e.g., GitLab, GitHub Actions, Jenkins)Terraform, Ansible, JenkinsPythonGoLang, C/C++Enterprise Grade Kubernetes cluster (RedHat OpenShift preferred) and/or Google AnthosSoftware development lifecycle includes design, development, testing, packaging, and deployment using GolangRoles & ResponsibilitiesTechnical knowledge of high-performance compute, NVIDIA DGX/GPUs and/or Cisco Unified Compute System. Handle availability, latency, scalability and efficiency of NVIDIA and Cisco UCS infrastructureby instilling engineering reliability into the development life cycle with a focus on fault tolerant approaches. Drive capacity planning, performance analysis, instrumentation, and other non-functional systems requirements. Automate operational capabilities using Python, Ansible, Terraform, Go etc. Deliver automation through CI/CD pipeline and chatbot etc. Implement metrics driven processes to ensure service quality targets are met. Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity. Report this jobDice Id: 91137892Position Id: 8754667 [ad_2] Apply to this job
Back to Job Board