City: Nashville, TN, US
Employer Reference: 10006337
Job Description
The Site Reliability Engineer (SRE) works in the Advanced Computing Center for Research and Education (ACCRE) and is a key manager responsible for designing, building, and maintaining large-scale production systems with high efficiency and availability. This role encompasses various areas, including software and systems engineering practices, storage, data management, and services. SRE professionals are highly specialized and possess expertise in different domains such as systems, networking, storage, coding, database management, capacity management, continuous delivery, and deployment, as well as open-source cloud-enabling technologies like Kubernetes, containers, and virtualization. Reporting directly to the Director of Research Computing Operations, the SRE’s responsibilities encompass ensuring reliable storage solutions, managing data efficiently, and providing related services to support the overall stability and performance of ACCRE’s production systems.
The Site Reliability Engineer is shared with the CMS project. In this role they will work in tandem with the CMS Tier 2 coordinator for the US to ensure smoother data and job delivery to the ACCRE cluster by the project’s command center.
The Advanced Computing Center for Research and Education (ACCRE) is built and operated by Vanderbilt faculty. Its mission is to enable Vanderbilt researchers to explore and benefit from the “New World” of computing, thereby addressing questions of great societal importance that they would not have been able to otherwise. To achieve this, the center has established the following goals:
- Application Driven: ACCRE emphasizes the application of computational resources to important questions across the diverse disciplines of Vanderbilt researchers, rather than focusing solely on the development of computational hardware, tools, and methodologies.
- Low Barriers: The center aims to provide computational services with low barriers to participation, working closely with researchers to develop and adapt computing tools to their specific areas of inquiry.
- Expand the Paradigm: ACCRE collaborates with the Vanderbilt community to discover innovative ways of utilizing computing in the humanities, arts, and education.
- Promote Community: The center fosters an interactive community of researchers and cultivates a campus culture that supports and promotes the use of computing tools.
- Investigator Driven: ACCRE maintains a grassroots, bottom-up approach, operating as a facility by and for Vanderbilt faculty.
ACCRE provides computing resources flexible enough to support High Performance Computing applications in a broad range of research projects. To meet the growing demand for data storage, the center is developing and deploying solutions for both online and offline data repositories. Moreover, ACCRE offers the necessary hardware for investigators and students to visualize high-dimensional data using parallel graphics and stereo projection technologies. The center’s infrastructure also includes expertise and support staff to facilitate usage, including educational/outreach staff focused on lowering barriers to use and expanding the paradigm to encompass new and non-traditional areas of investigation. The center operates a 14,000+ core Linux cluster comprised of multiple computer architectures and over 30 petabytes of parallel access, fault tolerant, distributed disk storage.
The center operates a 14,000+ core Linux cluster comprised of multiple computer architectures and over 30 petabytes of parallel access, fault tolerant, distributed disk storage.
Duties and Responsibilities
- Lead the implementation and support of large-scale storage clusters based of the design and recommendations of the Architecture Group, including monitoring, logging, and alerting
- Formulate and deploy AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand.
- Lead the Systems Group to improve the lifecycle of services – from inception and design, through deployment, operation, and refinement.
- Coordinate work between the Infrastructure and Systems Groups to establish adequate level of support for services before they go live through activities such as system design consulting, developing software and frameworks, capacity management, and launch reviews.
- Define and oversee the formal evaluation of services once they are live by measuring and monitoring availability, latency, and overall system health, including leveraging machine learning models.
- Define and track hardware lifecycle metrics that impact spending priorities and budgetary plans.
- Coordinate and guide joint efforts between the Research Software Engineer (RSE) and Systems Groups on full-stack projects
- Scale systems sustainably through mechanisms like AI/ML and automation and evolve systems by pushing for changes that improve reliability and velocity.
- Practice sustainable incident response and blameless postmortems.
- Be part of an on-call rotation to support production systems.
- Ensure that ACCRE’s internal and external facing services have reliability and uptime as promised to the users.
- Enable developers to make changes to the existing system through careful preparation and planning while keeping an eye on capacity, latency, and performance.
Supervisory Relationships
This position has supervisory responsibilities for ACCRE’s Systems Group. This position reports administratively and functionally to the Director of Research Operations.
Qualifications
- Bachelor’s degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) is required.
- Advanced degree is strongly preferred.
- A minimum of five years of experience with one or more major programming languages such as C, C++, Java, or Fortran, during work or school is required.
- Five years of experience with one or more Unix scripting languages such as Perl, Bash, Csh, or Python, plus a working knowledge of all of these, during work or school is required.
- Experience with algorithms, data structures, complexity analysis, software design, and maintaining large-scale Linux-based systems is required.
- Experience in one or more of the following: C/C++, Java, Python, Go, Perl or Ruby, AI/ML frameworks and methodologies is required.
- Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform is required.
- Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic stack is required.
- Strong ability to work independently and in a team environment and make decisions is required.
- Strong ability to share knowledge coherently with others and motivate and mentor peers is required.
- Physical ability to work with and lift 30lbs when needed is required.
- Strong programming ability and understanding of commonly used design patterns is preferred.