Job ID: 30494
Location: Evanston, Illinois
The Senior Systems Administrator completes efforts and projects related to the design, installation, operation, support, upgrades and maintenance of infrastructure, including software, servers, networking and storage. Administers, maintains and supports complex and critical infrastructure. This role provides user support for server hosting requests, problem resolution, system changes and upgrades. The incumbent analyzes infrastructure performance and resolves problems or makes appropriate recommendations. This role ensures that all aspects of operability are delivered as part of the implementation process while also ensuring that existing service level is maintained or improved. Contributes to and ensures full compliance to operational standards, procedures and best practices.
- This position is responsible for the operation and maintenance of Northwestern University's Research Computing Infrastructure, consisting of a Supercomputing (HPC) system, multiple on premise server and application environments, and cloud based research computing hosted in Amazon Web Services (AWS). Actively participate in the acquisition, installation, and management of hardware (compute, storage, and network), operating systems (Linux), research and analytical software tools, and Cloud software delivery solutions. Operational aspects of role involve maintaining the environment to optimal working efficiency, scripting and bringing to bear automation tools, administration and monitoring, facilitating and executing “scheduler” related activities, handling support requests, and resolving hardware and software related events. Collaborate regularly with the research computing consulting team, and participate in the creation and execution of standard operating procedures to maintain the integrity and security of on premise and cloud based solutions.
- Leads planning, development, and coordination of operations and projects for current and future infrastructures.
- Anticipates impact of growth and changes in operations, and recommends design and/or process changes.
- Participates in disaster recovery/business continuity planning including backup and recovery procedures and higher availability configurations.
- Maintains awareness of new technologies through publications, outside contacts, and ongoing professional development
- Ensures data/media recoverability by implementing a schedule of system backups and database archive operations.
- Serves as project liaison – implementations or upgrades, acts as the focal point for communications between our team and the Project Leader.
- Facilitates coordination and a thorough understanding of requirements, attends project meetings, creates written meeting notes, creates appropriate ticketing, etc.
- Maintains policies and standard procedures to increase system uptime.
- Identifies training needs and keeps current on application technologies.
- Monitors security alerts and ensures that appropriate patches are applied in an automated and timely fashion; works with developers to patch or upgrade custom code for security compliance.
- Documents and maintains system standards; researches and recommends innovative approaches for system administration tasks.
- Creates and maintains standard OS installation images for virtualization templates.
- Administration and support of Supercomputing (HPC) hardware (servers, network components, firewalls), operating system (Linux), utilities, and analytical software tools, storage, and backup system
- Facilitation of HPC and research compute “scheduler” toolset and activities
- Cloud computing services configuration and support
- Monitors application performance on servers.
- Evaluates and manages appropriate software and hardware allocations to achieve an optimum performance level.
- Performs capacity planning for projecting future growth.
- In collaboration with development project teams, builds, rebuilds, and/or updates servers and configures hardware and virtual machines (VM), applications, peripherals, services, networking, storage.
- Participation in the implementation and support of cloud based research computing services (Primarily in AWS)
- Leads troubleshooting of application, operating system(s) server hardware, network communications and storage problems within infrastructure.
- Provides a second level of support; leads service incident and problem resolution efforts to support entrusted applications/products.
- Provides user support on deployed servers.
- Consults on best practices to users.
- Provides data and metrics to support sizing requirements and performance tuning decisions. Participate in related decision processes with managers and leads.
- Positive collaborative nature when working with others on team, colleagues from schools, and vendors
- Participate in 24x7 On-Call Rotation Schedule
- Provides work direction and/or supervises staff such as team members, subordinates, contractors, vendors, students, etc.
- Recommends staff hires/terminations
- Coaches and mentors staff
- Manages projects ensuring timelines and deliverables are met and meet expectations
- Performs other duties as assigned.