Harvard University seeks a Sr. Systems Engineer to work on cutting-edge infrastructure to support the Kempner Institute for Natural and Artificial Intelligence. This position will work within a team of RC Systems Engineers in FAS Research Computing (FASRC) to design, implement, deploy, and maintain advanced monitoring, logging, and alerting systems for mission-critical services. The Systems Engineering group maintains core production infrastructure for high performance computing, storage, and networking. These services will be key to the success of Kempner Institute researchers. This is an individual contributor position that will report to the Associate Director of Systems & Operations in FASRC.
This position will coordinate the planning of and conduct advanced research computing engineering duties. Implement current and develop new RC solutions to keep up with the pace of complex research problems. Work independently to build, monitor, and maintain the integrity of RC systems. Provide technical expertise to teams and projects alongside research programs. Be a key contributor to multiple projects simultaneously.
Over the next five years, the Kempner Institute (www. harvard.edu/kempner-institute) is building one of the largest academic machine learning clusters in the world to enable research in machine learning and neuroscience, with 1,000 -1,500 latest design GPUs and upgrades every five years. The cluster will be housed at the Massachusetts Green High Performance Computing Center (MGHPCC), a modern data center in western Massachusetts shared by Boston-area universities and managed by FASRC. The Senior Systems Engineer will be a critical team member within FASRC, partnering with Kempner Institute's leadership to ensure optimal performance and management of the cluster and help to inform future decisions related to compute, IO, and networking needs to support the Institute's research mission.
FASRC services include managing a Top 100 academic high-performance computing cluster, cloud computing, storage, databases, and other developmental platforms. The team directly engages with researchers through help requests, monitoring, office hours, training, and in-depth consultations. FASRC is committed to cultivating a diverse and inclusive culture that is vibrant, engaging, and encouraging of innovation as well as intellectual debate. We believe creating and maintaining an inclusive workplace allows employees from all backgrounds to achieve their fullest potential. We also believe an inclusive culture is one that accepts, values, and views the differences we all bring to the workplace.
Work is performed in an office setting on the Cambridge Campus and Allston.
May also be required to occasionally work in the FASRC data centers in Boston, MA, and Holyoke, MA.This is a full-time position with flexible hours and a hybrid in-person/remote work schedule option to be agreed upon at hire. The selected candidate will periodically need to be on campus as business needs require. All remote work must be performed from a commutable state where Harvard is registered to do business (CT, MA, MD, ME, NH, RI, and VT).
Minimum of seven years’ post-secondary education or relevant work experience
Additional Qualifications and Skills
Broad knowledge of the deployment and management of physical and virtual systems (e.g., storage, cluster computing, network, database, and applications).
Experience automating infrastructure with tools like Puppet, Chef, Ansible, or Terraform.
Experience with git and version control in general.
Demonstrated team performance skills, the ability to communicate clearly, a service mindset approach, and the ability to act as a trusted advisor.
Experience with monitoring systems such as writing service checks and creating actionable alerting.
Experience with metrics collection to gain insight into production systems.
Experience with log aggregation.
Experience with high-performance filesystems and distributed storage systems.