Principal Site Reliability Engineer
Company: International Association of Plumbing and Mechanic
Location: Palo Alto
Posted on: March 4, 2025
Job Description:
DescriptionLeidos has an opportunity within the newly created
Digital Modernization Practice Area, leading Site Reliability
Engineering for the Repeatable Offerings (RO) organization. The RO
organization is the delivery arm of the Digital Modernization
sector's Repeatable Offerings, delivering differentiated
capabilities and managed services across the sector and the larger
Leidos corporation. We are seeking a Principal Site Reliability
Engineer (SRE) to lead the design, implementation, and operation of
scalable, highly available systems. As a subject matter expert, you
will establish best practices for reliability, security, and
efficiency while driving innovation in our deployment and
operations strategies. You will collaborate with development teams
to improve system performance, automate processes, and ensure
smooth recovery in high-pressure situations.The team is primarily
located in Blacksburg, VA, and the selected candidate will be
required to either be on-site in Blacksburg or will travel
frequently to that location, as well as other locations as
required.Primary Responsibilities:
- Lead the development and execution of SRE strategies to enhance
system reliability, scalability, and efficiency.
- Manage production systems and operations, ensuring robust
development and implementation processes.
- Oversee recovery efforts for unstable or at-risk projects,
applying expertise in remediation strategies.
- Design and implement microservice architectures, including
orchestrators, for high-performance distributed systems.
- Develop, maintain, and optimize CI/CD pipelines, infrastructure
as code (IaC), and automation frameworks.
- Drive adoption of best practices for horizontal and vertical
scaling of microservices.
- Define and implement packaging and deployment strategies to
support rapid and reliable software delivery.
- Collaborate with engineering teams to improve observability,
monitoring, and operational excellence.
- Provide technical leadership in managing containerized
applications and orchestration platforms.
- Mentor and guide teams on modern reliability engineering
methodologies and best practices.Basic Qualifications:
- Requires BS degree and 12 - 15 years of prior relevant
experience or Masters with 10 - 13 years of prior relevant
experience.
- Proven experience as a Principal SRE or equivalent role in
establishing robust and reliable systems.
- Expertise in managing production systems and operations,
including monitoring, incident response, and performance
optimization.
- Strong experience with Kubernetes and container
orchestration.
- Deep understanding of CI/CD pipelines, infrastructure as code
(IaC), Helm Charts, and Operators.
- Hands-on experience in designing and implementing microservice
architecture and distributed systems.
- Experience leading development teams in packaging and
deployment strategies.
- Strong knowledge of management strategies and techniques to
support SRE principles.
- Must have U.S. Citizenship.
- Must be able to obtain and maintain a Public Trust clearance
specific to the customer.Preferred Qualifications:
- Strong experience with OpenShift in enterprise
environments.
- Experience with auto-scaling, self-healing architectures, and
advanced resiliency strategies.
- Demonstrated success in improving and recovering red/unhealthy
projects.
- Familiarity with service mesh technologies and distributed
tracing for monitoring and observability.
- Expertise in designing and implementing highly available,
fault-tolerant systems at scale.
- Experience working on Federal Government contracts.
#J-18808-Ljbffr
Keywords: International Association of Plumbing and Mechanic, Palo Alto , Principal Site Reliability Engineer, Professions , Palo Alto, California
Didn't find what you're looking for? Search again!
Loading more jobs...