Site Reliability Engineer (GCP)
Veritas Partners has an immediate need for a Site Reliability Engineer (GCP) to work in a full-time capacity for a well-established financial institution!
This is a hybrid in-office position.
The successful Site Reliability Engineer (GCP) will work on a newly developed Bank as a Service platform!
We are in search of a dynamic SRE to partner with our Development teams to ensure the cloud-based infrastructure's integrity, performance, reliability, and cost-effectiveness.
Responsibilities:
● Provide L2/L3 support for production systems
● Work with the product and development teams to establish service level objectives and monitor to ensure the objectives are met
● Create dashboards to monitor performance and scalability in GCP platform
● An ability to design and code new software or modify existing software
● Supporting cloud environments in accordance with operational requirements
● Review, resolve incidents, escalation and tasks
● Facilitate root cause analysis meetings in the event of a production-systems incident and improve run books.
● Monitor production components by running health checks, monitoring latency and memory utilization.
● Practice incident management best practices and perform RCA.
● Participate in disaster recovery tests and operational acceptance tests
● Deploying and debugging cloud initiatives as needed in accordance with best practices
● Automating build and infrastructure self healing pipeline.
● Maintain and update deployment Playbook
● Remediate security vulnerabilities in the cloud infrastructure
● Work with the Information security and dev teams on implementing secure cloud best practices.
● Troubleshoot and resolve issues in live production environments and implement strategies to eliminate them with minimal support.
● Support and monitor new and existing services, platforms, and application stacks.
● Engage in improving the lifecycle of services deployment, operations, and refnement.
● Participate in periodic 24x7 on-call duties.
● Being accountable for resolving the outage via workaround or permanent fx
● Ensuring all administration and reports are maintained and up to date including contacts information technical diagrams post major incident reviews.
● Responsible for communicating with various stakeholders & shipping IT Communication.
● Responsible for the effective implementation of the process Incident, Change and Problem Management and conducts the respective reporting procedure.
● Monitor the incidents to ensure that the Service Level Agreement is respected.
● Identify initiate schedule and conduct incident reviews
Qualifications:
● Excellent verbal, written, and interpersonal communication skills to maintain relationships and partnerships.
● Strong leadership skills and understanding of developing and mentoring others.
● 5+ years of experience in a DevOps Engineer role or related position
● 2+ years of experience with GCP is a must. Experience with AWS or Azure preferred in addition to GCP experience.
● Experience with log monitoring tools
● Experience administering multiple observability or APM systems
● 2+ years of Software Development work experience using Java or similar languages.
● Experience with Apache Kafka or similar event streaming platforms
● Experience with container orchestration
● Disaster recovery experience
● High level proficiency in understanding of REST and microservice architectures
● Advanced understanding of how to develop, build, test, and deploy code using an integrated CI/CD Pipeline
● Hands on experience automating CI/CD pipelines
● Cloud Certification(s) in AWS, GCP or Azure is preferred
● Demonstrated ability to manage and complete projects from design phase to implementation phase
Technical Skills & Experience Required
● Cloud services providers: GCP
● Orchestration: GKE, Cloud Run, Cloud Functions
● CI/CD tools: GitHub CI, Jenkins, CloudBuild
● Infrastructure as Code: Terraform
● Monitoring tools: Google Cloud Monitoring, Grafana, DataDog, Google Cloud Trace
● Logging tools: Google Cloud Logging, Prometheus and/or Splunk
● Secrets management
● Languages: Python, GO, Java, JavaScript, TypeScript
● Kafka, KSQL
● Linux
● Scalable, high-available architecture
● Agile development
● SCM and project management tools: Gitlab, Jira
Requirements:
● Hybrid/remote position with ability to work at our Sterling office.
● Expertise with GCP.
● Experience working with GitHub, Git-based tools, CI/CD tools similar to Jenkins, Artifactory, Terraform, CloudFormation and other modern tools.
● Experience with Kubernetes, Docker, and containerization, GKE or equivalent tools.