294 Site Reliability jobs in Canada

Site Reliability Engineer

Mississauga, Ontario Compass Group

Posted 5 days ago

Job Viewed

Tap Again To Close

Job Description

You might not know our name, but you know where we are. That’s because Compass Group Canada is part of a global foodservice and support services company that’s the 6th largest employer in the world, with 625,000 employees.

You’ll find us in schools, colleges, hospitals, office buildings, senior living communities, tourist attractions, sports venues, remote camps and military installations and more. We’re in all major cities, at remote work sites and everywhere in between – doing business in Canada and 50+ other countries where you can learn and grow. Join us now and point your career forward!

**Why work with Compass Digital?** We are a member of Compass Group North America the leading foodservice and support services company. We create remarkable customer experiences through the innovative design and development of technology products and services. Compass Digital began as an innovation startup and the team has since rapidly grown and now supports technology and innovation across all of North America. Compass digital is comprised of user experience designers, developers, data scientists, project managers, business analysts, marketers and technology implementation managers, all of whom look at the world from a human perspective to rethink what's possible in the areas of technology innovation and consumer engagement within the foodservice and hospitality industries. Join us.

# **Job Summary**

As a **Reliability Engineer** you will work in focus areas such as observability, release automation, incident and problem response improvements, security, code quality, patch management and SRE advocacy. You will have the opportunity to use the latest and greatest cloud and open-source technology to enable our product and test engineering teams through solutions, self-service capabilities, support, and guidance.

Now, if you were to come on board as our **Reliability Engineer** we’d ask you to do the following for us:

- Add your voice, skills, and experience to our team
- Help grow and diversify our SRE team and discipline
- Step back to see the big picture, and dive deep when needed
- Design, build, and maintain automation for infrastructure, monitoring, and operations
- Write high-quality, maintainable code to improve reliability and reduce toil
- Build observability into systems from the ground up; make failure visible and actionable
- Evaluate and integrate open-source and commercial tools to increase stability and insight
- Collaborate with product teams to ensure systems are instrumented, secure, and resilient
- Document and share your work with the team and broader SRE community
- Champion a culture of ownership, transparency, and continuous improvement

Think you have what it takes to be our **Reliability Engineer?** We’re committed to hiring the best talent for the role. Here’s how we’ll know you’ll be successful in the role:

- **3–5 years** of experience in SRE, DevOps, Platform, Infrastructure, or Software Engineering roles.
- Strong foundational skills in infrastructure, automation, or software engineering.
- The ability to work independently on moderately complex problems while seeking guidance for strategic decisions.
- A desire to grow into senior engineering or technical leadership roles over time.

Our technology stack is primarily cloud-based, and while we currently use **AWS**, we welcome experience with **Azure, Google Cloud, or other cloud platforms**. A strong understanding of cloud-native principles matters more than specific provider experience. We value tools and practices that promote **automation, containerization, repeatability, and visibility**.

Compass Group Canada is committed to nurturing a diverse workforce representative of the communities within which we operate. We encourage and are pleased to consider all qualified candidates, without regard to race, colour, citizenship, religion, sex, marital / family status, sexual orientation, gender identity, aboriginal status, age, disability or persons who may require an accommodation, to apply.

For accommodation requests during the hiring process, please contact for further information.
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Montréal, Quebec Botpress Technologies Inc.

Posted today

Job Viewed

Tap Again To Close

Job Description

Job Description

Job Description

Help bring AI agents to companies worldwide.

Over the next decade, autonomous agents will redefine how we work.

Botpress allows companies to build and deploy advanced AI agents that move beyond conversation into real business logic.

Our product works today and at scale, across industries, regions, and limitless use cases.

As the 3rd fastest-growing B2B AI start-up worldwide, we’re at the forefront of the AI revolution, providing the most widely-used platform for sophisticated AI agents.

The work ahead is ambitious. The opportunity is rare. We take a deliberate approach to growth: product-led, capital-efficient, and highly focused.

If you want to build foundational technology for one of the most meaningful platform shifts in software, we’re looking for top talent to join us.

Key Highlights:

  • Over 1 million AI agents and chatbots deployed
  • 700,000+ platform users
  • Trusted by 35% of Fortune 500 companies
  • 7 years of expertise in AI solutions
About the Role

We’re hiring a Site Reliability Engineer to help ensure the stability, scalability, and security of our platform. You’ll be part of the product team, owning the systems that keep our services resilient and performant under real-world loads.

This is a hands-on engineering role focused on infrastructure reliability and operational excellence. You’ll architect and maintain the cloud systems (e.g. AWS) that power Botpress, with a strong focus on observability, uptime, and automation.

You’ll collaborate closely with engineers to refine how we ship, monitor, and operate software — always with an eye toward reducing risk and improving speed. Part of this role will include opening up the site to different regions of users.

Responsibilities
  • Architect and maintain scalable infrastructure
  • Design and optimize CI/CD pipelines to ensure smooth delivery of changes
  • Improve observability through advanced monitoring, logging, and alerting
  • Own incident response and support the engineering team in diagnosing and resolving issues
  • Build systems that increase platform reliability, resiliency, and uptime
  • Enforce security best practices across environments and workflows
  • Manage infrastructure as code using tools like Terraform or Pulumi
  • Document operational procedures, disaster recovery plans, and system runbooks

Requirements

  • 3+ years in SRE, DevOps, or infrastructure engineering roles
  • Deep experience with AWS cloud infrastructure and services (ECS, S3, Lambda, RDS)
  • Comfortable with Linux systems, containerization, and orchestration (e.g. Docker, Kubernetes)
  • Proficient in CI/CD tools, infrastructure-as-code, and automation scripting
  • Familiar with incident management and site reliability principles
  • Experience with observability stacks like Datadog, Grafana, Prometheus, etc.
  • Strong communicator and collaborator across technical teams
  • Calm and systematic under pressure when production issues arise
  • Bonus: Previous experience in a fast-paced startup or SaaS environment
About Botpress

Botpress recently raised its $25 million Series B funding. As a fast-growing start-up, we run a lean and innovative ship that leans on AI for maximum business impact. At Botpress, everyone is an owner, bringing their unique perspective and talents.

Our teams are talented and passionate. We intentionally hire individuals who are eager, passionate, talented, and hungry to learn and grow throughout their career.

You'll be on a team that's not just adapting to the AI revolution, but leading it. Joining our team means changing the future of enterprise AI and building technology that will define the next era of business automation.

Benefits

  • Work at one of Canada’s fastest-growing AI start-ups
  • Work with a talented and passionate team
  • 4 weeks of vacation
  • Paid sick and parental leave
  • Comprehensive health, dental, vision, travel, and life insurance
  • Funding for education and skills improvement
  • Fully-stocked fridge and cupboard – we take snacks seriously
  • Your own desk – no ‘hot-desk’-style sign-up systems
  • A vibrant office community, including weekly socials

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Ottawa, Ontario Sectigo

Posted today

Job Viewed

Tap Again To Close

Job Description

Job Description

Job Description

Company Description

At Sectigo, we align around our mission and pride ourselves in helping thousands of customers sleep better at night.

Sectigo is a leading provider of digital identity and cybersecurity solutions, offering a comprehensive suite of products to protect online transactions and communications. Our mission is to secure the digital landscape for enterprises worldwide.

“When people think Online trust management, they think Sectigo because we offer our customers unparalleled peace of mind.”

How we show up with each other and our customers every day is just as important, and we win as #OneSectigo by living out our core values - S upport, E xcellence, C ommunication, T eamwork, I ntegrity, G rowth and O penness. We are committed to investing in our diverse teams where everyone understands their role and how they support our strategic goals, we drive operational excellence through scale and efficiency, and we strive to delight our customers and become the market leader in our industry. If you aspire to join a driven team that holds each other accountable to meeting our lofty goals and you’d like to be part of our growth story in delivering a market leading user experience, we’d like to talk to you.

Job Description

We are looking for a Site Reliability Engineer to join our growing global team at Sectigo.

The Site Reliability Engineer will design and implement solutions to reduce toil and ensure reliability of our critical services at Sectigo.   

This is a full-time and individual contributor role working in the hybrid model and at least 3 days a week from our Ottawa office, reporting to our Cloud Operations team. 

The compensation range for this position is between CAN 100,000 and CAN 115,000, based on years of experience and internally equity.  

Here are the core functions, responsibilities, and expectations for this role: 

  • Ensure the reliability of our critical products and services by meeting or exceeding SRE objectives.  
  • Instantiate and maintain production infrastructure using Infrastructure as Code and Configuration Management tools.  
  • Build and maintain proper monitoring of our services by utilizing centralized logging and time series databases.  
  • Automate deployments, administration, and monitoring of our services by following CI/CD practices.  
  • Work with engineering and information security teams to enhance, document, establish processes and generally improve the operability and security of our services.   
  • Other duties as assigned and related to the nature of this role and company initiatives. 
  • Participation in team on-call rotation is required.
Qualifications

Education:

  • Bachelor's degree in information systems, computer science, technology, or a related field is preferred. 

Experience:

  • Minimum of 3+ years of software and/or operational experience in building and maintaining internet-facing production environments is required.   
  • Strong experience with Linux/Unix systems administration.  
  • Knowledge of source control tools (Git preferred).  
  • Experience with Configuration Management and Infrastructure as Code tools (Ansible, Puppet, Terraform preferred).   
  • Good understanding of container technology (Docker, Kubernetes preferred). 
  • Experience with monitoring tools (Prometheus, Grafana, Nagios, or similar.) and alerting systems. 
  • Experience with non-cloud infrastructure.  
  • Experience running a large-scale 24/7 production environment.   
  • Experience with distributed data processing, databases, and large-scale file systems is a plus.  

Ideal Candidate Profiles, Talents, and Desired Qualifications:

  • Strong scripting abilities in Bash and Python. 
  • Experience with incident management, troubleshooting, and root cause analysis. 
  • Experience in handling postmortems, building incident response plans, and improving incident resolution procedures. 
  • Experience running and maintaining real-world build systems (Jenkins, DroneCI, or similar tools) 
  • Demonstrable experience with the entire life cycle of software, starting with Systems Architecture, Systems Design, Implementation, Maintenance, and Operation.  
  • Programming experience using HTTP Service APIs. 
  • Virtualization experience (VMWare, Proxmox, Oracle Linux Virtualization Manager).   
  • Network administration experience is a plus.   
  • Exposure to Security and Testing frameworks is a plus.   
  • Exposure to compliant regulated industries such as Finance, Healthcare, or Government is a plus.   
  • Experience with distributed data processing, databases, and large-scale file systems is a plus.   


Additional Information

Global team. Global reach. Global impact.

At Sectigo, we believe doing good is good business. Our strength and our success come from our team of passionate, engaged individuals who make a difference, both locally and globally. Our commitment to engagement is rooted in an unconditionally inclusive workforce, embodying our unique perspectives, heritages, and backgrounds, all as diverse as the experiences of each Sectigo employee. Importantly, we strive to be recognized not only as the CLM leader but also for our intentional efforts to promote employees into the roles that most challenge and excite them, into experiences that allow them to grow their interests as we grow the business. We are committed to bringing a little bit of fun and a whole lot of happiness into everything we do so that our work – and our team members – reflect the positive outcomes we deliver to our customers every day.  

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Montréal, Quebec Fed IT

Posted today

Job Viewed

Tap Again To Close

Job Description

Bonjour,
Je suis Robin, conseiller en recrutement et développement des affaires au sein de FED IT, cabinet de recrutement spécialisé sur les métiers TI.
J'interviens sur deux types de recrutement : temporaires et permanents.
Tous nos conseillers en recrutements sont des experts TI qui parlent votre langage et évoluent dans votre univers. Nous couvrons les métiers de l'informatique, développement, décisionnel et infrastructure.

Bonjour,
Je suis Robin, conseiller en recrutement et développement des affaires au sein de FED IT, cabinet de recrutement spécialisé sur les métiers TI.
J'interviens sur deux types de recrutement : temporaires et permanents.
Tous nos conseillers en recrutements sont des experts TI qui parlent votre langage et évoluent dans votre univers. Nous couvrons les métiers de l'informatique, développement, décisionnel et infrastructure.

Bâtir et maintenir les pipelines CICD communs utilisés par la grande équipe
o Promouvoir les bonnes pratiques de résilience et stabilité auprès des équipes applicatives et d'infrastructures.
o Participer à l'introduction et l'intégration du développement piloté par l'intelligence artificielle générative dans notre cycle de développement
o Comprendre les flux principaux de nos environnements critiques pour et déceler les points de faiblesse unique (single point of failure).
o Supporter les équipes T.I. afin d'améliorer leur support documentaire et diagramme d'architecture pour y inclure l'information de résilience et stabilité.
o Promouvoir et augmenter l'automatisation des tâches T.I. pour réduire les erreurs humaines.
o Analyse bout en bout de stabilité et recommandations d'amélioration de la performance et la résilience des systèmes.
o Promouvoir les bonnes pratiques de surveillance et supporter les équipes T.I. dans l'implantation des indicateurs clé de résilience et stabilité.
o Supporter les équipes T.I. à la suite d'événements majeurs impactant la résilience de leurs systèmes.
o Participer à la refonte de l'architecture transversale du domaine de la carte de crédit
o Mettre au défi vos collègues architectes, développeurs et designers afin de développer l'équipe dans son ensemble
o Participer à une multitude de projets d'envergure
o Prendre part au support des applications développées par l'équipe selon le modèle « you build it you run it ».

o Expertise en design logiciel de systèmes complexes supportant des milliers de clients concurrents.
o Compétences confirmées avec les technologies Github Copilot et l'éditeur VS Code. Connaissance avec AWS Bedrock et Open AI un atout.
o Excellente compréhension des principes DevSecOPS, surveillance et observabilité.
o Expérience en technologie cloud AWS (développement de services, déploiement, automatisation et opérations).
o Maitrise des outils de surveillances (Datadog, Cloud Watch, Splunk)
o Expérience de travail avec des API.
o Expérience dans un poste de leadership technologique.
o Expérience opérationnelle 24/7.
o Expérience en tests de charges et analyse.
o Expérience en procédure de reprise après sinistre (Disaster Recovery )
o Grande capacité à résoudre des problèmes complexes multi-systèmes.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Laval, Quebec Fed IT

Posted today

Job Viewed

Tap Again To Close

Job Description

Bonjour,
Je suis Robin, conseiller en recrutement et développement des affaires au sein de FED IT, cabinet de recrutement spécialisé sur les métiers TI.
J'interviens sur deux types de recrutement : temporaires et permanents.
Tous nos conseillers en recrutements sont des experts TI qui parlent votre langage et évoluent dans votre univers. Nous couvrons les métiers de l'informatique, développement, décisionnel et infrastructure.

Bonjour,
Je suis Robin, conseiller en recrutement et développement des affaires au sein de FED IT, cabinet de recrutement spécialisé sur les métiers TI.
J'interviens sur deux types de recrutement : temporaires et permanents.
Tous nos conseillers en recrutements sont des experts TI qui parlent votre langage et évoluent dans votre univers. Nous couvrons les métiers de l'informatique, développement, décisionnel et infrastructure.

Bâtir et maintenir les pipelines CICD communs utilisés par la grande équipe
o Promouvoir les bonnes pratiques de résilience et stabilité auprès des équipes applicatives et d'infrastructures.
o Participer à l'introduction et l'intégration du développement piloté par l'intelligence artificielle générative dans notre cycle de développement
o Comprendre les flux principaux de nos environnements critiques pour et déceler les points de faiblesse unique (single point of failure).
o Supporter les équipes T.I. afin d'améliorer leur support documentaire et diagramme d'architecture pour y inclure l'information de résilience et stabilité.
o Promouvoir et augmenter l'automatisation des tâches T.I. pour réduire les erreurs humaines.
o Analyse bout en bout de stabilité et recommandations d'amélioration de la performance et la résilience des systèmes.
o Promouvoir les bonnes pratiques de surveillance et supporter les équipes T.I. dans l'implantation des indicateurs clé de résilience et stabilité.
o Supporter les équipes T.I. à la suite d'événements majeurs impactant la résilience de leurs systèmes.
o Participer à la refonte de l'architecture transversale du domaine de la carte de crédit
o Mettre au défi vos collègues architectes, développeurs et designers afin de développer l'équipe dans son ensemble
o Participer à une multitude de projets d'envergure
o Prendre part au support des applications développées par l'équipe selon le modèle « you build it you run it ».

o Expertise en design logiciel de systèmes complexes supportant des milliers de clients concurrents.
o Compétences confirmées avec les technologies Github Copilot et l'éditeur VS Code. Connaissance avec AWS Bedrock et Open AI un atout.
o Excellente compréhension des principes DevSecOPS, surveillance et observabilité.
o Expérience en technologie cloud AWS (développement de services, déploiement, automatisation et opérations).
o Maitrise des outils de surveillances (Datadog, Cloud Watch, Splunk)
o Expérience de travail avec des API.
o Expérience dans un poste de leadership technologique.
o Expérience opérationnelle 24/7.
o Expérience en tests de charges et analyse.
o Expérience en procédure de reprise après sinistre (Disaster Recovery )
o Grande capacité à résoudre des problèmes complexes multi-systèmes.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Longueuil, Quebec Fed IT

Posted today

Job Viewed

Tap Again To Close

Job Description

Bonjour,
Je suis Robin, conseiller en recrutement et développement des affaires au sein de FED IT, cabinet de recrutement spécialisé sur les métiers TI.
J'interviens sur deux types de recrutement : temporaires et permanents.
Tous nos conseillers en recrutements sont des experts TI qui parlent votre langage et évoluent dans votre univers. Nous couvrons les métiers de l'informatique, développement, décisionnel et infrastructure.

Bonjour,
Je suis Robin, conseiller en recrutement et développement des affaires au sein de FED IT, cabinet de recrutement spécialisé sur les métiers TI.
J'interviens sur deux types de recrutement : temporaires et permanents.
Tous nos conseillers en recrutements sont des experts TI qui parlent votre langage et évoluent dans votre univers. Nous couvrons les métiers de l'informatique, développement, décisionnel et infrastructure.

Bâtir et maintenir les pipelines CICD communs utilisés par la grande équipe
o Promouvoir les bonnes pratiques de résilience et stabilité auprès des équipes applicatives et d'infrastructures.
o Participer à l'introduction et l'intégration du développement piloté par l'intelligence artificielle générative dans notre cycle de développement
o Comprendre les flux principaux de nos environnements critiques pour et déceler les points de faiblesse unique (single point of failure).
o Supporter les équipes T.I. afin d'améliorer leur support documentaire et diagramme d'architecture pour y inclure l'information de résilience et stabilité.
o Promouvoir et augmenter l'automatisation des tâches T.I. pour réduire les erreurs humaines.
o Analyse bout en bout de stabilité et recommandations d'amélioration de la performance et la résilience des systèmes.
o Promouvoir les bonnes pratiques de surveillance et supporter les équipes T.I. dans l'implantation des indicateurs clé de résilience et stabilité.
o Supporter les équipes T.I. à la suite d'événements majeurs impactant la résilience de leurs systèmes.
o Participer à la refonte de l'architecture transversale du domaine de la carte de crédit
o Mettre au défi vos collègues architectes, développeurs et designers afin de développer l'équipe dans son ensemble
o Participer à une multitude de projets d'envergure
o Prendre part au support des applications développées par l'équipe selon le modèle « you build it you run it ».

o Expertise en design logiciel de systèmes complexes supportant des milliers de clients concurrents.
o Compétences confirmées avec les technologies Github Copilot et l'éditeur VS Code. Connaissance avec AWS Bedrock et Open AI un atout.
o Excellente compréhension des principes DevSecOPS, surveillance et observabilité.
o Expérience en technologie cloud AWS (développement de services, déploiement, automatisation et opérations).
o Maitrise des outils de surveillances (Datadog, Cloud Watch, Splunk)
o Expérience de travail avec des API.
o Expérience dans un poste de leadership technologique.
o Expérience opérationnelle 24/7.
o Expérience en tests de charges et analyse.
o Expérience en procédure de reprise après sinistre (Disaster Recovery )
o Grande capacité à résoudre des problèmes complexes multi-systèmes.

This advertiser has chosen not to accept applicants from your region.
Be The First To Know

About the latest Site reliability Jobs in Canada !

 

Nearby Locations

Other Jobs Near Me

Industry

  1. request_quote Accounting
  2. work Administrative
  3. eco Agriculture Forestry
  4. smart_toy AI & Emerging Technologies
  5. school Apprenticeships & Trainee
  6. apartment Architecture
  7. palette Arts & Entertainment
  8. directions_car Automotive
  9. flight_takeoff Aviation
  10. account_balance Banking & Finance
  11. local_florist Beauty & Wellness
  12. restaurant Catering
  13. volunteer_activism Charity & Voluntary
  14. science Chemical Engineering
  15. child_friendly Childcare
  16. foundation Civil Engineering
  17. clean_hands Cleaning & Sanitation
  18. diversity_3 Community & Social Care
  19. construction Construction
  20. brush Creative & Digital
  21. currency_bitcoin Crypto & Blockchain
  22. support_agent Customer Service & Helpdesk
  23. medical_services Dental
  24. medical_services Driving & Transport
  25. medical_services E Commerce & Social Media
  26. school Education & Teaching
  27. electrical_services Electrical Engineering
  28. bolt Energy
  29. local_mall Fmcg
  30. gavel Government & Non Profit
  31. emoji_events Graduate
  32. health_and_safety Healthcare
  33. beach_access Hospitality & Tourism
  34. groups Human Resources
  35. precision_manufacturing Industrial Engineering
  36. security Information Security
  37. handyman Installation & Maintenance
  38. policy Insurance
  39. code IT & Software
  40. gavel Legal
  41. sports_soccer Leisure & Sports
  42. inventory_2 Logistics & Warehousing
  43. supervisor_account Management
  44. supervisor_account Management Consultancy
  45. supervisor_account Manufacturing & Production
  46. campaign Marketing
  47. build Mechanical Engineering
  48. perm_media Media & PR
  49. local_hospital Medical
  50. local_hospital Military & Public Safety
  51. local_hospital Mining
  52. medical_services Nursing
  53. local_gas_station Oil & Gas
  54. biotech Pharmaceutical
  55. checklist_rtl Project Management
  56. shopping_bag Purchasing
  57. home_work Real Estate
  58. person_search Recruitment Consultancy
  59. store Retail
  60. point_of_sale Sales
  61. science Scientific Research & Development
  62. wifi Telecoms
  63. psychology Therapy
  64. pets Veterinary
View All Site Reliability Jobs