Senior Machine Learning Engineer, AI Platform

Company: Mozilla Location: Remote (Canada) Contract: Permanent

About Mozilla

Mozilla Corporation, backed by a non-profit foundation, has been a driving force in shaping the internet for over 25 years. We are dedicated to creating products like Firefox and Pocket that empower individuals and promote an internet built for people, not corporations. Our work spans diverse areas including AI, social media, and security, all while staying true to our core mission of making the internet better for everyone. As a wholly-owned subsidiary of the Mozilla Foundation, we prioritize our mission over shareholder interests, collaborating with thousands of global contributors to develop open-source software.

About the Team and Role

The AI Platform team is responsible for building the core infrastructure that enables intelligent experiences across all Mozilla products. This includes developing model training pipelines, high-throughput inference services, GPU orchestration, and secure, privacy-respecting AI systems designed for global scale and reliability.

We are seeking a Machine Learning Engineer with a strong platform engineering mindset to help design, build, and operate Mozilla's AI platform. This role sits at the intersection of machine learning, distributed systems, and production infrastructure. You will ensure that machine learning models are trained, deployed, and served efficiently, securely, and at scale. You will collaborate closely with product, infrastructure, and security teams to facilitate rapid iteration while adhering to strict performance and privacy requirements.

What You'll Do

Design, build, and operate key AI platform components for training, deploying, and serving machine learning models in production.
Manage end-to-end model serving and inference workflows, focusing on improvements in reliability, scalability, performance, and operational excellence.
Lead initiatives to optimize inference systems for throughput, latency, and cost-efficiency across both CPU and GPU workloads.
Design and manage GPU-based inference and training workloads, including performance tuning, capacity planning, and resource utilization.
Own and enhance critical aspects of the model lifecycle, such as packaging, versioning, testing, validation, and deployment automation.
Implement and evolve observability practices (metrics, logging, tracing, alerting) to boost visibility and operational resilience of ML services and pipelines.
Partner with product, infrastructure, security, and data teams to develop scalable platform capabilities for AI-powered features.
Contribute to technical design discussions, propose architectural enhancements, and mentor junior engineers through code reviews and knowledge sharing.
Participate in and improve operational processes, including incident response, on-call rotations, and post-incident reviews.

What You'll Bring

Bachelor's degree with 4-6 years of relevant industry experience, or a Master's degree with significant hands-on experience building and operating production ML systems, or equivalent work experience.
Strong proficiency in Python development for machine learning systems, backend services, or distributed data processing.
Proven experience deploying and operating ML workloads in cloud environments with production-grade infrastructure.
Solid understanding of model serving architectures, inference pipelines, and performance trade-offs (latency, throughput, cost, scaling).
Hands-on experience with GPU-based workloads and accelerated computing in production.
Experience designing CI/CD pipelines and development workflows for reliable ML system deployment.
Ability to independently scope and drive technical initiatives, balancing product and operational priorities.
Strong problem-solving skills and the ability to debug performance and reliability issues in distributed systems.
Clear and effective communication skills, with experience collaborating across engineering, product, and infrastructure teams.

Bonus Skills

Experience with inference optimization strategies (e.g., batching, quantization, compilation, model conversion, hardware-specific tuning).
Familiarity with containerization and orchestration systems (e.g., Docker, Kubernetes) in production.
Experience designing observability systems for distributed services, including metrics strategy and performance profiling.
Exposure to privacy-preserving ML techniques, security best practices, or responsible AI system design.
Contributions to open-source ML infrastructure projects or leadership in building reusable internal ML tooling.

What We Offer

Generous performance-based bonus plans.
Comprehensive medical, dental, and vision coverage.
Substantial retirement contributions with immediate vesting.
Quarterly company-wide wellness days.
Country-specific holidays plus a birthday day off.
One-time home office stipend.
Annual professional development budget.
Quarterly well-being stipend.
Considerable paid parental leave.
Employee referral bonus program.
Additional benefits (life/AD&D, disability, EAP, etc. - varies by country).

Hiring Range:

Canada Tier 1 Locations: $128,000 -

Senior Machine Learning Engineer, AI Platform

Job Description