Senior Machine Learning Engineer, AI Platform
Mozilla · Canada
Job Description
Senior Machine Learning Engineer, AI Platform
Company: Mozilla Location: Remote (Canada) Contract: Permanent
About Mozilla
Mozilla Corporation, backed by a non-profit foundation, has been a driving force in shaping the internet for over 25 years. We are dedicated to creating products like Firefox and Pocket that empower individuals and promote an internet built for people, not corporations. Our work spans diverse areas including AI, social media, and security, all while staying true to our core mission of making the internet better for everyone. As a wholly-owned subsidiary of the Mozilla Foundation, we prioritize our mission over shareholder interests, collaborating with thousands of global contributors to develop open-source software.
About the Team and Role
The AI Platform team is responsible for building the core infrastructure that enables intelligent experiences across all Mozilla products. This includes developing model training pipelines, high-throughput inference services, GPU orchestration, and secure, privacy-respecting AI systems designed for global scale and reliability.
We are seeking a Machine Learning Engineer with a strong platform engineering mindset to help design, build, and operate Mozilla's AI platform. This role sits at the intersection of machine learning, distributed systems, and production infrastructure. You will ensure that machine learning models are trained, deployed, and served efficiently, securely, and at scale. You will collaborate closely with product, infrastructure, and security teams to facilitate rapid iteration while adhering to strict performance and privacy requirements.
What You'll Do
- Design, build, and operate key AI platform components for training, deploying, and serving machine learning models in production.
- Manage end-to-end model serving and inference workflows, focusing on improvements in reliability, scalability, performance, and operational excellence.
- Lead initiatives to optimize inference systems for throughput, latency, and cost-efficiency across both CPU and GPU workloads.
- Design and manage GPU-based inference and training workloads, including performance tuning, capacity planning, and resource utilization.
- Own and enhance critical aspects of the model lifecycle, such as packaging, versioning, testing, validation, and deployment automation.
- Implement and evolve observability practices (metrics, logging, tracing, alerting) to boost visibility and operational resilience of ML services and pipelines.
- Partner with product, infrastructure, security, and data teams to develop scalable platform capabilities for AI-powered features.
- Contribute to technical design discussions, propose architectural enhancements, and mentor junior engineers through code reviews and knowledge sharing.
- Participate in and improve operational processes, including incident response, on-call rotations, and post-incident reviews.
What You'll Bring
- Bachelor's degree with 4-6 years of relevant industry experience, or a Master's degree with significant hands-on experience building and operating production ML systems, or equivalent work experience.
- Strong proficiency in Python development for machine learning systems, backend services, or distributed data processing.
- Proven experience deploying and operating ML workloads in cloud environments with production-grade infrastructure.
- Solid understanding of model serving architectures, inference pipelines, and performance trade-offs (latency, throughput, cost, scaling).
- Hands-on experience with GPU-based workloads and accelerated computing in production.
- Experience designing CI/CD pipelines and development workflows for reliable ML system deployment.
- Ability to independently scope and drive technical initiatives, balancing product and operational priorities.
- Strong problem-solving skills and the ability to debug performance and reliability issues in distributed systems.
- Clear and effective communication skills, with experience collaborating across engineering, product, and infrastructure teams.
Bonus Skills
- Experience with inference optimization strategies (e.g., batching, quantization, compilation, model conversion, hardware-specific tuning).
- Familiarity with containerization and orchestration systems (e.g., Docker, Kubernetes) in production.
- Experience designing observability systems for distributed services, including metrics strategy and performance profiling.
- Exposure to privacy-preserving ML techniques, security best practices, or responsible AI system design.
- Contributions to open-source ML infrastructure projects or leadership in building reusable internal ML tooling.
What We Offer
- Generous performance-based bonus plans.
- Comprehensive medical, dental, and vision coverage.
- Substantial retirement contributions with immediate vesting.
- Quarterly company-wide wellness days.
- Country-specific holidays plus a birthday day off.
- One-time home office stipend.
- Annual professional development budget.
- Quarterly well-being stipend.
- Considerable paid parental leave.
- Employee referral bonus program.
- Additional benefits (life/AD&D, disability, EAP, etc. - varies by country).
Hiring Range:
- Canada Tier 1 Locations: $128,000 -
✨ This description was enhanced by AI based on the original listing.