Role and Project Scope:
We are seeking a skilled and passionate ML Platform Engineer to join our team and build the next generation of our machine learning infrastructure. You will be responsible for designing, implementing, and maintaining the core MLOps platform that empowers our Data Science and ML Engineering teams to rapidly develop, deploy, and monitor high-performance models at scale.
Crucially, you will contribute to the evolution of our unified AI Platform, covering both traditional ML and our growing LLM (Large Language Model) platform.
What You’ll Do:
- Platform Development: Design, build, and maintain the end-to-end MLOps platform using Kubernetes and Cloud Services.
- Infrastructure as Code (IaC): Use Terraform or similar tools to manage, provision, and scale all ML-related infrastructure securely and efficiently.
- Pipeline Automation: Implement and optimize CI/CD/CT (Continuous Integration, Delivery, Training) pipelines to automate model training, testing, packaging, and deployment using tools like Argo and Kubeflow Pipelines.
- Serving Infrastructure: Build highly available, low-latency, and high-throughput model serving infrastructure.
- Observability: Implement robust monitoring, alerting, and logging solutions to track infrastructure health, model performance, and data/model drift.
- Tooling & Support: Evaluate, integrate, and support ML tools such as Feature Stores and distributed model training pipelines.
- Security & Compliance: Ensure platform security, implement RBAC (Role-Based Access Control), and manage secrets for sensitive data and production environments.
- Collaboration: Work closely with Data Scientists and ML Engineers to understand their needs and provide technical guidance on best practices for scaling their models.Spearhead the creation of AI solutions to enable growth and drive internal efficiency across the business.
What You Need to Have:
- 10+ years in backend software development, with at least 2+ years focus on AI/ML Platform or MLOps infrastructure.
- Deep expertise in MLOps practices, including automated deployment pipelines, model optimization, and production lifecycle management.
- Proven experience designing and implementing low-latency model serving solutions.
- Proficiency in Python.
- Skill in writing high-quality, maintainable code.
- Experience in design and development of large-scale distributed, high concurrency, low-latency inference, high availability systems.
- Excellent communication and mentoring abilities.
- A relevant degree in Computer Science, Mathematics or related fields.