Role Overview
We are looking for a Senior AI Systems Engineer who can translate research concepts and product requirements into reliable, production-grade AI systems.
You will own the design, implementation, optimisation, evaluation, and deployment of Cliply’s AI processing pipelines. You will work closely with the company’s product and technical leadership to build reusable AI components for understanding video, audio, and text understanding.
This is a hands-on engineering role. You will be expected to write code, run experiments, evaluate models, design system components, deploy services, and resolve production issues.
Key Responsibilities
Multimodal AI Development
- Design, implement, and optimise AI pipelines that process video, audio, speech, images, and text.
- Integrate pretrained and open-source models for computer vision, video understanding, speech processing, language understanding, and multimodal reasoning.
- Develop approaches for combining signals across modalities, including visual content, dialogue, audio events, metadata, and temporal context.
- Build reusable AI abstractions that can support multiple downstream product applications.
Video and Temporal Intelligence
- Develop systems for frame, shot, scene, event, and segment-level video understanding.
- Work with sequential and temporal information to identify actions, transitions, relationships, and meaningful moments.
- Design appropriate frame-sampling, clip-sampling, windowing, aggregation, and temporal-fusion strategies.
- Evaluate approaches for long-form video processing while balancing quality, latency, and compute cost.
Model Evaluation and Experimentation
- Evaluate and benchmark models using appropriate datasets, baselines, and quality metrics.
- Design reproducible experiments and perform error analysis, ablation studies, and model comparisons.
- Identify model limitations, hallucinations, failure modes, data-quality issues, and domain-specific gaps.
- Develop clear evaluation frameworks for AI-generated metadata, classifications, rankings, and clips.
Production AI Engineering
- Build, test, deploy, and maintain production AI services and inference pipelines.
- Develop scalable APIs and asynchronous processing workflows for large media files.
- Take ownership of performance, reliability, maintainability, observability, and failure recovery.
- Optimise model serving for latency, throughput, memory utilisation, and infrastructure cost.
- Implement versioning, monitoring, logging, and reproducibility across models, datasets, prompts, and configurations.
Model Optimisation and Deployment
- Convert and optimise models using appropriate technologies such as ONNX Runtime, TensorRT, quantisation, pruning, batching, or distillation.
- Deploy AI workloads across cloud, GPU, and containerised environments.
- Profile inference pipelines and resolve bottlenecks across preprocessing, inference, post-processing, data movement, and storage.
- Evaluate when to use self-hosted models, managed APIs, or hybrid inference architectures.
Architecture and Collaboration
- Work with the Lead Architect and product leadership to translate proprietary concepts into implementable technical designs.
- Contribute to data models, schemas, APIs, storage strategies, and system interfaces.
- Participate in technical design reviews and challenge assumptions constructively.
- Document architecture decisions, model behaviour, experiments, interfaces, and operational procedures.
- Mentor junior engineers and contribute to engineering standards and technical hiring.
Required Qualifications
- Approximately 5 or more years of hands-on experience in AI, machine learning, deep learning, computer vision, video intelligence, or related systems.
- Strong programming skills in Python.
- Strong practical experience with PyTorch, TensorFlow, or a comparable deep-learning framework.
- Demonstrated experience taking AI or ML systems from experimentation through production deployment.
- Strong understanding of neural-network architectures, model training, evaluation, inference, and optimisation.
- Experience working with at least one of the following areas:
- computer vision or video understanding;
- multimodal AI;
- generative AI;
- speech or audio processing;
- machine-learning systems;
- large language models;
- temporal or sequential modelling.
- Experience building APIs, inference services, or processing pipelines using technologies such as FastAPI, Docker, Kubernetes, cloud services, or distributed workers.
- Ability to understand technical papers and convert relevant concepts into working prototypes.
- Strong problem-solving, debugging, and systems-thinking capability.
- Ability to work independently in an early-stage environment with evolving requirements.
Highly Desirable Experience
- Video transformers, action recognition, temporal localisation, temporal grounding, or long-form video understanding.
- Cross-modal alignment or fusion involving video, audio, speech, and text.
- Vision-language models, multimodal language models, embedding models, or retrieval systems.
- Model optimisation using TensorRT, ONNX Runtime, CUDA, quantisation, pruning, batching, or distillation.
- GPU-based inference and deployment.
- FFmpeg, OpenCV, video codecs, frame processing, or media-processing pipelines.
- Vector databases, semantic retrieval, metadata indexing, or knowledge representations.
- Experience with scene detection, tracking, segmentation, object detection, OCR, ASR, or audio-event detection.
- Experiment tracking and model lifecycle tools such as MLflow or Weights & Biases.
- Production monitoring, data-drift detection, retraining workflows, and model versioning.
- Startup, research-to-production, or zero-to-one product-development experience.
What We Are Not Looking For
This role is unlikely to be suitable for candidates whose experience is primarily limited to:
- prompt engineering without substantive ML or software-engineering depth;
- LangChain or RAG application development without model or production ownership;
- data analysis or business intelligence;
- MLOps administration without hands-on AI model development;
- classical computer vision integration without temporal, multimodal, or production-system exposure;
- academic research without evidence of implementation and deployment.
These are not automatic disqualifiers, but candidates must demonstrate broader depth in engineering and AI depth.
What Success Looks Like
Within the first six months, the successful candidate should be able to:
- build an end-to-end pipeline for processing long-form video;
- integrate visual, audio, speech, and textual signals;
- create structured and reusable outputs from the processed content;
- develop and validate scoring, retrieval, or segmentation components;
- deploy reliable AI services with measurable performance;
- establish repeatable evaluation and experimentation practices;
- contribute meaningfully to the architecture of Cliply’s proprietary media-intelligence layer.
Application Requirements
Candidates should provide:
- a current CV;
- links to GitHub, technical publications, patents, demos, or relevant projects, where available;
- a brief description of one AI system they personally designed, implemented, and deployed;
- clarification of their individual contribution to any major team project referenced in their application.
About Cliply
Cliply is building an AI-native media intelligence platform that transforms unstructured video, audio, and text into structured, searchable, and reusable intelligence.
Our platform combines state-of-the-art AI models with proprietary algorithms and representations to support applications such as:
- long-form video understanding;
- intelligent clip and highlight generation;
- semantic search and retrieval;
- content classification and metadata generation;
- contextual advertising and commerce;
- media automation and analytics.
We are not seeking to train a foundation model from scratch. Our focus is on building robust, differentiated AI systems that integrate, evaluate, adapt, and operationalise modern multimodal models at production scale.
Why Join Cliply
- Work on a difficult, technically differentiated multimodal AI problem.
- Shape the architecture of a platform at an early and foundational stage.
- Own meaningful components rather than a narrow subsystem.
- Work across research, machine learning, backend systems, and product development.
- Contribute to proprietary technology and potential intellectual property.
- Build systems that can support search, editing, automation, monetisation, and intelligence across multiple forms of media.