LLM Engineer (Inference Optimization)

EngineeringRemote (Global)Full-time

We are looking for an LLM Engineer to lead the development and execution of our inference optimization product. You'll own the end-to-end process: from PII-safe intake → analysis → scoring → benchmarking → deployment → observability. In the early phase, you will directly run this process for customers, producing clear evaluation reports and delivering optimized endpoints. As we grow, you will architect and build the automation and tools to deliver these same services autonomously and at scale — serving billions of tokens across tens of thousands of customers.

What you'll do

Inference Analysis: Review customer prompts, tokens, and usage patterns to identify optimization opportunities including prompt efficiency, token reduction, caching, decoding strategies, and context management.
Scoring & Evaluation: Establish baseline metrics across accuracy, quality, cost, latency, determinism, and reliability.
Benchmarking: Orchestrate workloads against alternative models (OpenAI, Anthropic, Mistral, Cohere, OSS, etc.) and measure comparative performance.
Recommendation & Deployment: Recommend and configure optimized models, deploy endpoints with caching/guardrails, and provide dashboards for observability.
Automation: Build reusable pipelines for ingestion, analysis, scoring, benchmarking, and deployment.
Tooling: Develop internal frameworks for evaluation metrics, semantic caching, workload segmentation, and model routing.
Observability: Build dashboards and monitoring systems for real-time quality, cost, latency, and drift detection.
Continuous Improvement: Automate retraining, fine-tuning, and eval-gated releases at scale.

Must-haves

Strong experience with LLM APIs and frameworks (OpenAI, Anthropic, HuggingFace, vLLM, LangChain, etc.).
Hands-on experience optimizing prompts, decoding parameters, context windows, and caching.
Proficiency with Python/TypeScript for pipeline and platform development.
Experience with evaluation frameworks (LLM-as-judge, task-specific metrics, human-in-the-loop evals).
Familiarity with vector databases, embeddings, and retrieval systems for semantic caching and RAG.
Experience deploying APIs/endpoints and building observability/monitoring dashboards.
Solid understanding of data privacy, PII protection, and security best practices.

Nice-to-haves

Experience with fine-tuning (LoRA, PEFT, RLHF) and model hosting.
Familiarity with multi-model routing, bandit algorithms, or traffic canaries.
Experience working with MLOps/Langfuse/Helicone/LangSmith style observability tools.
Prior work in AI optimization, inference cost management, or applied research.

Apply for this role

No cover letter needed — just tell us what excites you.