Brixo
Skip to main content
Back to Agent Operations
Humanloop logo

Humanloop

Humanloop is the LLM evals platform for enterprises. Teams at Gusto, Vanta and Duolingo use Humanloop to ship reliable AI products. We enable you to adopt best practices for prompt management, evaluation and observability.

Visit Website

Founded

2020

Location

London, United Kingdom

Employees

12

Funding

$12M Series A

Humanloop: Enterprise LLM Evaluation and Ops Platform (Now Joining Anthropic)

Humanloop built an enterprise platform for evaluating and operating large language models (LLMs), with a focus on reliable AI delivery for production teams. The product centered on eval pipelines, prompt/version management, observability, and closing the feedback-to-fine‑tuning loop. In 2025, the team announced that Humanloop is joining Anthropic, and public reports indicate a product wind‑down.

  • Homepage/status: [Humanloop](https://humanloop.com) and [Product overview](https://humanloop.com/home)
  • Coverage: [TechCrunch on the team joining Anthropic](https://techcrunch.com/2025/08/13/anthropic-nabs-humanloop-team-as-competition-for-enterprise-ai-talent-heats-up/) and an [Agenta sunsetting/migration post](https://agenta.ai/blog/humanloop-sunsetting-migration-and-alternative)
  • What Humanloop Did

  • Evals and test gating for LLM applications: build datasets, run automated evaluators (including LLM‑as‑a‑judge), compare runs across models/prompts, and track regressions over time.
  • Platform overview: [Evaluations](https://humanloop.com/platform/evaluations) and [Docs: Evaluation overview](https://humanloop.com/docs/v4/guides/evaluation/overview)
  • LLM‑as‑a‑judge: use an LLM to score outputs on relevance, factuality, safety, tone, and policy adherence.
  • Primer: [LLM‑as‑a‑Judge](https://humanloop.com/blog/llm-as-a-judge)
  • RAG and agent testing: reproducible evals for retrieval quality and answer correctness, with run histories for CI‑style comparisons.
  • Guides: [RAG evaluation tutorial](https://humanloop.com/docs/tutorials/rag-evaluation) and [RAG explained](https://humanloop.com/blog/rag-explained)
  • Feedback-to-fine‑tuning loop: capture user interactions, label good/bad outcomes, curate datasets, and fine‑tune models.
  • How‑tos: [Fine‑tune a model](https://humanloop.com/docs/v4/guides/finetune-a-model) and [What is fine‑tuning?](https://humanloop.com/blog/what-is-fine-tuning)
  • Prompt management and observability: version prompts, trace model calls, review logs, and compare model performance with built‑in logging.
  • Integration overview: [Integrating Humanloop](https://humanloop.com/docs/explanation/integrating-humanloop)
  • How It Worked (Evaluation Pipeline)

    1. Build representative datasets from real user tasks or logs.

    2. Add evaluators: *LLM‑as‑a‑judge* and/or human graders with criteria such as helpfulness, factuality, safety, and policy adherence.

    3. Run comparisons across prompt or model versions; track regression metrics over time.

    4. Gate deploys with pass thresholds; ship fixes via prompt updates or fine‑tuning.

  • References: [Evaluations overview](https://humanloop.com/platform/evaluations) and [Docs: Evaluation pipeline](https://humanloop.com/docs/v4/guides/evaluation/overview)
  • For RAG/agents, Humanloop provided end‑to‑end examples for evaluating retrieval steps and final answer quality:

  • Guide: [RAG evaluation tutorial](https://humanloop.com/docs/tutorials/rag-evaluation)
  • Integrations

  • Model providers: OpenAI, Anthropic, Google, Azure OpenAI, AWS Bedrock, and custom models.
  • Details: [Supported models](https://humanloop.com/docs/reference/models)
  • Integration methods: SDKs/APIs; run evals in your own runtime or managed on Humanloop; store artifacts in code or inside the platform.
  • How to integrate: [Integrating Humanloop](https://humanloop.com/docs/explanation/integrating-humanloop)
  • Pricing and Free Plan

  • A public free plan historically offered: 2 members, 50 eval runs/month, and 10k logs/month, with enterprise options for scale and private deployment.
  • Historical references: [Pricing](https://humanloop.com/pricing) and [Software Finder](https://softwarefinder.com/artificial-intelligence/humanloop)
  • Note: Given the wind‑down, verify current availability and terms on the [Humanloop site](https://humanloop.com).

    Customers and Proof

  • Notable customers: Gusto, Vanta, Duolingo.
  • Case study: Gusto implemented dataset‑based gates for every model/prompt update before production.
  • Read more: [Gusto case study](https://humanloop.com/case-studies/gusto)
  • Company Facts

  • Company: Humanloop
  • Founded: 2020
  • Founders: Raza Habib, Peter Hayes, Jordan Burgess
  • HQ: London, UK
  • Focus: Evals, prompt management, logging/observability, feedback, fine‑tuning
  • Tagline: “The LLM evals platform for enterprises”
  • Status: Team joining Anthropic; product appears to be sunsetting
  • Sources: [Homepage/status](https://humanloop.com), [LinkedIn](https://www.linkedin.com/company/humanloop), [TechCrunch coverage](https://techcrunch.com/2025/08/13/anthropic-nabs-humanloop-team-as-competition-for-enterprise-ai-talent-heats-up/), [Agenta sunsetting/migration](https://agenta.ai/blog/humanloop-sunsetting-migration-and-alternative)
  • Who It Was For

  • Product and platform teams shipping LLM features that must be reliable
  • AI/ML engineers and data scientists needing eval‑driven development, prompt management, and observability
  • Enterprises seeking governance, test gates, and controlled rollouts for AI apps
  • Common Use Cases

  • Customer support automation with eval gates on accuracy, tone, and compliance
  • Knowledge base Q&A with RAG retrieval checks and answer correctness scoring
  • Content and code assistants safeguarded by regression test sets to prevent quality drift
  • Safety/compliance scoring using LLM‑as‑a‑judge plus human spot checks
  • Fine‑tuning on real user feedback to boost task performance
  • User‑Reported Pros and Cons

  • Pros:
  • Strong evaluation focus and actionable guidance for reliable AI apps
  • Learn more: [Platform evaluations](https://humanloop.com/platform/evaluations) and [Evaluating LLM apps (blog)](https://humanloop.com/blog/evaluating-llm-apps)
  • Practical RAG evaluation playbooks for production teams
  • Tutorial: [RAG evaluation](https://humanloop.com/docs/tutorials/rag-evaluation)
  • Credible references like Gusto with data‑driven release processes
  • Example: [Gusto case study](https://humanloop.com/case-studies/gusto)
  • Cons:
  • Pricing concerns from some builders
  • Discussion: [Reddit thread](https://www.reddit.com/r/OpenAI/comments/127tb7m/human_loop_any_good/)
  • Longevity risk due to the team joining Anthropic and product sunsetting
  • Status: [Humanloop site](https://humanloop.com) and [Agenta migration post](https://agenta.ai/blog/humanloop-sunsetting-migration-and-alternative)
  • Migration Considerations and Alternatives

    With the Humanloop product reportedly winding down, teams have explored migration paths to similar platforms focused on evals, tracing, prompt management, and observability. Common alternatives mentioned in the market include LangSmith, Langfuse, HoneyHive, Orq.ai, Vellum, and LangWatch. Several vendors have published migration guides tailored for former Humanloop users. Validate feature parity for:

  • Dataset‑based evals and CI/regression runs
  • LLM‑as‑a‑judge and human grading workflows
  • Prompt/version management and test gating
  • Logging/tracing and production observability
  • Feedback collection and fine‑tuning support
  • BYO key model integrations and on‑prem/private deployment options
  • Key Resources

  • Overview and status: [Humanloop](https://humanloop.com) and [Product overview](https://humanloop.com/home)
  • Evals: [Platform evaluations](https://humanloop.com/platform/evaluations) and [Docs: Evaluation overview](https://humanloop.com/docs/v4/guides/evaluation/overview)
  • RAG evaluation: [Tutorial](https://humanloop.com/docs/tutorials/rag-evaluation)
  • Fine‑tuning: [Guide](https://humanloop.com/docs/v4/guides/finetune-a-model) and [What is fine‑tuning?](https://humanloop.com/blog/what-is-fine-tuning)
  • LLM‑as‑a‑judge: [Blog post](https://humanloop.com/blog/llm-as-a-judge)
  • Supported models: [Models reference](https://humanloop.com/docs/reference/models)
  • Integration: [Integrating Humanloop](https://humanloop.com/docs/explanation/integrating-humanloop)
  • Case study: [Gusto](https://humanloop.com/case-studies/gusto)
  • Pricing (historical): [Pricing page](https://humanloop.com/pricing)
  • Company: [LinkedIn](https://www.linkedin.com/company/humanloop)
  • News and migration: [TechCrunch coverage](https://techcrunch.com/2025/08/13/anthropic-nabs-humanloop-team-as-competition-for-enterprise-ai-talent-heats-up/) and [Agenta sunsetting/migration](https://agenta.ai/blog/humanloop-sunsetting-migration-and-alternative)
  • Related Companies

    Galileo logo

    Galileo

    Galileo is the leading platform for enterprise GenAI evaluation and observability. Our comprehensive suite of products support builders across the new AI development workflow—from fine-tuning LLMs to developing, testing, monitoring, and securing their AI applications. Each product is powered by our research-backed evaluation metrics. Today, Galileo is used by 100s of AI teams from startups to Fortune 50 enterprises, including Twilio, Comcast, and HP.

    HoneyHive logo

    HoneyHive

    HoneyHive is the leading AI observability and evals platform, trusted by next-gen AI startups to Fortune 100 enterprises. We make it easy and repeatable for modern AI teams to debug, evaluate, and monitor AI agents, and deploy them to production with confidence. HoneyHive’s founding team brings AI and infrastructure expertise from Microsoft OpenAI, Amazon, Amplitude, New Relic, and Sisu. The company is based in New York and San Francisco.

    LangFuse logo

    LangFuse

    Langfuse is the 𝗺𝗼𝘀𝘁 𝗽𝗼𝗽𝘂𝗹𝗮𝗿 𝗼𝗽𝗲𝗻 𝘀𝗼𝘂𝗿𝗰𝗲 𝗟𝗟𝗠𝗢𝗽𝘀 𝗽𝗹𝗮𝘁𝗳𝗼𝗿𝗺. It helps teams collaboratively develop, monitor, evaluate, and debug AI applications. Langfuse can be 𝘀𝗲𝗹𝗳-𝗵𝗼𝘀𝘁𝗲𝗱 in minutes and is battle-tested and used in production by thousands of users from YC startups to large companies like Khan Academy or Twilio. Langfuse builds on a proven track record of reliability and performance. Developers can trace any Large Language model or framework using our SDKs for Python and JS/TS, our open API or our native integrations (OpenAI, Langchain, Llama-Index, Vercel AI SDK). Beyond tracing, developers use 𝗟𝗮𝗻𝗴𝗳𝘂𝘀𝗲 𝗣𝗿𝗼𝗺𝗽𝘁 𝗠𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁, 𝗶𝘁𝘀 𝗼𝗽𝗲𝗻 𝗔𝗣𝗜𝘀, 𝗮𝗻𝗱 𝘁𝗲𝘀𝘁𝗶𝗻𝗴 𝗮𝗻𝗱 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀 to improve the quality of their applications. Product managers can 𝗮𝗻𝗮𝗹𝘆𝘇𝗲, 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗲, 𝗮𝗻𝗱 𝗱𝗲𝗯𝘂𝗴 𝗔𝗜 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝘀 by accessing detailed metrics on costs, latencies, and user feedback in the Langfuse Dashboard. They can bring 𝗵𝘂𝗺𝗮𝗻𝘀 𝗶𝗻 𝘁𝗵𝗲 𝗹𝗼𝗼𝗽 by setting up annotation workflows for human labelers to score their application. Langfuse can also be used to 𝗺𝗼𝗻𝗶𝘁𝗼𝗿 𝘀𝗲𝗰𝘂𝗿𝗶𝘁𝘆 𝗿𝗶𝘀𝗸𝘀 through security framework and evaluation pipelines. Langfuse enables 𝗻𝗼𝗻-𝘁𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝘁𝗲𝗮𝗺 𝗺𝗲𝗺𝗯𝗲𝗿𝘀 to iterate on prompts and model configurations directly within the Langfuse UI or use the Langfuse Playground for fast prompt testing. Langfuse is 𝗼𝗽𝗲𝗻 𝘀𝗼𝘂𝗿𝗰𝗲 and we are proud to have a fantastic community on Github and Discord that provides help and feedback. Do get in touch with us!

    LangSmith logo

    LangSmith

    LangChain provides the agent engineering platform and open source frameworks developers need to ship reliable agents fast.

    Phoenix (Arize AI) logo

    Phoenix (Arize AI)

    Ship Agents that Work. Arize AI & Agent Engineering Platform. One place for development, observability, and evaluation.

    Portkey logo

    Portkey

    AI Gateway, Guardrails, and Governance. Processing 14 Billion+ LLM tokens every day. Backed by Lightspeed.