In their article, Challenges in evaluating AI systems, Anthropic concluded that (1) robust evaluation tools for AI are tough to develop and implement, and (2) we need good tools because effective AI governance depends on our ability to evaluate AI systems meaningfully.

Delving deeper into the challenges of AI evaluation, Anthropic categorizes these challenges based on evaluation methods:

(ordered from least to most challenging)

  1. Multiple choice evaluations
  2. Third-party evaluation frameworks
  3. Humans (crowd sourcing and domain experts)
  4. Generative AI
  5. External Orgs

1. Multiple choice evaluations

These quantify model performance on various tasks, typically with a simple metric: accuracy. Two popular choices are the Measuring Multitask Language Understanding (MMLU), which measures accuracy on 57 tasks ranging from mathematics to history to law, and Bias Benchmark for Question Answering (BBQ), which tests for social biases against people belonging to protected classes along nine social dimensions.

MMLU has minor problems, such as being so widely used that it might be included in training data, simple formatting changes resulting in worse model accuracy, inconsistent implementation by developers (e.g., using different reasoning methods), and flat-out incorrect questions.

BBQ doesn’t have an off-the-shelf implementation and took a great full-time engineer an entire week to test (and there were still some errors, like the model not answering the questions 😂). The Bias Score also requires some finesse to interpret.

2. Third-party evaluation frameworks like BIG-bench and Stanford’s Holistic Evaluation of Language Models (HELM)

Third-party evaluation tools were error-prone, slow, and expensive (either in implementation, turnaround time, or coordination).

BIG-bench has 204 crowdsourced evaluations (with some vetting) ranging across topics from science to social reasoning. Unfortunately, it was even more challenging than BBQ to get going, needed to be faster to run all of the evals reasonably, was buggy, and it’s hard to know which of the 204 tasks were most relevant.

HELM is expert-driven and evaluates models across scenarios like reasoning and disinformation using standardized metrics like accuracy, calibration, robustness, and fairness. Unlike BIG-bench, the HELM team could run it with some API access. Unfortunately, it didn’t support Claude’s format well out of the box, and the turnaround/coordination costs were high.

3. Using crowdworkers to measure model helpfulness or harmfulness

Anthropic conducts A/B tests on crowdsourcing platforms, where people engage in open-ended dialogue with two models and select the response from Model A or B that is more helpful or harmless. For harmlessness, they encourage active red teaming and adversarial probing. They then rank models based on the resulting helpful/harmful scores.

Problems with this method are: Cost and time. Exposing evaluators to harmful outputs. Variability in human evaluators' creativity or motivation. Helpfulness and harmfulness are in natural tension.

4. Using domain experts to red team for national security-relevant threats

One important goal is determining whether and how AI models could create or exacerbate national security risks (piloted in the Frontier Threats Red Teaming for AI Safety project).

Challenges for such evaluations include an inability to standardize the process for evaluation (i.e., more art than science), natural security scenarios are complex by nature, sometimes security clearances are necessary, and legal risk.

5. Using generative AI to develop evaluations for generative AI

AI has been good at generating multiple-choice evaluations for itself. Still, humans have to evaluate the evaluations, which goes back to problems from human-sourced evaluation. Furthermore, if models are biased, their evaluations may also be.

[](Constitutional AI) used model-based red teaming to train Claude to be more harmless and was quite successful. Nonetheless, more study is still required.

6. Working with a non-profit organization to audit our models for dangerous capabilities

Anthropic enlisted the Alignment Research Center (ARC), which assesses frontier AI models for dangerous capabilities (e.g., the ability for a model to accumulate resources, make copies of itself, become hard to shut down, etc.). While they initially expected the collaboration to be straightforward, it required significant science and engineering support from Anthropic. Furthermore, keeping external independence meant that any “tricks of the trade” from Anthropic couldn’t be shared with ARC, which made them less effective.

Anthropic’s Policy Recommendations

  • More funding for evaluation efforts
  • Legal safe harbor for companies to evaluate national security risks

Q&A Log

  1. What is AI Governance? (sourced from

    • ‍Responsible Development: How can general-purpose AI developers make responsible development and deployment decisions?
    • Regulation: How can governments use their regulatory toolboxes to ensure that AI developers and users behave responsibly?
    • International Governance: What role can international coordination play in reducing risks from AI?
    • Compute Governance: How can public and private decisions about access to compute shape outcomes?