Benchmarks

Search benchmarks…

Legal

Financial

Healthcare

GenAI benchmarking study headed by John Craske

Provider

CMS and LITIG

Dataset name

Litig AI Benchmark

Dataset size

TBD testcases

Benchmark type

Upcoming

Date published

February 19, 2025

Industry

Legal

View paper

The LinksAI English law benchmark

Provider

Linklaters

Dataset name

English law benchmark

Dataset size

50 testcases

Benchmark type

Open-ended task

Date published

February 10, 2025

Industry

Legal

View paper

Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models

Provider

Stanford RegLab

Dataset name

Legal Hallucinations

Dataset size

5000 testcases

Benchmark type

Open-ended task

Date published

June 21, 2024

Industry

Legal

View paper

Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools

Provider

Stanford

Dataset name

LLM Tool dataset

Dataset size

200+ testcases

Benchmark type

Open-ended RAG

Date published

June 6, 2024

Industry

Legal

View paper

Long-context retrieval benchmark on legal documents

Provider

Stanford Hazy Research

Dataset name

LoCo

Dataset size

7,730 testcases

Benchmark type

Open-ended task

Date published

February 12, 2024

Industry

Legal

View paper

Version of our internal dataset for evaluating large language models (LLMs) and model systems on complex legal tasks

Provider

Harvey

Dataset name

BigLaw Bench

Dataset size

50+ testcases

Benchmark type

Open-ended task

Date published

August 29, 2024

Industry

Legal

View paper

An Evaluation Benchmark Assessing comprehensive performance of LLMs in highly specialized legal domains on Chinese Law

Provider

Open Compass

Dataset name

LawBench

Dataset size

20 testcases

Benchmark type

Open-ended task

Date published

September 28, 2023

Industry

Legal

View paper

A collaboratively built large language model benchmark for legal reasoning

Provider

Stanford

Dataset name

Legalbench

Dataset size

162 testcases

Benchmark type

Open-ended task

Date published

August 20, 2023

Industry

Legal

View paper

The Overruling Dataset: A Benchmark for Detecting Legal Decisions that Have Been Overruled

Provider

Casetext

Dataset name

Overruling

Dataset size

2,400 testcases

Benchmark type

Binary classification

Date published

April 22, 2021

Industry

Legal

View paper

CLAUDETTE: an Automated Detector of Potentially Unfair Clauses in Online Terms of Service

Provider

Universit`a di Modena

Dataset name

CLAUDETTE

Dataset size

9,414 testcases

Benchmark type

Binary classification

Date published

May 3, 2018

Industry

Legal

View paper