The Cambridge Law Corpus: A Corpus for Legal AI Research

The article introduces the Cambridge Law Corpus (CLC), a comprehensive collection of over 250,000 court cases from the UK, spanning from the 16th century to the 21st century. The CLC is designed for legal AI research and its first release includes raw text and meta-data. The corpus also includes annotations on case outcomes for 638 cases, provided by legal experts.

The authors have used this annotated data to train and evaluate case outcome extraction with GPT-3, GPT-4 and RoBERTa models, providing benchmarks for future research. The article also includes a detailed legal and ethical discussion due to the sensitive nature of the material. The CLC will only be released for research purposes under specific restrictions.

Key takeaways

The Cambridge Law Corpus (CLC) is introduced as a corpus for legal AI research, consisting of over 250,000 UK court cases, some dating back to the 16th century.
The first release of the corpus includes raw text and meta-data, as well as annotations on case outcomes for 638 cases, done by legal experts.
Case outcome extraction has been trained and evaluated using GPT-3, GPT-4 and RoBERTa models, providing benchmarks for future research.
Due to the sensitive nature of the material, the corpus will only be released for research purposes under certain restrictions, accompanied by an extensive legal and ethical discussion.

The Cambridge Law Corpus: A Corpus for Legal AI Research

Key takeaways

Discussion (0)