Blog | TaxAgents

Large Language Models Ad Referendum - How Good Are They at Machine Translation in the Legal Domain

February 6, 2024 · One min read

Sung Kim

TaxAgents Team Member

Author(s)

Vicent Briva-Iglesias, Joao Lucas Cavalheiro Camargo, Gokhan Dogru

Abstract

This study evaluates the machine translation (MT) quality of two state-of-the-art large language models (LLMs) against a tradition-al neural machine translation (NMT) system across four language pairs in the legal domain. It combines automatic evaluation met-rics (AEMs) and human evaluation (HE) by professional transla-tors to assess translation ranking, fluency and adequacy. The re-sults indicate that while Google Translate generally outperforms LLMs in AEMs, human evaluators rate LLMs, especially GPT-4, comparably or slightly better in terms of producing contextually adequate and fluent translations. This discrepancy suggests LLMs' potential in handling specialized legal terminology and context, highlighting the importance of human evaluation methods in assessing MT quality. The study underscores the evolving capabil-ities of LLMs in specialized domains and calls for reevaluation of traditional AEMs to better capture the nuances of LLM-generated translations.

Links to paper

Link to arXiv: https://arxiv.org/abs/2402.07681
Link to pdf: https://arxiv.org/ftp/arxiv/papers/2402/2402.07681.pdf

InSaAF - Incorporating Safety through Accuracy and Fairness | Are LLMs ready for the Indian Legal Domain

February 6, 2024 · 2 min read

Sung Kim

TaxAgents Team Member

Author(s)

Yogesh Tripathi, Raghav Donakanti, Sahil Girhepuje, Ishan Kavathekar, Bhaskara Hanuma Vedula, Gokul S Krishnan, Shreya Goyal, Anmol Goel, Balaraman Ravindran, Ponnurangam Kumaraguru

Abstract

Recent advancements in language technology and Artificial Intelligence have resulted in numerous Language Models being proposed to perform various tasks in the legal domain ranging from predicting judgments to generating summaries. Despite their immense potential, these models have been proven to learn and exhibit societal biases and make unfair predictions. In this study, we explore the ability of Large Language Models (LLMs) to perform legal tasks in the Indian landscape when social factors are involved. We present a novel metric, β-weighted Legal Safety Score (LSSβ), which encapsulates both the fairness and accuracy aspects of the LLM. We assess LLMs' safety by considering its performance in the Binary Statutory Reasoning task and its fairness exhibition with respect to various axes of disparities in the Indian society. Task performance and fairness scores of LLaMA and LLaMA--2 models indicate that the proposed LSSβ metric can effectively determine the readiness of a model for safe usage in the legal sector. We also propose finetuning pipelines, utilising specialised legal datasets, as a potential method to mitigate bias and improve model safety. The finetuning procedures on LLaMA and LLaMA--2 models increase the LSSβ, improving their usability in the Indian legal domain. Our code is publicly released.

Links to paper

Link to arXiv: https://arxiv.org/abs/2402.10567
Link to pdf: https://arxiv.org/pdf/2402.10567.pdf

A I Am Not a Lawyer, But... Engaging Legal Experts towards Responsible LLM Policies for Legal Advice

February 2, 2024 · 2 min read

Sung Kim

TaxAgents Team Member

Author(s)

Inyoung Cheong, King Xia, K.J. Kevin Feng, Quan Ze Chen, Amy X. Zhang

Abstract

The rapid proliferation of large language models (LLMs) as general purpose chatbots available to the public raises hopes around expanding access to professional guidance in law, medicine, and finance, while triggering concerns about public reliance on LLMs for high-stakes circumstances. Prior research has speculated on high-level ethical considerations but lacks concrete criteria determining when and why LLM chatbots should or should not provide professional assistance. Through examining the legal domain, we contribute a structured expert analysis to uncover nuanced policy considerations around using LLMs for professional advice, using methods inspired by case-based reasoning. We convened workshops with 20 legal experts and elicited dimensions on appropriate AI assistance for sample user queries (``cases''). We categorized our expert dimensions into: (1) user attributes, (2) query characteristics, (3) AI capabilities, and (4) impacts. Beyond known issues like hallucinations, experts revealed novel legal problems, including that users' conversations with LLMs are not protected by attorney-client confidentiality or bound to professional ethics that guard against conflicted counsel or poor quality advice. This accountability deficit led participants to advocate for AI systems to help users polish their legal questions and relevant facts, rather than recommend specific actions. More generally, we highlight the potential of case-based expert deliberation as a method of responsibly translating professional integrity and domain knowledge into design requirements to inform appropriate AI behavior when generating advice in professional domains.

Links to paper

Link to arXiv: https://arxiv.org/abs/2402.01864
Link to pdf: https://arxiv.org/pdf/2402.01864.pdf

Better Call GPT, Comparing Large Language Models Against Lawyers

January 24, 2024 · One min read

Sung Kim

TaxAgents Team Member

Author(s)

Lauren Martin, Nick Whitehouse, Stephanie Yiu, Lizzie Catterson, Rivindu Perera (Onit AI Centre of Excellence)

Abstract

This paper presents a groundbreaking comparison between Large Language Models and traditional legal contract reviewers, Junior Lawyers and Legal Process Outsourcers. We dissect whether LLMs can outperform humans in accuracy, speed, and cost efficiency during contract review. Our empirical analysis benchmarks LLMs against a ground truth set by Senior Lawyers, uncovering that advanced models match or exceed human accuracy in determining legal issues. In speed, LLMs complete reviews in mere seconds, eclipsing the hours required by their human counterparts. Cost wise, LLMs operate at a fraction of the price, offering a staggering 99.97 percent reduction in cost over traditional methods. These results are not just statistics, they signal a seismic shift in legal practice. LLMs stand poised to disrupt the legal industry, enhancing accessibility and efficiency of legal services. Our research asserts that the era of LLM dominance in legal contract review is upon us, challenging the status quo and calling for a reimagined future of legal workflows.

Links to paper

Link to arXiv: https://arxiv.org/abs/2401.16212
Link to pdf: https://arxiv.org/pdf/2401.16212.pdf

INACIA - Integrating Large Language Models in Brazilian Audit Courts - Opportunities and Challenges

January 10, 2024 · One min read

Sung Kim

TaxAgents Team Member

Author(s)

Jayr Pereira, Andre Assumpcao, Julio Trecenti, Luiz Airosa, Caio Lente, Jhonatan Cléto, Guilherme Dobins, Rodrigo Nogueira, Luis Mitchell, Roberto Lotufo

Abstract

This paper introduces INACIA (Instrução Assistida com Inteligência Artificial), a groundbreaking system designed to integrate Large Language Models (LLMs) into the operational framework of Brazilian Federal Court of Accounts (TCU). The system automates various stages of case analysis, including basic information extraction, admissibility examination, Periculum in mora and Fumus boni iuris analyses, and recommendations generation. Through a series of experiments, we demonstrate INACIA's potential in extracting relevant information from case documents, evaluating its legal plausibility, and formulating propositions for judicial decision-making. Utilizing a validation dataset alongside LLMs, our evaluation methodology presents a novel approach to assessing system performance, correlating highly with human judgment. These results underscore INACIA's potential in complex legal task handling while also acknowledging the current limitations. This study discusses possible improvements and the broader implications of applying AI in legal contexts, suggesting that INACIA represents a significant step towards integrating AI in legal systems globally, albeit with cautious optimism grounded in the empirical findings.

Links to paper

Link to arXiv: https://arxiv.org/abs/2401.05273
Link to pdf: https://arxiv.org/pdf/2401.05273.pdf

Large Legal Fictions - Profiling Legal Hallucinations in Large Language Models

January 2, 2024 · 2 min read

Sung Kim

TaxAgents Team Member

Author(s)

Matthew Dahl, Varun Magesh, Mirac Suzgun, Daniel E. Ho

Abstract

Large language models (LLMs) have the potential to transform the practice of law, but this potential is threatened by the presence of legal hallucinations -- responses from these models that are not consistent with legal facts. We investigate the extent of these hallucinations using an original suite of legal queries, comparing LLMs' responses to structured legal metadata and examining their consistency. Our work makes four key contributions: (1) We develop a typology of legal hallucinations, providing a conceptual framework for future research in this area. (2) We find that legal hallucinations are alarmingly prevalent, occurring between 69% of the time with ChatGPT 3.5 and 88% with Llama 2, when these models are asked specific, verifiable questions about random federal court cases. (3) We illustrate that LLMs often fail to correct a user's incorrect legal assumptions in a contra-factual question setup. (4) We provide evidence that LLMs cannot always predict, or do not always know, when they are producing legal hallucinations. Taken together, these findings caution against the rapid and unsupervised integration of popular LLMs into legal tasks. Even experienced lawyers must remain wary of legal hallucinations, and the risks are highest for those who stand to benefit from LLMs the most -- pro se litigants or those without access to traditional legal resources.

Links to paper

Link to arXiv: https://arxiv.org/abs/2401.01301
Link to pdf: https://arxiv.org/pdf/2401.01301.pdf

Weaving Pathways for Justice with GPT - LLM-driven automated drafting of interactive legal applications

December 14, 2023 · One min read

Sung Kim

TaxAgents Team Member

Author(s)

Quinten Steenhuis, David Colarusso, Bryce Willey

Abstract

Can generative AI help us speed up the authoring of tools to help self-represented litigants? In this paper, we describe 3 approaches to automating the completion of court forms: a generative AI approach that uses GPT-3 to iteratively prompt the user to answer questions, a constrained template-driven approach that uses GPT-4-turbo to generate a draft of questions that are subject to human review, and a hybrid method. We use the open source Docassemble platform in all 3 experiments, together with a tool created at Suffolk University Law School called the Assembly Line Weaver. We conclude that the hybrid model of constrained automated drafting with human review is best suited to the task of authoring guided interviews.

Links to paper

Link to arXiv: https://arxiv.org/abs/2312.09198
Link to pdf: https://arxiv.org/pdf/2312.09198.pdf

Classifying complex documents - comparing bespoke solutions to large language models

December 12, 2023 · One min read

Sung Kim

TaxAgents Team Member

Author(s)

Glen Hopkins, Kristjan Kalm

Abstract

Here we search for the best automated classification approach for a set of complex legal documents. Our classification task is not trivial: our aim is to classify ca 30,000 public courthouse records from 12 states and 267 counties at two different levels using nine sub-categories. Specifically, we investigated whether a fine-tuned large language model (LLM) can achieve the accuracy of a bespoke custom-trained model, and what is the amount of fine-tuning necessary.

Links to paper

Link to arXiv: https://arxiv.org/abs/2312.07182
Link to pdf: https://arxiv.org/pdf/2312.07182.pdf

Boosting legal case retrieval by query content selection with large language models

December 6, 2023 · One min read

Sung Kim

TaxAgents Team Member

Author(s)

Youchao Zhou, Heyan Huang, Zhijing Wu

Abstract

Legal case retrieval, which aims to retrieve relevant cases to a given query case, benefits judgment justice and attracts increasing attention. Unlike generic retrieval queries, legal case queries are typically long and the definition of relevance is closely related to legal-specific elements. Therefore, legal case queries may suffer from noise and sparsity of salient content, which hinders retrieval models from perceiving correct information in a query. While previous studies have paid attention to improving retrieval models and understanding relevance judgments, we focus on enhancing legal case retrieval by utilizing the salient content in legal case queries. We first annotate the salient content in queries manually and investigate how sparse and dense retrieval models attend to those content. Then we experiment with various query content selection methods utilizing large language models (LLMs) to extract or summarize salient content and incorporate it into the retrieval models. Experimental results show that reformulating long queries using LLMs improves the performance of both sparse and dense models in legal case retrieval.

Links to paper

Link to arXiv: https://arxiv.org/abs/2312.03494
Link to pdf: https://arxiv.org/pdf/2312.03494.pdf

Questioning Biases in Case Judgment Summaries - Legal Datasets or Large Language Models?

December 1, 2023 · One min read

Sung Kim

TaxAgents Team Member

Author(s)

Aniket Deroy, Subhankar Maity

Abstract

The evolution of legal datasets and the advent of large language models (LLMs) have significantly transformed the legal field, particularly in the generation of case judgment summaries. However, a critical concern arises regarding the potential biases embedded within these summaries. This study scrutinizes the biases present in case judgment summaries produced by legal datasets and large language models. The research aims to analyze the impact of biases on legal decision making. By interrogating the accuracy, fairness, and implications of biases in these summaries, this study contributes to a better understanding of the role of technology in legal contexts and the implications for justice systems worldwide. In this study, we investigate biases wrt Gender-related keywords, Race-related keywords, Keywords related to crime against women, Country names and religious keywords. The study shows interesting evidences of biases in the outputs generated by the large language models and pre-trained abstractive summarization models. The reasoning behind these biases needs further studies.

Links to paper

Link to arXiv: https://arxiv.org/abs/2312.00554
Link to pdf: https://arxiv.org/pdf/2312.00554.pdf

Author(s)​

Abstract​

Links to paper​

Author(s)​

Abstract​

Links to paper​

Author(s)​

Abstract​

Links to paper​

Author(s)​

Abstract​

Links to paper​

Author(s)​

Abstract​

Links to paper​

Author(s)​

Abstract​

Links to paper​

Author(s)​

Abstract​

Links to paper​

Author(s)​

Abstract​

Links to paper​

Author(s)​

Abstract​

Links to paper​

Author(s)​

Abstract​

Links to paper​

Author(s)

Abstract

Links to paper

Author(s)

Abstract

Links to paper

Author(s)

Abstract

Links to paper

Author(s)

Abstract

Links to paper

Author(s)

Abstract

Links to paper

Author(s)

Abstract

Links to paper

Author(s)

Abstract

Links to paper

Author(s)

Abstract

Links to paper

Author(s)

Abstract

Links to paper

Author(s)

Abstract

Links to paper