Law Quarterly

Classifying Legal Approaches: A Machine Learning Project

By Gordon Yang


There is no denying the significant impact technology has had on our society in the past two decades. Everything is moving at an increasingly faster pace than it had in the past, and the legal field, known for being bound by norms, tradition, and precedence has struggled to keep up. Today, a new paradigm is emerging. The availability of massive volumes of data and computational power is expanding the possibilities of legal “text-mining”, machine learning, and AI. In this article we explore some of these methods to help us classify the legal approaches made in the decisions of a set of court cases, evaluate the results, and discuss future outlook.

Precedents is the defining feature of the common law system, attaining special legal significance in virtue of its practical, and not merely theoretical, authority over the content of the law. To measure the practical effects of precedents, we have to first categorize the different types of approaches and interpretations that judges take. The set of cases that we will analyze comes from all trial, district court cases since 2014 that have cited Iqbal v. Ashcroft (2009). Past studies compared decisions in the period before and after Iqbal to find that district courts were not applying standards as handed down by upper courts and that motions to dismiss had increased after Iqbal. While they manually counted and labeled the number of cases that falls under each category, our project aims to automate this process with the help of machine learning driven by complex statistical techniques.

For our goal of classification, we first return those phrases and sentences that contain the court’s rationale for dismissing the case, in particular those that cite the precedent. To build the classifier, we extract features (words) from the relevant segments, convert them into document term matrix form (frequency of word appearances), assign labels to the sentences, and then associate them with their corresponding features by fitting them into the model. The model we used was a Support Vector Machine, which can be visualized mathematically as drawing boundaries between the different classes.

Our premise is that if we know the types of arguments we are looking for, then we can identify the features that the arguments typically incorporate. Here there’s somewhat of a paradox. By making theoretical legal distinctions, we can easily typify arguments by their semantic meaning. The problem, however, is that the various syntactic forms they come in is more elusive to us, and we need some way to assign labels to the types of arguments even before building our classifier. The “checklist” and “common sense” test formulated by Colleen is a good starting point.

The two statements representative of the checklist approach are:

    1. “While legal conclusions can provide the framework of a complaint, they must be supported by factual allegations”
    2. “Rule 8 demands “more than an unadorned, the-defendant-unlawfully-harmed-me accusation.”

The two statements representative of the common-sense approach are:

    1. “A claim has facial plausibility when the plaintiff pleads factual content that allows the court to draw the reasonable inference that the defendant is liable for the misconduct alleged”
    2. “Determining whether a complaint states a plausible claim for relief will be a context-specific task that requires the reviewing court to draw on its judicial experience and common sense.”

(Colleen, 423-424)

As the names of the two types of approaches might suggest, the difference between them lies in how the judge determines whether or not the case should withstand dismissal. In most cases, application of the checklist and the common sense approach should lead to the same result, as a complaint generally must contain more than conclusory allegations to trigger the judge’s common sense determination that the defendant might indeed be liable for the alleged illegal behavior (Colleen, 418). For more on the legal theory behind the approaches, read this article.

Evaluating results and discussion: 1163 instances of type 0 (checklist), 362 instances of type 1 (common-sense), and 84 instances of type 2 (ambiguous) were found through searching sentences for key terms. We split the data into training (80%) and testing (20%), and applied the trained classifier on the unseen test data, attaining an accuracy rate of around 98% with a standard deviation of 1.6%.

There is one thing we notice right away, which is the proportion of checklist statements were much higher than the 33% found in Colleen’s paper which we based our legal framework on. This could signify a defect in our searching techniques or simply a change in trend over the years. Normally, we would manually label hundreds of cases, however, since the framework has already been provided to us by earlier work using human judgment, we decided to rely on an “unsupervised” method, which was to search sentences containing key terms found in the representative statements, to do a supervised task. Since there was no given ground-truth, there is a little uncertainty in what the accuracy says about the data.

We created a separate class, Type 2, after we noticed that the classifier got sentences containing both “reasonable inference” and “factual allegations”, wrong. Now we can see that most of these were false negatives of Type 2. To get a better sense of the model and the errors it made so that it is less of a “blackbox”, we share some of the classifications it made we found to be interesting.

Here is a sentence which the classifier gave an 87% probability of being Type 2, but it was marked wrong because of how we defined Type 2 (missing factual before allegation).

“The Court must take all allegations in the complaint as true and draw all reasonable inferences in the plaintiffs favor”

Through simple human reasoning, the obvious answer should be Type 1. This example shows the limitations of a simple “bag of words” model with no consideration for word order, and the flaws of how we obtained the labels in the first place, by a simple term key match.

How should we classify a sentence like this?

“Therefore in spite of the requirement that the Court draw all reasonable inferences in the plaintiff s favor some factual allegations will render certain inferences unreasonable”

Even for a human, this task is challenging without making certain assumptions i.e. checking factual allegations required to draw reasonable inferences is still considered a “common-sense” approach. The machine gave an 71% probability of being Type 1, but this is mainly based on the mere presence of “reasonable inference” disregarding the relative complexity of the sentence with negatives and reasoning involved. The point being that, for us to evaluate the model, we have to first form standards of right and wrong, and not just look at the accuracy level.

Overall, the model performed well for what it really is: the degree of overlap between a key phrase and the statistical distribution of words in a sentence which determines its predictability. Future work can be done on the development of “unsupervised” methods for the generation of true labels. The hope is that the new techniques of “Natural Language Inference” based on deep neural networks can be applied to legal reasoning. Analogy by way of precedents is only one of the many modes of reasoning, but the legal authority they carry and the relationships it allows machines to draw is more than promising. These are all steps towards the ultimate goal of rehashing arguments under any context, which would significantly lower the costs of litigation that the courts are so concerned with. Statistics can also help us evaluate which cases are most likely to be ruled in which party’s favor and which arguments would be most effective, giving us a more complete picture of the judicial system as a whole. This is what could be the starting stages of bringing about a revolution in the legal industry, making courts more accessible to democracy and bringing justice to cases that wouldn’t have been tried otherwise.


Email for questions and comments

One Response to Classifying Legal Approaches: A Machine Learning Project

  1. Pingback: Plausibility Standards and the Limitation of Access to Courts | Binghamton Law Quarterly

Leave a Reply

Your email address will not be published. Required fields are marked *