How to Classify Documents in C#

In this article, we discuss the advantages and challenges of implementing AI document classification for C# pipelines and discuss some solutions.

Brian O'Neill

CORE ·

Jun. 23, 26 · Analysis

Likes (0)

Comment

Save

863 Views

A functional automated document processing pipeline typically needs to know what type of document it’s dealing with before it can do anything useful with it. The extraction logic that determines when it’s dealing with an invoice, for example, is different from the extraction logic for a tax form, and the routing rules for a contract are clearly different from those for an ID document. Classification is what makes downstream automation possible when there are multiple unique input types.

Building reliable classification logic, however, is no simple task. It’s easy to create something brittle, and much harder to create something dynamic and flexible that works reliably in the majority of cases. In this article, we’ll look at why classification breaks down at scale, and we’ll examine what it actually takes to build and maintain a reliable solution in C#. Towards the end, we’ll walk through a dedicated API that handles classification across a wide range of document formats using AI without requiring a specially trained model for each document type.

Why Classification Is Harder Than It Looks

The intuitive starting point for implementing classification functionality is a rules-based approach. We might define keywords or structural patterns for each document type, and we might assign a category to those documents based purely on which rules match. This is a traditional approach, and it works reasonably well in controlled environments with a small, predictable set of document types to contend with. As you might’ve guessed, though, the problems become immediately apparent when the document population gets more diverse.

The better way to look at this is that document classification is a fundamentally semantic task, not a syntactic one. While we might expect most invoices to fit a certain mold, for example, it’s entirely possible that two invoices share almost no overlapping vocabulary if they come from different vendors or industries. The presence of the word “invoice” in a document is neither necessary nor sufficient for accurate classification; plenty of legitimate invoices don’t use it, and plenty of non-invoice documents do. Rules built around surface features are ultimately brittle: they can’t stand up to the variation that’s normal in any real-world intake workflow.

This problem is compounded by variability in document layouts. A template-matching approach assumes a document of a certain type follows predictable visual structures, and while that may hold for documents generated in a controlled environment (e.g., internal documents), it rarely holds true for externally received documents. For example, a claims form from one carrier might look nothing like a claims form from another, and neither may resemble the training examples the claims classifier was built on.

Building brittle rules creates a looming maintenance problem, and while that might hold some appeal for job security, it’s best to avoid it altogether. Every new document type, every new vendor that updates their template, and every edge case that slips through the cracks requires a manual update to the heuristic ruleset. In low-volume environments, that might be ok, but in high-volume environments it’s just an ongoing cost instead of a one-time setup.

The Challenge of Building This In-House

If we’re moving past a rules-based classification approach, the obvious next candidate is machine learning. This solves many of the above problems when implemented correctly, but it also introduces a fair share of new ones.

Training a reliable classification model requires labeled examples; there needs to be enough variety within each document type for the model to generalize. For common document types, that data may be available, but for specialized or proprietary document types, it often isn’t. Model performance also tends to degrade over time as document populations change. If we get new vendors, new form versions, new regulatory requirements, or anything else in that vein, we might see an initially effective model erode over time, and detecting that degradation requires monitoring infrastructure that most teams don’t have in place.

And none of this addresses the fact that most likely none of this will be implemented in C#. If the classification model lives in a Python service called from a C# application, that means we end up maintaining a cross-language service dependency and handling things like serialization. We also get stuck managing the availability/latency of one additional internal component. While none of this is insurmountable, it adds up to a meaningful engineering investment before a single document has been classified in production. If our core product isn’t document intelligence, that tradeoff probably doesn’t make sense.

Document Classification With a Web API

A dedicated AI classification API offers a practical alternative for teams that aren’t in a position to build/maintain a pipeline in their own environment. It’s easy to implement with a few lines of code.

We’ll start by installing the SDK via NuGet:

    C#
   
   dotnet add package Cloudmersive.APIClient.NETCore.DocumentAI --version 1.0.0

Then we’ll import the required namespaces:

    C#
   
 

   using System;
using System.Diagnostics;
using Cloudmersive.APIClient.NETCore.DocumentAI.Api;
using Cloudmersive.APIClient.NETCore.DocumentAI.Client;
using Cloudmersive.APIClient.NETCore.DocumentAI.Model;
  

The “advanced classification” endpoint accepts a request body containing the document + optional configuration:

    C#
   
 

   namespace Example
{
    public class ExtractClassificationAdvancedExample
    {
        public void main()
        {
            Configuration.Default.AddApiKey("Apikey", "YOUR_API_KEY");

            var apiInstance = new ExtractApi();
            var recognitionMode = "Advanced";
            var body = new AdvancedExtractClassificationRequest();

            try
            {
                DocumentAdvancedClassificationResult result = apiInstance.ExtractClassificationAdvanced(recognitionMode, body);
                Debug.WriteLine(result);
            }
            catch (Exception e)
            {
                Debug.Print("Exception when calling ExtractApi.ExtractClassificationAdvanced: " + e.Message);
            }
        }
    }
}
  

It’s worth understanding a few things about this. Categories is an optional array that lest us define our own classification targets; each category takes a CategoryName and a CategoryDescription, and the API then evaluates what the document is against out defined list rather than classifying freely (which it can do, just not as effectively). This is the major value-add; a healthcare organizations classification needs look nothing like a financial institution’s, for instance, and custom category definitions let us tune the classifier to our specific document “population” without doing any actual training. The CategoryDescription field can be used carefully as a way of precisely describing documents for more reliable results, particularly when distinguishing between similar document types.

Here’s a configured example:

    JSON
   
 

   var body = new AdvancedExtractClassificationRequest
{
    InputFile = "YOUR_BASE64_ENCODED_FILE",
    Categories = new List<DocumentClassificationCategory>
    {
        new DocumentClassificationCategory
        {
            CategoryName = "Invoice",
            CategoryDescription = "A document requesting payment for goods or services provided by a vendor"
        },
        new DocumentClassificationCategory
        {
            CategoryName = "Tax Form",
            CategoryDescription = "A government-issued form used for reporting income, deductions, or tax obligations"
        },
        new DocumentClassificationCategory
        {
            CategoryName = "Contract",
            CategoryDescription = "A legally binding agreement between two or more parties outlining terms and obligations"
        }
    },
    ResultCrossCheck = "Advanced",
    MaximumPagesProcessed = 5
};
  

And here’s what the response object looks like:

    JSON
   
 

   {
  "Successful": true,
  "DocumentCategoryResult": "Invoice",
  "ConfidenceScore": 0.97
}
  

DocumentCategoryResult contains the plain-language classification the API assigned, and the ConfidenceScore tells us whether we should take a second look (nobody should blindly trust AI; it makes mistakes and tends to know when that’s likely).

Conclusion

In this article, we looked at why rule-based document classification breaks down at scale, and we examined the real engineering costs of building and maintaining a classification model in-house. We then walked through a dedicated API that handles classification in a single configurable API call, which may be attractive to C# developers looking to implement AI document classification as a tool instead of signing up for a months-long development project.

Machine learning

Opinions expressed by DZone contributors are their own.

Related

Trending