How to Extract Document Summaries in C#/.NET
In this article, we explore the benefits of implementing document summarization capabilities in a C# document processing pipeline, and we suggest several solutions.
Join the DZone community and get the full member experience.
Join For FreeDocument Summarization: A Significant Value-Add in the AI Boom
C# developers are often saddled with building document-heavy processing workflows across enterprise systems. When we think about what “document processing” means, we probably focus on things like extraction, validation, conversion, storage, etc., but in many modern systems, being able to quickly parse an understanding of what a document contains is often just as important as parsing data from it.
Whether we’re processing PDFs from an upload workflow, or tapping into an email server to handle inbound email attachments, or even dealing with hand-scanned images from mobile devices, producing short, readable document summaries can drastically improve how downstream systems and users interact with those documents. A concise summary paragraph can help with anything from document approval & review to search indexing or triage (among many other examples).
In this article, we’ll take a closer look at what it means to generate document summaries programmatically in C#, and we’ll explore some of the challenges developers can expect when building document summarization logic into their applications. Towards the end, we’ll examine a few open-source approaches to document summarization and compare them with using a dedicated API to generate clean, one-paragraph summaries from a wide range of input file formats.
Common Scenarios for Document Summarization in Enterprise Projects
Like most document-processing tasks, document summarization is valuable when systems need to quickly interpret and act on incoming files. It’s best to walk through a few examples to see where this adds practical value (we’ll expand on the examples mentioned above).
We’ll start with document approval and review workflows. In approval-heavy systems, document reviewers are usually tasked with quickly evaluating complex documents. In workflows involving long documents (e.g., 30-page PDFs), approval workflows are often a frustrating grind for reviewers. Providing the reviewer with a concise and accurate summary of the document can significantly speed up the review cycle and reduce the friction inherent in the approval process.
In searching and indexing workflows, document libraries, and knowledge bases have always benefited from additional metadata. Document summaries (especially keyword-rich summaries) are in many respects a form of metadata, and they can speed up search query responses to a significant degree. In large SharePoint or internal document repositories, these summaries can also add quick context for users before they open and review files that were shared with them.
For our final example, we’ll look at document intake and triage workflows. A lot of enterprise systems accept uploaded documents (e.g., invoices, contracts, claims forms, etc.), and generating one-paragraph summaries at the point of upload allows reviewers to quickly understand what files contain without opening them manually. That means documents can be handed off more quickly to their next destination (e.g., insurance claim summaries can indicate if they’re related to auto or home insurance).
What Does It Mean to Summarize Documents Programmatically in C#?
Generating document summaries isn’t just a straightforward text operation. In practice, it involves several layers of complexity, especially when dealing with multiple file formats and inconsistent input quality (this last is a major inhibitor).
We first need to consider file format variability. In enterprise workflows, it’s exceedingly rare to deal with a single document type. Complex file containers like PDF, Word, Excel, PowerPoint, etc., all store text differently, and flat image files (i.e., photos of documents) don’t store text at all. The latter category has traditionally required OCR (optical character recognition) before any meaningful summarization can occur. Handling this variability requires some fairly robust preprocessing before any summarization logic can be implemented.
Next, we must consider content structures within documents. Documents often contain tables, headers, bullet points, embedded images, and other unique formatting alongside standard paragraphs of text. To effectively summarize a document, our solution needs to interpret the overall meaning of a document rather than simply extracting raw text, and that means identifying key sections, ignoring irrelevant content (something expert human reviewers are usually excellent at), and ultimately combining that information coherently.
Perhaps the most difficult element of document summarization is dealing with low-quality inputs (particularly document images). We can expect documents to arrive as photos or scans rather than perfect, clean digital files, and these documents often suffer from the pitfalls of any homemade photography. Document photos are often poorly lit, skewed, or suffering from low resolution, making the text within the document difficult to read. If a summarization pipeline doesn’t account for these inconsistencies, it can’t be expected to produce reliable results.
Of course, we can’t list all the above challenges without also mentioning the usual length & scale problems associated with any document processing workflow. Long documents require more processing and can introduce performance concerns in high-volume systems. Generating summaries consistently across hundreds or thousands of documents requires some careful planning regarding compute usage and throughput.
Open-Source Approaches to Document Summarization in .NET
As always in my articles, I like to present open-source options for teams that prefer to build everything in-house. This will, of course, involve multiple layers of tooling and integration.
We’ll start with the text extraction portion of the summarization process, and we’ll break these down in terms of the document format they work for.
- PDF text extraction: We can use `itext` (formerly `itext7`) or `PDFPig` for extracting text from PDFs. `itext` boasts a whopping 125 million worldwide users, which is the first indication that it’s worth investigating further. It can perform OCR extraction on both images and scanned PDFs, and it’s relatively easy to use with plenty of code snippets made available for reference. `PDFPig ` has a not-too-shabby 18 million downloads on NuGet, and it’s similarly capable of working with both image-based and text-based PDF documents.
- Open Office XML (OOXML) Files: We don’t need to look much further than the standard Open XML SDK to work with Word, Excel, PowerPoint, or other OOXML file types. We’ll find all the text extraction methods we need ready for implementation in a C# application.
- Email and message file containers: For this category of file type, we’ll want to check out the `MailKit` SDK. It’s great for parsing EML or MSG email files, and it offers robust features for security, SMTP, POP3, IMAP, proxy support, and MIME parsing.
Now we’ll move on to the actual summarization portion of the document summarization process. This step requires an NLP model, and probably the easiest solution is to call a hosted LLM from .NET (e.g., OpenAI). As long as the prompt is carefully crafted to limit the response to a useful length (e.g., 1 paragraph only), this can work fairly well.
Otherwise, if we have access to a local model, we might run it behind a small service (e.g., host a summarization model in a Python/FastAPI container and have C# call it over HTTP). We could also pick a summarization model that has an ONNX export, run inference on it, and implement tokenization and post-processing (this is by far the most painful option).
Generating Document Summaries With a Web API
In a lot of cases, the effort required to build and maintain a full document summarization pipeline outweighs the benefits of gaining complete control. One alternative is to simplify the implementation by handling all of the steps described above in a single comprehensive step.
To do that, we can implement an AI Web API, and we’ve provided one good option below. This particular API will give us a level of control over how aggressively the model attempts to recognize content in our documents, and it’ll accept a wide range of relevant document formats (including PDF, OOXML, and several image formats) before returning a simple, concise response object summarizing the document’s contents.
To set up this API in a C# project, we’ll first install the .NET SDK via NuGet.
dotnet add package Cloudmersive.APIClient.NETCore.DocumentAI --version 1.0.0
Following that, we’ll import the required classes:
using System;
using System.Diagnostics;
using Cloudmersive.APIClient.NETCore.DocumentAI.Api;
using Cloudmersive.APIClient.NETCore.DocumentAI.Client;
using Cloudmersive.APIClient.NETCore.DocumentAI.Model;
We can now refer to the example code below to structure our request. It’s very straightforward; all we’re doing is establishing our connection with an API key, creating an API instance, configuring our request parameters (only the input file path is required), and implementing a simple try/catch block to capture built-in error handling.
namespace Example
{
public class ExtractSummaryExample
{
public void main()
{
// Configure API key authorization: Apikey
Configuration.Default.AddApiKey("Apikey", "YOUR_API_KEY");
var apiInstance = new ExtractApi();
var recognitionMode = recognitionMode_example; // string | Optional; Recognition mode - Advanced (default) provides the highest accuracy but slower speed, while Normal provides faster response but lower accuracy for low quality images (optional)
var inputFile = new System.IO.FileStream("C:\\temp\\inputfile", System.IO.FileMode.Open); // System.IO.Stream | Input document, or photos of a document, to extract data from (optional)
try
{
// Extract Summary from a Document using AI
SummarizeDocumentResponse result = apiInstance.ExtractSummary(recognitionMode, inputFile);
Debug.WriteLine(result);
}
catch (Exception e)
{
Debug.Print("Exception when calling ExtractApi.ExtractSummary: " + e.Message );
}
}
}
}
Our API responses will follow a very simple format:
{
"Successful": true,
"DocumentSummaryText": "string"
}
The “DocumentSummaryText” attribute will contain a 1-paragraph text summary we can easily ship anywhere downstream.
The advantage of using this API is in abstracting away the complexity of document summarization capabilities while retaining precise control over where and how that feature is implemented in our production code. It’s a good alternative to in-house tool building in enterprise environments.
Conclusion
In this article, we discussed the value-add of implementing a document summarization tool into a C# document processing pipeline, and we highlighted the challenges associated with building out a document summarization pipeline in-house. We suggested some open-source tools for C# developers to use, and we also provided a fully realized web API solution designed to abstract document summarization complexity away from our environment.
Opinions expressed by DZone contributors are their own.
Comments