PII Leakage Detection and Measuring the Accuracy of Reports and Statements Using Machine Learning

Securing sensitive data and validating the correctness of reports and statements by checking for inconsistencies with machine learning capabilities.

Sep. 02, 25 · Analysis

Likes (4)

Comment

Save

2.2K Views

Reports, invoices, and statements play a vital role in sharing weekly, monthly, and annual usage data and its trends with end-users on day-to-day activities. Starting from utility usage, financial trends, credit statements, and medical data are shared with humans in the form of reports and statements, both in electronic and paper formats. These documents contain PII, personally identifiable information, of users, including address, phone number, account numbers, medical history, and Social Security numbers. Data is also represented in tables, a wide variety of charts, and graphs for an enhanced user experience.

Problem

Organizations and institutions pay fines, penalties, and work on settlements, often now due to PII data breaches and inaccurate data in reports. The majority of organizations use a third-party vendor to generate and send out these statements to their customers. The chances of misdelivery or sharing inaccurate information are relatively high. Using a visual language model and machine learning techniques, we can eliminate the data breach by detecting and fixing it.

This article provides an understanding of using different document extraction and validation techniques, including machine learning techniques, to eliminate the PII leakage and measure the accuracy of reports and statements sent to end users.

What We Will Cover

Reports and statements generation, extraction, and validation
Document extraction using non-machine learning applications
Document extraction using multiple machine learning platforms
- Visual language model (SmocDocling and Docling)
- Agentic document extraction (Landing AI)
- Amazon AI services (AWS Textract and Comprehend)
Document validation

Overview of Reports and Statements Generation

Customers’ everyday transactions data is stored in databases or a data warehouse. To generate reports or statements, a software process executes against the data, collects it, modifies it to a specific format, and then shares it with other methods for document generation. After that, these statements are sent out to customers through email or printed and mailed to their addresses. The chances of mistakes can happen during any phase of this generation process.

Extraction and Validation

Organisations widely use the PDF format to generate reports and statements. These PDFs are of two types:

Generated using software
Scanned by a physical device

To validate the accuracy of these documents, they are parsed and extracted as text and compared against the data storage, the source of truth.

Extraction

Now, let us take a look at different flavours of document extraction using non-machine learning and machine learning-based mechanisms.

Legacy Non-Machine Learning Approach

PDF to text conversion:

A combination of a PDF parser and an OCR engine is used to parse generated PDFs and extract text from scanned PDFs, thereby pulling the data from these statements.

Programming languages like Python and Java have wrappers built on top of Tesseract OCR that can be utilised for parsing.

Optical character recognition:

OCR engine-based software is used to extract text from image files and convert it into plain text or a machine-readable format. Tesseract is a popular open-source OCR engine developed by HP and Google.

Parsers:

Python – pypdf
Java – Apache PDFBox

Tesseract OCR wrapper:

Python – pytesseract
Java – Tess4j

    Python
   
 

   import pytesseract
from pdf2image import convert_from_path

# Convert PDF pages to images
pdf_images = convert_from_path(pdf_path, dpi=300) 

extract = ""

# Iterate through each pdf_images and extract text
for i, pdf_image in enumerate(pdf_images):
    print("Processing pdf_image {i+1}...")
    # Extract text from the image using PyTesseract
    text = pytesseract.image_to_string(pdf_image)
    extract += "--- pdf_image {i+1} ---\n"
    extract += text + "\n\n"

# Print the extracted text
print(extract)
  

Shortcomings of the legacy approach:

A limitation of legacy processes is their inability to extract data from various PDF elements, including tables, charts, structured lists, and multi-column layouts. It cannot extract data from a bounding box, a frame within the documents that contains related information.

Machine Learning Approach

With visual language models (VLM), we can move beyond OCR and its limitations, parsing or extracting information into a structured format from PDFs, including tables, charts, lists, and bounding boxes. Then pass the extracted data to the downstream system for analysis and checking for accuracy.

Visual language model:

AI systems can read and understand videos or images using VLM. It generates text based on the information present in videos or pictures and allows prompting to be performed on the visual information.

SmolDocling and Docling

SmolDocling is an ultra compact VLM that is capable of end-to-end multi-modal document conversion. With 256M parameters, its performance challenges large visual language models (LVLM) of size 20X times larger. They read documents that are in image format and convert them to DocTags, an XML-style markup format. Document elements are transformed into tags like heading, paragraph, table, chart, lists, etc. Element positions in the documents are represented using location tag nesting as a bounding box.

Docling is an open-source Python-based library for document parsing and processing. It is efficient in reading generated PDF texts and converting them into structured data formats like HTML, JSON, Text, and DocTags that can be consumed by other AI systems.

    Python
   
 

   from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel.pipeline_options import (
    VlmPipelineOptions,
)
from docling.datamodel import vlm_model_specs

# Using SmolDocling MLX model in docling library 
pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.SMOLDOCLING_MLX,  
)

# Initialize the DocumentConverter
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        ),
    }
)

# Convert the PDF to a DoclingDocument object
extract = converter.convert(pdf_path)

# Extract the text content as Markdown
markdown_content = extract.document.export_to_markdown()

# Print the extracted text
print(markdown_content)
  

Landing AI’s Agentic Document Extraction (ADE)

Agentic documentation extraction provides an API, a library, and a UI playground to extract data from complex and unstructured documents, such as PDFs and images, into a structured format like JSON or markdown. Internally, it uses a large vision model (LVM) and a large language model (LLM) for the extraction tasks. Capable of extracting text, tables, charts, and form fields into JSON with relationships between the document elements without any model training or fine-tuning.

A schema can be generated from the document, and we can use the schema to specify subsection or bounding box extraction instead of complete document extraction.

    Python
   
 

   from agentic_doc.parse import parse

# Parse the PDF document using Agentic Doc parse function
extract = parse(pdf_path)

# Accessing extracted data
if extract:
    # Convert the extracted content as Markdown
    print("Extracted Data (Markdown):")
    print(extract[0].markdown)
  

AWS Textract:

Textract is an AWS machine learning based service that can extract data from scanned documents, and it goes beyond traditional OCR techniques to extract data as key-value pairs, thereby retaining the structure of the document. It uses computer vision and natural language processing for extraction. Textract provides APIs to parse data with confidence scores, making it easy to evaluate the extraction. For extracting graphs and charts, a multi-step process using Amazon Rekognition with Textract can be employed. Post extraction, the data can be consumed in the AWS ecosystem for further processing.

AWS Comprehend:

Comprehend is a natural language processing service that uses its internal parser to extract data from generated PDFs. The custom entity recognition feature of Comprehend can be used for specific entity recognition and extraction based on the need. With its machine learning capabilities, it can continuously learn and adapt to any new extraction. Similar to any AWS service, the extracted information can be stored or processed in the AWS ecosystem.

Comparison of Different Extraction Methods

All machine learning based extraction moves out of OCR engines and uses AI models for efficient and faster extraction. The small VLM model SmolDocling demonstrates the power of minimal cost and infrastructure for extraction. By combining AWS Textract and Comprehend, we can run synchronous or asynchronous mode to extract both generated and scanned reports.

AWS Textract, AWS Comprehend, and Landing AI offer different pricing plans, allowing us to choose based on our usage. Use SmolDocling if you plan to create a wrapper on top of them for customization to meet your needs. Otherwise, prefer AWS Comprehend and Textract as numerous other AWS services back them.

Validation

Now it is time to validate the extracted data against the source before the reports and statements are shipped out to the end user. The structured Markdown identified from the documents can be stored in memory or a storage system like AWS S3 or DynamoDB. Then it can be validated against source data located in an RDBMS or NoSQL database. If the source data is located in Snowflake, we can use Cortex Agents and Cortex Analyst to search for the data and compare the accuracy of PII and other information.

Validation for PII information and accuracy of report and statement content can be performed either immediately after the statement generation, before shipping the documents, or before user download from an application.

Conclusion

Given the multiple steps and processes involved in generating reports and statements that contain PII and user information, it is crucial to validate the context of these documents to ensure the accuracy of the information. We can upload the extracted data into RAG systems for later processing and analysis.

With machine learning capabilities, it becomes easy and efficient to extract the data from complex documents and compare it with its source. The time and cost of extraction and validation can be drastically reduced by utilising a visual language model and machine learning capabilities.

AWS Language model Machine learning

Opinions expressed by DZone contributors are their own.

Related

Trending