How to Prevent Data Loss in C#
Building reliable text-based PII redaction means solving many independent problems at once. We'll address those challenges and learn some solutions.
Join the DZone community and get the full member experience.
Join For FreeData loss prevention is one of those capabilities that tends to get prioritized reactively, specifically after some compliance violation surfaces, or a data leak makes its way into an incident report. By then, the damage is done; we don’t have any control over the data we lost anymore. The more practical approach is to intercept sensitive data before it leaves our application, at the point where text containing PII or other regulated information is about to be transmitted or stored somewhere downstream.
It’s easier said than done, of course. Building reliable text-based PII detection from scratch is pretty difficult. Sensitive data appears in unstructured text in unpredictable ways, and sensitive data refers to an extremely broad range of data types. The detection logic that works for one data type often doesn’t carry over to others. Health-related data in particular introduces its own layer of complexity, involving PHI categories that require specific recognition logic well beyond what general-purpose NLP approaches handle reliably.
In this article, we’ll look at what it means to implement text-based PII redaction in C#, and we’ll examine the challenges of building that capability in-house. Towards the end, we’ll walk through a dedicated API that can simplify this by handling detection and redaction across 34 configurable data types.
What “Text Redaction” Actually Covers
Before we get into the implementation side of this topic, it’s worth being specific about the full scope of the data detection and redaction problem. PII redaction in text isn’t limited to finding email addresses and phone numbers; a comprehensive solution needs to cover financial identifiers such as credit card numbers, bank account numbers, and IBANs. It needs to handle government identifiers, including social security numbers, taxpayer IDs, passport numbers, and driver’s licenses.
In healthcare, it needs to recognize health-related PHI, including insurance numbers, member IDs, health plan beneficiary numbers, injury and disease references, treatment types, treatment dates, payments made for treatment, and health universal record locators. It should also cover infrastructure and security data like bearer tokens, HTTP cookies, private keys, credentials, IP addresses, MAC addresses, source code, and deep web URLs.
Not every industry needs to address every single category, but for a lot of companies, much of this data is a relevant part of the problem. Solving this problem is meaningfully different than simple pattern matching in text. Many of these data types don’t have a fixed format, and they appear in context-dependent ways, requiring AI-based recognition for reliable detection across real-world input.
The Challenge of Building This In-House
If we’re building a text-based PII redaction pipeline in C#, we need to assemble several components, each with its own maintenance burden.
The first problem is detection. Regex-based approaches can work reasonably well for structured identifiers with predictable formats, like credit card numbers or SSNs, but they break down quickly against free-form text where the same information appears in a dozen different phrasings. Covering the full range of data types listed in the previous section with regex alone would require an extremely extensive (and extremely brittle) ruleset that degrades as language patterns evolve.
We’ve called out health-related PHI detection a few times already, and we’ll do that again now. PHI detection compounds the above problem considerably. Recognizing references to injuries, diseases, treatment types, and treatment dates in unstructured text requires contextual understanding that regex isn’t capable of providing. If we’re going to build reliable PHI detection, we’ll probably need to train or fine-tune an NLP model on labeled healthcare data, which most development teams don’t have access to.
We’ll finish by addressing the redaction problem. It’s worth noting first that this isn’t strictly necessary; we can certainly incorporate PII detection without an automatic redaction step. If the goal is to enable a normal, redacted flow of information through our network, however, we’ll want to consider efficient redaction techniques.
One way we can handle this is to replace detected data with asterisks (or another symbol) while preserving the surrounding text structure. This requires careful handling, particularly when we have overlapping detected spans. Managing that cleanly across dozens of data types in a single pipeline pass isn’t trivial for any engineering team.
Text PII Redaction With a Web API
For most teams, the effort required to build and maintain a full text redaction pipeline outweighs the benefits of building it in-house. A practical alternative is to simply offload detection/recognition/redaction to a dedicated web API. That means each step can be handled in a single block of code rather than assembling all the components independently.
We’ll walk through one such implementation with C# code examples.
We’ll first install the SDK via NuGet:
Install-Package Cloudmersive.APIClient.NETCore.DLP -Version 1.1.0
We’ll then import the required namespaces:
using System;
using System.Diagnostics;
using Cloudmersive.APIClient.NETCore.DLP.Api;
using Cloudmersive.APIClient.NETCore.DLP.Client;
using Cloudmersive.APIClient.NETCore.DLP.Model;
The request body is straightforward. We use a few lines of code to call the method, and we control detection and redaction behavior through a JSON body:
namespace Example
{
public class RedactTextAdvancedExample
{
public void main()
{
// Configure API key authorization: Apikey
Configuration.Default.AddApiKey("Apikey", "YOUR_API_KEY");
var apiInstance = new RedactApi();
var body = new DlpAdvancedRedactionRequest(); // DlpAdvancedRedactionRequest | Input request (optional)
try
{
// Redact User Data in Input Text (Advanced)
DlpAdvancedRedactionResponse result = apiInstance.RedactTextAdvanced(body);
Debug.WriteLine(result);
}
catch (Exception e)
{
Debug.Print("Exception when calling RedactApi.RedactTextAdvanced: " + e.Message );
}
}
}
}
We can build our request body off the model below. The idea here is that we can allow sensitive data types that are necessary for any given workflow, and we can choose to redact sensitive data by either replacing it with asterisks or deleting it entirely from the text input:
{
"InputText": "Hello, world!",
"AllowEmailAddress": false,
"AllowPhoneNumber": false,
"AllowStreetAddress": false,
"AllowPersonName": false,
"AllowBirthDate": false,
"AllowPassportNumber": false,
"AllowDriversLicense": false,
"AllowSocialSecurityNumber": false,
"AllowTaxpayerID": false,
"AllowCreditCardNumber": false,
"AllowCreditCardExpirationDate": false,
"AllowCreditCardVerificationCode": false,
"AllowBankAccountNumber": false,
"AllowIBAN": false,
"AllowHealthInsuranceNumber": false,
"AllowBearerToken": false,
"AllowHttpCookie": false,
"AllowPrivateKeys": false,
"AllowCredentials": false,
"AllowDeepWebUrls": false,
"AllowSourceCode": false,
"AllowIpAddress": false,
"AllowMacAddress": false,
"AllowHealthInsuranceMemberID": false,
"AllowHealthInjuryOrDisease": false,
"AllowHealthTypeOfTreatment": false,
"AllowHealthDateAndTimeOfTreatment": false,
"AllowHealthPlanBeneficiaryNumber": false,
"AllowHealthPaymentsMadeForTreatment": false,
"AllowVehicleID": false,
"AllowDeviceID": false,
"AllowNamesOfRelatives": false,
"AllowHealthUniversalRecordLocator": false,
"AllowBiometrics": false,
"RedactionMode": "ReplaceWithAsterisk",
"ProvideAnalysisRationale": true
}
Our response essentially mirrors the structure of the allow flags, adding a top-level CleanResult Boolean to flag when sensitive data is identified, a field containing the redacted text, and some additional analysis:
{
"RedactedText": "string",
"CleanResult": true,
"ContainsEmailAddress": true,
"ContainsPhoneNumber": true,
"ContainsStreetAddress": true,
"ContainsPersonName": true,
"ContainsBirthDate": true,
"ContainsPassportNumber": true,
"ContainsDriversLicense": true,
"ContainsSocialSecurityNumber": true,
"ContainsTaxpayerID": true,
"ContainsCreditCardNumber": true,
"ContainsCreditCardExpirationDate": true,
"ContainsCreditCardVerificationCode": true,
"ContainsBankAccountNumber": true,
"ContainsIBAN": true,
"ContainsHealthInsuranceNumber": true,
"ContainsBearerToken": true,
"ContainsHttpCookie": true,
"ContainsPrivateKeys": true,
"ContainsCredentials": true,
"ContainsDeepWebUrls": true,
"ContainsSourceCode": true,
"ContainsIpAddress": true,
"ContainsMacAddress": true,
"ContainsHealthInsuranceMemberID": true,
"ContainsHealthInjuryOrDisease": true,
"ContainsHealthTypeOfTreatment": true,
"ContainsHealthDateAndTimeOfTreatment": true,
"ContainsHealthPlanBeneficiaryNumber": true,
"ContainsHealthPaymentsMadeForTreatment": true,
"ContainsVehicleID": true,
"ContainsDeviceID": true,
"ContainsNamesOfRelatives": true,
"ContainsHealthUniversalRecordLocator": true,
"ContainsBiometrics": true,
"AnalysisRationale": "string"
}
Conclusion
In this article, we looked at the scope of text-based PII redaction as a compliance and data loss prevention problem, and we outlined the challenges of building out that capability reliably in-house. We then walked through a dedicated API that handles detection and redaction in a single configurable API call.
For C# teams building document processing pipelines, outbound communication workflows, or any other system where text containing sensitive data needs to be sanitized before it moves downstream, this is a practical way to add comprehensive data loss prevention without committing resources towards building or maintaining the underlying detection logic.
Opinions expressed by DZone contributors are their own.
Comments