How Opsbrew Masks PII Data in Logs Using Machine Learning and Regex
Opsbrew uses spaCy and a regex wrapper to mask PII data in cloud-native logs. Read about their journey in data masking for logs.
Join the DZone community and get the full member experience.Join For Free
Masking PII data in logs is an overlooked yet important aspect when managing log data for cloud-native applications. I was in conversation with Arun Mohan, Co-founder of Opsbrew, who had an interesting story to share on this topic. In this post, I share the approach and learnings of Opsbrew as they incorporated data masking into their log management product.
Some Background on Opsbrew
From speaking to large organizations that rely on logs they found a need to not just route logs efficiently, but to do this in a secure way. This means encrypting, or masking, log data appropriately. This started Opsbrew on a path to figure out the best way to mask PII (Personally identifiable information) data efficiently and responsibly.
Why Mask Log Data?
Log data that's exposed is a vulnerability. It can be accessed by internal and external users and misused either intentionally, or unintentionally. The state of log masking today means that the onus to mask log data rests not on any vendor, but on the organization themselves. For example, API management tools like Apigee do not include an option to mask data like passwords and usernames while logging the payload. They only temporarily hide the data in a trace.
SIEM tools like Splunk mask PII data in logs when received, and before indexing. However, this is too late as the same logs could have gone to other tools or locations apart from Splunk. PII data needs to be masked at the source, before being sent to an SIEM tool.t
With the complexity of cloud-native apps, log data can easily get lost in transit, and so can PII data along with these logs. it's important to mask log data at the very beginning of the log lifecycle for greater control and security. For organizations that operate on-premise data centers, this means masking data before it leaves your premises.
The Logging Components in a Cloud-Native Stack
Today, on-premises data centers are used in combination with cloud to create a hybrid cloud setup. Additionally, organizations may use a combination of multiple public cloud platforms like Azure and AWS to leverage the best of each vendor's tooling. To integrate applications across a multicloud infrastructure API gateway solutions like Apigee are predominantly used. And finally, there are multiple monitoring and SIEM tools like Azure Sentinel, Splunk, and AWS CloudTrail. Each of these tools may have access to your logs and the including PII data.
Log masking needs to apply across all these components. Opsbrew, being a log pipeline management solution, has access to log data end-to-end, and is uniquely positioned to handle log data masking.
Checkpost #1: spaCy
The team surveyed the available options for log masking, and settled on the open source natural language processing tool spaCy. spaCy is a Python tool that is purpose-built to scan large dumps of data and extract specific information. It is typically used to prepare raw data for deep learning models, and it can be extended to scan log data as well.
Opsbrew decided to use it to identify potential PII in logs. spaCy on its own wouldn't mask the log data. Rather, it would point out which data is PII and needs masking.
Once implemented, Opsbrew found that spaCy was great for large scale data crunching, and it was not resource intensive. However, the accuracy rate of spaCy was not very high. For example, there were a few instances where spaCy was unable to differentiate between certain emails and website URLs, or between some phone numbers and zip codes. Also, Spacy was good at spotting western names, but not uncommon names like Arabic names. With many of their clients based out of the Middle East, this was a priority for Opsbrew. They needed higher accuracy levels.
Spacy can potentially be made to accomplish even these more difficult tasks, but it would involve lots more training and more time than the team was willing to dedicate. The team decided to accomplish this using another method - a regex wrapper.
Checkpost #2: A Regex Wrapper
The idea is to use regex as a second checkpost to vet the suggestions coming from spaCy. The reason for this is that regex is easier to implement and can deliver more accurate results. Regex is rules-based. While this has no machine learning capabilities, it is great for pattern recognition. The Opsbrew team wrote a custom regex wrapper over spaCy. This regex wrapper would confirm whether the PII data picked up by Spacy is accurate, and would act as a second checkpost.
Once detected, the regex wrapper would then replace the PII data with masking values. This is the easy part. The bigger challenge is to accurately identify the PII data, which spaCy and the regex wrapper do together.
The Road Ahead
With PII masking implemented, Opsbrew doesn't plan to stop with this. The next step for the team is to build context detection capabilities into Opsbrew so that the PII scrubbing would be even more accurate. For example, in a banking and finance app, Opsbrew would be able to automatically spot an account number, or other sensitive information, based on the context.
To implement this, the team plans to further develop on the regex model by adding layers of checks. This way they can expand the types of PII data that is identified. They would also use custom Python scripts alongside the regex wrappers. Maybe a post on that is due soon.
If you work with or manage logs for your organization, you can take a leaf from Opsbrew and implement your own log masking solution. Or even better, you can check out what Opsbrew does. The key is to mask logs right from the source and ensure they stay masked at every step of the log lifecycle. Log masking shouldn't be overlooked, and with powerful tools like spaCy and tried-and-true regex wrappers, there's no excuse to leave PII data exposed in logs.
Published at DZone with permission of Twain Taylor, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.