Welcome to the Data Engineering category of DZone, where you will find all the information you need for AI/ML, big data, data, databases, and IoT. As you determine the first steps for new systems or reevaluate existing ones, you're going to require tools and resources to gather, store, and analyze data. The Zones within our Data Engineering category contain resources that will help you expertly navigate through the SDLC Analysis stage.
Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Big data comprises datasets that are massive, varied, complex, and can't be handled traditionally. Big data can include both structured and unstructured data, and it is often stored in data lakes or data warehouses. As organizations grow, big data becomes increasingly more crucial for gathering business insights and analytics. The Big Data Zone contains the resources you need for understanding data storage, data modeling, ELT, ETL, and more.
Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
IoT, or the Internet of Things, is a technological field that makes it possible for users to connect devices and systems and exchange data over the internet. Through DZone's IoT resources, you'll learn about smart devices, sensors, networks, edge computing, and many other technologies — including those that are now part of the average person's daily life.
Securely Sign and Manage Documents Digitally With DocuSign and Ballerina
Usage of GenAI for Personalized Customer Experience in Mobile Apps
Recently, I haven’t been updating my open-source articles as frequently — not because I’ve stopped writing, but because the progress on open-source commercialization has been great, and the endless task of drafting proposals has consumed my time. As a native open-source commercial company, WhaleOps employs mostly engineers. Asking these open-source contributors to write proposals wastes their development time, and their proposals don’t always meet the quality standard. Unlike managing in a big company, being a startup CEO means stepping into any role the company needs. After every strategic meeting, I’m the first to roll up my sleeves and tackle the most urgent tasks. As a result, I haven’t had time to write articles, as my limited time is mostly taken up with creating proposals that follow the formal template style. Especially recently, with one bid after another, I’ve found myself questioning my own sanity. The existing large models coudn't solve my problem, so I built my own. As a tech person, I always look for tools to solve my problems. Naturally, I thought of large models, but they can’t fully grasp the nuances of our products and often produce unreliable outputs. Plus, you’re the one delivering the work, not the model. So, I decided to develop a tool for proposal generation based on large models using Python and a compatible large model (compatible with ChatGPT). This tool automatically generates a proposal from your product documentation, breaking down product manuals into feature points. Based on a human-created mapping of these points to the requirements, it generates a Word version of the proposal and an Excel deviation table. The model can condense or expand content or simply copy relevant functionality as needed. Features of the Open-Source Proposal Tool The functionality of this tool is simple, with the most challenging part being the Word formatting (formatting in Word is always a pain). I experimented with several methods to make sure it follows the correct Title 1, Title 2, Title 3, body text, table, and image formats in the Word template. Staying true to the open-source spirit, I’ve uploaded the tool to my personal GitHub under an Apache License, so feel free to use it if you need it. Here’s what it does: Breaks down your product manual into a set of reusable detail documents, reducing the need to repeatedly reference the source document when drafting a proposal. You can also customize specific functionality (the default file name is “Template.docx”).Based on the requirements table filled in by a person, it automatically generates a proposal in a point-to-point response format, including all headings and content, with the correct Title 1, 2, 3 formatting, and automatically organizes body text, images, and bullet points (the default requirements table is “requirements_table.xlsx,” and the generated content is in “proposal_content.docx”).For any product requirements in the corresponding functionality section, it automatically copies the product manual content into the point-to-point response section, retaining images, tables, and bullet points. You can also rewrite the product description to suit different proposal needs. If there’s no matching functionality, the model automatically generates relevant content (review and modify as needed).Completes the technical requirements deviation table by automatically filling in responses in the “requirements_table.xlsx,” with responses formatted as “Answer: Fully supports, {Model-generated text based on project requirements}” and includes the section number corresponding to the proposal. With this tool, you can quickly modify and generate proposals at will. You can get it here. Proposal Generation Process Start by running Extract_Word.py to generate your product feature points as a Word document, then run Generate.py. If a feature point is missing, you can enter “X” in the Excel file, and the model will generate content that meets the requirements. However, I strongly recommend a manual review. After generation, you’ll see the proposal formatted with a table of contents, body text, images, tables, and bullet points, all automatically organized. The entire proposal is structured according to the client’s requirements in sequential format, with subheadings, content, images, and point-to-point responses. This takes care of all the repetitive work. The automatically generated deviation table includes everything, whether the content was generated by the model or not, along with the corresponding section numbers. You only need to finalize the deviation table with “&” symbols based on the final requirements — the model has written it all for you. For me, this tool has reduced what used to be 8 hours of work to around 30 minutes, and even our business team can generate the business proposal sections using the template. In total, this has cut down our time for a week-long proposal to 1-2 days, reducing the manpower required by 80%. How to Use It? First, download all the code to a single directory from GitHub: Proposal Large Model (Chinese version). Install the Python environment and packages: pip install openpyxl, docx, openai, requests, docx, python-docx.Apply for a ChatGPT or Baidu Qianfan large model key (I used ERNIE-Speed-8K, which is free), record the token, and place it in the relevant code section.Copy your product manual to Template.docx. Be sure to use the body text, Title 1, Title 2, and Title 3 styles provided; other formats may cause issues.Run Extract_Word.py to generate the feature point document from the product manual (supports up to 3 heading levels). If the list formatting appears off, don’t worry; the final format will align properly.Fill in Columns B and C (which will generate secondary and tertiary headings automatically) and Column G (the corresponding product manual chapter). If a chapter is missing, enter “X.” Note that if there is no corresponding chapter or an “X” is entered, the model will generate content automatically.Review the “proposal_content.docx” document and keep the chapter for which you want to start generating the proposal. You can modify the body text and heading 1, 2, and 3 styles; just don’t rename the styles, or there may be errors.Adjust the parameters in Generate.py: API_KEY and SECRET_KEY: Baidu Cloud large model keys.MAX_WIDTH_CM: Maximum image width; images larger than this will be resized automatically.The prompts for generating content have been customized for large data scenarios, but feel free to modify them.MoreSection=1 will read Column C to generate detailed tertiary headings (default is on).ReGenerateText=0 will re-generate text content automatically for different proposal needs (default is off).DDDAnswer=1 generates the point-to-point response content at the top of each feature point (default is on).key_flag=1 adds the importance level of each requirement to the proposal headings (default is on).last_heading_1=2 specifies the starting chapter for the technical solution in “proposal_content.docx.”Run Generate.py. Summary With this tool, you only need to check the product features against the proposal requirements, and most of the proposal content is generated automatically. The business proposal section can also be generated similarly, so creating a 1,000-page proposal now only takes a few minutes.
With the advancements in artificial intelligence (AI), the models are getting increasingly complex, resulting in increased size and latency, leading to difficulties in shipping models to production. Maintaining a balance between performance and efficiency is often a challenging task, and the faster and more lightweight you make your models, the further along they can be deployed into production. Training models on massive datasets with over a billion parameters results in high latency and is impractical for real-world use. In this article, we will be delving into techniques that can help make your model more efficient. These methods focus on reducing the models’ size and latency and making them ready for deployment without any significant degradation in performance. 1. Pruning One of the first methods we would discuss is model pruning. More often than not, deep learning models are trained on extensive datasets, and as the neural networks keep getting trained, there are connections within the network that are not significant enough for the result. Model pruning is a technique to reduce the size of the neural networks by removing such less important connections. Doing this results in a sparse matrix, i.e., certain matrix values are set to 0. Model pruning helps not only in reducing the size of the models but also in inference times. Pruning can be broadly classified into two types: Structured pruning: In this method, we remove entire groups of weights from the neural networks for acceleration and size reduction. The weights are removed based on their L-n norm or at random.Unstructured pruning: In this method, we remove individual connections of weights. We zero out the units in a tensor with the lowest L-n norm or even at random. Additionally, we also have magnitude pruning, wherein we remove a percentage of weights with the smallest absolute values. But, to get an ideal balance between performance and efficiency, we often follow a strategy called iterative pruning as shown in the figure below. It's important to note that sparse matrix multiplication algorithms are critical in order to maximize the benefits of pruning. 2. Quantization Another method for model optimization is quantization. Deep learning neural networks are often comprised of billions of parameters, and by default, in machine learning frameworks such as PyTorch, these parameters are all stored as 32-bit floating precision, which leads to increased memory consumption and increased latency. Quantization is a method that lowers the precision of these parameters to lower bits, such as floating-point 16-bit or 8-bit integers. Doing this reduces the computational costs and memory footprint for the model as 8 8-bit integer takes four times less space as compared to FP32. We can broadly classify quantization as follows: Binary quantization: By representing weights and activations as binary numbers (that is, -1 or 1), you may significantly reduce the amount of memory that is required and the amount of computation that is required.Fixed-point quantization: Decrease numerical precision to a predetermined bit count, such as 8-bit or 16-bit, facilitating efficient storage and processing at the expense of some degree of numerical accuracy.Dynamic quantization: Modifying numerical precision in real-time during inference to balance model size and computing accuracy. Source: Qualcomm 3. Knowledge Distillation In the domain of model optimization, another effective methodology is knowledge distillation. The basic idea behind knowledge distillation is how a student learns from its teacher. We have an original pre-trained model which has the entire set of parameters, and this is known as the teacher model. Then, we have a student model that directly learns from the teacher model’s outputs rather than from any labeled data. This allows the student model to learn much faster as it learns from a probability distribution known as soft targets over all possible labels. In knowledge distribution, the student model need not have an entire set of layers or parameters, thereby making it significantly smaller and faster and, at the same time, providing similar performance as compared to the teacher model.KD has been shown to reduce model size by 40% while maintaining ~97% of the teacher model’s performance. Implementing knowledge distillation can be resource-intensive. Training a child model for a complex network such as BERT typically takes 700 GPU hours, whereas training it from scratch or the teacher model would take around 2400 GPU hours. However, given the performance retention by the child model and efficiency gains, knowledge distillation is a sought-after method for optimizing large models. Conclusion The advancement of deep neural networks has resulted in heightened complexity of the models employed in deep learning. Currently, models may possess millions or even billions of parameters, necessitating substantial computational resources for training and inference. Model optimization solutions aim to reduce the computational requirements of complex models while enhancing their overall efficiency. Numerous applications, especially those implemented on edge devices, possess limited access to computational resources such as memory, computing power, and energy. This is particularly applicable to edge devices. Optimizing models for these resource-constrained environments is essential for facilitating efficient deployment and real-time inference. Methods such as pruning, quantization, and knowledge distillation are some of the model optimization methods that can help achieve it.
In the IoT world, security is one of the biggest challenges. When you’re connecting multiple devices together over a network, various doors are left ajar to security threats, along with an increase in the number of doors themselves that open to the involved data transmissions. However, data transmissions are an integral part of IoT, because they allow devices to share various types of data among themselves and transmit it to other devices, including notifications and media files. This ability is essential for IoT ecosystems, in which devices need to communicate efficiently to perform complex tasks. However, access to the data channel must be both restricted and encrypted to maintain security. WebRTC is one approach to establishing secure data channels over an IoT network. WebRTC establishes direct peer-to-peer connections, allowing data to flow directly between devices instead of through a separate server. The basic security consists of three mandatory WebRTC encryption protocols: secure real-time protocol (SRTP), secure encryption key exchange, and secure signaling. These protocols encrypt the data sent through WebRTC, protect the encryption keys, and secure the web server connection. Here, we’ll explain further how WebRTC security works to protect your IoT network. A Look at SRTP for WebRTC Security One of the primary concerns in IoT security is the potential for data interception. WebRTC mitigates this risk with secure real-time protocol (SRTP), which encrypts media streams and data packets during transfer. These protocols are widely used in systems such as video surveillance, smart home devices, healthcare IoT, industrial IoT, and connected vehicles, making them essential for securing real-time data transfer across various IoT applications. SRTP builds on the basic real-time protocol (RTP) by adding encryption and authentication layers. Each data packet is encrypted using a unique key shared exclusively between communicating devices. This ensures that even if a packet is intercepted, its content cannot be accessed without the decryption key. WebRTC achieves secure key exchange through DTLS-SRTP, which integrates datagram transport layer security with SRTP to establish secure connections. In addition to encryption, SRTP includes mechanisms for data integrity verification. Every packet has an authentication tag, a digital signature that confirms it has not been tampered with during transmission. If a packet’s tag fails verification, it is discarded, protecting communication from interference. Encryption Key Exchange While SRTP encrypts the data itself, WebRTC employs secure encryption key exchange mechanisms to protect the keys that control access to data streams. These keys, often referred to as session keys, are unique, temporary codes used to encrypt and decrypt the data exchanged between devices. Without these keys, intercepted data cannot be read or modified. Key exchange begins with a DTLS “handshake,” a process that verifies the identities of communicating devices and securely transfers encryption keys. This step ensures that only authenticated devices can participate in the communication. Essentially, Datagram Transport Layer Security (DTLS) plays a critical role in WebRTC by confirming the credentials of both the sender and receiver (similar to verifying IDs) to ensure all participants in the media stream are who they claim to be. A crucial part of this process involves the exchange and validation of certificate fingerprints. WebRTC provides a mechanism to generate fingerprints of certificates, which act as unique identifiers for each device in the connection. Secure Signaling In WebRTC, signaling — the process that helps establish a peer-to-peer connection — is a crucial security component. Signaling mechanisms are used to set up, modify, and terminate WebRTC connections. Although WebRTC doesn’t define a specific signaling protocol, developers typically rely on secure channels (like HTTPS or WebSockets) to manage signaling messages. To understand the differences between SRTP, secure encryption key exchange, and secure signaling, think of them as three roles in building a secure house: SRTP: SRTP is like the lock on the doors and windows in WebRTC security. It ensures that once people (here, data) are inside the house, they are safe and cannot be accessed by unauthorized individuals. It encrypts media streams (audio, video, or data packets) and ensures they remain private and untampered during transmission.Encryption key exchange: This is like the locksmith who provides and secures the keys to the locks. DTLS verifies the identities of the participants (like showing ID to ensure you’re the homeowner) and securely delivers the session keys that control access to the encrypted data.Secure signaling: Secure signaling is like the blueprint and construction crew that set up the house and its security features. Signaling manages the negotiation of how the connection will function — determining the structure (e.g., codecs, ICE candidates, and connection parameters) while ensuring the plans (signaling messages) are not intercepted or altered during setup. So, while SRTP and DTLS focus on protecting the data itself and the keys that enable encryption, secure signaling ensures that the initial connection setup process remains private and free from interference. By securing the signaling messages, WebRTC prevents attackers from tampering with the connection parameters or hijacking the session during its setup phase. Additional WebRTC Security Considerations While SRTP, encryption key exchange, and secure signaling are foundational to WebRTC security, several other safeguards ensure that WebRTC operates within a robust security framework. Browser trust and security updates: Since WebRTC is a browser-based technology, security depends heavily on the browser’s integrity and update cycle. Trusted browsers like Chrome and Firefox automatically receive security patches, reducing the likelihood of vulnerabilities. However, downloading from a trusted source is critical; a compromised browser could weaken WebRTC’s security.User permissions and access control: WebRTC requires explicit user permission to access local resources like cameras and microphones. This permission-based access prevents unauthorized apps from using a device’s hardware and informs users when an application is accessing these resources.TURN servers and data routing: When direct peer-to-peer connections are not possible, WebRTC falls back on TURN servers, which relay data but cannot access its content due to encryption. This fallback option ensures secure communication even in network-restricted environments. Final Thoughts While WebRTC provides robust security features, its effectiveness depends heavily on how it is implemented in applications. The protocols discussed earlier — SRTP for encrypting data streams, DTLS for secure key exchange, and secure signaling for safeguarding the connection setup — form a strong foundation. However, if developers cut corners or mismanage these elements, the data channel can still be left vulnerable to attack. For example, using insecure signaling mechanisms, such as unencrypted HTTP instead of HTTPS or WebSockets, undermines the secure signaling process and exposes the connection setup to interception. Similarly, failing to implement proper DTLS key exchange protocols or neglecting to update SRTP configurations with the latest security standards can compromise the integrity of the encrypted data streams. By adhering to WebRTC security best practices — ensuring secure signaling channels, maintaining updated encryption standards, and leveraging the inherent strengths of SRTP and DTLS — IoT developers can create applications that are both highly functional and secure. These measures are critical to protecting sensitive data and ensuring the reliability of IoT ecosystems in a world where security threats continue to evolve.
Think back to those days when you met the love of your life. The feeling was mutual. The world seemed like a better place, and you were on an exciting journey with your significant other. You were both “all-in” as you made plans for a life together. Life was amazing... until it wasn’t. When things don’t work out as planned, then you’ve got to do the hard work of unwinding the relationship. Communicating with each other and with others. Sorting out shared purchases. Moving on. Bleh. Believe it or not, our relationship with technology isn’t all that different. Breaking Up With a Service There was a time when you decided to adopt a service — maybe it was a SaaS, or a PaaS, or something more generic. Back in the day, did you make the decision while also considering the time when you would no longer plan to use the service anymore? Probably not. You were just thinking of all the wonderful possibilities for the future. But what happens when going with that service is no longer in your best interest? Now, you’re in for a challenge, and it’s called service abdication. While services can be shut down with a reasonable amount of effort, getting the underlying data can be problematic. This often depends on the kind of service and the volume of data owned by that service provider. Sometimes, the ideal unwinding looks like this: Stop paying for the service, but retain access to the data source for some period of time. Is this even a possibility? Yes, it is! The Power of VPC Peering Leading cloud providers have embraced the virtual private cloud (VPC) network as the de facto approach to establishing connectivity between resources. For example, an EC2 instance on AWS can access a data source using VPCs and VPC end-point services. Think of it as a point-to-point connection. VPCs allow us to grant access to other resources in the same cloud provider, but we can also use them to grant access to external services. Consider a service that was recently abdicated but with the original data source left in place. Here’s how it might look: This concept is called VPC peering, and it allows for a private connection to be established from another network. A Service Migration Example Let’s consider a more concrete example. In your organization, a business decision was made to streamline how it operates in the cloud. While continuing to leverage some AWS services, your organization wanted to optimize how it builds, deploys, and manages its applications by terminating a third-party, cloud-based service running on AWS. They ran the numbers and concluded that internal software engineers could stand up and support a new auto-scaled service on Heroku for a fraction of the cost that they had been paying the third-party provider. However, because of a long tenure with the service provider, migrating the data source is not an option anytime soon. You don’t want the service, and you can’t move the data, but you still want access to the data. Fortunately, the provider has agreed to a new contract to continue hosting the data and provide access — via VPC peering. Here’s how the new arrangement would look: VPC Peering With Heroku In order for your new service (a Heroku app) to access the original data source in AWS, you’ll first need to run your app within a Private Space. For more information, you can read about secure cloud adoption and my discovery of Heroku Private Spaces. Next, you’ll need to meet the following simple network requirements: The VPC must use a compatible IPv4 CIDR block in its network configuration.The VPC must use an RFC1918 CIDR block (10.0.0.0/8, 172.16.0.0/12, or 192.168.0.0/16).The VPC’s CIDR block must not overlap with the CIDR ranges for your Private Space. The default ranges are 10.0.0.0/16, 10.1.0.0/16, and 172.17.0.0/16. With your Private Space up and running, you’ll need to retrieve its peering information: Shell $ heroku spaces:peering:info our-new-app === our-new-app Peering Info AWS Account ID: 647xxxxxx317 AWS Region: us-east-1 AWS VPC ID: vpc-e285ab73 AWS VPC CIDR: 10.0.0.0/16 Space CIDRs: 10.0.128.0/20, 10.0.144.0/20, 10.0.0.0/20, 10.0.16.0/20 Unavailable CIDRs: 10.1.0.0/16 Copy down the AWS Account ID (647xxxxxx317) and AWS VPC ID (vpc-e285ab73). You’ll need to give that information to the third-party provider who controls the AWS data source. From there, they can use either the AWS Console or CLI to create a peering connection. Their operation would look something like this: Shell $ aws ec2 create-vpc-peering-connection \ --vpc-id vpc-e527bb17 \ --peer-vpc-id vpc-e285ab73 \ --peer-owner-id 647xxxxxx317 { "VpcPeeringConnection": { "Status": { "Message": "Initiating Request to 647xxxxxx317", "Code": "initiating-request" }, "Tags": [], "RequesterVpcInfo": { "OwnerId": "714xxxxxx214", "VpcId": "vpc-e527bb17", "CidrBlock": "10.100.0.0/16" }, "VpcPeeringConnectionId": "pcx-123abc456", "ExpirationTime": "2025-04-23T22:05:27.000Z", "AccepterVpcInfo": { "OwnerId": "647xxxxxx317", "VpcId": "vpc-e285ab73" } } } This creates a request to peer. Once the provider has done this, you can view the pending request on the Heroku side: Shell $ heroku spaces:peerings our-new-app In the screenshot below, we can see the pending-acceptance status for the peering connection. From here, you can accept the peering connection request: Shell $ heroku spaces:peerings:accept pcx-123abc456 --space our-new-app Accepting and configuring peering connection pcx-123abc456 We check the request status a second time: Shell $ heroku spaces:peerings our-new-app We see that the peer connection is active. At this point, the app running in our Heroku Private Space will be able to access the AWS data source without any issues. Conclusion An unfortunate truth in life is that relationships can be unsuccessful just as often as they can be long-lasting. This applies to people, and it applies to technology. When it comes to technology decisions, sometimes changing situations and needs drive us to move in different directions. Sometimes, things just don’t work out. And in these situations, the biggest challenge is often unwinding an existing implementation — without losing access to persistent data. Fortunately, Heroku provides a solution for slowly migrating away from existing cloud-based solutions while retaining access to externally hosted data. Its easy integration for VPC peering with AWS lets you access resources that still need to live in the legacy implementation, even if the rest of you have moved on. Taking this approach will allow your new service to thrive without an interruption in service to the consumer.
It is a continuous effort to stay competent with technological advances and stay current. Hence, there is a need for companies to continually evaluate what is obsolete or inefficient and adapt to new tools and approaches. As the data grows over time, we start to observe the operations being slowed down, and with the changes in security standards and increased amount of creative security risks, the application may become vulnerable or cost more to keep it running. Such situations call for an upgrade or replacement of inefficient applications. Choosing more modern, scalable, secure alternatives will increase performance and better fulfill the consumers' demands. This article looks at application retirement from a data perspective, best practices, challenges, and how to manage and leverage data during this process. Understanding Application Retirement Application Retirement and Its Importance Application retirement is the process of shutting down outdated, redundant, or legacy software or business applications while ensuring that critical data is preserved, accessible, and compliant with regulatory standards. Retiring legacy applications helps reduce operational costs by getting rid of the need to maintain and support applications that are no longer required, redundant, or operationally vulnerable. Next, it improves security by phasing out applications that no longer receive security updates, thus making them less exposed to security threats and more compliant. Furthermore, it enables IT teams to focus on modernizations and innovations that add value to the company and stay current in the market. Drivers for Application Retirement When we start thinking about retiring an application, there must be a level of discovery done and key areas need to be assessed. Below are some of the key drivers of application retirement. Technological Advancements With the advancements, new technologies keep evolving and facilitate better performance and productivity, this results in incompatibility with the traditional ones as they end up creating silos which in turn, slows down innovation. Legacy applications frequently need help to integrate with modern tools, cloud platforms, and APIs. Hence, switching to modern, scalable, and more robust systems will better support their business operations. Cost Reduction Oftentimes, maintaining legacy systems requires specialized skills along with non-negotiable high licensing fees. In addition, outdated software/hardware leads to rising operational costs. A good way to work around this problem is to switch from legacy systems to modern systems which will enable the company to save on operational costs, including maintenance, service repairs, and labor costs. Compliance and Security Risks In some cases, older applications do not receive any future security updates or patches, which leaves them vulnerable to cyber-attacks and, in other cases, exposes the company to penalties of non-compliance, loss of data, or even reputational damage. Upgrading to ERP Solution Many organizations are opting for pre-built solutions that are ready to deploy because they are quick and require minimal configuration. Moreover, off-the-shelf ERP solutions are reliable, stable, cost-effective and need less maintenance. Upgrading to readily available tools helps organizations to deploy applications quickly and can be worry-free about future releases or staying current with the technology. Performance Issues When the legacy system is challenged with poor performance, it leads to a less effective user experience. For example, think of a website with very slow response times and security vulnerabilities. Replacing such systems with the latest solutions would enhance the overall performance. Redundant Applications It is not uncommon for organizations to accumulate multiple systems performing similar functions over a period. Discovery and assessment of existing applications and laying out the purpose of each tool will help identify overlaps. Retiring duplicate applications or consolidating them into fewer applications will optimize the system architecture, thereby reducing maintenance costs. Digital Transformation In a fast-evolving digital world, keeping up with technology is not only needed but is required to thrive in the rapidly changing market. Digital transformation is the path many organizations are opting to align technology and their business goals. Digital transformation empowers businesses to improve efficiency, enhance customer experiences, and stay competitive in the digital world. Preparing for Application Retirement In most cases, decommissioning an application can be a long and complex process. Simply because it requires thorough planning, collaboration, and consideration for everyone impacted — whether it’s employees, customers, or other stakeholders. Further, the application can be retired only after a replacement or alternate solution is found. Identifying Candidates for Retirement As discussed above, there could be several drivers for retiring an application; however, identifying involves a step-by-step process of conducting a thorough analysis. Factors like technical debt, business value, usage, age, maintenance costs, potential security risks, redundancy with other applications, and whether a suitable replacement exists or not influence the decision. If multiple applications are identified, a carefully curated roadmap has to be prepared so the transition can happen smoothly. To gather valuable insights and make informed decisions, companies conduct surveys to gather as much information as possible from customers, and business stakeholders. Some of the key aspects that help determine the value and impact are: Purpose of the legacy applicationUtilization of the application by users or shadow systemsTechnical health of the applicationAge of the applicationCost of maintenance, hardware, and licensing (operational costs if any)Any redundant applications with overlapping functionalityEstimated cost and time to implement and adapt a new systemFeasibility and cost of new implementation An application retirement follows one of these 3 strategies: Sunsetting: It follows a phased approach where the legacy system is gradually sunset after informing the users of a planned end-of-life date. The old functionality is replaced with workarounds or alternate solutions.Decommissioning: Switching to a new solution altogether by removing the legacy solution from production completely.Consolidation: Combining the functionalities of multiple applications into a single, streamlined system. Establishing a Retirement Roadmap Regardless of the retirement strategy chosen, a well-defined retirement roadmap is critical for a successful application retirement process. This roadmap should outline, but is not limited to, the following: TimelineMilestonesResources required for the project and the costPotential risks and their impactMitigation plans to address any unforeseen challenges Steps Involved in Retiring Applications 1. Evaluating the Purpose and Impact on Business Operations Applications are built to help support a business use case or business operations of a company; so, evaluating the impact on business is a key step in the retirement process. Impact analysis uncovers the end-to-end workflows, identifies the dependencies, and discovers any shadow systems. Finally, it aids in developing a mitigation plan to ensure a smooth transition. 2. Engagement Like in any project, effective communication and collaboration with stakeholders is vital throughout the application retirement process. To get an in-depth understanding of application usage by different types of users, active engagement with stakeholders, internal/external partners, IT teams, business units, compliance officers, and sometimes with sponsors is also required. It provides them with an opportunity to express their concerns and requirements early in the process, enabling the project team to address those so the teams can stay aligned on the project goal. 3. Communication Strategies Understandably, change is often hard, yet it remains a constant. To mitigate resistance to change and maintain transparency, clear and frequent communication is encouraged to keep all parties informed about the progress and set expectations. Additionally, providing training and support to users transitioning to new systems ensures a smoother adoption process. 4. Data Migration and Archival When transitioning to a brand-new system, there might not be a lot of flexibility to import or convert everything that we have in the legacy system into the new system. For example, assume you have a legacy system that is 50 years old. You do not necessarily need all the 50-year-old data to be migrated to the new system. There might be only a subset of data that might be essential or critical, which should be transferred to the new system, and there might be another subset of data that should be preserved or archived for historical purposes. In this example, the former subset will follow the migration strategy, and the latter will follow the data archival strategy. For this, first, we need to identify the data assets that fall under each strategy, and, next, classify them based on their importance and sensitivity (e.g., personal data, financial records) to choose the right tool that is compliant. Strategies for Data Migration and Archiving Data migration strategy involves exercises such as data mapping, Extraction, Transformation, and Loading (ETL) and this cycle may need to be repeated until all the data is converted and migrated successfully into the new location. Data archival solutions for storing historical data should support long-term storage, searchability, and data encryption to protect the data stored in it. Whether it is data migration or archiving, rigorous validation should be done to ensure the correctness and completeness of the data. Strategies like comparing pre- and post-migration data sets can be employed to identify inconsistencies in the data and take steps toward correcting them. 5. Compliance and Security Regulations such as GDPR, HIPAA, and other standards provide guidelines on how to handle, store, and dispose of data, especially when it comes to sensitive or personal information. Knowing these rules will help you avoid fines and legal trouble. Compliance During and After Retirement To comply, you need to have clear protocols for data handling during the retirement process. This includes: Documenting data retention policies to decide what data to keep and for how longUsing secure data deletion methods for data that’s no longer needed so it can’t be recoveredKeeping detailed records of the retirement process including data migration, archival, and deletion activities 6. Data Security During Migration and Archival Data security is key during the application retirement process, especially during data migration and archival. Some aspects to consider in keeping data secure are as follows: Using encryption to protect data in transit and at restAccess controls — who can view or modify dataMonitoring to detect issues or vulnerabilities 7. Addressing Potential Security Vulnerabilities Legacy applications often have inherent security weaknesses that can be exploited during the retirement process. Conducting a thorough security assessment can help identify these vulnerabilities. Measures to address these may include: Applying the latest security patches and updates to the retiring application before migrationUsing secure data transfer methods to prevent interception or tamperingImplementing a comprehensive incident response plan to address any security incidents that arise during the retirement process Tools and Technologies for Application Retirement Retirement Management Tools Several platforms are available that can be used to facilitate the entire application retirement lifecycle. Using dedicated application tools is usually more beneficial than executing each step of the retirement process manually which results in less error and more efficiency. Some of the popular application retirement platforms are as follows: IBM Infosphere OPTIM This is an IBM product specifically designed for data archiving and management which is crucial for retiring applications by preserving necessary historical data and using proprietary compression technology enabling space saving for a high volume of data. Mimecast Mimecast offers a wide array of sources of data to import from as well as flexibility on the datatypes that can be imported and how they are stored and retrieved — well suited if the archived data has frequent retrieval and reporting requirements. Bloomberg Vault Apart from its wide assortment of data archival features, it also comes with a few regulations’ obligatory features like consolidated compliance, legal search, and retention management. It is highly scalable due to being cloud-based and has custom features catering to the financial industry. Barracuda Message Archiver This is one product that excels in archiving messages. It not only offers strong email archiving capabilities but also excels in features like legal hold, retention, and eDiscovery. Selection Criteria for Choosing the Right Tool When it comes to selecting the right tool for data archival, one needs to be mindful of a few basic selection criteria: Compatibility The source data (which must be archived) can vary from case to case. Not all data archive tools support all data types of file formats. For instance, if the intended decommissioned data is from the mainframe, IBM Infosphere OPTIM may be a good choice as it supports mainframe compatibility. Scalability If the requirement demands a constant need to archive data, over an extended period at a variable pace and there are chances that there can be a surge in the volume of data to be archived, using a public cloud-based archival system like Google Vault is recommended. Compliance Features Although most archival tools offer some sort of compliance features, if the requirement is to enforce complex regulatory requirements specific to geographies, audit trails, or highly customized data retention policies, then choose the tool that allows you to implement the specific policy. For instance, Mimecast has a wide assortment of options for various compliance needs and may be a good fit for most cases. Ease of Use In most cases, there is a requirement for ease of use, where the users implementing data archival solutions are not too technical, all the public cloud-based data archival solutions have decent ease of use. Usually, one needs to be mindful of the tradeoff between ease of use and advanced features tailored to your needs. Challenges and Mitigation Strategies Common Challenges Data Loss and Integrity Issues Data loss can occur when the archived data gets moved from the source to the destination systems. Some of the main reasons for data loss and integrity issues are incompatibility of the data types, incomplete migrations, or incorrect mapping between source and destination systems. This can potentially lead to the data volume difference in the archival systems, or data formats that make it unusable for retrieval or incompliant with regulations. Stakeholder Resistance Often owing to the fear of loss of a system they are comfortable working with over the years, or insecurity over the loss of data querying ability, many business stakeholders often resist data migration initiatives. This leads to user groups getting reluctant to adopt new systems and archive existing ones, leading to operational inefficiency and decreased productivity. Regulatory and Compliance Risks When the archived data moves from the source to the destination systems, there is often the risk of non-adherence to the regulatory and compliance standards. Often, the archived data may fail to meet the regulatory checks specific to the destination system and not adhere to the local compliance guidelines, if they are not considered prior to the archival processes. Mitigation Strategies to the Common Challenges Comprehensive Data Validation Rigorous data validation strategies need to be implemented to mitigate data loss and integrity issues. Some of the commonly employed methods include conducting extensive pre and post-archival testing and comparing basic statistics, using automated validation scripts and implementing checksums to ensure data consistency, and regularly auditing data to identify potential data issues before they happen and correct them. Early Engagements of Stakeholders It is the key to avoiding stakeholders’ resistance even before data archival initiatives can begin. Communicating early and clearly with the stakeholders during the archival process can allay the fears and apprehensions that they might have regarding the process. Conducting training sessions and supporting them with their queries further cements their confidence in the archival process. It is recommended to involve key users of the stakeholder group in the strategic phase of the archival process to foster ownership and buy-in. Ensuring Compliance To ensure adherence to compliance, there needs to be a clear understanding of all the compliance requirements and governance frameworks right from the beginning of the archival process. For instance, if the archived data is getting moved to Europe, the archival process should incorporate processes to adhere to the General Data Protection Regulation (GDPR) which is mandatory in the EU. Furthermore, it is recommended to keep all the detailed before and after record counts and other key figure snapshots and audit trails. Implementing disposal policies and data retention on data objects is another way to meet compliance standards. Apart from the aforementioned, choosing the right tools can ease the retirement and archival process. Tools that have decent migration features and offer data management compliance features often significantly mitigate common data migration challenges. Case Study: Archiving Sales Data for a Wholesale Pharma Distributor Background A leading wholesale pharma distributor has sales data spanning several decades. Due to compliance regulations, the organization needs to retain only 10 years of data and decommission and archive the prior data. The primary challenge was to securely archive this older than 10 years data while keeping it accessible for Business intelligence reporting. Solution As this historical data originated out of IBM Mainframe, this datatype constraint meant that only a few of the existing data archival tools were in consideration. IBM Infosphere Optim was finally chosen as it fit the compliance and business intelligence requirements and there were no scalability requirements foreseen in the lifecycle of the archived data. The implementation involved several key steps. Data assessment and classification: The sales data was classified based on the value of the business, relevance to end-user reporting, and scope of applying retention policies. Furthermore, the data that qualified for retention was sub-classified into regulatory policies and historical value.Secure data migration: The Archival tool was employed to convert the data from the source format to archived format and then migrated to the new migration vaults for safe storage. Retention and regulatory hold policies were applied to individual data objects so that after each year, older data gets deleted by itself and it is also ensured that some perpetual data objects are never deleted. Check sums, counts for pre and post migration and compared to ensure there is no loss of data. Audit trail is retained for compliance.Stakeholder engagement: Key stakeholders in the business were engaged throughout the process. People who depended on this data for analytics reporting were familiarized with the revised process of reporting from the archives. Compliance teams were familiarized on how to ensure that retention policies have worked or not. The implementation resulted in several benefits for the pharma distributor. Enhanced compliance: This effort ensured that the organization stayed within the compliance mandate, while still providing easy accessibility of the data to the stakeholders.Improved cost savings: The decommissioning of the erstwhile application and the associated data saved 30% cost associated with storage and maintenance of those applications translating to several thousand dollars in annual savings.Improved data accessibility: Despite being archived, the portion of the data that has business value remained accessible to the users using business intelligence applications and historical analysis initiatives. This case study is a testimony of how a wholesale drug distributor effectively managed the decommissioning and archival of an astonishing amount of historical sales data using IBM Infosphere Optim to stay in the path of compliance and regulatory guardrails, still delivering data accessibility to the stakeholders and secure a healthy cost saving. Future Trends in Application Retirement AI-Driven Data Management Application retirement and data archiving are set to be revolutionized by automation in data management using AI. AI-based tools can automatically identify and classify data more precisely which enables the preservation of relevant data. This automation ensures lessened human intervention which leads to increased data quality performance and a more streamlined and cost-effective retirement process. Automation in Migration and Archiving Automation technologies are playing an important part in data migration and archiving. Automation workflows can handle repetitive tasks in the data engineering and integration domain, specifically in Extraction, Transformation, and Loading (ETL) with increased speed and precision. Also, it allows for the integration of a wider array of data sources. Scalability With Cloud-Based Archival Solutions Cloud-based data archival applications have the inherent feature of scalability which is particularly useful for cases where the data archival process will likely continue for a large duration and unexpected volume surges may occur. Organizations in such scenarios do not need to plan about provisioning storage in advance and avoid the cost of hefty upfront storage investments for archived data. Enhanced Security and Compliance in Cloud Archival Solutions Most of the cloud archival solutions have inbuilt advanced security features like encryption and multi-factor authentication, which seamlessly ensure the security of the archived data during migration and throughout its lifecycle. Additionally, they offer built-in compliance tools that help the users adhere to the regulatory guardrails more effectively. Leveraging Historical Data As the impetus to incorporating machine learning models in the enterprise grows, with it grows the need to train those models with historical data. As such, archived data are good sources for not only business intelligence reporting, but also are very useful in training machine learning models. As such, by leveraging historical data, organizations can gain previous insights into patterns, trends, and overall customer behavior. It is anticipated that leveraging archived data for Business Intelligence and machine learning will be an integral component of the archived data lifecycle. Conclusion As and when enterprises look to modernize their applications and upgrade to newer infrastructure, generally to avail new features, reduce operational costs, or just to remain compliant, there arises a need to decommission the existing applications and archive the associated data. This helps organizations seamlessly transition to modern and mode efficient solutions while still retaining the capability to look back at their historical archived data whenever they need to. This unique capability avoids risks with data loss, regulatory non-compliance, and security vulnerabilities while maintaining operational efficiency. Key Strategies for Success There are several key strategies that can ensure a successful implementation of a data archival endeavor. To begin with, a comprehensive data validation before and after the data archive state ensures data integrity and quality, while engaging stakeholders early in the archival process ensures smoother participation. Lastly, implementing a reliable governance framework in the archival process is the key, and while doing so, several tools have these built features that make life much easier while implementing. Leveraging AI in this whole process has its own set of benefits like increased automation in the data preprocessing and transformation phase. It is also equally important to remember that data is gold, so there must be a clear strategy for the accessibility aspect of the archived data. While business intelligence out of the historical data is an apparent benefit, it can do wonders in training the ML models with historical data too. Final Thoughts In conclusion, application retirement and data archival is not just a business requirement and technical necessity but a strategic opportunity as well. Organizations can adopt the best practices of data archiving aided by the right tools to turn the decommissioned archived data into a pathway for modernization, cost savings, and strategic assets which fuels future growth and innovation. Embracing data archival the right way will propel organizations to the pathway of modernization, with clear accessibility to their past.
In the world of distributed systems, few things are more frustrating to users than making a change and then not seeing it immediately. Try to change your status on your favorite social network site and reload the page only to discover your previous status. This is where Read Your Own Writes (RYW) consistency becomes quite important; this is not a technical need but a core expectation from the user's perspective. What Is Read Your Own Writes Consistency? Read Your Own Writes consistency is an assurance that once a process, usually a user, has updated a piece of data, all subsequent reads by that same process will return the updated value. It is a specific category of session consistency along the lines of how the user interacts with their own data modification. Let's look at these real-world scenarios where RYW consistency is important: 1. Social Media Updates When you tweet or update your status on your social media," is that you expect to see the tweet or status update as soon as the feed is reloaded. Without RYW consistency, content may seem to “vanish” for a brief period of time and subsequently, the same to appear multiple time, confusing your audience and duplication occurs. 2. Document Editing In systems that involve collaborative document editing, such as Google Docs, the user must see their own changes immediately, though there might be some slight delay in the updates of other users. 3. E-commerce Inventory Management If a seller updates his product inventory, he must immediately see the correct numbers in order to make informed business decisions. Common Challenges in Implementing RYW 1. Caching Complexities One of the biggest challenges comes from caching layers. When data is cached at different levels (browser, CDN, application server), it is important to have a suitable cache invalidation or update strategy so as to deliver the latest write to a client, i.e., the user. 2. Load Balancing In systems by means of multiple replicas and load balancers, requests from the same user can possibly be routed to different servers. This can break RYW consistency if not handled properly. 3. Replication Lag In primary-secondary distribution databases, writes are directed to the primary and reads can be sourced from the secondaries. All this could lead to the generation of a window where recent writes are no longer visible. Implementation Strategies 1. Sticky Sessions Python # Example load balancer configuration class LoadBalancer: def route_request(self, user_id, request): # Route to the same server for a given user session server = self.session_mapping.get(user_id) if not server: server = self.select_server() self.session_mapping[user_id] = server return server 2. Write-Through Caching Python class CacheLayer: def update_data(self, key, value): # Update database first self.database.write(key, value) # Immediately update cache self.cache.set(key, value) # Attach version information self.cache.set_version(key, self.get_timestamp()) 3. Version Tracking Python class SessionManager: def track_write(self, user_id, resource_id): # Record the latest write version for this user timestamp = self.get_timestamp() self.write_versions[user_id][resource_id] = timestamp def validate_read(self, user_id, resource_id, data): # Ensure read data is at least as fresh as user's last write last_write = self.write_versions[user_id].get(resource_id) return data.version >= last_write if last_write else True Best Practices 1. Use Timestamps or Versions Attach version information to all writesCompare versions during reads to ensure consistencyConsider using logical clocks for better ordering 2. Implement Smart Caching Strategies Use cache-aside pattern with careful invalidationConsider write-through caching for critical updatesImplement cache versioning 3. Monitor and Alert Track consistency violationsMeasure read-write latenciesAlert on abnormal patterns Conclusion Read Your Own Writes consistency may appear like a rather boring request. However, its proper implementation in a distributed system requires careful consideration of caching, routing, and data replication design issues. By being aware of the challenges involved and implementing adequate solutions, we will be able to design systems that make the experience smooth and intuitive for users. By the way, there are a lot of consistency models in distributed systems, and RYW consistency is often non-essential in the case of user experience. There is still room for users to accept eventual consistency when observing updates from other users, but they do so by expecting that their own changes will be reflected immediately.
In digital communication, email remains a primary tool for both personal and business correspondence. However, as email usage has grown, so has the prevalence of spam and malicious emails. Organizations like Spamhaus work tirelessly to maintain email security, protect users from spam, and set standards for email etiquette. By using machine learning (ML) and artificial intelligence (AI), Spamhaus can improve its email filtering accuracy, better identify malicious senders, and promote responsible emailing practices. This article explores how machine learning and AI can be leveraged for Spamhaus email etiquette and security, highlighting techniques used for spam detection, filtering, and upholding responsible emailing standards. Section 1: The Role of Spamhaus in Email Etiquette and Security Spamhaus is a non-profit organization that maintains several real-time databases used to identify and block spam sources. By analyzing IP addresses, domain reputations, and known malicious activities, Spamhaus helps internet service providers (ISPs) and organizations filter out unwanted emails. Beyond spam blocking, Spamhaus also establishes guidelines for email etiquette to help prevent legitimate messages from being flagged and promote ethical practices in email marketing and communication. Section 2: Machine Learning Techniques in Spam Detection and Filtering 1. Supervised Machine Learning for Email Classification Spam vs. ham classification: Supervised learning models, such as decision trees, support vector machines, and logistic regression, can be trained on labeled datasets containing spam (unwanted emails) and ham (legitimate emails) examples. These models learn the distinguishing features between spam and non-spam emails based on keywords, sender reputation, frequency of certain terms, and more.Feature extraction: Machine learning models rely on features such as email subject lines, sender metadata, URLs, and attachments. By identifying specific words, links, and patterns associated with spam, the models can classify emails more accurately. 2. Natural Language Processing (NLP) for Content Analysis NLP techniques can analyze the content and language structure within emails. Spam messages often use certain phrases, misspellings, or urgent language to deceive users. NLP models, such as sentiment analysis and named entity recognition, can identify these patterns and flag potentially harmful emails.Using techniques like Word2Vec or TF-IDF, words and phrases in an email can be converted into numerical vectors that capture their contextual meaning. These vectors help the ML model understand the text better and identify suspicious language patterns. 3. Bayesian Filtering Bayesian filtering is a probabilistic approach commonly used in spam detection. This method calculates the likelihood that an email is spam based on the frequency of certain words or features in the email. As the filter is trained with more spam and ham emails, it continually improves its accuracy. Section 3: AI-Powered Enhancements for Spamhaus Email Etiquette 1. Unsupervised Learning for Pattern Detection Unlike supervised models, unsupervised learning does not rely on labeled data. Instead, it identifies patterns and anomalies in email data. Techniques like clustering and anomaly detection can be used to find unusual email patterns that may indicate spam or phishing attempts.Clustering algorithms: By grouping similar emails together, clustering algorithms (e.g., K-means) can help Spamhaus identify patterns in spam emails that are evolving or changing over time, such as new phishing tactics or scams. 2. Deep Learning Models for Phishing Detection Phishing attacks are one of the biggest email security challenges, as they are often sophisticated and hard to detect. Deep learning models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), can analyze the entire structure of an email, including headers, content, and hyperlinks, to identify potential phishing attempts with high accuracy. 3. AI-Driven Domain and IP Reputation Scoring By analyzing historical data on domains and IP addresses, AI models can assign reputation scores to various sources. These scores are based on factors like the frequency of spam reports, associations with known malicious activity, and unusual email-sending patterns. A low reputation score could result in an email being flagged as suspicious or blocked entirely. 4. Adaptive Learning With Reinforcement Techniques Reinforcement learning can be used to create adaptive filters that continuously improve as they interact with new data. These filters adjust their response based on feedback, refining their spam detection over time, adapting to new spam tactics, and evolving email etiquette. Section 4: Ensuring Responsible Emailing With AI 1. User Behavior Analytics Machine learning models can analyze user behavior to detect anomalies, such as unusual sending patterns or spikes in email volume. By identifying these behaviors, Spamhaus can encourage responsible email usage and discourage practices associated with spam-like behavior, even among legitimate senders. 2. Sender Authentication Techniques AI can help verify sender identities and enhance email authentication using protocols like SPF (Sender Policy Framework), DKIM (DomainKeys Identified Mail), and DMARC (Domain-based Message Authentication, Reporting, and Conformance). Machine learning models can cross-reference these authentication mechanisms to prevent email spoofing and ensure that emails are sent by verified sources. 3. Predictive Modeling for Engagement and Spam-Like Behavior AI can analyze engagement metrics, such as open rates and click-through rates, to identify email campaigns that might be perceived as spammy by recipients. By offering insights into how recipients interact with emails, predictive models can help senders improve their practices, aligning with Spamhaus guidelines for responsible emailing. 4. Automated Feedback Loops for Continuous Improvement AI-driven feedback loops can alert email marketers or organizations when their emails are flagged as spam or exhibit characteristics of poor etiquette. These insights can help senders refine their strategies to meet best practices, reducing the chances of legitimate emails being blocked. Section 5: Benefits and Challenges of Using AI and ML in Email Etiquette Benefits Higher accuracy: AI models can identify nuanced patterns that are difficult for traditional filters to catch, improving accuracy in detecting spam and malicious emails.Real-time detection: Machine learning enables real-time analysis, allowing Spamhaus to block spam emails before they reach the inbox.Better user experience: By reducing false positives and promoting responsible emailing, AI improves the overall email experience for both senders and recipients. Challenges Privacy and data protection: AI models require extensive data, raising concerns about user privacy and data security. Organizations must adhere to data protection regulations and prioritize user privacy.Model bias and fairness: ML models can sometimes exhibit biases based on the data they’re trained on. It’s essential to monitor and correct these biases to avoid mistakenly flagging legitimate senders.Adaptability to evolving threats: Spam and phishing tactics are constantly evolving, requiring AI models to be updated and retrained regularly to stay effective. Conclusion Machine learning and AI have the potential to transform Spamhaus email etiquette and security, improving spam detection, reducing false positives, and enhancing the user experience. By leveraging techniques such as supervised learning, NLP, Bayesian filtering, and unsupervised learning, AI can provide more accurate and adaptive filtering solutions. Additionally, with the integration of user behavior analysis and predictive modeling, AI can support responsible emailing practices, encouraging a safer and more ethical email environment. As these technologies continue to advance, the collaboration between AI and organizations like Spamhaus will play a crucial role in keeping email communication secure, efficient, and courteous. By staying vigilant, continuously refining models, and promoting best practices, the future of email security and etiquette looks promising with the support of machine learning and AI.
We’re all familiar with the principles of DevOps: building small, well-tested increments, deploying frequently, and automating pipelines to eliminate the need for manual steps. We monitor our applications closely, set up alerts, roll back problematic changes, and receive notifications when issues arise. However, when it comes to databases, we often lack the same level of control and visibility. Debugging performance issues can be challenging, and we might struggle to understand why databases slow down. Schema migrations and modifications can spiral out of control, leading to significant challenges. Overcoming these obstacles requires strategies that streamline schema migration and adaptation, enabling efficient database structure changes with minimal downtime or performance impact. It’s essential to test all changes cohesively throughout the pipeline. Let’s explore how this can be achieved. Automate Your Tests Databases are prone to many types of failures, yet they often don’t receive the same rigorous testing as applications. While developers typically test whether applications can read and write the correct data, they often overlook how this is achieved. Key aspects like ensuring the proper use of indexes, avoiding unnecessary lazy loading, or verifying query efficiency often go unchecked. For example, we focus on how many rows the database returns but neglect to analyze how many rows it had to read. Similarly, rollback procedures are rarely tested, leaving us vulnerable to potential data loss with every change. To address these gaps, we need comprehensive automated tests that detect issues proactively, minimizing the need for manual intervention. We often rely on load tests to identify performance issues, and while they can reveal whether our queries are fast enough for production, they come with significant drawbacks. First, load tests are expensive to build and maintain, requiring careful handling of GDPR compliance, data anonymization, and stateful applications. Moreover, they occur too late in the development pipeline. When load tests uncover issues, the changes are already implemented, reviewed, and merged, forcing us to go back to the drawing board and potentially start over. Finally, load tests are time-consuming, often requiring hours to fill caches and validate application reliability, making them less practical for catching issues early. Schema migrations often fall outside the scope of our tests. Typically, we only run test suites after migrations are completed, meaning we don’t evaluate how long they took, whether they triggered table rewrites, or whether they caused performance bottlenecks. These issues often go unnoticed during testing and only become apparent when deployed to production. Another challenge is that we test with databases that are too small to uncover performance problems early. This reliance on inadequate testing can lead to wasted time on load tests and leaves critical aspects, like schema migrations, entirely untested. This lack of coverage reduces our development velocity, introduces application-breaking issues, and hinders agility. The solution to these challenges lies in implementing database guardrails. Database guardrails evaluate queries, schema migrations, configurations, and database designs as we write code. Instead of relying on pipeline runs or lengthy load tests, these checks can be performed directly in the IDE or developer environment. By leveraging observability and projections of the production database, guardrails assess execution plans, statistics, and configurations, ensuring everything will function smoothly post-deployment. Build Observability Around Databases When we deploy to production, system dynamics can change over time. CPU load may spike, memory usage might grow, data volumes could expand, and data distribution patterns may shift. Identifying these issues quickly is essential, but it's not enough. Current monitoring tools overwhelm us with raw signals, leaving us to piece together the reasoning. For example, they might indicate an increase in CPU load but fail to explain why it happened. The burden of investigating and identifying root causes falls entirely on us. This approach is outdated and inefficient. To truly move fast, we need to shift from traditional monitoring to full observability. Instead of being inundated with raw data, we need actionable insights that help us understand the root cause of issues. Database guardrails offer this transformation. They connect the dots, showing how various factors interrelate, pinpointing the problem, and suggesting solutions. Instead of simply observing a spike in CPU usage, guardrails help us understand that a recent deployment altered a query, causing an index to be bypassed, which led to the increased CPU load. With this clarity, we can act decisively, fixing the query or index to resolve the issue. This shift from "seeing" to "understanding" is key to maintaining speed and reliability. The next evolution in database management is transitioning from automated issue investigation to automated resolution. Many problems can be fixed automatically with well-integrated systems. Observability tools can analyze performance and reliability issues and generate the necessary code or configuration changes to resolve them. These fixes can either be applied automatically or require explicit approval, ensuring that issues are addressed immediately with minimal effort on your part. Beyond fixing problems quickly, the ultimate goal is to prevent issues from occurring in the first place. Frequent rollbacks or failures hinder progress and agility. True agility is achieved not by rapidly resolving issues but by designing systems where issues rarely arise. While this vision may require incremental steps to reach, it represents the ultimate direction for innovation. Metis empowers you to overcome these challenges. It evaluates your changes before they’re even committed to the repository, analyzing queries, schema migrations, execution plans, performance, and correctness throughout your pipelines. Metis integrates seamlessly with CI/CD workflows, preventing flawed changes from reaching production. But it goes further — offering deep observability into your production database by analyzing metrics and tracking deployments, extensions, and configurations. It automatically fixes issues when possible and alerts you when manual intervention is required. With Metis, you can move faster and automate every aspect of your CI/CD pipeline, ensuring smoother and more reliable database management. Everyone Needs to Participate Database observability is about proactively preventing issues, advancing toward automated understanding and resolution, and incorporating database-specific checks throughout the development process. Relying on outdated tools and workflows is no longer sufficient; we need modern solutions that adapt to today’s complexities. Database guardrails provide this support. They help developers avoid creating inefficient code, analyze schemas and configurations, and validate every step of the software development lifecycle within our pipelines. Guardrails also transform raw monitoring data into actionable insights, explaining not just what went wrong but how to fix it. This capability is essential across all industries, as the complexity of systems will only continue to grow. To stay ahead, we must embrace innovative tools and processes that enable us to move faster and more efficiently.
Organizations face the growing challenge of managing, protecting, and governing data across diverse environments. As data flows through hybrid cloud systems, multi-cloud environments, and on-premises infrastructures, maintaining a cohesive, secure data ecosystem has become a complicated and daunting affair. A promising solution to this challenge is the concept of a data fabric — a unified, integrated layer that provides seamless access, management, and governance across disparate data sources. However, ensuring the security and integrity of data within this unconventional framework requires an equally unconventional approach to security. In this article, I’d like to discuss how Zero Trust Architecture (ZTA) can provide a solid foundation for achieving security and trust in a data fabric. First, let’s take a deeper dive into the concept of the data fabric. Understanding Data Fabric A data fabric, as an architecture, is designed to streamline data management, integration, and governance across various platforms, both on-premises and in the cloud. It essentially provides a layer of abstraction that connects and automates data across siloed environments, offering real-time access, data sharing, and analytics. However, given the wide distribution of data, its constant movement across different systems, and the wide range of human and non-human actors interacting with it, a data fabric introduces significant security challenges. The old paradigm of setting up a secure perimeter that controls what gets in just doesn’t cut it. In fact, it would be like applying medieval security to a modern metropolis. No one would, for example, suggest protecting Los Angeles by building an alligator-filled moat around it. The Challenge of Trust in a Data Fabric Trust — defined in cybersecurity as the belief that an entity has the integrity and authority to possess data — is fundamental to any secure system. However, the nature of a data fabric architecture complicates traditional models of trust that rely solely on network perimeter control and user identity. In legacy security models, once a user or system is authenticated and granted access to a network, trust is implicitly extended for the session’s duration. However, this approach is no longer adequate in a world where data is spread across environments, accessed by diverse devices, and subjected to increasingly sophisticated cyber threats. A data fabric, by its very nature, increases the attack surface by connecting a wide array of systems, applications, and services. As data moves across these various endpoints, the risk of unauthorized access, data leakage, or other malicious activity grows. This environment requires a security model that continuously validates the trustworthiness of users, devices, and services, rather than assuming trust based on initial authentication. Zero Trust Architecture: The Right Security Foundation for the Data Fabric The National Institute of Standards and Technology’s (NIST) Zero Trust Architecture (ZTA) is a cybersecurity framework based on the principle of “never trust, always verify.” It assumes that no one, whether inside or outside the network perimeter, should be trusted by default. In a Zero Trust model, access to resources is granted only after continuous authentication, authorization, and trust validation — using policies, user behavior analytics, and real-time risk assessments. In a chaotic and threatening cyber landscape that is increasingly encompassing not just complex systems, but “systems of systems” where devices, networks, and people must collaborate, (ZTA) addresses skyrocketing complexity and interconnectivity. It does this by emphasizing several core principles that help establish trust in complex systems: Trust no one, verify everything: Similar to the Cold War-era motto “trust but verify,” ZTA posits that no entity, whether internal or external, should be trusted by default. Every access request is scrutinized and fully authenticated before being granted. Least privilege access: Just as a visitor to a restricted facility like the White House is only granted access to specific areas, users and devices are granted the minimum level of access necessary to complete their tasks. This principle minimizes the potential impact of any unauthorized access.Micro-segmentation: Dividing the network into smaller segments limits the lateral movement of cybercriminals. Micro-segmentation, combined with least privilege access, creates additional roadblocks for attackers, making it harder to spread through a network.Continuous monitoring and vigilance: Like strategically placed security cameras in a physical facility, ZTA emphasizes real-time monitoring of network traffic, access requests, and user behavior. This constant vigilance enables organizations to detect and respond to potential threats before they can cause significant harm. Benefits of Zero Trust Architecture By examining each access request and limiting privileges to the minimum necessary for a task, the likelihood of unauthorized access or data breaches is dramatically reduced. But the benefits of adopting ZTA don’t end there. They extend to: Enhanced adaptability: Thanks to principles like micro-segmentation, the flexible framework of ZTA enables organizations to quickly adapt to new threats, technologies, and business requirements.Simplified compliance: ZTA’s focus on continuous monitoring and evaluation offers deep, easily accessible insights into technical architecture that make it easier to stay on top of regulatory compliance. Reduced cybersecurity complexity: ZTA eliminates the need for disjointed security solutions, allowing for a more streamlined and efficient security infrastructure. Zero Trust as the Backbone of a Secure Data Fabric While a data fabric offers a powerful solution to unify and streamline data management, without a security framework like ZTA, it remains vulnerable to cyber threats. Zero Trust provides the continuous validation, granular access control, and real-time monitoring necessary to protect sensitive data within a data fabric. By embracing principles like continuous authentication, micro-segmentation, and least privilege access, organizations can build a robust and secure environment that ensures both the safety and trustworthiness of their data. Achieving a secure Zero Trust network in a data fabric requires a combination of technology, automation, and strong governance, which I’ll discuss in more detail in future articles, but, in summary, ZTA offers a paradigm shift that establishes a security posture that can adapt to evolving threats and technologies, and can greatly simplify many key operations. In my next article, I’ll discuss how Zero Trust principles can evolve with new developments such as AI and 6G.
In a world where companies rely heavily on data for insights about their performance, potential issues, and areas for improvement, logging comprehensively is crucial, but it comes at a cost. If not stored properly it can become cumbersome to maintain, query, and overall expensive. Logging detailed user activities like time spent on various apps, which interface where they are active, navigation path, app start-up times, crash reports, country of login, etc. could be vital in understanding user behaviors — but we can easily end up with billions of rows of data, which can quickly become an issue if scalable solutions are not implemented at the time of logging. In this article, we will discuss how we can efficiently store data in an HDFS system and use some of Presto’s functionality to query massive datasets with ease, reducing compute costs drastically in data pipelines. Partitioning Partitioning is a technique where similar logical data can be clubbed together and stored in a single file making retrieval quicker. For example, let's consider an app like YouTube. It would be useful to group data belonging to the same date and country into one file, which would result in multiple smaller files making scanning easier. Just by looking at the metadata, Presto can figure out which one of the specific files needs to be scanned based on the query the user provides. Internally, a folder called youtube_user_data would be created within which multiple subfolders would be created for each partition by date and country (e.g., date=2023-10-01/country=US). If the app was launched in 2 countries and has been active for 2 days, then the number of files generated would be 2*2 = 4 (cartesian product of the unique values in the partition columns). Hence, choosing columns with low cardinality is essential. For example, if we add interface as another partition column, with three possible values (ios, android, desktop), it would increase the number of files to 2×2×3=12. Based on the partitioning strategy described, the data would be stored in a directory structure like this: Below is an example query on how to create a table with partition columns as date and country: SQL CREATE TABLE youtube_user_data ( user_id BIGINT, Age int, Video_id BIGINT, login_unixtime BIGINT, interface VARCHAR, ip_address VARCHAR, login_date VARCHAR, country VARCHAR … … ) WITH ( partitioned_by = ARRAY[‘login_date’, ‘country’], format = 'DWRF', oncall = ‘your_oncall_name’, retention_days = 60, ); Ad Hoc Querying When querying a partitioned table, specifying only the needed partitions can speed up your query wall time greatly. SQL SELECT SUM(1) AS total_users_above_30 FROM youtube_user_data WHERE Login_date = ‘2023-10-01’ And country = ‘US’ And age > 30 By specifying the partition columns as filters in the query, Presto will directly jump to the folder 2023-10-01 and US, and retrieve only the file within that folder skipping the scanning of other files completely. Scheduling Jobs If the source table is partitioned by country, then setting up daily ETL jobs also becomes easier, as we can now run them in parallel. For example: Python # Sample Dataswarm job scheduling, that does parallel processing # taking advantage of partitions in the source table insert_task = {} wait_for = {} for country in ["US", "CA"]: # wait for job wait_for[country] = WaitforOperator( table="youtube_user_data", partitions=f"login_date=<DATEID>/country={country}" ) # insert job insert_task[country] = PrestoOperator( dep_list = [wait_for[country]], input_data = { "in": input.table("youtube_user_data").col("login_date").eq("<DATEID>") .col("country").eq(country)}, output_data = {"out": output.table("output_table_name").col("login_date").eq("<DATEID>") .col("country").eq(country)}, select = """ SELECT user_id, SUM(1) as total_count FROM <in:youtube_user_data> """ ) Note: The above uses Dataswarm as an example for processing/inserting data. Here, there will be two parallel running tasks — insert_task[US] and insert_task[CA] — which will query only the data pertaining to those partitions and load them into a target table which would also be partitioned on country and date. Another benefit is that waitforoperator can be set up to check if that particular partition of interest has landed rather than waiting for the whole table. If, say, CA data is delayed, but US data has landed, then we can start the US insert task first and later once CA upstream data lands, then kick off the CA insert job. Above is a simple DAG showing the sequence of events that would be run. Bucketing If frequent Group by and join operations are to be performed on a table, then we can further optimize the storage using bucketing. Bucketing organizes data into smaller chunks within a file based on a key column (e.g., userid), so when querying, Presto would know in which bucket a specific ID would be present. How to Implement Bucketing Choose a bucketing column: Pick a key column that is commonly used for joins and group bys.Define buckets: Specify the number of buckets to divide the data into. SQL CREATE TABLE youtube_user_data ( user_id BIGINT, Age int, Video_id BIGINT, login_unixtime BIGINT, interface VARCHAR, ip_address VARCHAR, login_date VARCHAR, country VARCHAR … … ) WITH ( partitioned_by = ARRAY[‘login_date’, ‘country’], format = 'DWRF', oncall = ‘your_oncall_name’, retention_days = 60, bucket_count = 1024, bucketed_by = ARRAY['user_id'], ); Note: The bucket size should be a power of 2. In the above example, we chose 1024 (2^10). Before Bucketing Data for a partition is stored in a single file, requiring a full scan to locate a specific user_id: After Bucketing Userids are put into smaller buckets based on which range they fall under. You'll notice that user IDs are assigned to specific buckets based on their value. For example, a new user ID of 1567 would be placed in Bucket 1: Bucket 1: 1000 to 1999Bucket 2: 2000 to 2999Bucket 3: 3000 to 3999Etc. When performing a join with another table — say, to retrieve user attributes like gender and birthdate for a particular user (e.g., 4592) — it would be much quicker, as Presto would know under which bucket (bucket 4) that user would be so it can directly jump to that specific one and skip scanning the others. It would still need to search where that user would be within that bucket. We can speed up that process as well by taking advantage of sorting the data on the key ID while storing them within each of the buckets, which we will explore in the later section. SQL SELECT a.user_id, b.gender, b.birthdate FROM youtube_user_data a JOIN dim_user_info b ON a.user_id = b.user_id WHERE a.login_date = '<DATEID>' AND a.country = 'US' AND b.date = '<DATEID>' Hidden $bucket Column For bucketed tables, there is a hidden column to let you specify the buckets you want to read data from. For example, the following query will count over bucket #17 (the bucket ID starts from 0). SQL SELECT SUM(1) AS total_count FROM youtube_user_data WHERE ds='2023-05-01' AND "$bucket" = 17 The following query will roughly count over 10% of the data for a table with 1024 buckets: SQL SELECT SUM(1) AS total_count FROM youtube_user_data WHERE ds='2023-05-01' AND "$bucket" BETWEEN 0 AND 100 Sorting To further optimize the buckets, we can sort them while inserting the data so query speeds can be further improved, as Presto can directly jump to the specific index within a specific bucket within a specific partition to fetch the data needed. How to Enable Sorting Choose a sorting column: Typically, this is the same column used for bucketing, such as user_id.Sort data during insertion: Ensure that data is sorted as it is inserted into each bucket. SQL CREATE TABLE youtube_user_data ( user_id BIGINT, Age int, Video_id BIGINT, login_unixtime BIGINT, interface VARCHAR, ip_address VARCHAR, login_date VARCHAR, country VARCHAR … … ) WITH ( partitioned_by = ARRAY[‘login_date’, ‘country’], format = 'DWRF', oncall = ‘your_oncall_name’, retention_days = 60, bucket_count = 1024, bucketed_by = ARRAY['user_id'], sorted_by = ARRAY['userid'] ); In a sorted bucket, the userids are inserted in an orderly manner, which makes retrieval efficient. It becomes very handy when we have to join large tables or perform aggregations across billions of rows of data. Conclusion Partitioning: For large datasets, partition the table on low cardinality columns like date, country, and interface, which would result in smaller HDFS files. Presto can then only query the needed files by looking up the metadata/file name.Bucketing and sorting: If a table is to be used frequently in several join or group bys, then it would be beneficial to bucket and sort the data within each partition further reducing key lookup time.Caveat: There is an initial compute cost for bucketing and sorting as Presto would have to remember the order of the key while inserting. However, this one-time cost could be justified by savings in repeated downstream queries.