Security Virtual Roundtable: Join DZone SMEs as they discuss software supply chains, the role of AI/ML in security, CNAPP, and more.
Data Engineering: Work with DBs? Build data pipelines? Or maybe you're exploring AI-driven data capabilities? We want to hear your insights.
Programming languages allow us to communicate with computers, and they operate like sets of instructions. There are numerous types of languages, including procedural, functional, object-oriented, and more. Whether you’re looking to learn a new language or trying to find some tips or tricks, the resources in the Languages Zone will give you all the information you need and more.
Applying the Pareto Principle To Learn a New Programming Language
Using Zero-Width Assertions in Regular Expressions
In today's data-driven world, real-time data processing and analytics have become crucial for businesses to stay competitive. Apache Hudi (Hadoop Upserts and Incremental) is an open-source data management framework that provides efficient data ingestion and real-time analytics on large-scale datasets stored in data lakes. In this blog, we'll explore Apache Hudi with a technical deep dive and Python code examples, using a business example for better clarity. Table of Contents: Introduction to Apache Hudi Key Features of Apache Hudi Business Use Case Setting Up Apache Hudi Ingesting Data with Apache Hudi Querying Data with Apache Hudi Security and Other Aspects Security Performance Optimization Monitoring and Management Conclusion 1. Introduction to Apache Hudi Apache Hudi is designed to address the challenges associated with managing large-scale data lakes, such as data ingestion, updating, and querying. Hudi enables efficient data ingestion and provides support for both batch and real-time data processing. Key Features of Apache Hudi Upserts (Insert/Update) Efficiently handle data updates and inserts with minimal overhead. Traditional data lakes struggle with updates, but Hudi's upsert capability ensures that the latest data is always available without requiring full rewrites of entire datasets. Incremental Pulls Retrieve only the changed data since the last pull, which significantly optimizes data processing pipelines by reducing the amount of data that needs to be processed. Data Versioning Manage different versions of data, allowing for easy rollback and temporal queries. This versioning is critical for ensuring data consistency and supporting use cases such as time travel queries. ACID Transactions Ensure data consistency and reliability by providing atomic, consistent, isolated, and durable transactions on data lakes. This makes Hudi a robust choice for enterprise-grade applications. Compaction Hudi offers a compaction mechanism that optimizes storage and query performance. This process merges smaller data files into larger ones, reducing the overhead associated with managing numerous small files. Schema Evolution Handle changes in the data schema gracefully without disrupting the existing pipelines. This feature is particularly useful in dynamic environments where data models evolve over time. Integration With Big Data Ecosystem Hudi integrates seamlessly with Apache Spark, Apache Hive, Apache Flink, and other big data tools, making it a versatile choice for diverse data engineering needs. 2. Business Use Case Let's consider a business use case of an e-commerce platform that needs to manage and analyze user order data in real time. The platform receives a high volume of orders every day, and it is essential to keep the data up-to-date and perform real-time analytics to track sales trends, inventory levels, and customer behavior. 3. Setting Up Apache Hudi Before we dive into the code, let's set up the environment. We'll use PySpark and the Hudi library for this purpose. Shell # Install necessary libraries pip install pyspark==3.1.2 pip install hudi-spark-bundle_2.12 4. Ingesting Data With Apache Hudi Let's start by ingesting some order data into Apache Hudi. We'll create a DataFrame with sample order data and write it to a Hudi table. Python from pyspark.sql import SparkSession from pyspark.sql.functions import col, lit import datetime # Initialize Spark session spark = SparkSession.builder \ .appName("HudiExample") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.sql.hive.convertMetastoreParquet", "false") \ .getOrCreate() # Sample order data order_data = [ (1, "2023-10-01", "user_1", 100.0), (2, "2023-10-01", "user_2", 150.0), (3, "2023-10-02", "user_1", 200.0) ] # Create DataFrame columns = ["order_id", "order_date", "user_id", "amount"] df = spark.createDataFrame(order_data, columns) # Define Hudi options hudi_options = { 'hoodie.table.name': 'orders', 'hoodie.datasource.write.storage.type': 'COPY_ON_WRITE', 'hoodie.datasource.write.recordkey.field': 'order_id', 'hoodie.datasource.write.partitionpath.field': 'order_date', 'hoodie.datasource.write.precombine.field': 'order_date', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.database': 'default', 'hoodie.datasource.hive_sync.table': 'orders', 'hoodie.datasource.hive_sync.partition_fields': 'order_date' } # Write DataFrame to Hudi table df.write.format("hudi").options(**hudi_options).mode("overwrite").save("/path/to/hudi/orders") print("Data ingested successfully.") 5. Querying Data With Apache Hudi Now that we have ingested the order data, let's query the data to perform some analytics. We'll use the Hudi DataSource API to read the data. Python # Read data from Hudi table orders_df = spark.read.format("hudi").load("/path/to/hudi/orders/*") # Show the ingested data orders_df.show() # Perform some analytics # Calculate total sales total_sales = orders_df.groupBy("order_date").sum("amount").withColumnRenamed("sum(amount)", "total_sales") total_sales.show() # Calculate sales by user sales_by_user = orders_df.groupBy("user_id").sum("amount").withColumnRenamed("sum(amount)", "total_sales") sales_by_user.show() 6. Security and Other Aspects When working with large-scale data lakes, security, and data governance are paramount. Apache Hudi provides several features to ensure your data is secure and compliant with regulatory requirements. Security Data Encryption Hudi supports data encryption at rest to protect sensitive information from unauthorized access. By leveraging Hadoop's native encryption support, you can ensure that your data is encrypted before it is written to disk. Access Control Integrate Hudi with Apache Ranger or Apache Sentry to manage fine-grained access control policies. This ensures that only authorized users and applications can access or modify the data. Audit Logging Hudi can be integrated with log aggregation tools like Apache Kafka or Elasticsearch to maintain an audit trail of all data operations. This is crucial for compliance and forensic investigations. Data Masking Implement data masking techniques to obfuscate sensitive information in datasets, ensuring that only authorized users can see the actual data. Performance Optimization Compaction As mentioned earlier, Hudi's compaction feature merges smaller data files into larger ones, optimizing storage and query performance. You can schedule compaction jobs based on your workload patterns. Indexing Hudi supports various indexing techniques to speed up query performance. Bloom filters and columnar indexing are commonly used to reduce the amount of data scanned during queries. Caching Leverage Spark's in-memory caching to speed up repeated queries on Hudi datasets. This can significantly reduce query latency for interactive analytics. Monitoring and Management Metrics Hudi provides a rich set of metrics that can be integrated with monitoring tools like Prometheus or Grafana. These metrics help you monitor the health and performance of your Hudi tables. Data Quality Implement data quality checks using Apache Griffin or Deequ to ensure that the ingested data meets your quality standards. This helps in maintaining the reliability of your analytics. Schema Evolution Hudi's support for schema evolution allows you to handle changes in the data schema without disrupting existing pipelines. This is particularly useful in dynamic environments where data models evolve over time. 7. Conclusion In this blog, we have explored Apache Hudi and its capabilities to manage large-scale data lakes efficiently. We set up a Spark environment, ingested sample order data into a Hudi table, and performed some basic analytics. We also discussed the security aspects and performance optimizations that Apache Hudi offers. Apache Hudi's ability to handle upserts, provide incremental pulls, and ensure data security makes it a powerful tool for real-time data processing and analytics. By leveraging Apache Hudi, businesses can ensure their data lakes are up-to-date, secure, and ready for real-time analytics, enabling them to make data-driven decisions quickly and effectively. Feel free to dive deeper into Apache Hudi's documentation and explore more advanced features to further enhance your data engineering workflows. If you have any questions or need further clarification, please let me know in the comments below!
In the world of programming, understanding the efficiency of your code is crucial. This is where concepts like time and space complexity come into play. In this blog post, we will explore these concepts in detail, focusing on how to calculate and interpret time complexity using Big O Notation. We will also look at practical examples in Python. What Is Time Complexity? Time complexity measures the efficiency of your code as the length of the input increases. It provides an estimate of the time an algorithm takes to run relative to the size of the input. What Is Space Complexity? Space complexity refers to the additional space taken by your code as the length of the input increases. It helps to understand the memory requirements of an algorithm. Example: Time Complexity in Python Let's say we have a list of 1000 numbers, and we need to print each number with an extra prefix or sentence before the elements: Python numbers = [i for i in range(1000)] for number in numbers: print(f"Number: {number}") In this example, suppose, if printing each element takes 1 second, printing all 1000 elements would take 1000 seconds. If there is only 1 element, it takes 1 second. Therefore, the time taken is directly proportional to the size of the input. Big O, Theta, and Omega Notations Big O Notation: Describes the worst-case scenario. Theta Notation: Describes the average-case scenario. Omega Notation: Describes the best-case scenario. Big O Notation is the most widely used as it gives a clear understanding of the worst-case time and space complexity. Practical Examples With Python Code Let's dive into examples with code to understand these concepts better. Example 1: Constant Time Complexity — O(1) In the following function demo, the list is of size 3. We need to calculate the time complexity in terms of the list size. Here, we are printing the first element of the list. So, whether the list size is 3 or 3000, we are just printing the 0th element. Python def demo(lst): print(lst[0]) demo([1, 2, 3]) The time complexity of this operation is O(1), which is constant time. As the size increases, the time remains constant. Example 2: Linear Time Complexity — O(n) In this code, the loop runs n times, making the time complexity O(n). This is known as linear complexity. As the input increases, the complexity increases linearly. Python def print_elements(lst): for element in lst: print(element) print_elements([1, 2, 3]) Example 3: Quadratic Time Complexity — O(n^2) When there are two nested loops, the time complexity becomes quadratic, O(n^2). The outer loop runs n times and the inner loop runs m times. Python def print_pairs(lst): for i in range(len(lst)): for j in range(len(lst)): print(lst[i], lst[j]) print_pairs([1, 2, 3]) Example 4: Cubic Time Complexity — O(n^3) When there are three nested loops, the time complexity is cubic, O(n^3). Python def print_triplets(lst): for i in range(len(lst)): for j in range(len(lst)): for k in range(len(lst)): print(lst[i], lst[j], lst[k]) print_triplets([1, 2, 3]) Example 5: Dominating Term In a function with multiple complexities, we consider the term with the highest growth rate (dominating term). Python def complex_function(lst): for i in range(len(lst)): # O(n) print(lst[i]) for i in range(len(lst)): # O(n) for j in range(len(lst)): # O(n^2) print(lst[i], lst[j]) for i in range(len(lst)): # O(n) for j in range(len(lst)): # O(n) for k in range(len(lst)): # O(n^3) print(lst[i], lst[j], lst[k]) complex_function([1, 2, 3]) The dominating term here is O(n^3). Space Complexity in Python Let's also understand space complexity with a practical example. Example: Space Complexity Consider the following function that creates a list of n elements. Python def create_list(n): new_list = [] for i in range(n): new_list.append(i) return new_list create_list(1000) In this example, the space complexity is O(n) because the space required to store the new_list grows linearly with the size of the input n. For every new element added to the list, we need additional space. Complexity Graph Understanding time and space complexity helps in optimizing code. With the following time/space complexity vs. input size graph, we can understand the different complexities. Constant time complexity is the best, and cubic time complexity is the worst. While optimizing code, the goal is to minimize complexity.
The development of the .NET platform and C# language moves forward with the launch of .NET 9 and C# 13, introducing a range of enhancements and advancements to boost developer efficiency, speed, and safety. This article delves into upgrades and new features in these releases giving developers a detailed look. Figure courtesy of Microsoft .NET 9 .NET 9 introduces a range of improvements, to the .NET ecosystem with a strong focus on AI and building cloud-native distributed applications by releasing .NET Aspire, boosting performance and enhancements to .NET libraries and frameworks. Here are some notable highlights: .NET Aspire It's an opinionated stack that helps in developing .NET cloud-native applications and services. I recently wrote and published an article related to this on DZone. Performance Improvements .NET 9 is focused on optimizing cloud-native apps, and performance is a key aspect of this optimization. Several performance-related improvements have been made in .NET 9, including: 1. Faster Exceptions Exceptions are now 2-4x faster in .NET 9, thanks to a more modern implementation. This improvement means that your app will spend less time handling exceptions, allowing it to focus on its core functionality. 2. Faster Loops Loop performance has been improved in .NET 9 through loop hoisting and induction variable widening. These optimizations allow loops to run faster and more efficiently, making your app more responsive. 3. Dynamic PGO Improvements Dynamic PGO (Profile-Guided Optimization) has been improved in .NET 9, reducing the cost of type checks. This means that your app will run faster and more efficiently, with less overhead from type checks. 4. RyuJIT Improvements RyuJIT, the .NET Just-In-Time compiler, has been improved in .NET 9 to inline more generic methods. This means that your app will run faster, with less overhead from method calls. 5. Arm64 Code Optimizations Arm64 code can now be written to be much faster using SVE/SVE2 SIMD instructions on Arm64. This means that your app can take advantage of the latest Arm64 hardware, running faster and more efficiently. 6. Server GC Mode The new server GC mode in .NET 9 has been shown to reduce memory usage by up to 2/3 in some benchmarks. This means that your app will use less memory, reducing costs and improving performance. These performance-related improvements in .NET 9 mean that your app will run faster, leaner, and more efficiently. Whether you're building a cloud-native app or a desktop app, .NET 9 has the performance optimizations you need to succeed. AI-Related Improvements These AI-related improvements in .NET enable developers to build powerful applications with AI capabilities, integrate with the AI ecosystem, and monitor and observe AI app performance. Multiple partnerships include Qdrant, Milvus, Weaviate, and more to expand the .NET AI ecosystem. It is easy to integrate with Semantic Kernel, Azure SQL, and Azure AI search. Feature Improvement Benefit Tensor<T> New type for tensors Effective data handling and information flow for learning and prediction Smart Components Prebuilt controls with end-to-end AI features Infuse apps with AI capabilities in minutes OpenAI SDK Official .NET library for OpenAI Delightful experience and parity with other programming languages Monitoring and Observing Features for monitoring and tracing AI apps Reliable, performant, and high-quality outcomes Note: There is some integration work within the .NET Aspire team, semantic kernel, and Azure to utilize the .NET Aspire dashboard to collect and track metrics. Web-Related Improvements Improved performance, security, and reliability Upgrades to existing ASP.NET Core features for modern cloud-native apps Built-in support for OpenAPI document generation Ability to generate OpenAPI documents at build-time or runtime Customizable OpenAPI documents using document and operation transformers These improvements aim to enhance the web development experience with .NET and ASP.NET Core, making it easier to build modern web apps with improved quality and fundamentals. Caching Improvements With HybridCache As one of my favorites, I will explain more in-depth along with code samples in a different article about HybridCache. In short, The HybridCache API in ASP.NET Core is upgraded to provide a more efficient and scalable caching solution. It introduces a multi-tier storage approach, combining in-process (L1) and out-of-process (L2) caches, with features like "stampede" protection and configurable serialization. This results in significantly faster performance, with up to 1000x improvement in high cache hit rate scenarios. C# 13: Introducing New Language Features C# 13 brings forth a range of language elements aimed at enhancing code clarity, maintainability, and developer efficiency. Here are some key additions: params collections: The params keyword is no longer restricted to just array types. It can now be used with any collection type that is recognized, including System.Span<T>, System.ReadOnlySpan<T>, and types that implement System.Collections.Generic.IEnumerable<T> and have an Addmethod. This provides greater flexibility when working with methods that need to accept a variable number of arguments. In the code snippet below, the PrintNumbers method accepts a params of type List<int>[], which means you can pass any number of List<int> arguments to the method. C# public void PrintNumbers(params List<int>[] numbersLists) { foreach (var numbers in numbersLists) { foreach (var number in numbers) { Console.WriteLine(number); } } } PrintNumbers(new List<int> {1, 2, 3}, new List<int> {4, 5, 6}, new List<int> {7, 8, 9}); New lock object: System.Threading.Lock has been introduced to provide better thread synchronization through its API. New escape sequence: You can use \e as a character literal escape sequence for the ESCAPE character, Unicode U+001B. Method group natural type improvements: This feature makes small optimizations to overload resolution involving method groups. Implicit indexer access in object initializers: The ^ operator allows us to use an indexer directly within an object initializer. Conclusion C# 13 and .NET 9 mark a crucial step towards the advancement of C# programming and the .NET environment. The latest release brings a host of new features and improvements that enhance developer productivity, application performance, and security. By staying up-to-date with these changes, developers can leverage these advancements to build more robust, efficient, and secure applications. Happy coding!
Organizations heavily rely on data analysis and automation to drive operational efficiency. In this piece, we will look into the basics of data analysis and automation with examples done in Python which is a high-level programming language used for general-purpose programming. What Is Data Analysis? Data analysis refers to the process of inspecting, cleaning, transforming, and modeling data so as to identify useful information, draw conclusions, and support decision-making. It is an essential activity that helps in transforming raw data into actionable insights. The following are key steps involved in data analysis: Collecting: Gathering data from different sources. Cleaning: Removing or correcting inaccuracies and inconsistencies contained in the collected dataset. Transformation: Converting the collected dataset into a format that is suitable for further analysis. Modeling: Applying statistical or machine learning models on the transformed dataset. Visualization: Representing the findings visually by creating charts, and graphs among others using suitable tools such as MS Excel or Python's matplotlib library. The Significance of Data Automation Data automation involves the use of technology to execute repetitive tasks associated with handling large datasets with minimal human intervention required. Automating these processes can greatly improve their efficiency thereby saving time for analysts who can then focus more on complex duties. Some common areas where it’s employed include: Data ingestion: Automatically collecting and storing data from various sources. Data cleaning and transformation: Using scripts or tools (e.g., Python Pandas library) for preprocessing the collected dataset before performing other operations on it like modeling or visualization. Report generation: Creating automated reports or dashboards that update themselves whenever new records arrive at our system etcetera. Data integration: Combining information obtained from multiple sources so as to get a holistic view when analyzing it further down during the decision-making process. Introduction to Python for Data Analysis Python is a widely used programming language for data analysis due to its simplicity, readability, and vast libraries available for statistical computing. Here are some simple examples that demonstrate how one can read large datasets as well as perform basic analysis using Python: Reading Large Datasets Reading datasets into your environment is one of the initial stages in any data analysis project. For this case, we will need the Pandas library which provides powerful data manipulation and analysis tools. Python import pandas as pd # Define the file path to the large dataset file_path = 'path/to/large_dataset.csv' # Specify the chunk size (number of rows per chunk) chunk_size = 100000 # Initialize an empty list to store the results results = [] # Iterate over the dataset in chunks for chunk in pd.read_csv(file_path, chunksize=chunk_size): # Perform basic analysis on each chunk # Example: Calculate the mean of a specific column chunk_mean = chunk['column_name'].mean() results.append(chunk_mean) # Calculate the overall mean from the results of each chunk overall_mean = sum(results) / len(results) print(f'Overall mean of column_name: {overall_mean}') Basic Data Analysis Once you have loaded the data, it is important to conduct some preliminary examination on it so as to familiarize yourself with its contents. Performing Aggregated Analysis There are times you might wish to perform a more advanced aggregated analysis over the entire dataset. For instance, let’s say we want to find the sum of a particular column across the whole dataset by processing it in chunks. Python # Initialize a variable to store the cumulative sum cumulative_sum = 0 # Iterate over the dataset in chunks for chunk in pd.read_csv(file_path, chunksize=chunk_size): # Calculate the sum of the specific column for the current chunk chunk_sum = chunk['column_name'].sum() cumulative_sum += chunk_sum print(f'Cumulative sum of column_name: {cumulative_sum}') Missing Values Treatment in Chunks It is common for missing values to exist during data preprocessing. Instead, here is an instance where missing values are filled using the mean of each chunk. Python # Initialize an empty DataFrame to store processed chunks processed_chunks = [] # Iterate over the dataset in chunks for chunk in pd.read_csv(file_path, chunksize=chunk_size): # Fill missing values with the mean of the chunk chunk.fillna(chunk.mean(), inplace=True) processed_chunks.append(chunk) # Concatenate all processed chunks into a single DataFrame processed_data = pd.concat(processed_chunks, axis=0) print(processed_data.head()) Final Statistics From Chunks At times, there is a need to get overall statistics from all chunks. This example illustrates how to compute the average and standard deviation of an entire column by aggregating outcomes from each chunk. Python import numpy as np # Initialize variables to store the cumulative sum and count cumulative_sum = 0 cumulative_count = 0 squared_sum = 0 # Iterate over the dataset in chunks for chunk in pd.read_csv(file_path, chunksize=chunk_size): # Calculate the sum and count for the current chunk chunk_sum = chunk['column_name'].sum() chunk_count = chunk['column_name'].count() chunk_squared_sum = (chunk['column_name'] ** 2).sum() cumulative_sum += chunk_sum cumulative_count += chunk_count squared_sum += chunk_squared_sum # Calculate the mean and standard deviation overall_mean = cumulative_sum / cumulative_count overall_std = np.sqrt((squared_sum / cumulative_count) - (overall_mean ** 2)) print(f'Overall mean of column_name: {overall_mean}') print(f'Overall standard deviation of column_name: {overall_std}') Conclusion Reading large datasets in chunks using Python helps in efficient data processing and analysis without overwhelming system memory. By taking advantage of Pandas’ chunking functionality, various tasks involving data analytics can be done on large datasets while ensuring scalability and efficiency. The provided examples illustrate how to read large datasets in portions, address missing values, and perform aggregated analysis; thus providing a strong foundation for working with huge amounts of data in Python.
In today's fast-paced digital world, maintaining a competitive edge requires integrating advanced technologies into organizational processes. Cloud computing has revolutionized how businesses manage resources, providing scalable and efficient solutions. However, the transition to cloud environments introduces significant security challenges. This article explores how leveraging high-level programming languages like Python and SQL can enhance cloud security and automate critical control processes. The Challenge of Cloud Security Cloud computing offers numerous benefits, including resource scalability, cost efficiency, and flexibility. However, these advantages come with increased risks such as data breaches, unauthorized access, and service disruptions. Addressing these security challenges is paramount for organizations relying on cloud services. Strengthening Cloud Security With Python Python's versatility makes it an ideal tool for enhancing cloud security. Its robust ecosystem of libraries and tools can be used for the following: Intrusion Detection and Anomaly Detection Python can analyze network traffic and logs to identify potential security breaches. For example, using libraries like Scapy and Pandas, security analysts can create scripts to monitor network anomalies. Python import scapy.all as scapy import pandas as pd def detect_anomalies(packets): # Analyze packets for anomalies pass packets = scapy.sniff(count=100) detect_anomalies(packets) Real-Time Monitoring Python's real-time monitoring capabilities help detect and respond to security incidents promptly. Using frameworks like Flask and Dash, organizations can build dashboards to visualize security metrics. Python from flask import Flask, render_template app = Flask(__name__) @app.route('/') def dashboard(): # Fetch and display real-time data return render_template('dashboard.html') if __name__ == '__main__': app.run(debug=True) Automating Security Tasks Python can automate routine security tasks such as patching, policy enforcement, and vulnerability assessments. This automation reduces human error and ensures consistent execution of security protocols. Python import os def apply_security_patches(): os.system('sudo apt-get update && sudo apt-get upgrade -y') apply_security_patches() Automating Control Processes With SQL SQL plays a critical role in managing and automating control processes within cloud environments. Key applications include: Resource Provisioning and Scaling SQL scripts can automate the provisioning and scaling of cloud resources, ensuring optimal utilization. SQL INSERT INTO ResourceManagement (ResourceType, Action, Timestamp) VALUES ('VM', 'Provision', CURRENT_TIMESTAMP); Backup and Recovery SQL can automate backup and recovery processes, ensuring data protection and minimizing downtime. SQL CREATE EVENT BackupEvent ON SCHEDULE EVERY 1 DAY DO BACKUP DATABASE myDatabase TO 'backup_path'; Access Control Automating access control using SQL ensures that only authorized users can access sensitive data. SQL GRANT SELECT, INSERT, UPDATE ON myDatabase TO 'user'@'host'; Integrating Python and SQL for Comprehensive Security The synergy of Python and SQL provides a holistic approach to cloud security. By combining their strengths, organizations can achieve: Enhanced efficiency: Automation reduces manual intervention, speeding up task execution and improving resource utilization. Consistency and reliability: Automated processes ensure consistent execution of security protocols, reducing the risk of human error. Improved monitoring and reporting: Integrating Python with SQL allows for comprehensive monitoring and reporting, providing insights into system performance and security. Python import mysql.connector def fetch_security_logs(): db = mysql.connector.connect( host="your-database-host", user="your-username", password="your-password", database="your-database-name" ) cursor = db.cursor() cursor.execute("SELECT * FROM SecurityLogs") logs = cursor.fetchall() for log in logs: print(log) fetch_security_logs() Conclusion As organizations increasingly adopt cloud technologies, the importance of robust security measures cannot be overstated. Leveraging Python and SQL for cloud security and automation offers a powerful approach to addressing modern security challenges. By integrating these languages, organizations can build resilient, efficient, and secure cloud environments, ensuring they stay ahead in the competitive digital landscape.
Writing concise and effective Pandas code can be challenging, especially for beginners. That's where dovpanda comes in. dovpanda is an overlay for working with Pandas in an analysis environment. dovpanda tries to understand what you are trying to do with your data and helps you find easier ways to write your code and helps in identifying potential issues, exploring new Pandas tricks, and ultimately, writing better code – faster. This guide will walk you through the basics of dovpanda with practical examples. Introduction to dovpanda dovpanda is your coding companion for Pandas, providing insightful hints and tips to help you write more concise and efficient Pandas code. It integrates seamlessly with your Pandas workflow. This offers real-time suggestions for improving your code. Benefits of Using dovpandas in Data Projects 1. Advanced-Data Profiling A lot of time can be saved using dovpandas, which performs comprehensive automated data profiling. This provides detailed statistics and insights about your dataset. This includes: Summary statistics Anomaly identification Distribution analysis 2. Intelligent Data Validation Validation issues can be taken care of by dovpandas, which offers intelligent data validation and suggests checks based on data characteristics. This includes: Uniqueness constraints: Unique constraint violations and duplicate records are identified. Range validation: Outliers (values of range) are identified. Type validation: Ensures all columns have consistent and expected data types. 3. Automated Data Cleaning Recommendations dovpandas gives automated cleaning tips. dovpandas provides: Data type conversions: Recommends appropriate conversions (e.g., converting string to datetime or numeric types). Missing value imputation: Suggests methods such as mean, median, mode, or even more sophisticated imputation techniques. Outlier: Identifies and suggests how to handle methods for outliers. Customizable suggestions: Suggestions are provided according to the specific code problems. The suggestions from dovpandas can be customized and extended to fit the specific needs. This flexibility allows you to integrate domain-specific rules and constraints into your data validation and cleaning process. 4. Scalable Data Handling It's crucial to employ strategies that ensure efficient handling and processing while working with large datasets. Dovpandas offers several strategies for this purpose: Vectorized operations: Dovpandas advises using vectorized operations(faster and more memory-efficient than loops) in Pandas. Memory usage: It provides tips for reducing memory usage, such as downcasting numeric types. Dask: Dovpandas suggests converting Pandas DataFrames to Dask DataFrames for parallel processing. 5. Promotes Reproducibility dovpandas ensure that standardized suggestions are provided for all data preprocessing projects, ensuring consistency across different projects. Getting Started With dovpanda To get started with dovpanda, import it alongside Pandas: Note: All the code in this article is written in Python. Python import pandas as pd import dovpanda The Task: Bear Sightings Let's say we want to spot bears and record the timestamps and types of bears you saw. In this code, we will analyze this data using Pandas and dovpanda. We are using the dataset bear_sightings_dean.csv. This dataset contains a bear name with the timestamp the bear was seen. Reading a DataFrame First, we'll read one of the data files containing bear sightings: Python sightings = pd.read_csv('data/bear_sightings_dean.csv') print(sightings) We just loaded the dataset, and dotpandas gave the above suggestions. Aren't these really helpful?! Output The 'timestamp' column looks like a datetime but is of type 'object'. Convert it to a datetime type. Let's implement these suggestions: Python sightings = pd.read_csv('data/bear_sightings_dean.csv', index_col=0) sightings['bear'] = sightings['bear'].astype('category') sightings['timestamp'] = pd.to_datetime(sightings['timestamp']) print(sightings) The 'bear' column is a categorical column, so astype('category') converts it into a categorical data type. For easy manipulation and analysis of date and time data, we used pd.to_datetime() to convert the 'timestamp' column to a datetime data type. After implementing the above suggestion, dovpandas gave more suggestions. Combining DataFrames Next, we want to combine the bear sightings from all our friends. The CSV files are stored in the 'data' folder: Python import os all_sightings = pd.DataFrame() for person_file in os.listdir('data'): with dovpanda.mute(): sightings = pd.read_csv(f'data/{person_file}', index_col=0) sightings['bear'] = sightings['bear'].astype('category') sightings['timestamp'] = pd.to_datetime(sightings['timestamp']) all_sightings = all_sightings.append(sightings) In this all_sightings is the new dataframe created.os.listdir('data') will list all the files in the ‘data’directory.person_file is a loop variable that will iterate over each item in the ‘data’directory and will store the current item from the list. dovpanda.mute() will mute dovpandas while reading the content.all_sightings.append(sightings) appends the current sightings DataFrame to the all_sightings DataFrame. This results in a single DataFrame containing all the data from the individual CSV files. Here's the improved approach: Python sightings_list = [] with dovpanda.mute(): for person_file in os.listdir('data'): sightings = pd.read_csv(f'data/{person_file}', index_col=0) sightings['bear'] = sightings['bear'].astype('category') sightings['timestamp'] = pd.to_datetime(sightings['timestamp']) sightings_list.append(sightings) sightings = pd.concat(sightings_list, axis=0) print(sightings) sightings_list = [] is the empty list for storing each DataFrame created from reading the CSV files. According to dovpandas suggestion, we could write clean code where the entire loop is within a single with dovpanda.mute(), reducing the overhead and possibly making the code slightly more efficient. Python sightings = pd.concat(sightings_list,axis=1) sightings dovpandas again on the work of giving suggestions. Analysis Now, let's analyze the data. We'll count the number of bears observed each hour: Python sightings['hour'] = sightings['timestamp'].dt.hour print(sightings.groupby('hour')['bear'].count()) Output hour 14 108 15 50 17 55 18 58 Name: bear, dtype: int64 groupby time objects are better if we use Pandas' specific methods for this task. dovpandas tells us how to do so. dovpandas gave this suggestion on the code: Using the suggestion: Python sightings.set_index('timestamp', inplace=True) print(sightings.resample('H')['bear'].count()) Advanced Usage of dovpanda dovpanda offers advanced features like muting and unmuting hints: To mute dovpanda: dovpanda.set_output('off') To unmute and display hints: dovpanda.set_output('display') You can also shut dovpanda completely or restart it as needed: Shutdown:dovpanda.shutdown() Start:dovpanda.start() Conclusion dovpanda can be considered a friendly guide for writing Pandas code better. The coder can get real-time hints and tips while doing coding. It helps optimize the code, spot issues, and learn new Pandas tricks along the way. dovpanda can make your coding journey smoother and more efficient, whether you're a beginner or an experienced data analyst.
Machine learning continues to be one of the most rapidly advancing and in-demand fields of technology. Machine learning, a branch of artificial intelligence, enables computer systems to learn and adopt human-like qualities, ultimately leading to the development of artificially intelligent machines. Eight key human-like qualities that can be imparted to a computer using machine learning as part of the field of artificial intelligence are presented in the table below. Human Quality AI Discipline (using ML approach) Sight Computer Vision Speech Natural Language Processing (NLP) Locomotion Robotics Understanding Knowledge Representation and Reasoning Touch Haptics Emotional Intelligence Affective Computing (aka. Emotion AI) Creativity Generative Adversarial Networks (GANs) Decision-Making Reinforcement Learning However, the process of creating artificial intelligence requires large volumes of data. In machine learning, the more data that we have and train the model on, the better the model (AI agent) becomes at processing the given prompts or inputs and ultimately doing the task(s) for which it was trained. This data is not fed into the machine learning algorithms in its raw form. It (the data) must first undergo various inspections and phases of data cleansing and preparation before it is fed into the learning algorithms. We call this phase of the machine learning life cycle, the data preprocessing phase. As implied by the name, this phase consists of all the operations and procedures that will be applied to our dataset (rows/columns of values) to bring it into a cleaned state so that it will be accepted by the machine learning algorithm to start the training/learning process. This article will discuss and look at the most popular data preprocessing techniques used for machine learning. We will explore various methods to clean, transform, and scale our data. All exploration and practical examples will be done using Python code snippets to guide you with hands-on experience on how these techniques can be implemented effectively for your machine learning project. Why Preprocess Data? The literal holistic reason for preprocessing data is so that the data is accepted by the machine learning algorithm and thus, the training process can begin. However, if we look at the intrinsic inner workings of the machine learning framework itself, more reasons can be provided. The table below discusses the 5 key reasons (advantages) for preprocessing your data for the subsequent machine learning task. Reason Explanation Improved Data Quality Data Preprocessing ensures that your data is consistent, accurate, and reliable. Improved Model Performance Data Preprocessing allows your AI Model to capture trends and patterns on deeper and more accurate levels. Increased Accuracy Data Preprocessing allows the model evaluation metrics to be better and reflect a more accurate overview of the ML model. Decreased Training Time By feeding the algorithm data that has been cleaned, you are allowing the algorithm to run at its optimum level thereby reducing the computation time and removing unnecessary strain on computing resources. Feature Engineering By preprocessing your data, the machine learning practitioner can gauge the impact that certain features have on the model. This means that the ML practitioner can select the features that are most relevant for model construction. In its raw state, data can have a magnitude of errors and noise in it. Data preprocessing seeks to clean and free the data from these errors. Common challenges that are experienced with raw data include, but are not limited to, the following: Missing values: Null values or NaN (Not-a-Number) Noisy data: Outliers or incorrectly captured data points Inconsistent data: Different data formatting inside the same file Imbalanced data: Unequal class distributions (experienced in classification tasks) In the following sections of this article, we will proceed to work with hands-on examples of Data Preprocessing. Data Preprocessing Techniques in Python The frameworks that we will utilize to work with practical examples of data preprocessing: NumPy Pandas SciKit Learn Handling Missing Values The most popular techniques to handle missing values are removal and imputation. It is interesting to note that irrespective of what operation you are trying to perform if there is at least one null (NaN) inside your calculation or process, then the entire operation will fail and evaluate to a NaN (null/missing/error) value. Removal This is when we remove the rows or columns that contain the missing value(s). This is typically done when the proportion of missing data is relatively small compared to the entire dataset. Example Output Imputation This is when we replace the missing values in our data, with substituted values. The substituted value is commonly the mean, median, or mode of the data for that column. The term given to this process is imputation. Example Output Handling Noisy Data Our data is said to be noisy when we have outliers or irrelevant data points present. This noise can distort our model and therefore, our analysis. The common preprocessing techniques for handling noisy data include smoothing and binning. Smoothing This data preprocessing technique involves employing operations such as moving averages to reduce noise and identify trends. This allows for the essence of the data to be encapsulated. Example Output Binning This is a common process in statistics and follows the same underlying logic in machine learning data preprocessing. It involves grouping our data into bins to reduce the effect of minor observation errors. Example Output Data Transformation This data preprocessing technique plays a crucial role in helping to shape and guide algorithms that require numerical features as input, to optimum training. This is because data transformation deals with converting our raw data into a suitable format or range for our machine learning algorithm to work with. It is a crucial step for distance-based machine learning algorithms. The key data transformation techniques are normalization and standardization. As implied by the names of these operations, they are used to rescale the data within our features to a standard range or distribution. Normalization This data preprocessing technique will scale our data to a range of [0, 1] (inclusive of both numbers) or [-1, 1] (inclusive of both numbers). It is useful when our features have different ranges and we want to bring them to a common scale. Example Output Standardization Standardization will scale our data to have a mean of 0 and a standard deviation of 1. It is useful when the data contained within our features have different units of measurement or distribution. Example Output Encoding Categorical Data Our machine learning algorithms most often require the features matrix (input data) to be in the form of numbers, i.e., numerical/quantitative. However, our dataset may contain textual (categorical) data. Thus, all categorical (textual) data must be converted into a numerical format before feeding the data into the machine learning algorithm. The most commonly implemented techniques for handling categorical data include one-hot encoding (OHE) and label encoding. One-Hot Encoding This data preprocessing technique is employed to convert categorical values into binary vectors. This means that each unique category becomes its column inside the data frame, and the presence of the observation (row) containing that value or not, is represented by a binary 1 or 0 in the new column. Example Output Label Encoding This is when our categorical values are converted into integer labels. Essentially, each unique category is assigned a unique integer to represent hitherto. Example Output This tells us that the label encoding was done as follows: ‘Blue’ -> 0 ‘Green’ -> 1 ‘Red’ -> 2 P.S., the numerical assignment is Zero-Indexed (as with all collection types in Python) Feature Extraction and Selection As implied by the name of this data preprocessing technique, feature extraction/selection involves the machine learning practitioner selecting the most important features from the data, while feature extraction transforms the data into a reduced set of features. Feature Selection This data preprocessing technique helps us in identifying and selecting the features from our dataset that have the most significant impact on the model. Ultimately, selecting the best features will improve the performance of our model and reduce overfitting thereof. Correlation Matrix This is a matrix that helps us identify features that are highly correlated thereby allowing us to remove redundant features. “The correlation coefficients range from -1 to 1, where values closer to -1 or 1 indicate stronger correlation, while values closer to 0 indicate weaker or no correlation”. Example Output 1 Output 2 Chi-Square Statistic The Chi-Square Statistic is a test that measures the independence of two categorical variables. It is very useful when we are performing feature selection on categorical data. It calculates the p-value for our features which tells us how useful our features are for the task at hand. Example Output The output of the Chi-Square scores consists of two arrays: The first array contains the Chi-Square statistic values for each feature. The second array contains the p-values corresponding to each feature. In our example: For the first feature: The chi-square statistic value is 0.0 p-value is 1.0 For the second feature: The chi-square statistic value is 3.0 p-value is approximately 0.083 The Chi-Square statistic measures the association between the feature and the target variable. A higher Chi-Square value indicates a stronger association between the feature and the target. This tells us that the feature being analyzed is very useful in guiding the model to the desired target output. The p-value measures the probability of observing the Chi-Square statistic under the null hypothesis that the feature and the target are independent. Essentially, A low p-value (typically < 0.05) indicates that the association between the feature and the target is statistically significant. For our first feature, the Chi-Square value is 0.0, and the p-value is 1.0 thereby indicating no association with the target variable. For the second feature, the Chi-Square value is 3.0, and the corresponding p-value is approximately 0.083. This suggests that there might be some association between our second feature and the target variable. Keep in mind that we are working with dummy data and in the real world, the data will give you a lot more variation and points of analysis. Feature Extraction This is a data preprocessing technique that allows us to reduce the dimensionality of the data by transforming it into a new set of features. Logically speaking, model performance can be drastically increased by employing feature selection and extraction techniques. Principal Component Analysis (PCA) PCA is a data preprocessing dimensionality reduction technique that transforms our data into a set of right-angled (orthogonal) components thereby capturing the most variance present in our features. Example Output With this, we have successfully explored a variety of the most commonly used data preprocessing techniques that are used in Python machine learning tasks. Conclusion In this article, we explored popular data preprocessing techniques for machine learning with Python. We began by understanding the importance of data preprocessing and then looked at the common challenges associated with raw data. We then dove into various preprocessing techniques with hands-on examples in Python. Ultimately, data preprocessing is a step that cannot be skipped from your machine learning project lifecycle. Even if there are no changes or transformations to be made to your data, it is always worth the effort to apply these techniques to your data where applicable. because, in doing so, you will ensure that your data is cleaned and transformed for your machine learning algorithm and thus your subsequent machine learning model development factors such as model accuracy, computational complexity, and interpretability will see an improvement. In conclusion, data preprocessing lays the foundation for successful machine-learning projects. By paying attention to data quality and employing appropriate preprocessing techniques, we can unlock the full potential of our data and build models that deliver meaningful insights and actionable results. Code Python # -*- coding: utf-8 -*- """ @author: Karthik Rajashekaran """ # we import the necessary frameworks import pandas as pd import numpy as np # we create dummy data to work with data = {'A': [1, 2, None, 4], 'B': [5, None, None, 8], 'C': [10, 11, 12, 13]} # we create and print the dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # TECHNIQUE: ROW REMOVAL > we remove rows with any missing values df_cleaned = df.dropna() print("Row(s) With Null Value(s) Deleted:\n" + str(df_cleaned), "\n") # TECHNIQUE: COLUMN REMOVAL -> we remove columns with any missing values df_cleaned_columns = df.dropna(axis=1) print("Column(s) With Null Value(s) Deleted:\n" + str(df_cleaned_columns), "\n") #%% # IMPUTATION # we create dummy data to work with data = {'A': [1, 2, None, 4], 'B': [5, None, None, 8], 'C': [10, 11, 12, 13]} # we create and print the dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we impute the missing values with mean df['A'] = df['A'].fillna(df['A'].mean()) df['B'] = df['B'].fillna(df['B'].median()) print("DataFrame After Imputation:\n" + str(df), "\n") #%% # SMOOTHING # we create dummy data to work with data = {'A': [1, 2, None, 4], 'B': [5, None, None, 8], 'C': [10, 11, 12, 13]} # we create and print the dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we calculate the moving average for smoothing df['A_smoothed'] = df['A'].rolling(window=2).mean() print("Smoothed Column A DataFrame:\n" + str(df), "\n") #%% # BINNING # we create dummy data to work with data = {'A': [1, 2, None, 4], 'B': [5, None, None, 8], 'C': [10, 11, 12, 13]} # we create and print the dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we bin the data into discrete intervals bins = [0, 5, 10, 15] labels = ['Low', 'Medium', 'High'] # we apply the binning on column 'C' df['Binned'] = pd.cut(df['C'], bins=bins, labels=labels) print("DataFrame Binned Column C:\n" + str(df), "\n") #%% # NORMALIZATION # we import the necessary frameworks from sklearn.preprocessing import MinMaxScaler import pandas as pd # we create dummy data to work with data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]} # we print the original dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we apply mix-max normalization to our data using sklearn scaler = MinMaxScaler() df_normalized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns) print("Normalized DataFrame:\n" + str(df_normalized), "\n") #%% # STANDARDIZATION # we create dummy data to work with data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]} # we print the original dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we import 'StandardScaler' from sklearn from sklearn.preprocessing import StandardScaler # we apply standardization to our data scaler = StandardScaler() df_standardized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns) print("Standardized DataFrame:\n" + str(df_standardized), "\n") #%% # ONE-HOT ENCODING # we import the necessary framework from sklearn.preprocessing import OneHotEncoder # we create dummy data to work with data = {'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']} # we print the original dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we apply one-hot encoding to our categorical features encoder = OneHotEncoder(sparse_output=False) encoded_data = encoder.fit_transform(df[['Color']]) encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Color'])) print("OHE DataFrame:\n" + str(encoded_df), "\n") #%% # LABEL ENCODING # we import the necessary framework from sklearn.preprocessing import LabelEncoder # we create dummy data to work with data = {'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']} # we print the original dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we apply label encoding to our dataframe label_encoder = LabelEncoder() df['Color_encoded'] = label_encoder.fit_transform(df['Color']) print("Label Encoded DataFrame:\n" + str(df), "\n") #%% # CORRELATION MATRIX # we import the necessary frameworks import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # we create dummy data to work with data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50], 'C': [5, 4, 3, 2, 1]} # we print the original dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we compute the correlation matrix of our features correlation_matrix = df.corr() # we visualize the correlation matrix sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.show() #%% # CHI-SQUARE STATISTIC # we import the necessary frameworks from sklearn.feature_selection import chi2 from sklearn.preprocessing import LabelEncoder import pandas as pd # we create dummy data to work with data = {'Feature1': [1, 2, 3, 4, 5], 'Feature2': ['A', 'B', 'A', 'B', 'A'], 'Label': [0, 1, 0, 1, 0]} # we print the original dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we encode the categorical features in our dataframe label_encoder = LabelEncoder() df['Feature2_encoded'] = label_encoder.fit_transform(df['Feature2']) print("Encocded DataFrame:\n" + str(df), "\n") # we apply the chi-square statistic to our features X = df[['Feature1', 'Feature2_encoded']] y = df['Label'] chi_scores = chi2(X, y) print("Chi-Square Scores:", chi_scores) #%% # PRINCIPAL COMPONENT ANALYSIS # we import the necessary framework from sklearn.decomposition import PCA # we create dummy data to work with data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50], 'C': [5, 4, 3, 2, 1]} # we print the original dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we apply PCA to our features pca = PCA(n_components=2) df_pca = pd.DataFrame(pca.fit_transform(df), columns=['PC1', 'PC2']) # we print the dimensionality reduced features print("PCA Features:\n" + str(df_pca), "\n") References Datacamp, How to Learn Machine Learning in 2024, February 2024. [Online]. [Accessed: 30 May 2024]. Statista, Growth of worldwide machine learning (ML) market size from 2021 to 2030, 13 February 2024. [Online]. [Accessed: 30 May 2024]. Hurne M.v., What is affective computing/emotion AI? 03 May 2024. [Online]. [Accessed: 30 May 2024].
Since the recent release of the GQL (Graph Query Language) standard by ISO, there have been many discussions among graph database vendors and research institutions on how it will influence the industry. Apparently, its prevalence is backed by the wide applications of graph databases across diverse sectors — from recommendation engines to supply chains, a standard unified language for querying and managing graph databases is needed. The significance of GQL lies in its ability to replace multiple database-specific query languages with a single, standardized one. This facilitates the interoperability between graph databases and calls for the end of dependence on certain graph database vendors. Moreover, beyond the query language, GQL defines what a graph database should be and what key characteristics it should own are finally standardized, laying a far-reaching influential foundation for the development of the graph database industry. In this article, I will walk you through some important terms of GQL and explore its transformative potential for the industry. Key Terms and Definitions of GQL GQL aims to establish a unified, declarative graph database query language that is both compatible with modern data types and can intuitively express the complex logic of a graph. It defines a comprehensive and robust framework for interacting with property graph databases, including DQL, DML, and DDL, providing a modern and flexible approach to graph data management and analysis. Below are some key definitions of GQL that developers or users of graph databases should be aware of. Property Graph Data Model GQL operates on a data model including nodes (vertices) and edges (relationships), which allows for pattern-based analysis and flexible data addition. The data model is specifically tailored for Property Graph Databases in that GQL is based on relatively mature graph query languages with wide applications, absorbing its advantages and settling the new standards. Resource Description Framework (RDF), once another type of graph data model, is not included in GQL as a standard graph data model. With the GQL definition, it is apparent that the property graph data model is the de facto standard. Graph Pattern Matching (GPM) GPM language defined by GQL enables users to write simple queries for complex data analysis. While traditional graph database query languages support single pattern matching, GQL further facilitates complex pattern matching across multiple patterns. For example, GQL supports path aggregation, grouping variables, and nested pattern matching with optional filtering, offering expressive capabilities to handle more sophisticated business logic. GQL Schema GQL allows for both schema-free graphs, which accept any data, and mandatory schema graphs, which are constrained by a predefined graph type specified in a “GQL schema”. This dual approach of GQL supporting two types of schema caters to a wide range of data management needs, from the flexibility of schema-free graphs to the precision of schema-constrained ones. Schema-free graphs allow adding new attributes to nodes or relationships at any time without modifying the data model. This adaptability is beneficial when dealing with complex and changing data, but from another perspective, schema-free graphs shift the burden of data management complexities, such as handling data consistency and data quality, onto developers. On the contrary, the mandatory schema graph offers a rigid framework that guarantees data consistency and integrity. The deterministic data structure within a mandatory schema makes any data changes clear and manageable. Furthermore, the predefined data structure enhances the comprehensibility and usability of data, which brings optimized query processes for both users and systems. While the mandatory schema graphs may sacrifice some flexibility, the trade-off is often justified in production environments where data structures are well-defined and the output data exhibits regular patterns. Graph Types Graph types are templates that restrict the contents of a graph to specific node and edge types, offering a certain level of data control and structure. Under the GQL definition, a graph type can be applied to multiple graphs, which means that the same graph structure type can be shared in different applications, making it more flexible. For example, data of a business might differ in different departments, among various regions and the data permissions may be isolated from each other. Under this situation, using the same graph type can facilitate business management as multiple graphs with the same graph type enable permission management and data privacy compliance regulations. Notable Advancements of GQL Separation of GQL-Catalog and GQL-Data GQL defines a persistent and extensible catalog initialization runtime environment with reference to SQL: GQL-catalog. GQL lists its stored data objects, including various metadata, such as graph, graph type, procedure, function, etc. GQL-catalog can be independently maintained or upgraded from the data itself, which allows for flexible permission management and a unified, standardized approach to catalog management. Multi-Graph Joint Query GQL enables multi-graph joint queries. By using different graph expressions in the query process, users can perform operations such as union, conditional rules, and join can be performed on different graphs. This capability benefits scenarios such as anti-fraud investigations and the integration of public and private knowledge graphs, where cross-referencing public and private data sets is crucial. These scenarios require both data isolation and integrated data analysis due to data compliance, maintenance, and other reasons. Therefore, the data needs to be split into multiple graphs, but they need to be combined to complete a certain business requirement. Supporting Undirected Graph Different from the previous definition of graph databases where relationships always have a direction, GQL allows undirected graphs. In some scenarios, there is naturally no direction of relationships between vertices, such as friendships. While these relationships could be modeled as directed, doing so would necessitate two separate edges, making the modeling and querying process complicated. Conclusion In summary, the standardization of GQL is a significant step forward for the graph database industry. Not only does it provide a simplified user experience, but GQL also regulates what property graph databases are and what features they should own, referring to real-world use cases. It boosts the transformative potential of graph databases for all industries where they are leveraged.
Regression analysis is a technic of estimating the value of a dependent variable based on the values of independent values. These models largely explore the possibilities of studying the impact of independent variables on dependent variables. In this article, we will focus on estimating the value of revenue (dependent variable) based on the historical demand (independent values) coming from various demand channels like Call, Chat, and Web inquiries. I will use Python libraries like statsmodels and sklearn to develop an algorithm to forecast revenue. Once we can predict the revenue, this empowers the businesses to strategies the investment, and prioritize the customers and demand channels with an aim to grow the revenue. Data Ingestion and Exploration In this section, I will detail the test data, data ingestion using pandas, and data exploration to get familiar with the data. I am using a summarized demand gen dataset, containing the following data attributes. The following code block would ingest the input file “DemandGenRevenue.csv” and list the sample records. Python import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import os import statsmodels.formula.api as sm from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet from sklearn.metrics import mean_squared_error, r2_score from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.model_selection import GridSearchCV import warnings warnings.simplefilter(action='ignore', category=FutureWarning) df = pd.read_csv("DemandGenRevenue.csv") df.head() Python df.columns df.info() df.describe().T The following code can be used to design Scatter plots to explore the linearity assumption between the independent variables — Call, Chat, Web Inquiry, and the dependent variable — Revenue. Python sns.pairplot(df, x_vars=["call", "chat", "Web_Inquiry"], y_vars="Revenue", kind="reg") Let's explore the normality assumption of the dependent variable — Revenue using Histograms. Python df.hist(bins=20) Before we start working on the model, let's explore the relationship between each independent variable and the dependent variable using Linear regression plots. Python sns.lmplot(x='call', y='Revenue', data=df) sns.lmplot(x='chat', y='Revenue', data=df) sns.lmplot(x='Web_Inquiry', y='Revenue', data=df) Forecasting Model In this section, I will delve into the model preparation using statsmodels and sklearn libraries. We will build a linear regression model based on the demand coming from calls, chats, and web inquiries. Python X = df.drop('Revenue', axis=1) y = df[["Revenue"]] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=46) The following code will build a Linear regression model to forecast the revenue. Python lin_model = sm.ols(formula="Revenue ~ call + chat + Web_Inquiry", data=df).fit() print(lin_model.params, "\n") Use the below code to explore the coefficients of the linear model. Python print(lin_model.summary()) The code below can be used to define various models and loop through the models to forecast, for simplicity's sake, we will only focus on the Linear Regression Model. Python results = [] names = [] models = [('LinearRegression', LinearRegression())] for name, model in models: model.fit(X_train, y_train) y_pred = model.predict(X_test) result = np.sqrt(mean_squared_error(y_test, y_pred)) results.append(result) names.append(name) msg = "%s: %f" % (name, result) print(msg) Now that we have the model ready, let's try and forecast the revenue based on the input. If we were to get 1000 calls, 650 chats, and 725 Web Inquiries, based on the historical data we can expect $64.5M of revenue. Python new_data = pd.DataFrame({'call': [1000], 'chat': [650], 'Web_Inquiry': [725]}) Forecasted_Revenue = lin_model.predict(new_data) print("Forecasted Revenue:", int(Forecasted_Revenue)) The code below provides another set of inputs to test the model, if the demand center receives 2000 calls, 1200 chats, and 250 web inquiries, the model forecasts the revenue at $111.5M of revenue. Python new_data = pd.DataFrame({'call': [2000], 'chat': [1200], 'Web_Inquiry': [250]}) Forecasted_Revenue = lin_model.predict(new_data) print("Forecasted Revenue:", int(Forecasted_Revenue)) Conclusion In the end, Python offers multiple libraries to implement the forecasting, statsmodels and sklearn lays a solid foundation for implementing a linear regression model to predict the outcomes based on historical data. I would suggest continued Python exploration for working on enterprise-wide sales and marketing data to analyze historical trends and execute models to predict future sales and revenue. Darts is another Python library I would recommend implementing time series-based anomaly detection and user-friendly forecasting based on models like ARIMA to deep neural networks.
I’m a senior solution architect and polyglot programmer interested in the evolution of programming languages and their impact on application development. Around three years ago, I encountered WebAssembly (Wasm) through the .NET Blazor project. This technology caught my attention because it can execute applications at near-native speed across different programming languages. This was especially exciting to me as a polyglot programmer since my programming expertise ranges across multiple programming languages including .NET, PHP, Node.js, Rust, and Go. Most of the work I do is building cloud-native enterprise applications, so I have been particularly interested in advancements that broaden Wasm’s applicability in cloud-native development. WebAssembly 2.0 was a significant leap forward, improving performance and flexibility while streamlining integration with web and cloud infrastructures to make Wasm an even more powerful tool for developers to build versatile and dynamic cloud-native applications. I aim to share the knowledge and understanding I've gained, providing an overview of Wasm’s capabilities and its potential impact on the cloud-native development landscape. Polyglot Programming and the Component Model My initial attraction to WebAssembly stemmed from its capability to enhance browser functionalities for graphic-intensive and gaming applications, breaking free from the limitations of traditional web development. It also allows developers to employ languages like C++ or Rust to perform high-efficiency computations and animations, offering near-native performance within the browser environment. Wasm’s polyglot programming capability and component model are two of its flagship capabilities. The idea of leveraging the strengths of various programming languages within a unified application environment seemed like the next leap in software development. Wasm offers the potential to leverage the unique strengths of various programming languages within a single application environment, promoting a more efficient and versatile development process. For instance, developers could leverage Rust's speed for performance-critical components and .NET's comprehensive library support for business logic to optimize both development efficiency and application performance. This led me to Spin, an open-source tool for the creation and deployment of Wasm applications in cloud environments. To test Wasm’s polyglot programming capabilities, I experimented with the plugins and middleware models. I divided the application business logic into one component, and the other component with the Spin component supported the host capabilities (I/O, random, socket, etc.) to work with the host. Finally, I composed with http-auth-middleware, an existing component model from Spin for OAuth 2.0, and wrote more components for logging, rate limit, etc. All of them were composed together into one app and run on the host world (Component Model). Cloud-Native Coffeeshop App The first app I wrote using WebAssembly was an event-driven microservices coffeeshop app written in Golang and deployed using Nomad, Consul Connect, Vault, and Terraform (you can see it on my GitHub). I was curious about how it would work with Kubernetes, and then Dapr. I expanded it and wrote several use cases with Dapr such as entire apps with Spin, polyglot apps (Spin and other container apps with Docker), Spin apps with Dapr, and others. What I like about it is the speed of start-up time (it’s very quick to get up and running), and the size of the app – it looks like a tiny but powerful app. The WebAssembly ecosystem has matured a lot in the past year as it relates to enterprise projects. For the types of cloud-native projects I’d like to pursue, it would benefit from a more developed support system for stateful applications, as well as an integrated messaging system between components. I would love to see more capabilities that my enterprise customers need such as gRPC or other communication protocols (Spin currently only supports HTTP), data processing and transformation like data pipelines, a multi-threading mechanism, CQRS, polyglot programming language aggregations (internal modular monolith style or external microservices style), and content negotiation (XML, JSON, Plain-text). We also need real-world examples demonstrating Wasm’s capabilities to tackle enterprise-level challenges, fostering a better understanding and wider technology adoption. We can see how well ZEISS does from their presentation at KubeCon in Paris last month. I would like to see more and more companies like them involved in this game, then, from the developer perspective, we will benefit a lot. Not only can we easily develop WebAssembly apps, but many enterprise scenarios shall also be addressed, and we will work together to make WebAssembly more handy and effective. The WebAssembly Community Sharing my journey with the WebAssembly community has been a rewarding part of my exploration, especially with the Spin community who have been so helpful in sharing best practices and new ideas. Through tutorials and presentations at community events, I've aimed to contribute to the collective understanding of WebAssembly and cloud-native development, and I hope to see more people sharing their experiences. I will continue creating tutorials and educational content, as well as diving into new projects using WebAssembly to inspire and educate others about its potential. I would encourage anyone getting started to get involved in the Wasm community of your choice to accelerate your journey. WebAssembly’s Cloud-Native Future I feel positive about the potential for WebAssembly to change how we do application development, particularly in the cloud-native space. I’d like to explore how Wasm could underpin the development of hybrid cloud platforms and domain-specific applications. One particularly exciting prospect is the potential for building an e-commerce platform based on WebAssembly, leveraging its cross-platform capabilities and performance benefits to offer a superior user experience. The plugin model existed for a long time in the e-commerce world (see what Shopify did), and with WebAssembly’s component model, we can build the application with polyglot programming languages such as Rust, Go, TypeScript, .NET, Java, PHP, etc. WebAssembly 2.0 supports the development of more complex and interactive web applications, opening the door for new use cases such as serverless stateless functions, data transformation, and the full-pledge of web API functionalities, moving into edge devices (some embedded components). New advancements like WASI 3.0 with asynchronous components are bridging the gaps. I eagerly anticipate the further impact of WebAssembly on our approaches to building and deploying applications. We’re just getting started.
Sameer Shukla
Sr. Software Engineer,
Leading Financial Institution
Kai Wähner
Technology Evangelist,
Confluent
Alvin Lee
Founder,
Out of the Box Development, LLC