Welcome to the Data Engineering category of DZone, where you will find all the information you need for AI/ML, big data, data, databases, and IoT. As you determine the first steps for new systems or reevaluate existing ones, you're going to require tools and resources to gather, store, and analyze data. The Zones within our Data Engineering category contain resources that will help you expertly navigate through the SDLC Analysis stage.
Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Big data comprises datasets that are massive, varied, complex, and can't be handled traditionally. Big data can include both structured and unstructured data, and it is often stored in data lakes or data warehouses. As organizations grow, big data becomes increasingly more crucial for gathering business insights and analytics. The Big Data Zone contains the resources you need for understanding data storage, data modeling, ELT, ETL, and more.
Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
IoT, or the Internet of Things, is a technological field that makes it possible for users to connect devices and systems and exchange data over the internet. Through DZone's IoT resources, you'll learn about smart devices, sensors, networks, edge computing, and many other technologies — including those that are now part of the average person's daily life.
Enterprise AI
In recent years, artificial intelligence has become less of a buzzword and more of an adopted process across the enterprise. With that, there is a growing need to increase operational efficiency as customer demands arise. AI platforms have become increasingly more sophisticated, and there has become the need to establish guidelines and ownership. In DZone’s 2022 Enterprise AI Trend Report, we explore MLOps, explainability, and how to select the best AI platform for your business. We also share a tutorial on how to create a machine learning service using Spring Boot, and how to deploy AI with an event-driven platform. The goal of this Trend Report is to better inform the developer audience on practical tools and design paradigms, new technologies, and the overall operational impact of AI within the business. This is a technology space that's constantly shifting and evolving. As part of our December 2022 re-launch, we've added new articles pertaining to knowledge graphs, a solutions directory for popular AI tools, and more.
Distributed SQL Essentials
There are many ways to handle ID generation in PostgreSQL, but I’ve chosen to investigate these four approaches: Auto-incrementing (SERIAL data type) Sequence-caching Sequence-incrementing with client-side ID management UUID-generation Depending on your application and your underlying database tables, you might choose to employ one or more of these options. Below, I’ll explain how each can be achieved in Node.js using the Sequelize ORM. 1. Auto-Incrementing Most developers choose the most straightforward option before exploring potential optimizations, and I’m no different. Here’s how you can create an auto-incrementing ID field in your Sequelize model definitions: JavaScript // Sequelize const { DataTypes } = require('sequelize'); const Product = sequelize.define( "product", { id: { type: DataTypes.INTEGER, autoIncrement: true, primaryKey: true, }, title: { type: DataTypes.STRING, } } ); If you’re familiar with Sequelize, you’ll be no stranger to this syntax, but others might wonder what’s actually happening under the hood. The autoIncrement flag tells PostgreSQL to create an id column with a SERIAL data type. This data type implicitly creates a SEQUENCE, which is owned by the products table’s id column. SQL // PostgreSQL equivalent CREATE SEQUENCE products_id_seq; CREATE TABLE products ( id INT NOT NULL DEFAULT NEXTVAL('products_id_seq'), title VARCHAR(255) ); When inserting a product into our table, we don’t need to supply a value for id, as it’s automatically-generated from the underlying sequence. We can simply run the following to insert a product: JavaScript // Sequelize await Product.create({title: "iPad Pro"}); SQL //PostgreSQL equivalent INSERT INTO products (title) VALUES ('iPad Pro'); Dropping our table will also drop the automatically-created sequence, products_id_seq: JavaScript // Sequelize await Product.drop(); SQL // PostgreSQL equivalent DROP TABLE products CASCADE; Although this approach is extremely easy to implement, our PostgreSQL server needs to access the sequence to get its next value on every write, which comes at a latency cost. This is particularly bad in distributed deployments. Now that we have the basics out of the way, let’s try to speed things up. As we all know, “cache is king.” 2. Sequence-Caching Although the autoIncrement flag in the Sequelize model definition totally eliminates the need to interact with sequences directly, there are scenarios where you might consider doing so. For instance, what if you wanted to speed up writes by caching sequence values? Fear not, with a little extra effort, we can make this happen. Sequelize doesn’t have API support to make this happen, as noted on Github, but there’s a simple workaround. By utilizing the built-in literal function, we are able to access a predefined sequence in our model: JavaScript const { literal, DataTypes } = require('sequelize'); const Product = sequelize.define("product", { id: { type: DataTypes.INTEGER, primaryKey: true, defaultValue: literal("nextval('custom_sequence')"), }, }); sequelize.beforeSync(() => { await sequelize.query('CREATE SEQUENCE IF NOT EXISTS custom_sequence CACHE 50'); }); await sequelize.sync(); That’s not too bad. So, this is what changed: We’ve created our own sequence, named custom_sequence, which is used to set the default value for our product ID. This sequence is created in the beforeSync hook, so it will be created before the products table and its CACHE value has been set to 50. The defaultValue is set to the next value in our custom sequence. Well, what about the cache? Sequences in PostgreSQL can optionally be supplied a CACHE value upon creation, which allots a certain number of values to be stored in memory per session. With our cache set at 50, here’s how that works: SQL //Database Session A > SELECT nextval('custom_sequence'); 1 > SELECT nextval('custom_sequence'); 2 //Database Session B > SELECT nextval('custom_sequence'); 51 > 52 For an application with multiple database connections, such as one running microservices or multiple servers behind a load balancer, each connection will receive a set of cached values. No session will contain duplicate values in its cache, ensuring there are no collisions when inserting records. In fact, depending on how your database is configured, you might find gaps in your sequenced id column if a database connection fails and is restarted without using all the values alloted in its cache. However, this generally isn’t a problem, as we’re only concerned with uniqueness. So, what’s the point? Speed. Speed is the point! By caching values on our PostgreSQL backend and storing them in memory, we’re able to retrieve the next value very quickly. This allows the database to scale, without needing to repeatedly obtain the next sequence value from the master node on writes. Of course, caching comes with the drawback of an increased memory constraint on the PostgreSQL server. Depending on your infrastructure, this could be a worthy optimization. 3. Client-Side Sequencing Sequence-caching improves performance by caching values on our PostgreSQL backend. How could we use a sequence to cache values on our client instead? Sequences in PostgreSQL have an additional parameter called INCREMENT BY that can be used to achieve this: JavaScript // DB Initialization const { literal, DataTypes } = require('sequelize'); const Product = sequelize.define("product", { id: { type: DataTypes.INTEGER, primaryKey: true }, }); sequelize.beforeSync(() => { await sequelize.query('CREATE SEQUENCE IF NOT EXISTS custom_sequence INCREMENT BY 50'); }); await sequelize.sync(); // Caller let startVal = await sequelize.query("SELECT nextval('custom_sequence')"); let limit = startVal + 50; if (startVal >= limit) { startVal = await sequelize.query("SELECT nextval('custom_sequence')"); limit = startVal + 50; } await Product.create({id: startVal, title: "iPad Pro"}) startVal += 1; Here, we’re utilizing our custom sequence in a slightly different way. No default value is supplied to our model definition. Instead, we’re using this sequence to set unique values client-side, by looping through the values in the increment range. When we’ve exhausted all of the values in this range, we make another call to our database to get the next value in our sequence to “refresh” our range. Here’s an example: SQL // Database Session A > SELECT nextval('custom_sequence'); 1 * inserts 50 records // id 1 // id 2 ... // id 50 * > SELECT nextval('custom_sequence'); 151 // Database Session B > SELECT nextval('custom_sequence'); 51 * inserts 50 records before Session A has used all numbers in its range * > SELECT nextval('custom_sequence'); 101 Database Session A connects and receives the first value in the sequence. Database Session B connects and receives the value of 51 because we’ve set our INCREMENT BY value to 50. Like our auto-incrementing solutions, we can ensure there are no ID collisions by referencing our PostgreSQL sequence to determine the start value for our range. What Problems Might Arise From This Solution? Well, it’s possible that a database administrator could choose to increase or decrease the INCREMENT BY value for a particular sequence, without application developers being notified of this change. This would break application logic. How Can We Benefit From Client-Side Sequencing? If you have a lot of available memory on your application server nodes, this could be a potential performance benefit over sequence-caching on database nodes. In fact, you might be wondering if it’s possible to utilize a cache on the client and server in the same implementation. The short answer is yes. By creating a sequence with CACHE and INCREMENT BY values, we benefit from a server-side cache of our sequence values and a client-side cache for the next value in our range. This performance optimization provides the best of both worlds if memory constraints are not of primary concern. Enough with the sequences already. Let’s move on to unique identifiers. 4. UUID-Generation So far, we’ve covered three ways to generate sequential, integer-based IDs. Another data type, the Universally Unique Identifier (UUID), removes the need for sequences entirely. A UUID is a 128-bit identifier, which comes with the guarantee of uniqueness due to the incredibly small probability that the same ID would be generated twice. PostgreSQL comes with an extension called pgcrypto, which can be installed to generate UUIDs with the gen_random_uuid function. This function generates a UUID value for a database column, much the same that nextval is used with sequences. Additionally, Node.js has several packages which generate UUIDs, such as, you guessed it, uuid. JavaScript // Sequelize const { literal, DataTypes } = require('sequelize'); const Product = sequelize.define( "product", { id: { type: DataTypes.UUID, defaultValue: literal('gen_random_uuid()') primaryKey: true, }, title: { type: DataTypes.STRING, } } ); sequelize.beforeSync(() => { await sequelize.query('CREATE EXTENSION IF NOT EXISTS "pgcrypto"'); }); SQL // PostreSQL Equivalent CREATE TABLE products ( id UUID NOT NULL DEFAULT gen_random_uuid(), title VARCHAR(255) ); This allows us to generate a UUID client-side, with a server-side default, if required. A UUID-based approach brings unique benefits with the random nature of the data type being helpful with certain data migrations. This is also helpful for API security, as the unique identifier is in no way tied to the information being stored. Additionally, the ability to generate an ID client side without managing state is helpful in a distributed deployment, where network latencies play a big role in application performance. For example, in a geo-partioned YugabyteDB cluster, connections are made to the nearest database node to serve low-latency reads. However, on writes, this node must forward the request to the primary node in the cluster (which could reside in another region of the world) to determine the next sequence value. The use of UUIDs eliminates this traffic, providing a performance boost. So, what’s the downside? Well, the topic of UUIDs is somewhat polarizing. One obvious downside would be the storage size of a UUID relative to an integer, 16-bytes as opposed to 4-bytes for an INTEGER and 8-bytes for a BIGINT. UUIDs also take some time to generate, which is a performance consideration. Get Building Ultimately, there are many factors to consider when choosing how to generate your database IDs. You might choose to use auto-incrementing IDs for a table with infrequent writes, or one that doesn’t require low-latency writes. Another table, spread across multiple geographies in a multi-node deployment, might benefit from using UUIDs. There’s only one way to find out. Get out there and write some code.
When I mention memory debugging, the first thing that comes to the minds of many developers is the profiler. That isn’t wrong, but it’s still a partial picture. Profilers are amazing at mapping that “big picture,” but when you want to understand the domain, they fall short. Modern debuggers let us gain a level of insight into the application that’s unrivaled. We can inspect and locate a specific object instance with surgical precision. Transcript Welcome back to the eighth part of debugging at scale, where we know exactly which object was allocated by whom and why. Profiler vs. Debugger Profilers expose a lot of information about memory, but they don’t give us the fine-grained view a debugger offers. The debugger can solve that last mile problem; it can connect the information you see in the debugger to actual actionable changes you need to make in the code. The debugger perfectly complements the insights of the profiler. In the debugger, we can pinpoint specific objects and memory locations. A profiler is a blunt instrument, and the debugger is a fine one. By combining both, we can zoom in on a specific issue and understand the root cause. Searchable Memory View We’ll start by launching the IDE memory view. We can enable the memory view by clicking the widget on the right side of the IDE here. Once we click it, we can see the memory view in the same area. Notice that the memory view is empty by default even after we launch it. This keeps the IDE responsive. In order to see the actual objects in memory, we need to click the load link in the center. Once loaded, we can see the instance count for every object. This helps us get a sense of what exactly is taking up memory. But that’s only part of the story. When we step over, there are allocations happening. We can see the difference between the current point and the one before when we look at the diff column. Notice when I say “point,” I mean either the line before with a step over, but it can also apply for pressing continue between two breakpoints. In this case, I can see the line I stepped over triggered the allocation of 70-byte arrays. That might seem like a lot, but the IDE can’t distinguish threads and a deep call graph, so we need to take the number with a grain of salt. We can double-click an individual entry and see all the instances of the given object, which is a remarkably powerful feature. I’ll dig a bit deeper into this feature soon enough. As a reminder, we can filter the objects we see here using the field on the top of the dialog and locate any object in memory. This is a very powerful tool. Update Loaded Classes Clicking load classes every time is tedious. I have a fast machine with a lot of RAM. I can enable “Update Loaded Classes on Debugger Stop,” and I will no longer need to press load explicitly. Only do that if your machine is a monster, as this will slow down your debugging sessions noticeably. I’m enabling this here because I think it will make the video clearer. Track New Instances You might have noticed that area on the right side of the instance view. We can enable it with the track new instances option. This option lets us explicitly track the individual allocations that are going on between two points. We can enable that by right-clicking any non-array object and enabling this option like we do here. Once enabled, we see a small watch sign next to the tracked object, but there’s a lot more involved as we continue the execution. I can now see only the objects allocated in this diff. We can understand exactly what happened in terms of RAM at great detail. Notice that here I can see the exact number of elements that were allocated here. There were a lot because I took a long pause waiting before stepping over. By clicking show new instances, I get a special version of the instances dialog. In this version of the dialog, I only see the new instances created. The IDE knows exactly which objects were created between the last stop on a breakpoint and now. It only shows me these objects. For each of the individual objects, I can see the stack trace that triggered it all the way up to the literal call to new. I can understand who created every object and follow the logic to why an object was created. I can double-click an entry in the stack and go to the applicable source code. This is a fantastic level of insight. Step-Over and Breakpoints I discussed this before, but these updates don’t just work for step-over. Everything I showed works exactly the same when jumping between two breakpoints. Even if they’re in separate methods, the diff will be between those two points. This is very powerful. You can slowly narrow the gap between two points as you discover which area of the code is taking up memory. Notice that memory allocation directly correlates to performance, as garbage collection is a major source of performance overhead. This lets us narrow down the root cause. Final Word In the next video, we’ll discuss remote debugging and its risks. I know what you might be thinking; I already know how to use remote debugging. This is a different video, we’ll discuss tunneling, port-forwarding, and the risks involved in doing all of that. If you have any questions, please use the comments section. Thank you!
Correctly evaluating model performance is a crucial task while working with machine learning. There are quite a few metrics that we may use to do so. That can be problematic for someone who just started the journey in this field — at least, it was for me. I will start with describing concepts like true/false negatives/positives as they are the base for more complex metrics. Then I will mention and explain metrics like accuracy, precision, recall, or calibration error. I will also explain the basics behind the confusion matrix and a short code snippet on how to build one. Why? Finding resources online and reading them is simple. Everyone can do it, and I did it as well, but I missed an all-in-one glossary for all the stuff. This is my main motivation behind writing this text. First, I will describe all the metrics I came into contact with while working on my previous project. I think that such a metrics glossary will be useful for all the people new to working with machine learning models. Metrics Let's start with true positives and other positive/negative combinations. I will make it in tabular form for easier reading. True/False Positive/Negative Confusion Matrix Less commonly known as Error Matrix, it is a basic visual representation of our model performance. The concept takes its name from the fact that it makes it easy to see whether the system is confusing two or more classes. Moreover, in the case of multiclass, we can easily nothing w a pair of classes is the hardest for the model to differentiate. In most cases, it represents the instances of an actual class in rows while representing the instances of a predicted class in columns. However, there can also be reversed representation when columns are labels and rows are predictions, but it is less frequent. Accuracy It is the basic metric when it comes to model performance. It describes how often our model makes correct predictions — usually, the measurement is expressed in percentage. The problem with accuracy is that it is a very poor metric and is easy to play with. The most notable one is that we can fairly easily achieve high accuracy in quite complex tasks. For example, in the case of anti-money laundering, you can always just return zero — meaning that this person is not laundering money — and for sure, you will achieve accuracy higher than 95 %. As most people are not actually trying to do any money laundering stuff. The question is: does such high accuracy mean that your model is good or that you will need some other metric to verify your model performance? The answer I leave up to you. Furthermore, it is easy to overfit the model when one bases only on accuracy. We may make too many assumptions in our code that apply only to our test set and may not generalize at all. Another problem is that when we incorrectly prepare a test set. It will be too similar to the train set, or part of the test set will be included in the train set. We can once again end up with a quite high accuracy but a poor generalizing model. As for equations for accuracy — we can express them in terms of true positives and true negatives. Thus it can be viewed as a ratio of correct predictions compared to the whole population. TP + TN — Correct predictions P + N — The whole population Precision Checks how many positives were, in fact, identified correctly. Represent the ratio of correctly predicted positive classes to all items predicted as positives. This can be viewed as a ratio of TP to the sum of TP and FP. High precision means that we can easily identify positives. Furthermore, precision helps us to visualize the reliability of the machine learning model in classifying positive classes. TP + FP — Total number of classified positives Recall Less commonly known as sensitivity. It tries to answer the question of what number of actual positives was identified correctly. Represents the ratio of correctly predicted positive classes to all items that are actually positive. Thus it can be expressed as a ratio of TP compared to the sum of TP and FN. High recall means that we are able to correctly identify most of the positives. While low recall means that model is misidentifying positive values. TP + FN — All positive samples Precision and Recall Problem To fully evaluate the model performance, we need to know both metrics. However, the relationship between them is quite complex. Usually, actions that increase precision results in a decrease in the recall, and vice versa; actions that increase recall result in a decrease in precision. Therefore, you have to carefully balance and pick which metric is the most important for your model use case. Confidence Score A number from 0 to 1 (0 to 100 if one is using percentage notation) is used to express how sure our model is of its predictions. In general, the higher the confidence score, the better. A confidence score below 0,5 (50) may indicate random or semi-random predictions. While evaluating accuracy results for the model, you should also take the Confidence Score into consideration. There is no reason why you should need a model with high accuracy but low confidence. Effectively a model totally uncertain of its predictions. We should aim to express the accuracy of our model within a certain confidence score. ROC and AUC Score ROC is an abbreviation for Receiver Operating Characteristic Curve. It is the graphical representation of binary classification prediction ability. Describes the relation between Recall (or true positive rate) and false positive rate (FPR) at various threshold settings. AUC is an abbreviation for Area Under Curve. While AUROC is the abbreviation for Area Under Receiver Operating Characteristic Curve. It is a number from zero to one, which describes the part of the plot located below the ROC curve. It can be used to describe how well our model can distinguish positive samples from negative samples. Depending on the value of AUC, your model will behave differently. For AUC value equal to: 1 — model will correctly predict all the labels. From 0.5 to 1, the higher the AUC, the higher the chance our model will predict results correctly. 0.5 — model is not able to distinguish positives from negatives. 0 — model will incorrectly predict all the labels (it will classify all positives as negatives and vice versa). IUO Intersection over the union in longer form or Jaccard Index. It is a metric for describing the similarities between the two data sets, with a range from 0 to 1 (or 0 to 100 or with percentage notation). The higher the value, the more similar the two populations are. For IOU equal to: 1 — Sets that share all members. 0.5 — Sets share half of the members. 0 — Sets share no members. This metric is heavily used in object detection and segmentation to calculate the degree of overlap between segments. Although it's easy to interpret, it is extremely sensitive to small sample sizes and may give erroneous results, especially with very small samples or data sets with missing observations. It can be expressed via the following equation: Jaccard Index = (the number in both sets) / (the number in either set) * 100 In more mathematical notation: Here you can also see why it is called the intersection over the union, as the first operation is called an intersection, while the second is called a union. Calibration Error It describes how well the predicted output probabilities of the model match the actual probabilities of the ground truth distribution. The Calibration Error can be used to visualize how far are given model results are from real-life results. F1 score Mixes precision and recall into one metric in a form of their harmonic mean, and it is designed to work better for an imbalanced dataset. It is also the default metric used whenever one needs only one metric to show some results. Both precision and recall are given equal weight, so no one of them has a bigger impact than the other. We can expect that if both are high, then F1 will also be high, similar to low values of precision and recall. However, what is important is that if one is high and the other one is low F1 value will be somewhere between. As usual, the higher the metric value, the better our model performance. Which Metric to Choose? Sooner or later, there begins a problem of which metric to present to stakeholders or on which metric we should focus to make our model better. The answer here is simple — It depends. For sure, you should not base model performance evaluation on accuracy alone and take more metrics into consideration. But, on the other hand, if you have to use only one metric to present some results, the F1 score or AUC are very good picks. As for other metrics, their importance greatly depends on your model's purpose and shortcomings: If you assess that errors caused by FNs are more undesirable, then you should focus on Recall. If you assess that both types of errors are undesirable, then focus on F1. If you want your model to be more certain of its prediction, then you should focus on increasing the Confidence Score and reducing Calibration Error. Additionally, if you want to show or see shortcomings of your model, then you can use Confusion Matrix to easily visualize which classes may be problematic. Conclusion There are many metrics that can be used to verify machine learning model performance, and their usage greatly depends on your model use case. However, remember that you should never base anything on Accuracy only and use other metrics to verify that the model is performing as expected. If you need single metrics to show to your stakeholder, the F1 score can be a good pick. Thank you for your time.
This review is about API Design Patterns by JJ Geewax from Manning. I already mentioned how I'm trying to get up to speed in the API world:reading books, viewing relevant YouTube videos, and reading relevant IETF RFCs. Facts 30 chapters, $35.00 The author is a Principal Software Engineer at Google Chapters Introduction Design principles Naming Resource scope and hierarchy Data types and defaults Fundamentals Resource identification: How to identify resources in an API Standard methods: The set of standard methods for use in resource-oriented APIs Partial updates and retrievals: How to interact with portions of resources Custom methods: Using custom (non-standard) methods in resource-oriented APIs Long-running operations: How to handle methods that are not instantaneous Rerunnable jobs: Running repeated custom functionality in an API Resource relationships Singleton sub-resources: Isolating portions of resource data Cross references: How to reference other resources in an API Association resources: How to manage many-to-many relationships with metadata Add and remove custom methods: How to manage many-to-many relationships without metadata Polymorphism: Designing resources with dynamically-typed attributes Collective operations Copy and move: Duplicating and relocating resources in an API Batch operations: Extending methods to apply to groups of resources atomically Criteria-based deletion: Deleting multiple resources based on a set of filter criteria Anonymous writes: Ingesting unaddressable data into an API Pagination: Consuming large amounts of data in bite-sized chunks Filtering: Limiting result sets according to a user-specified filter Importing and exporting: Moving data into or out of an API by interacting directly with a storage system Safety and Security Versioning and compatibility: Defining compatibility and strategies for versioning APIs Soft deletion: Moving resources to the "API recycle bin" Request deduplication: Preventing duplicate work due to network interruptions in APIs Request validation: Allowing API methods to be called in "safe mode" Resource revisions: Tracking resource change history Request retrial: Algorithms for safely retrying API requests Request authentication: Verifying that requests are authentic and untampered with Each design pattern chapter follows the same structure: Motivation: what problem solves the pattern Overview: a short description of the pattern Implementation: an in-depth explanation of the pattern. It's structured into different subsections. Trade-offs: patterns have strong and weak points; this section describes the latter Exercises: a list of questions to verify that one has understood the pattern Pros and Cons Let's start with the good sides: As I mentioned above, the structure of each chapter dedicated to a design pattern is repetitive. It makes the chapter easy to consume, as you know exactly what to expect. In general, I read my technical books just before going to sleep because I'm pretty busy during the day. Most books have long chapters, requiring me to stop mid-chapter when I start to fall asleep. When you start again, you need to get back a few pages to get the context back. The length of a chapter on API Design Patterns is ideal: neither too long nor too short. The Design Principles section starts from the basics. You don't need to be an expert on API to benefit from the book. I was not; I hope that I'm more seasoned by now. I was a bit dubious at first about the Exercises section of each chapter, for it didn't provide any solution. However, I came to realize it activates the active recall mechanism: instead of passively reading, actively recall what you learned in answering questions. It improves the memorization of what was learned. As an additional benefit, you can learn in a group, compare your answers and eventually debate them. Now, I've some critiques as well: Some patterns are directly taken from Google's API Improvement Proposals. It's not a problem per se, but when it's the case, there's no discussion at all about possible alternatives. For example, the chapter on custom methods describes how to handle actions that don't map precisely to an HTTP verb: a bank transfer is such an action because it changes two resources, the "from" and the "to" accounts.The proposed Google AIP is for the HTTP URI to use a : character followed by the custom verb, e.g., /accounts/123:transfer. That's an exciting proposal that solves the lack of mapping issue. But there are no proposed alternatives nor any motivation for why it should be this way. As an engineer, I can hardly accept implementing a solution with such far-reaching consequences without being provided with other alternatives with their pros and cons. Last but not least, the book doesn't mention any relevant RFC or IETF draft. Chapter 26 describes how to manage request deduplication, the fact that one may need to send the same non-idempotent request repeatedly without being afraid of ill side effects. The proposed solution is good: the client should use a unique key, and if the server gets the same key again, it should discard the request.It's precisely what the IETF draft describes: The Idempotency-Key HTTP Header Field. Still, there's no mention of this draft, giving the feeling that the book is disconnected from its ecosystem. Author's Replies For once, I was already in touch with the author. I offered him an opportunity to review the post. Since his answers are open, I decided to publish them with his permission: Why isn't there more discussion about alternatives? I think you're right — and I actually had quite a bit of discussion of the alternatives in the original manuscript, and I ended up chopping them out. One reason was that my editor wanted to keep chapters reasonably sized, and my internal debates and explanations of why one option was better or worse than another was adding less value than "here's the way to do it." The other reason was that I "should be opinionated." If this were a textbook for a class on exploring API design I could weigh all the sides and put together a proper debate on the pros and cons of the different options, but in most cases there turned out to be a very good option that we've tried out and seen work really well over the course of 5+ years (e.g., : for custom methods). In other cases we actively didn't have that and the chapter is a debate showing the different alternatives (e.g., Versioning). If I could do a 600-700 pages, I think it would have this for you. Why aren't there more references to IETF standards? This is a glaring oversight on my part. There are a lot of RFCs and I do mention some (e.g., RFC-6902 in Partial Updates, etc), but I haven't pulled in enough from that standards body and it's a mistake. If we do a 2nd edition, this will be at the top of the list. Conclusion Because of a couple of cons I mentioned, API Design Patterns falls short of being a reference book. Nonetheless, it's a great book that I recommend to any developer entering the world of APIs or even one with more experience to round up their knowledge.
There may be a scenario when you want to test an application when the network is slow(we also call it high network latency). Or you are reproducing a customer scenario(having high network latency) where some anomalous behavior is observed. In the Chrome browser, we can easily Simulate a slower network connection. This is very helpful when we are working with web applications. But we can also have non-web applications, like web-service applications, messaging brokers, etc. Also, the slowness simulated using Chrome dev tools is more from the client side. In this article, we will understand how tc command in Linux can be used to simulate network slowness and how we can simulate packet corruption. I have followed a couple of articles on the web, and after testing the commands mentioned in the articles, I am sharing my experiences. 1. Run a basic HTTP application using the python utility. Then we can check the response time using the curl or ab utility. $ python3 -m http.server Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ... # we can use curl with following syntax to get statistics for total response time. $ curl -o /dev/null -s -w 'Establish Connection: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n' http://localhost:8000/ Establish Connection: 0.000331s TTFB: 0.002978s Total: 0.003120s # we can also use ab utility to understand statistics of a request. Here -n switch is used for number of request to be sent. $ ab -n 1 http://localhost:8000/ This is ApacheBench, Version 2.3 <$Revision: 1879490 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking localhost (be patient).....done Server Software: SimpleHTTP/0.6 Server Hostname: localhost Server Port: 8000 Document Path: / Document Length: 2571 bytes Concurrency Level: 1 Time taken for tests: 0.003 seconds Complete requests: 1 Failed requests: 0 Total transferred: 2727 bytes HTML transferred: 2571 bytes Requests per second: 374.11 [#/sec] (mean) Time per request: 2.673 [ms] (mean) Time per request: 2.673 [ms] (mean, across all concurrent requests) Transfer rate: 996.29 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.0 0 0 Processing: 3 3 0.0 3 3 Waiting: 3 3 0.0 3 3 Total: 3 3 0.0 3 3 2. Now set a delay of two seconds and then again check the response time. Furthermore, we would see that the total response time is increased due to the delay we set. #set a delay of 2 second for loopback interface. $ sudo tc qdisc replace dev lo root netem delay 2000ms # show the rule set $ tc qdisc show dev lo qdisc netem 8001: root refcnt 2 limit 1000 delay 2s # we can use curl(or ab) with following syntax to get statistics for total response time. $ curl -o /dev/null -s -w 'Establish Connection: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n' http://localhost:8000/ Establish Connection: 4.000354s TTFB: 8.002722s Total: 8.002833s cpandey@cpandey:~$ ab -n 1 http://localhost:8000/ This is ApacheBench, Version 2.3 <$Revision: 1879490 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking localhost (be patient).....done Server Software: SimpleHTTP/0.6 Server Hostname: localhost Server Port: 8000 Document Path: / Document Length: 2571 bytes Concurrency Level: 1 Time taken for tests: 8.003 seconds Complete requests: 1 Failed requests: 0 Total transferred: 2727 bytes HTML transferred: 2571 bytes Requests per second: 0.12 [#/sec] (mean) Time per request: 8002.822 [ms] (mean) Time per request: 8002.822 [ms] (mean, across all concurrent requests) Transfer rate: 0.33 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 4000 4000 0.0 4000 4000 Processing: 4003 4003 0.0 4003 4003 Waiting: 4002 4002 0.0 4002 4002 Total: 8003 8003 0.0 8003 8003 3. Finally, delete the rule we had set earlier; this is important as we were simulating a slow network. This would slow down the complete interface and associated network traffic. Thus once we finish our testing, it is important to delete the rule which we have set. $ sudo tc qdisc delete dev lo root 4. We can also simulate the corruption of packets as per the percentage set. As per my testing, I found that this also helps in simulating network latency. I observed that packets are being re-transmitted because TCP protocol ensures data is received correctly, no data is missing, and in order. $ sudo tc qdisc replace dev lo root netem corrupt 50% That's it; I hope you will find this article interesting and helpful. Thank you.
Having a distributed and scalable graph database system is highly sought after in many enterprise scenarios. This, on the one hand, is heavily influenced by the sustained rising and popularity of big-data processing frameworks, including but not limited to Hadoop, Spark, and NoSQL databases; on the other hand, as more and more data are to be analyzed in a correlated and multi-dimensional fashion, it's getting difficult to pack all data into one graph on one instance, having a truly distributed and horizontally scalable graph database is a must-have. Do Not Be Misled Designing and implementing a scalable graph database system has never been a trivial task. There is a countless number of enterprises, particularly Internet giants, that have explored ways to make graph data processing scalable. Nevertheless, most solutions are either limited to their private and narrow use cases or offer scalability in a vertical fashion with hardware acceleration which only proves, again, that the reason why mainframe architecture computer was deterministically replaced by PC-architecture computer in the 90s was mainly that vertical scalability is generally considered inferior and less-capable-n-scalable than horizontal scalability, period. It has been a norm to perceive that distributed databases use the method of adding cheap PC(s) to achieve scalability (storage and computing) and attempt to store data once and for all on demand. However, doing the same cannot achieve equivalent scalability without massively sacrificing query performance on graph systems. Why scalability in a graph (database) system is so difficult (to get)? The primary reason is that graph system is high-dimensional; this is in deep contrast to traditional SQL or NoSQL systems, which are predominantly table-centric, essentially columnar and row stores (and KV stores in a more simplistic way) and have been proved to be relatively easy to implement with a horizontally scalable design. A seemingly simple and intuitive graph query may lead to deep traversal and penetration of a large amount of graph data, which tends to otherwise cause a typical BSP (Bulky Synchronous Processing) system to exchange heavily amongst its many distributed instances, therefore causing significant (and unbearable) latencies. On the other hand, most existing graph systems prefer to sacrifice performance (computation) while offering scalability (storage). This would render such systems impractical and useless in handling many real-world business scenarios. A more accurate way to describe such systems is that they probably can store a large amount of data (across many instances) but cannot offer adequate graph-computing power — to put it another way, these systems fail to return with results when being queried beyond meta-data (nodes and edges). This article aims to demystify the scalability challenge(s) of graph databases, meanwhile putting a lot of focus on performance issues. Simply put, you will have a better and unobstructed understanding of scalability and performance in any graph database system and gain more confidence in choosing your future graph system. There is quite a bit of noise in the market about graph database scalability; some vendors claim they have unlimited scalability, while others claim to be the first enterprise-grade scalable graph databases. Who should you believe or follow? The only way out is to equip yourself with adequate knowledge about scalability in graph database systems so that you can validate it by yourself and don't have to be misled by all those marketing hypes. Admittedly, there are many terms for graph database scalability; some can be dearly confusing, to name a few: HA, RAFT or Distributed Consensus, HTAP, Federation, Fabric, Sharding, Partitioning, etc. Can you really tell the difference, sometimes minute and often with overlapping features, of all these terms? We'll unravel them all. 3 Schools of Distributed Graph System Architecture Designs First, make sure you understand the evolution pathway from a standalone (graph database) instance to a fully distributed and horizontally scalable cluster of graph database instances. Graph 1: Evolution of Distributed (Graph) Systems. A distributed system may take many forms, and this rich diversification may lead to confusion. Some vendors misleadingly (and ironically) claim their database systems to be distributed evenly on a single piece of underpinning hardware instance, while other vendors claim their sharded graph database cluster can handle zillion-scale graph datasets while, in reality, the cluster can't even handle a typical multi-hop graph query or graph algorithm that reiteratively traverse the entire dataset. Simply put, there are ONLY three schools of scalable graph database architecture designs, as captured in the table: Table 1: Comparison of three schools of Distributed Graph Systems. HTAP Architecture The first school is considered a natural extension to the master-slave model, and we are calling it distributed consensus cluster where typically three instances form a graph database cluster. The only reason to have three or an odd number of instances in the same cluster is that it's easier to vote for a leader of the cluster. As you can see, this model of cluster design may have many variations; for instance, Neo4j's Enterprise Edition v4.x supports the original RAFT protocol, and only one instance handles workload, while the other two instances passively synchronize data from the primary instance — this, of course, is a naïve way of putting RAFT protocol to work. A more practical way to handle workload is to augment the RAFT protocol to allow all instances to work in a load-balanced way. For instance, having the leader instance handle read-and-write operations, while the other instances can at least handle read type of queries to ensure data consistencies across the entire cluster. A more sophisticated way in this type of distributed graph system design is to allow for HTAP (Hybrid Transactional and Analytical Processing), meaning there will be varied roles assigned amongst the cluster instances; the leader will handle TP operations, while the followers will handle AP operations, which can be further broken down into roles for graph algorithms, etc. The pros and cons of graph system leveraging distributed consensus include: Small hardware footprint (cheaper). Great data consistency (easier to implement). Best performance on sophisticated and deep queries. Limited scalability (relying on vertical scalability). Difficult to handle a single graph that's over ten billion-plus nodes and edges. What's illustrated below is a novel HTAP architecture from Ultipa with key features like: High-Density Parallel Graph Computing. Multi-Layer Storage Acceleration (Storage is in close proximity to compute). Dynamic Pruning (Expedited graph traversal via dynamic trimming mechanism). Super-Linear Performance (i.e., when computing resource such as the number of CPU cores is doubled, the performance gain can be more than doubled). Graph 2: HTAP Architecture Diagram by Ultipa Graph. Note that such HTAP architecture works wonderfully on graph data size that's below 10B nodes + edges. Because lots of computing acceleration are done via in-memory computing, and if every billion nodes and edges consume about 100GB of DRAM, it may take 1TB of DRAM on a single instance to handle a graph of ten billion nodes and edges. The upside of such design is that the architecture is satisfactory for most real-world scenarios. Even for G-SIBs (Globally Systemically Important Banks), a typical fraud detection, asset-liability management, or liquidity risk management use case would consume around one billion data; a reasonably sized virtual machine or PC server can decently accommodate such data scale and be very productive with an HTAP setup. The downside of such a design is the lack of horizontal (and unlimited) scalability. And this challenge is addressed in the second and third schools of distributed graph system designs (see Table 1). The two graphs below show the performance advantages of HTAP architecture. There are two points to watch out for: Linear Performance Gain: A 3-instance Ultipa HTAP cluster's throughput can reach ~300% of a standalone instance. The gain is reflected primarily in AP type of operations such as meta-data queries, path/k-hop queries, and graph algorithms, but not in TP operations such as insertions or deletions of meta-data because these operations are done primarily on the main instance before synchronized with secondary instances. Better performance = Lower Latency and Higher Throughput (TPS or QPS). Graph 3: Performance Advantages of HTAP Architecture. Graph 4: TPS comparison of Ultipa and Neo4j. Grid Architecture In the second school, there are also quite a few naming variations for such types of distributed and scalable graph system designs (some are misleading). To name a few: Proxy, Name server, MapReduce, Grid, or Federation. Ignore the naming differences; the key difference between the secondary school and the first school lies with the name server(s) functioning as a proxy between the client side and server side. When functioning as a proxy server, the name server is only for routing queries and forwarding data. On top of this, except for the running graph algorithm, the name server has the capacity to aggregate data from the underpinning instances. Furthermore, in federation mode, queries can be run against multiple underpinning instances (query-federation); for graph algorithms, however, the federation's performance is poor (due to data migration, just like how map-reduce works). Note that the second school is different from the third school in one area: data is functionally partitioned but not sharded in this school of design. For graph datasets, functional partitioning is the logical division of graph data, such as per time series (horizontal partitioning) or per business logic (vertical partitioning). Sharding, on the other hand, aims to be automated, business logic or time series ignorant. Sharding normally considers the location of network storage-based partitioning of data; it uses various redundant data and special data distribution to improve performance, such as making cuttings against nodes and edges on the one hand and replicating some of the cut data for better access performance on the other hand. In fact, sharding is very complex and difficult to understand. Automated sharding, by definition, is designed to treat unpredictable data distribution with minimal-to-zero human intervention and business-logic ignorant, but this ignorance can be very problematic when facing business challenges entangled with specific data distribution. Let's use concrete examples to illustrate this. Assuming you have 12 months' worth of credit card transaction data. In artificial partition mode, you naturally divide the network of data into 12 graph sets, one graph set with one-month transactions on each cluster of three instances, and this logic is predefined by the database admin. It emphasizes dividing the data via the metadata of the database and ignoring the connectivity between the different graph sets. It's business-friendly, it won't slow down data migration, and has good query performance. On the other hand, in auto-sharding mode, it's up to the graph system to determine how to divide (cut) the dataset, and the sharding logic is transparent to the database admin. But it's hard for developers to immediately figure out where the data is stored, therefore leading to potential slow data migration problems. It would be imprudent to claim that auto-sharding is more intelligent than functional partitioning simply because auto-sharding involves less human intervention. Do you feel something is wrong here? It's exactly what we are experiencing with the ongoing rising of artificial intelligence, we are allowing machines to make decisions on our behalf, and it's not always intelligent! (In a separate essay, we will cover the topic of the global transition from artificial intelligence to augmented intelligence and why graph technology is strategically positioned to empower this transition.) In Graph-5, a grid architecture pertaining to the second school of design is illustrated; the two extra components added on top of Graph-2's HTAP architecture are name server(s) and meta server(s). Essentially all queries are proxied through the name-server, and the name-sever works jointly with the meta-server to ensure the elasticity of the grid; the server cluster instances are largely the same as the original HTAP instance (as illustrated in Graph 2). Graph 5: Grid Architecture w/ Name Server and Meta Server. Referring to Table 1, the pros and cons of the grid architecture design can be summarized as follows: All the pros/benefits of a typical HTAP architecture are retained. Scalability is achieved with performance intact (compared to HTAP architecture). Restricted scalability — server clusters are partitioned with DBA/admin intervention. Introduction of name-server/meta-server, making cluster management sophisticated. The name-server is critical and complex in ensuring business logic is performed distributively on the server clusters and with simple merge and aggregation functionalities on it before returning to the clients. Business logic may be required to cooperate with partitioning and querying. Shard Architecture Now, we can usher in the third school of distributed graph system design with unlimited scalability — the shard (see Table 1). On the surface, the horizontal scalability of a sharding system also leverages name server and meta server as in the second school of design, but the main differences lie with the: Shard servers are genuinely shared. Name servers do NOT have knowledge about business logic (as in the second school) directly. Indirectly, it can roughly judge the category of business logic via automatic statistics collection. This decoupling is important, and it couldn't be achieved elegantly in the second school. The sharded architecture has some variations; some vendor calls it fabrics (it's actually more like grid architecture in the secondary school), and others call it map-reduce, but we should deep dive into the core data processing logic to unravel the mystery. There are only two types of data processing logic in shard architecture: Type 1: Data is processed mainly on name servers (or proxy servers) Type 2: Data is processed on sharded or partitioned servers as well as name servers. Type 1 is typical, as you see in most map-reduce systems such as Hadoop; data are scattered across the highly distributed instances. However, they need to be lifted and shifted over to the name servers before they are processed there. Type 2 is different in that the shard servers have the capacity to locally process the data (this is called: compute near or collocated with storage or data-centric computing) before they are aggregated and secondarily processed on the name servers. As you would imagine, type 1 is easier to implement as it's a mature design scheme by many big-data frameworks; however, type 2 offers better performance with more sophisticated cluster design and query optimization. Shard servers in type-2 offer computing power, while type-1 has no such capability. The graph below shows a type-2 shard design: Graph 6: Shard Architecture w/ Name Server and Meta Server. Sharding is nothing new from a traditional SQL or NoSQL big-data framework design perspective. However, sharding on graph data can be Pandora's box, and here is why: Multiple shards will increase I/O performance, particularly data ingestion speed. But multiple shards will significantly increase the turnaround time of any graph query that spans across multiple shards, such as path queries, k-hop queries, and most graph algorithms (the latency increase can be exponential!). Graph query planning and optimization can be extremely sophisticated, most vendors today have done very shallowly on this front, and there are tons of opportunities in deepening query optimization on-the-fly: Cascades (Heuristic vs. Cost) Partition-pruning (shard-pruning, actually) Index-choosing Statistics (Smart Estimation) Pushdown (making computing as close to the storage as possible) and more. In Graph-7, we captured some preliminary findings on the Ultipa HTAP cluster and Ultipa Shard cluster; as you can see, data ingestion speed improves by four times (super-linear), but everything else tends to be slower by five times or more (PageRank slower by 10x, LPA by 16X, etc.) Graph 7: Preliminary findings on the performance difference between HTAP and Shard Architecture. Stay Tuned There are tons of opportunities to continuously improve the performance of the sharding architecture. The team at Ultipa has realized that having a truly advanced cluster management mechanism and deeper query optimization on a horizontally scalable system are the keys to achieving endless scalability and satisfactory performance. Lastly, the third schools of distributed graph system architectures illustrate the diversity and complexity involved when designing a sophisticated and competent graph system. Its course, it’s hard to say one architecture is absolutely superior to another, given cost, subjective preference, design philosophy, business logic, complexity-tolerance, serviceability, and many other factors — it would be prudent to conclude that the direction of architecture evolution for the long term clearly is to go from the first school to the second school and eventually to the third school. However, most customer scenarios can be satisfied with the first two schools, and human intelligence (DBA intervention) still makes pivotal sense in helping to achieve an equilibrium of performance and scalability, particularly in the second and third schools of designs. Long live the formula: Graph Augmented Intelligence = Human-intelligence + Machine’s-graph-computing-power
As a bird’s eye view, Apache ZooKeeper has been leveraged to get coordination services for managing distributed applications. It holds responsibility for providing configuration information, naming, synchronization, and group services over large clusters in distributed systems. To consider as an example, Apache Kafka uses ZooKeeper for choosing their leader node for the topic partitions. zNodes The key concept of ZooKeeper is the zNode, which can be acted either as files or directories. ZNodes can be replicated between servers as they are working in a distributed file system. Znode can be described by a data structure called stats and it consolidates information about zNode context like creation time, number of changes (as version), number of children, length of stored data or zxid (ZooKeeper transaction ID) of creation, and last change. For every modification of zNodes, its version increases. The zNodes are classified into three categories: Persistence Ephemeral Sequential Persistence zNode Persistence zNode is alive even after the client, which created that particular zNode, is disconnected. Also, they survive after ZooKeeper restarted. Ephemeral zNode Ephemeral zNodes are active until the client is alive. As soon as the client gets disconnected from the ZooKeeper ensemble, then the ephemeral zNodes also get deleted automatically. Sequential zNode Sequential zNodes can be either persistent or ephemeral. Once a new zNode is created as a sequential zNode, then ZooKeeper sets the path of the zNode by attaching a 10-digit sequence number to the original name. The sequential zNode can be easily differentiated from the normal zNode with the help of different suffixes. The zNodes can have public or more restricted access. The access rights can be managed by special ACL permissions. Sessions Apache ZooKeeper’s operation relies heavily on sessions. The session will be established and the client will be given a session ID (a 64-bit number) when the client connects to the ZooKeeper server. A session has a timeout period, which is specified in milliseconds. The session might expire when the connection remains idle for more than the timeout period. The sessions are kept alive by the client sending a ping request (heartbeat) to the ZooKeeper service. By using a TCP connection, a client maintains the sessions with the ZooKeeper server. When a session ends, for any reason, the ephemeral zNodes created during that session will also get deleted. The right session timeout is determined by several factors, including the size of the ZooKeeper ensemble, application logic complexity, and network congestion. Watches The client can easily receive notifications about changes to the ZooKeeper ensemble through watches. The clients are able to set watches while reading a specific zNode. Any time a zNode (on which the client registers) changes, watches notify the registered client. Data associated with the zNode or changes in the zNode’s children are referred to as “zNode changes.” Watches are only activated once. A client must perform a second read operation if they want a notification again. The client will be disconnected from the server and the associated watches will also be removed when a connection session expires. The watches registered on a zNode can be removed with a call to removeWatches. Also, a ZooKeeper client can remove watches locally even if there is no server connection by setting the local flag to true. ZooKeeper Quorum It refers to the bare minimum of server nodes that must be operational and accessible to client requests. For a transaction to be successful, any client-generated updates to the ZooKeeper tree must be persistently stored in this quorum of nodes. Using the formula Q=2N+1, where Q is the number of nodes required to form a healthy ensemble and N is the maximum number of failure nodes, quorum specifies the rule for forming a healthy ensemble. The above formula can be considered to decide what is the safest and optimal size of a quorum. The ensemble can be defined simply as a group of ZooKeeper servers. The minimum number of nodes that are required to form an ensemble is three. A five-node ZooKeeper ensemble can handle two node failures because a quorum can be established from the remaining three nodes as per the formula Q=2N+1. The following entries can be defined as the quorum of ZooKeeper servers and must be available in the zoo.cfg file located under “conf directory.” server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3:2888:3888 And they follow the pattern as: server.X=server_name:port1:port2 server.X, where X is the server number in ASCII. Prior to that, we will have to create a file named as “myid” under the ZooKeeper data directory in each ZooKeeper server. This file should contain the server number X as an entry in it. server_name is the hostname of the node where the ZooKeeper service is started. port1: the ZooKeeper server uses this port to connect followers to the leader. port2: this port is used for leader election. Transactions Transaction in Apache ZooKeeper is atomic and idempotent and involves two steps namely leader election and atomic broadcast. ZooKeeper uses ZooKeeper Atomic Broadcast (ZAB), a unique atomic messaging protocol. Because it is atomic, the ZAB protocol ensures that updates will either succeed or fail. Local Storage and Snapshots Transactions are stored in local storage on ZooKeeper servers. The ZooKeeper Data Directory contains snapshots and transactional log files, which are persistent copy of the zNodes stored by an ensemble. The transactions are logged to transaction logs. Any changes to zNodes are appended to the transaction log and when the log file size increases, a snapshot of the current state of zNodes is written to the file system. The ZooKeeper tracks a fuzzy state of its own data tree within the snapshot files. Because ZooKeeper transaction logs are written at a rapid rate, it is critical that they be configured on a disk separate from the server’s boot device. In the event of a catastrophic failure or user error, the transactional logs and snapshot files in Apache ZooKeeper make it possible to recover data. Inside the zoo.cfg file available under the “conf directory” of the ZooKeeper server, the data directory is specified by the dataDir parameter and the data log directory is specified by the dataLogDir parameter. Conclusion In this article, you have learned about the internal components of Apache ZooKeeper, which included three types of zNodes, sessions, watches, ZooKeeper quorum, and transactions. At this point, you should have a clearer understanding of Apache ZooKeeper’s internal components and their uses. Hope you have enjoyed this read. Please like and share if you feel this composition is valuable.
Image Classification With Deep Convolutional Neural Networks Image classification with deep convolutional neural networks (DCNN) is a mouthful of a phrase, but it is worth studying and understanding due to the number of projects and tasks that can be completed by using this method. Image classification is a powerful tool to take static images and extrapolate important data that might otherwise be missed. In this article, we will break down the purpose behind image classification, give a definition for a CNN, discuss how these two can be used together, and briefly explain how to create a DCNN architecture. What Is the Purpose of Image Classification? As already mentioned, the predominant purpose of image classification is to be able to generate more data from images. These can be as simple as identifying color patterns and as complicated as generating new images based on data from other images. This is exactly why image classification, especially with convolutional neural networks, is so powerful. With machine learning, there are two ways these image classification models can be trained: supervised or unsupervised learning. For a more in-depth discussion on the benefits of these options, make sure to read our article on Supervised vs. Unsupervised Learning. Depending on the type of image classification model you are creating, you may find you want to supervise the learning of the model and control what data is being fed into it. However, you may also want to import as much raw data as possible to allow your model to generate conclusions on its own. Both are acceptable paths, depending on your goal. Image classification is the process of simply generating data from images and being able to organize that data for some other use. When connected to a DCNN the true power of an image classification model can be unlocked. What Is a Convolutional Neural Network? A convolutional neural network is a specific type of neural network architecture designed to learn from large amounts of data, specifically for image, audio, time series, and signal data. What makes a convolutional neural network any different from a deep learning algorithm, though? The keyword “convolutional” is what makes all the difference in the world when it comes to analyzing and interpreting datasets. A convolutional neural network is created with tens, if not hundreds, of layers of neural networks all working towards the same goal. This means that a convolutional neural network can have layers stacked on top of each other that can pull different data from each image. This allows for one image input to be “studied” by each convolutional neural network layer quickly for a specific dataset and then move on to the next image. Essentially, this allows for deep learning models that are incredibly efficient at being able to parse out tons of data without slowing down because each layer is only looking at a tightly focused piece of data for each image. The same process can be done for audio processing, signal data, and time-series datasets. Can CNN Be Used for Image Classification? It has already been confirmed that a convolutional neural network (CNN or ConvNet) can be used in image classification. However, image classification with deep convolutional neural networks hasn’t been addressed exactly. As an example, focusing on image classification with convolutional neural networks allows users to input examples of images with vehicles in them and not overwhelm the neural network. In a standard neural network, the larger the image the more time it will take to process what we’re hoping to extrapolate from the data. Image classification using convolutional neural networks makes this faster as each layer is only looking for a specific set of data and then passing the data along to the next layer. While a standard neural network is still going to be able to process the image dataset, it will take longer and may not create the desired results in a timely manner. Image classification with deep convoluted neural networks, though, will be able to handle more images in a shorter timeframe to be able to quickly identify the types of vehicles being shown in an image and classify them appropriately. The applications for image classification with deep convolutional neural networks are endless once a model is properly trained, so let’s touch on what the best learning process is for image classification. Which Learning Is Better for Image Classification? Earlier in this article, we touched on supervised learning vs. unsupervised learning. Here we will discuss the best learning methods for convolutional neural network models just a step further. Something that hasn’t been clearly stated up to this point is whether or not a convolutional neural network should be trained on machine learning models or deep learning models. The short answer is that it will almost certainly make more sense to go for image classification with deep convoluted neural networks. Since many machine learning models are built around single-input testing, this means that the process is much slower and far less accurate than using the CNN image classification option based on deep learning neural networks. How to Create Your Own Convolutional Neural Network Architecture Now, one last step to better understand image classification with deep convolutional neural networks is to dive into the architecture behind them. Convolutional neural network architecture isn’t as complicated as one might think. A CNN is essentially made up of three layers: input, convolutional, and output. There are different terms used for different convolutional neural network architectures, but these are the most basic ways to understand how these models are created. The input layer is the first step in the process, where all the images are first introduced into the model The convolutional layer consists of the multiple layers of the neural network that will work on the various classifications of the image, building one upon the other The output layer is the final step of the neural network where the images are actually classified based on set parameters How exactly a user sets up each layer depends largely on what is being created, but all of them are fairly simple to understand in practice. The convolutional layers are going to be much of the same code, just tweaked for each piece of data to be extracted within each layer all the way through to the output layer. While the total convolutional neural network architecture may seem intimidating with the tens or hundreds of layers that make up the convolutional layer, the structure of the model is actually fairly simple. Convolutional Neural Network Code Example Many hands-on code examples for building and training convolutional neural networks (focused on applications of image classification), can be found in this Deep Learning With Python GitHub repo. One simple flow is as follows: import tensorflow as tf # Loading build-in MNIST dataset mnist = tf.keras.datasets.fashion_mnist (training_images, training_labels), (test_images, test_labels) = mnist.load_data() # Showing example images fig, axes = plt.subplots(nrows=2, ncols=6,figsize=(15,5)) ax = axes.ravel() for i in range(12): ax[i].imshow(training_images[i].reshape(28,28)) plt.show() The example images may look like the following: Continuing with: # Reshaping the images for feeding into the neural network training_images=training_images / 255.0 test_images=test_images / 255.0 #Keras model definition model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(), tf.keras.layers.Dense(256, activation=tf.nn.relu), tf.keras.layers.Dense(10, activation=tf.nn.softmax)]) # Compiling the model, Optimizer chosen is ‘adam’ # Loss function chosen is ‘sparse_categorical_crossentropy` model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) # Training/fitting the model for 10 epochs model.fit(training_images, training_labels, epochs=10) Training may produce epoch-by-epoch results (loss, accuracy, and so on) as follows: Compute the accuracy of the trained model on test set predictions: test_loss = model.evaluate(test_images, test_labels) print("\nTest accuracy: ",test_loss[1]) Output: 10000/10000 [==============================] - 1s 67us/sample - loss: 0.3636 - acc: 0.8763 We encourage you to take a look at other examples from the same GitHub repo for more variety and optimization of code.
I'm not an anti-GUI person. In fact, I wrote three books about web GUI development with Java. However, I also like the command-line interface (CLI), especially text-based UIs. After a year of exploring MariaDB and the DevOps world, I got to discover and play with many text-based CLI tools that I didn't know even existed. These tools are especially useful when connecting to remote servers that don't have a GUI. One special CLI tool that I frequently use is the mariadb SQL client (or mysql in the MySQL world)—a CLI program used to connect to MariaDB-compatible databases. With it, you can send SQL queries and other commands to the database server. The MariaDB CLI-based SQL client The mariadb SQL client has multiple configuration options, one of which is the possibility to set a terminal pager. If you are familiar with Linux, you have probably heard or used the more and less pagers. You can set a pager through the environment variable PAGER and mariadb will automatically use it. Alternatively, you can set a pager only for the current session using the mariadb prompt. For example, to use the less pager run the following command once you are connected to the database: MariaDB SQL pager less The next time you run a SQL query, you’ll be able to navigate through the result set using the arrow keys on your keyboard. Setting a pager using the mariadb SQL client The less pager is useful but not the best for SQL result sets that are shown as tables. There’s an open-source tool called pspg (see the documentation and source code on GitHub), initially developed for PostgreSQL but which later added support for several other databases, including MariaDB. Since the mariadb SQL client is able to connect to MariaDB Xpand databases, I gave it a try, and it worked perfectly. Keep reading to find out how to try it out. The easiest way to get an Xpand database up and running is by creating a service on SkySQL (it’s free). However, you can also run a local instance using Docker. Here’s the snippet you need: Shell docker run --name xpand \ -d \ -p 3306:3306 \ --ulimit memlock=-1 \ mariadb/xpand-single Databases are more fun when there’s data in them. A simple yet interesting demo database is available on this website. On Linux-like operating systems, run the following commands (change the IP address in the last command if your Xpand database is running somewhere else): Shell sudo apt install curl -y curl https://www.mariadbtutorial.com/wp-content/uploads/2019/10/nation.zip --output nation.zip unzip nation.zip mariadb -h 127.0.0.1 -u xpand < nation.sql rm nation.zip nation.sql Remember to install pspg: Shell apt install pspg -y Connect to the database using the mariadb SQL client with a custom and cooler prompt that shows “Xpand”: Shell mariadb -h 127.0.0.1 -u xpand --prompt="Xpand [\d]> " nation I learned this tip from my colleague Patrick Bossman (Product Manager at MariaDB) during a webinar on MariaDB Xpand + Docker. I recommend watching it if you want to learn more. Connecting to MariaDB Xpand using a custom prompt Set the pspg pager for the current session: MariaDB SQL pager pspg -s 14 -X --force-uniborder --quit-if-one-screen A nice feature in pspg is that it shows the fancy text-based UI only when it makes sense (--quit-if-one-screen). So if your query returns only a few rows that fit in the screen, it will just show them right there on the screen as usual. For example, try running the following query: MariaDB SQL select * from continents; Nothing new to see here. The pspg pager won't activate if only a few rows are shown However, try the following: MariaDB SQL select * from countries; A navigable text-based interface allows you to explore the data more efficiently. The pspg pager rendering data from MariaDB Xpand You can search for a row, order, export to CSV, freeze columns, mark rows, and even use the mouse to interact with the tool, among other things. Some of the menu options in pspg I hope this tool helps you the next time you have to interact with a database via SSH and the command line. You can find more information about how to install pspg on your operating system, configuration options, and documentation on the GitHub repository for the project. If you want to learn more about distributed SQL and the MariaDB Xpand database, watch this short video, take a look at this datasheet, and explore some of the blog posts and documentation.