Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service
Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
Discrepancies Between Test and FastAPI App Data
Vector Database: A Beginner's Guide
Key Points The special value NULL means no data, a statement of the fact that the value is unknown. By default, columns, and variables of any type can take this value unless they have a NOT NULL constraint. Also, the DBMS automatically adds a NOT NULL constraint to columns included in the table's primary key. The main feature of NULL is that it is not equal to anything, not even another NULL. You cannot compare any value with it using any operators: =, <, >, like ... Even the expression NULL != NULL will not be true because one cannot uniquely compare one unknown with another. By the way, this expression will not be false either because when calculating the conditions, Oracle is not limited to the TRUE and FALSE states. Due to the presence of an element of uncertainty in the form of NULL, there is one more state — UNKNOWN. Thus, Oracle operates not with two-valued but with three-valued logic. This feature was laid down in his relational theory by grandfather Codd, and Oracle, being a relational DBMS, fully follows his precepts. In order not to meditate on the УweirdФ results of queries, the developer needs to know the truth table of three-valued logic. For convenience, we will make a procedure that prints the state of a boolean parameter: PLSQL procedure testBool( p_bool in boolean ) is begin if p_bool = true then dbms_output.put_line('TRUE'); elsif p_bool = false then dbms_output.put_line('FALSE'); else dbms_output.put_line('UNKNOWN'); end if; end; Familiar comparison operators give in to NULL: PLSQL exec testBool( null = null ); -- UNKNOWN exec testBool( null != null ); -- UNKNOWN exec testBool( null = 'a' ); -- UNKNOWN exec testBool( null != 'a' ); -- UNKNOWN Comparison With NULL There are special operators, IS NULL and IS NOT NULL, that allow comparisons with NULLs. IS NULL will return true if the operand is NULL and false if it is not. PLSQL select case when null is null then 'YES' else 'NO' end from dual; -- YES select case when 'a' is null then 'YES' else 'NO' end from dual; -- NO Accordingly, IS NOT NULL does the opposite: it will return true if the value of the operand is non-NULL and false if it is NULL: PLSQL select case when 'a' is NOT null then 'YES' else 'NO' end from dual; -- YES select case when null is NOT null then 'YES' else 'NO' end from dual; -- NO In addition, there are a couple of exceptions to the rules regarding comparisons with missing values. The first is the DECODE function, which considers two NULLs to be equivalent to each other. Secondly, these are composite indexes: if two keys contain empty fields, but all their non-empty fields are equal, then Oracle considers these two keys to be equivalent. DECODE goes against the system: PLSQL select decode( null , 1, 'ONE' , null, 'EMPTY' -- это условие будет истинным , 'DEFAULT' ) from dual; Boolean Operations and NULL Normally, the UNKNOWN state is handled in the same way as FALSE. For example, if you select rows from a table and the x = NULL condition in the WHERE clause evaluates to UNKNOWN, then you won't get any rows. However, there is a difference: if the expression NOT(FALSE) returns true, then NOT(UNKNOWN) returns UNKNOWN. The logical operators AND and OR also have their own characteristics when handling an unknown state. Specifics in the example below. In most cases, an unknown result is treated as FALSE: PLSQL select 1 from dual where dummy = null; -- query will not return result The negation of the unknown gives the unknown: PLSQL exec testBool( not(null = null) ); -- UNKNOWN exec testBool( not(null != null) ); -- UNKNOWN exec testBool( not(null = 'a') ); -- UNKNOWN exec testBool( not(null != 'a') ); -- UNKNOWN OR operator: PLSQL exec testBool( null or true ); -- TRUE <- !!!!! exec testBool( null or false ); -- UNKNOWN exec testBool( null or null ); -- UNKNOWN AND operator: PLSQL exec testBool( null and true ); -- UNKNOWN exec testBool( null and false ); -- FALSE <- !!!!! exec testBool( null and null ); -- UNKNOWN IN and NOT IN Operators Let's start with a few preliminary steps. For tests, let's create a table T with one numeric column A and four rows: 1, 2, 3, and NULL. PLSQL create table t as select column_value a from table(sys.odcinumberlist(1,2,3,null)); Enable request tracing (you must have the PLUSTRACE role for this). In the listings from the trace, only the filter part is left to show what the conditions specified in the request unfold into. PLSQL set autotrace on The preliminaries are over. Now let's work with the operators. Let's try to select all records that are included in the set (1, 2, NULL): PLSQL select * from t where a in ( 1, 2, null ); -- will return [1, 2] -- Predicate Information: -- filter("A"=1 OR "A"=2 OR "A"=TO_NUMBER(NULL)) As you can see, the line with NULL was not selected. This happened because the evaluation of the predicate "A"=TO_NUMBER(NULL) returned the status UNKNOWN. In order to include NULLs in the query result, you have to specify it explicitly: PLSQL select * from t where a in ( 1, 2 ) or a is null; -- will return [1, 2, NULL] -- Predicate Information: -- filter("A" IS NULL OR "A"=1 OR "A"=2) Let's try now with NOT IN: PLSQL select * from t where a not in ( 1, 2, null ); -- no rows selected -- Predicate Information: -- filter("A"<>1 AND "A"<>2 AND "A"<>TO_NUMBER(NULL)) Not a single result at all! Let's see why the triple was not included in the query results. Let's manually calculate the filter applied by the DBMS for case A=3: Due to the peculiarities of the three-valued logic, NOT IN is not friendly with NULLs at all: as soon as NULL gets into the selection conditions, do not wait for the data. NULL and Empty String Here Oracle deviates from the ANSI SQL standard and declares the equivalence of NULL and the empty string. This is perhaps one of the most controversial features, which from time to time, gives rise to multi-page discussions with the transition to personalities and other indispensable attributes of tough disputes. Judging by the documentation, Oracle itself would not mind changing this situation (it says that even now, an empty string is treated as NULL, this may change in future releases), but today such a colossal amount of code has been written for this DBMS, what to take and change the behavior of the system is hardly realistic. Moreover, they started talking about this, at least from the seventh version of the DBMS (1992-1996), and now the twelfth is on its way. NULL and empty string are equivalent: PLSQL exec testBool( '' is null ); -- TRUE If you follow the precept of the classic and look at the root, then the reason for the equivalence of an empty string and NULL can be found in the storage format of varchars and NULLs inside data blocks. Oracle stores table rows in a structure consisting of a header followed by data columns. Each column is represented by two fields: the length of the data in the column (1 or 3 bytes) and, in fact, the data itself. If varchar2 has zero length, then there is nothing to write in the data field, it does not take a single byte, and the special value 0xFF is written in the length field, indicating the absence of data. NULL is represented in exactly the same way: there is no data field, and 0xFF is written in the length field. The developers of Oracle could, of course, separate these two states, but that's how it has been with them since ancient times. Personally, the equivalence of an empty string and NULL seems quite natural and logical to me. The very name "empty line" implies the absence of meaning, emptiness, a donut hole. NULL basically means the same thing. But there is an unpleasant consequence here: if you can say with certainty about an empty string that its length is equal to zero, then the length of NULL is not defined in any way. Therefore, the expression length('') will return NULL for you, not zero, as you obviously expected. Another problem: you can't compare against an empty string. The expression val = '' will return the status UNKNOWN since it is essentially equivalent to val = NULL. The length of the empty string is undefined: PLSQL select length('') from dual; -- NULL Comparison with an empty string is not possible: PLSQL exec test_bool( 'a' != '' ); -- UNKNOWN Critics of Oracle's approach argue that an empty string does not necessarily mean unknown. For example, a sales manager fills out a customer card. He may indicate his contact number (555-123456), may indicate that he is unknown (NULL), or may indicate that there is no contact number (empty string). With the Oracle method of storing empty strings, it will be problematic to implement the latter option. From the point of view of semantics, the argument is correct, but I always have a question on it to which I have not received a complete answer: how will the manager enter an empty string in the УphoneФ field, and how will he further distinguish it from NULL? Of course, there are options, but still... Actually, if we talk about PL/SQL, then somewhere deep inside its engine, an empty string NULL are different. One way to see this is because associative collections allow you to store an element at index '' (an empty string) but do not allow you to store an element at index NULL: PLSQL declare procedure empty_or_null( p_val varchar2 ) is type tt is table of varchar2(1) index by varchar2(10); t tt; begin if p_val is not null then dbms_output.put_line('not null'); else -- trying to create an element with index p_val t(p_val) := 'x'; -- happened! dbms_output.put_line('empty string'); end if; exception -- it was not possible to create an element with index p_val when others then dbms_output.put_line('NULL'); end; begin empty_or_null( 'qwe' ); -- not null empty_or_null( '' ); -- empty string empty_or_null( NULL ); -- NULL end; In order to avoid problems, it is better to learn the rule from the docs: an empty string and NULL are indistinguishable in Oracle. NULLMath PLSQL select decode( null + 10, null, 'UNKNOWN', 'KNOWN') a from dual; -- UNKNOWN select decode( null * 10, null, 'UNKNOWN', 'KNOWN') a from dual; -- UNKNOWN select decode( abs(null), null, 'UNKNOWN', 'KNOWN') a from dual; -- UNKNOWN select decode( sign(null), null, 'UNKNOWN', 'KNOWN') a from dual; -- UNKNOWN Things are different with concatenation: you can add NULL to a string, and it won't change it. Such is the policy of double standards. PLSQL select null ||'AA'|| null ||'BB'|| null from dual; -- AABB NULL and Aggregate Functions Almost all aggregate functions, with the exception of COUNT (and even then, not always), ignore null values during calculations. If they didn't, then the first NULL that came along would lead the function result to an unknown value. Take, for example, the SUM function, which needs to sum the series (1, 3, null, 2). If it took into account empty values, then we would get the following sequence of actions: 1 + 3 = 4; 4 + null = null; null + 2 = null. It is unlikely that you will be satisfied with such a calculation when calculating aggregates because you probably did not want to get it. Table with data. Used below many times: PLSQL create table agg( id int, n int ); insert into agg values( 1, 1 ); insert into agg values( 2, 3 ); insert into agg values( 3, null ); insert into agg values( 4, 2 ); commit; Empty values are ignored by aggregates: PLSQL select sum(n) from agg; -- 6 The COUNT row count function, if used as COUNT(*) or COUNT(constant), will count null values. However, if it is used as COUNT(expression), then null values will be ignored. with a constant: PLSQL select count(*) from agg; -- 4 select count(1+1) from agg; -- 4 select count(user) from agg; -- 4 with expression: PLSQL select count(n) from agg; -- 3 select count(id) from agg; -- 4 select count(abs(n)) from agg; -- 3 Also, you should be careful with functions like AVG. Because it will ignore null values, the result for field N will be (1+3+2)/3, not (1+3+2)/4. Perhaps you do not need such an average calculation. To solve such problems, there is a standard solution — use the NVL function: PLSQL select avg(n) from agg; -- (1 + 3 + 2) / 3 = 2 select avg(nvl(n,0)) from agg; -- (1 + 3 + 0 + 2) / 4 = 1.5 Aggregate functions return UNKNOWN if they are applied to an empty dataset or if it consists of only NULLs. The exceptions are the REGR_COUNT and COUNT(expression) functions designed to count the number of rows. They will return zero in the cases listed above. A data set of only NULLs: PLSQL select sum(n) from agg where n is null; -- UNKNOWN select avg(n) from agg where n is null; -- UNKNOWN select regr_count(n,n) from agg where n is null; -- 0 select count(n) from agg where n is null; -- 0 Empty dataset: PLSQL select sum(n) from agg where 1 = 0; -- UNKNOWN select avg(n) from agg where 1 = 0; -- UNKNOWN select regr_count(n,n) from agg where 1 = 0; -- 0 select count(n) from agg where 1 = 0; -- 0 NULL in Indexes When creating an index, Oracle includes entries in the index structures for all rows containing NULL values in the indexed column. Such records are called NULL records. This allows you to quickly identify rows where the corresponding column contains NULL, which can be useful when executing queries with NULL or non-NULL conditions. Using NULL values in regular indexes: Regular indexes include references to table rows, indicating the values of the indexed column and the corresponding ROWIDs of these rows. For rows with NULL values, the index stores a special NULL marker to indicate the presence of NULL in the indexed column. This allows Oracle to quickly find NULL rows in an indexed column. Using NULL values in composite indexes: In composite indexes that index multiple columns, each column has its own index structure. Thus, for composite indexes containing NULL columns, a NULL token will be present for each column containing NULLs. Functional indexes and NULLs: Functional indexes are built on the basis of expressions or functions on table columns. If the function allows NULL arguments, then the index will include entries for NULL function arguments. This can be useful when optimizing queries that use nullable functions. Bad Practices Indexing columns with low NULL cardinality: Creating indexes on columns where most values are NULL can result in suboptimal index usage and poor query performance. This is because indexes with low NULL cardinality will take up a lot of space in the database, and queries with such indexes can be slower than full table scans. Indexing non-selective columns with NULL: Non-selective columns are columns that have few unique values or many duplicate NULL values. Creating indexes on such columns may not be practical, as such indexes may not provide a significant improvement in query performance and require more resources to maintain. Using NULL Indexes with IS NOT NULL Operators: If the query contains conditions with the IS NOT NULL operator, then NULL indexes will not be used by the query optimizer. Thus, using NULL indexes in such queries will be useless and waste resources on creating and maintaining unnecessary indexes. Indexing large text columns with NULL: Creating indexes on large text columns that may contain NULL values can be disadvantageous due to the large amount of data that must be stored in the index. Indexing such columns can significantly increase the size of the index and slow query performance. Overuse of functional indexes with NULL: Functional indexes can be useful for optimizing queries with functions that allow null arguments. However, excessive use of NULL functional indexes can lead to undesired index size and performance degradation. Irrelevant and unused indexes with NULL: Stale and unused NULL indexes remain in the database, consume space, and need to be updated when the data changes. Such indexes should be parsed and removed regularly to reduce system load and optimize performance. It's important to keep in mind that using NULL in indexes can be useful, but not always. When creating indexes with NULL, you should pay attention to the cardinality of NULL values in columns and their actual use in queries. This will help avoid unnecessary indexes and improve database performance. Good Practices Indexing columns with high NULL cardinality: Creating indexes on columns with high NULL cardinality can be beneficial because indexes allow you to quickly identify rows with NULL values. This is especially useful when queries frequently use null or non-null conditions in a column. Indexing columns commonly used in queries: Creating indexes on columns that are frequently used in queries can greatly improve query performance. Indexes can help speed up data retrieval and reduce query execution time. Using Functional Indexes with NULL: Functional indexes can be useful for optimizing queries with functions that allow null arguments. Such indexes can improve the performance of queries that use functions with NULL arguments. Using indexes with NULL in combination with IS NULL: NULL indexes can be very useful when using the IS NULL operator to find rows with NULL values. Such indexes allow you to quickly find all rows with NULL in the corresponding columns. Performance Analysis Using NULL Indexes When creating indexes with NULL, it is recommended that you analyze query performance and compare it to performance without indexes. This will help you determine which NULL indexes actually improve query performance and are justified in your particular situation. Periodic index maintenance: NULL indexes, like normal indexes, require periodic maintenance. Regularly updating index statistics will help the query optimizer evaluate query execution plans correctly and avoid unnecessary operations. Removing unused NULL indexes: Unused NULL indexes should be parsed and dropped regularly to reduce system load and optimize database performance. Control over updates and inserts: When using NULL indexes, you need to control the update and insert operations. NULL indexes can impact the performance of such operations, so it's important to consider them when designing and optimizing queries. Following these good practices will effectively use NULLs in Oracle indexes, improve query performance, and reduce the impact on the database. Using NULL indexes wisely will help you get the most out of indexes and increase database efficiency.
Let's start with the question, "How do you use Redis?" I'm sure most use it as a cache for the service. I hope you know that it can do more than just that. Recently, I spoke at a conference with a report on how we moved part of the data to Redis and requests fly to it in the first place. Now I want to tell you not about how we applied it, but about the fact that when working with Spring and its abstractions, you may not immediately notice the substitution. Let's try to write a small Spring app that will use two PostgreSQL and Redis databases. I want to note that we will store in the databases not some kind of flat object, but a full-fledged object from a relational database with nested fields (inner join). To do this, we need plugins that need to be installed in Redis such as RedisJSON and RediSearch. The first allows us to store our object in JSON format, and the second allows us to search by any field of our object, even nested fields. To work with a relational database, we will choose Spring Data JPA. And to work with Redis, we will use the excellent Redis OM Spring library, which allows you to work with the database at the abstraction level. This is an analog of Data JPA. Under the hood, Redis OM Spring has all the necessary dependencies for Spring and Jedis to work with the database. We will not dwell on the details, since the article is not about that. Let's Write Code So let's write code. Let's say we need to write a certain entity called "downtime" to the database. In this entity, I added other objects such as "place", "reason", and others. Entity for a relational database: Java @Entity @Table(schema = "test", name = "downtime") public class Downtime { @Id private String id; private LocalDateTime beginDate; private LocalDateTime endDate; @ManyToOne(fetch = FetchType.EAGER) @JoinColumn(name = "area") private Place area; @ManyToOne(fetch = FetchType.EAGER) @JoinColumn(name = "cause") private Cause cause; ... This piece of code does not need comments. We need to do the same for Redis. Object for Redis: Java @Document public class DowntimeDoc { @Id @Indexed private String id; @Indexed private LocalDateTime beginDate; private LocalDateTime endDate; @Indexed private PlaceDoc area; @Indexed private CauseDoc cause; .... In this case, instead of @Entity, we use @Document. This annotation indicates that our object is an entity. It will be stored in the database under the key “package path + class name + Idx.” The @Indexed annotation means that it will be indexed for search. If you do not specify this annotation, then this field will be saved in the database, but searching for it will return an empty result. You can add this annotation as needed. Data that is already in the database will be indexed asynchronously; new data will be indexed synchronously. Next, we will make a repository, which basically works to get data from the database. An example for a relational database: Java public interface DowntimeRepository extends JpaRepository<Downtime, String> { } Example for Redis: Java public interface DowntimeRedisRepository extends RedisDocumentRepository<DowntimeDoc, String> { } The difference is that we extend the current interface from RedisDocumentRepository, which extends the standard CRUD interface for Spring. Let's add a method to find the first downtime for the reason we specified. Java public interface DowntimeRepository extends JpaRepository<Downtime, String> { Downtime findFirstByCauseIdOrderByBeginDate(String causeId); } And the same for Redis: Java public interface DowntimeRedisRepository extends RedisDocumentRepository<DowntimeDoc, String> { DowntimeDoc findTopByCause_IdOrderByBeginDateAsc(String causeId); } As you noticed, if you write code working with the database through abstractions, then the difference is almost not noticeable. In addition, Redis OM Spring allows you to write queries yourself using the @Query annotation, as in Spring Data JPA. Here is an example of an HQL query: Java @Query("SELECT d FROM Downtime d" + " JOIN FETCH d.area " + " JOIN FETCH d.cause" + " JOIN FETCH d.fixer" + " JOIN FETCH d.area.level " + " WHERE d.area IN ?1 AND (d.beginDate BETWEEN ?2 AND ?3 OR d.cause IN ?4) ") List<Downtime> findAllByParams(List<Place> workPlace, LocalDateTime start, LocalDateTime end, List<Cause> causes); Same for Redis: Java @Query("(@area_id:{$areas} ) & (@beginDate:[$start $end] | @cause_id:{$causes})") Page<DowntimeDoc> findByParams(@Param("areas") List<String> areas, @Param("start") long start, @Param("end") long end, @Param("causes") List<String> causes, Pageable pageable); In the case of Redis, we simply specify the conditions for the “WHERE” section. It is not necessary to indicate which fields need to be attached since they are always pulled from the database. However, we can not pull up all the fields but specify with the additional “returnFields” parameter what exactly we need. You can also specify sorting, limit, and offset - the latter, by the way, is impossible in HQL. In this example, I passed Pageable to the method, and it will work at the database level, not pull all the data into the service, and trim it in it (as would be the case with Hibernate). Also, Redis OM Spring allows you to write queries using EntityStream, which is analogous to Stream API. Here is an example of the above queries using EntityStream. Java … entityStream .of(DowntimeDoc.class) .filter(DowntimeDoc$.AREA_ID.in(filter.getWorkPlace().toArray(String[]::new))) .filter(between + " | " + causes) .map(mapper::toEntity) .collect(Collectors.toList()); In this example, I'm using one filter using the metamodel, passing the parameters as a string to the second filter to show that both options are valid. You guessed it: EntityStream accepts a set of intermediate operations and executes this set when calling a terminal operation. Nuances of Redis OM Spring Let me tell you about some of the nuances of using Redis OM Spring: You will not be able to use a UUID as a primary key. You can specify it as a string and it will be indexed. But when searching, you will need to escape spaces @id: {2e5af82m\-02af\-553b\-7961\-168878aa521е} And one more thing: if you search through the RedisDocumentRepository repository, nothing will work, because there is such an expression in the code that will remove all screens: String regex = "(\\$" + key + ")(\\W+|\\*|\\+)(.*)"; Therefore, in order to search by such fields, you will have to write a query directly in RediSearch. I have an example of how to do this in the demo project. When searching through the RedisDocumentRepository methods, if you expect a collection, then you must pass a Pageable indicating the size of the expected rows or specifying the size in @Query; otherwise, you will receive a maximum of 10 records. The FT.SEARCH (@Query) method supports only one parameter for sorting. This is solved by writing a query through FT.AGGREGATE (@Aggregation). The above list is not exhaustive. While working with these libraries, I found many different things, but this is all just a specificity of the database implementation. Finally, I did not put information about Redis plugins in this article and did not talk about all the features of Redis OM Spring; otherwise, this article will be huge and not readable. Conclusion I showed that currently, Redis allows you to store an object with a large nesting and allows you to search through the fields of this object. If you are working with data through abstractions in the repository, then some may not see any difference from Spring Data JPA, especially if you use some simple queries like Save, delete, findAllBy, etc., as well as queries through the name of the method. Examples can be found on GitHub. All success.
Hello, fellow developers! This year, approximately 90,000 of us participated in the Stack Overflow survey. Impressively, we crowned Postgres as the #1 database. Moreover, DB Engines also spotlights PostgreSQL as one of the fastest-growing databases worldwide. What does this mean for us? It's clear that we should strive to become PostgreSQL experts. An essential step in this direction is setting up our own database for hands-on experiments. So, whether you prefer reading or watching, let’s walk through three practical, user-friendly, and absolutely free ways to kickstart your PostgreSQL journey. Option #1: Dive Into Postgres With Docker The simplest and most pocket-friendly way to start your journey with PostgreSQL is Docker. That's right: with a single Docker command, you have your database container humming merrily on your laptop: Shell docker run --name postgresql \ -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=password \ -p 5432:5432 \ -d postgres:latest The advantages are immense! Setting up your database is incredibly fast, and guess what? It all happens right on your hardware. Next, use your preferred SQL editor, like DataGrip, to open a database connection. Ensure you connect to localhost and use the username and password from the Docker command mentioned earlier. Once connected, execute a few simple SQL statements to ensure the Postgres instance is ready for more advanced experiments: SQL create table back_to_the_future(id int, name text); insert into back_to_the_future(1, 'Doc'); select * from back_to_the_future; Option #2: Jump Into Cloud-Native Postgres With Neon The second cost-free and straightforward way to learn PostgreSQL caters to those eager to delve into public cloud environments right from the start. Neon is a PostgreSQL-compatible database born and bred in the cloud. The icing on the cake? It's serverless. Dive in without spending a penny, and it autonomously scales your workload as needed. Eager to get started with Neon? For us developers, the command line is home, isn't it? Kick things off by installing the Neon Command Line Tool: Shell npm i -g neonctl Then authenticate and create an account: Shell neonctl auth Create a new project and database instance: Shell neonctl projects create --name mynewproject --region-id aws-us-east-1 neonctl databases create --name newdb Lastly, fetch your database connection string and you're set: Shell neonctl connection-string --database-name newdb Use that connection string to link up with the database instance via DataGrip: For quick verification, execute a couple of straightforward SQL commands: SQL create table matrix(id int, name text); insert into matrix values(1, 'Neo'); select * from matrix; Option #3: Build on Scalable Postgres With YugabyteDB Therefore, do you think you're done? Not at all, my friend! We conclude with YugabyteDB - the PostgreSQL "beast." Not only does it scale up and out across zones and regions, but it also withstands the most challenging cloud armageddons. Plus, its special knack: pinning user data to specific geographic locations. Want a taste of YugabyteDB straight in the cloud? YugabyteDB Managed (DBaas) offers a free tier, gifting you a dedicated single-node instance, to begin with, and you can simply transition to their dedicated plan when you're ready. And now, for the grand tradition! Time to fire up a YugabyteDB instance straight from the command line. First, install the ybm tool: Shell brew install yugabyte/tap/ybm Next, create an account and sign in using your authentication token: Shell ybm signup ybm auth The final steps involve setting up your first database instance: Shell ybm cluster create \ --cluster-name yugabyte \ --credentials username=admin,password=password-123 \ --cluster-tier Sandbox \ --cloud-provider AWS \ --wait And add your laptop to the database’s IP allow list (yep, YugabyteDB folks take security seriously): Shell ybm network-allow-list create \ --ip-addr \$(curl ifconfig.me) \ --name my-address ybm cluster network allow-list assign \\ --network-allow-list my-address \\ --cluster-name yugabyte Alright, once the database is started, take its connection string: Shell ybm cluster describe --cluster-name yugabyte And use that string to establish a connection via DataGrip: With the connection opened, send a few SQL commands to YugabyteDB to make sure the "beast" is ready to serve your requests: SQL create table avengers(id int, name text); insert into avengers values(1, 'Hulk'); select * from avengers; That’s All… For Now With PostgreSQL's popularity on the rise, it's inevitable you'll cross paths with it in upcoming projects. So why wait? Dive in now. And there's no better way to learn than by doing. Roll up your sleeves, launch a Postgres instance using one of the three methods outlined in this article, and savor the journey ahead.
In this article, we are going to learn how indexing table columns can help in the quick response time of SQL queries. We'll cover the steps for installing MySQL, creating stored procedures, analyzing queries, and understanding the impact of indexing. I have used MySQL version 8 on Ubuntu. Also, I used the Dbeavor tool as a MySQL client for connecting to the MySQL server. So let's learn together. I have used MySQL for demonstration purposes; however, the concept remains the same in all other databases as well. 1. Following way, we can install MySQL and access it using the root user. This MySQL instance is just for testing; thus, I used a simple password. MySQL $ sudo apt install mysql-server $ sudo systemctl start mysql.service $ sudo mysql mysql> SET GLOBAL validate_password.policy = 0; mysql> ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'password'; mysql> exit $ mysql -uroot -ppassword 2. Create a db and use it. MySQL mysql> create database testdb; mysql> show databases; mysql> use testdb; 3. Create two tables, employee1, and employee2. Here employee1 is without primary-key, and employee 2 is with primary-key. MySQL mysql> CREATE TABLE employee1 (id int,LastName varchar(255),FirstName varchar(255),Address varchar(255),profile varchar(255)); Query OK, 0 rows affected (0.01 sec) mysql> CREATE TABLE employee2 (id int primary key,LastName varchar(255),FirstName varchar(255),Address varchar(255),profile varchar(255)); Query OK, 0 rows affected (0.02 sec mysql> show tables; +------------------+ | Tables_in_testdb | +------------------+ | employee1 | | employee2 | +------------------+ 2 rows in set (0.00 sec) 4. Now, if we check the index for each table, we find that the employee2 table already has one index on the id column because it is a primary key. MySQL mysql> SHOW INDEXES FROM employee1 \G; Empty set (0.00 sec) ERROR: No query specified mysql> SHOW INDEXES FROM employee2 \G; *************************** 1. row *************************** Table: employee2 Non_unique: 0 Key_name: PRIMARY Seq_in_index: 1 Column_name: id Collation: A Cardinality: 0 Sub_part: NULL Packed: NULL Null: Index_type: BTREE Comment: Index_comment: Visible: YES Expression: NULL 1 row in set (0.00 sec) ERROR: No query specified 5. Now, create a stored procedure to insert bulk data in both tables. We are inserting 20000 records in each table. We can then invoke the stored procedure with CALL procedure-name command. MySQL mysql> CREATE PROCEDURE testdb.BulkInsert() BEGIN DECLARE i INT DEFAULT 1; truncate table employee1; truncate table employee2; WHILE (i <= 20000) DO INSERT INTO testdb.employee1 (id, FirstName, Address) VALUES(i, CONCAT("user","-",i), CONCAT("address","-",i)); INSERT INTO testdb.employee2 (id,FirstName, Address) VALUES(i,CONCAT("user","-",i), CONCAT("address","-",i)); SET i = i+1; END WHILE; END mysql> CALL testdb.BulkInsert() ; mysql> SELECT COUNT(*) from employee1 e ; COUNT(*)| --------+ 20000| mysql> SELECT COUNT(*) from employee2 e ; COUNT(*)| --------+ 20000| 6. Now, if we select the record for any random id, we find the response from the employee1 table is slow as it doesn't have any index. MySQL mysql> select * from employee2 where id = 15433; +-------+----------+------------+---------------+---------+ | id | LastName | FirstName | Address | profile | +-------+----------+------------+---------------+---------+ | 15433 | NULL | user-15433 | address-15433 | NULL | +-------+----------+------------+---------------+---------+ 1 row in set (0.00 sec) mysql> select * from employee1 where id = 15433; +-------+----------+------------+---------------+---------+ | id | LastName | FirstName | Address | profile | +-------+----------+------------+---------------+---------+ | 15433 | NULL | user-15433 | address-15433 | NULL | +-------+----------+------------+---------------+---------+ 1 row in set (0.03 sec) mysql> select * from employee1 where id = 19728; +-------+----------+------------+---------------+---------+ | id | LastName | FirstName | Address | profile | +-------+----------+------------+---------------+---------+ | 19728 | NULL | user-19728 | address-19728 | NULL | +-------+----------+------------+---------------+---------+ 1 row in set (0.03 sec) mysql> select * from employee2 where id = 19728; +-------+----------+------------+---------------+---------+ | id | LastName | FirstName | Address | profile | +-------+----------+------------+---------------+---------+ | 19728 | NULL | user-19728 | address-19728 | NULL | +-------+----------+------------+---------------+---------+ 1 row in set (0.00 sec) mysql> select * from employee1 where id = 3456; +------+----------+-----------+--------------+---------+ | id | LastName | FirstName | Address | profile | +------+----------+-----------+--------------+---------+ | 3456 | NULL | user-3456 | address-3456 | NULL | +------+----------+-----------+--------------+---------+ 1 row in set (0.04 sec) mysql> select * from employee2 where id = 3456; +------+----------+-----------+--------------+---------+ | id | LastName | FirstName | Address | profile | +------+----------+-----------+--------------+---------+ | 3456 | NULL | user-3456 | address-3456 | NULL | +------+----------+-----------+--------------+---------+ 1 row in set (0.00 sec) 7. Now check the output of the command EXPLAIN ANALYZE. This command actually executes the query and plans the query, instruments it, and executes it while counting rows and measuring time spent at various points in the execution plan. Here we find for employee1, there is a table scan performed, which means a full table is scanned or searched to fetch output. We also call it a full scan of the table. MySQL mysql> explain analyze select * from employee1 where id = 3456; +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Filter: (employee1.id = 3456) (cost=1989 rows=1965) (actual time=5.24..29.3 rows=1 loops=1) -> Table scan on employee1 (cost=1989 rows=19651) (actual time=0.0504..27.3 rows=20000 loops=1) | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.03 sec) # Here is detailed explanation from ChatGPT. Filter: (employee1.id = 3456): This indicates that there is a filter operation being performed on the "employee1" table, and only rows where the "id" column has a value of 3456 will be selected. (cost=1989 rows=1965) (actual time=5.3..31.9 rows=1 loops=1): This part provides some performance-related information about the query execution: cost=1989: It represents the cost estimate for the entire query execution. Cost is a relative measure of how much computational effort is required to execute the query. rows=1965: It indicates the estimated number of rows that will be processed in this part of the query. actual time=5.3..31.9: This shows the actual time taken for this part of the query to execute, which is measured in milliseconds. rows=1 loops=1: The number of times this part of the query is executed in a loop. -> Table scan on employee1 (cost=1989 rows=19651) (actual time=0.034..29.7 rows=20000 loops=1): This part shows that a table scan is being performed on the "employee1" table: Table scan: This means that the database is scanning the entire "employee1" table to find the rows that match the filter condition. cost=1989: The cost estimate for this table scan operation. rows=19651: The estimated number of rows in the "employee1" table. actual time=0.034..29.7: The actual time taken for the table scan operation, measured in milliseconds. rows=20000 loops=1: The number of times this table scan operation is executed in a loop. Overall, this query plan suggests that the database is executing a query that filters the "employee1" table to only return rows where the "id" column is equal to 3456. The table scan operation reads a total of 20,000 rows to find the matching row(s) and has an estimated cost of 1989 units. The actual execution time is 5.3 to 31.9 milliseconds, depending on the number of rows that match the filter condition. 8. For table employee2, we find that only one row is searched, and results are fetched. Thus if there are a lot of records in tables, we will observe considerable improvement in the response time of SQL queries. MySQL mysql> explain analyze select * from employee2 where id = 3456; +---------------------------------------------------------------------------------------------------+ | EXPLAIN | +---------------------------------------------------------------------------------------------------+ | -> Rows fetched before execution (cost=0..0 rows=1) (actual time=110e-6..190e-6 rows=1 loops=1) | +---------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) # As per ChatGPT explanation of this query plan is : Rows fetched before execution: This part indicates that the database is fetching some data before the main query is executed. (cost=0..0 rows=1): The cost estimate for this operation is 0 units, and it expects to fetch only one row. (actual time=110e-6..190e-6 rows=1 loops=1): This provides the actual time taken for the data fetching operation: actual time=110e-6..190e-6: The actual time range for the fetching operation, measured in microseconds (µs). rows=1: The number of rows fetched. loops=1: The number of times this data fetching operation is executed in a loop. Overall, this part of the query plan indicates that the database is fetching a single row before executing the main query. The actual time taken for this data fetching operation is in the range of 110 to 190 microseconds. This preliminary data fetch might be related to obtaining some essential information or parameters needed for the subsequent execution of the main query. 9. Now, let us make it more interesting. Let us analyze the query plan when we search a record for the non-index column FirstName on both tables. From the output, we find that Table Scan is done to search records which takes considerable time to fetch data. MySQL mysql> explain analyze select * from employee2 where FirstName = 'user-13456'; +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Filter: (employee2.FirstName = 'user-13456') (cost=2036 rows=2012) (actual time=15.7..24 rows=1 loops=1) -> Table scan on employee2 (cost=2036 rows=20115) (actual time=0.0733..17.8 rows=20000 loops=1) | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.02 sec) mysql> explain analyze select * from employee1 where FirstName = 'user-13456'; +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Filter: (employee1.FirstName = 'user-13456') (cost=1989 rows=1965) (actual time=23.7..35.2 rows=1 loops=1) -> Table scan on employee1 (cost=1989 rows=19651) (actual time=0.0439..28.9 rows=20000 loops=1) | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.03 sec) 10. Now, let us create an index on the employee1 table for column FirstName. MySQL mysql> CREATE INDEX index1 ON employee1 (FirstName); Query OK, 0 rows affected (0.13 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> show indexes from employee1 \G; *************************** 1. row *************************** Table: employee1 Non_unique: 1 Key_name: index1 Seq_in_index: 1 Column_name: FirstName Collation: A Cardinality: 19651 Sub_part: NULL Packed: NULL Null: YES Index_type: BTREE Comment: Index_comment: Visible: YES Expression: NULL 1 row in set (0.01 sec) ERROR: No query specified 11. Let us now again check the query plan for both tables when we search a single record for column FirstName. We find that employee1 quickly provides a response, there is only 1 row to search, and index lookup is done on the employee1 table when using the index on column FirstName. But for employee2, the response time is large, and all 20000 rows are searched to get a response. MySQL mysql> explain analyze select * from employee1 where FirstName = 'user-13456'; +-------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +-------------------------------------------------------------------------------------------------------------------------------------+ | -> Index lookup on employee1 using index1 (FirstName='user-13456') (cost=0.35 rows=1) (actual time=0.0594..0.0669 rows=1 loops=1) | +-------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> explain analyze select * from employee2 where FirstName = 'user-13456'; +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Filter: (employee2.FirstName = 'user-13456') (cost=2036 rows=2012) (actual time=15.7..23.5 rows=1 loops=1) -> Table scan on employee2 (cost=2036 rows=20115) (actual time=0.075..17.5 rows=20000 loops=1) | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.02 sec) That's it, guys. This article will help us to understand the impact of indexes on tables. How to analyze queries with explain analyze command. Also, there is learning about how to set up MySQL and how to write stored procedures for bulk insert.
In this article, we're going to look at how we can use Flyway to manage our SQL database schema in a Spring Boot application. Flyway is a database migration tool that offers capabilities for migration history and rollback and allows us to separate the database schema-related layer from the database entity layer of an application. Application Setup The Spring Boot application we will be using can be generated using this Spring Initializr link. It contains all the necessary dependencies. After downloading the application and resolving the dependencies, we will create a new Postgres database called spring-boot-flyway and configure the application to connect to it. Listing 2.1 application.properties: Properties files spring.datasource.url=jdbc:postgresql://localhost:5432/spring-boot-flyway spring.datasource.username=demo spring.datasource.password=demo By default, Flyway will search the db/migration/ directory in the classpath for migration files containing SQL statements to manage database tables and records. For older versions of the library, we may need to create an empty text file called .keep in resources/db/migration/ to ensure the directory is compiled and available during application start-up to avoid errors. Having done that, we can now start the application and it should run successfully. Basic Usage The way Flyway works is that we create a migration file in the resources/db/migration directory and Spring Boot will automatically execute the migration script because we've added the Flyway dependencies to the classpath in section 2. Listing 3.1 V1__Users.sql: SQL CREATE TABLE IF NOT EXISTS users ( id SERIAL, email VARCHAR(200) NOT NULL, name VARCHAR(200) NOT NULL, PRIMARY KEY (id) ); Let's take a minute to examine the snippet in Listing 3.1. The file name, V1__Users.sql, follows a certain convention: The "V" indicates that this is a versioned migration. The "1" that follows the V is the actual version number. It can also be "V1_1", which will translate to version 1.1. This is followed by the separator "__" (two underscores). This separates the version information from the name of the migration file, that is, Users, in this case. The last part ".sql" is the extension; thus, the file contains a plain SQL statement. At this point, restarting the application will create the users table in the database. Furthermore, we can see there's another table that we did not explicitly create - flyway_schema_history. The flyway_schema_history is used by Flyway itself to track the migration that has been applied. If the table is missing, Flyway will assume we're initializing the database for the first time and run all the migrations in order of their version number. When the flyway_schema_history table is present, Flyway will only apply newer migration files that have not been applied before. This means that in order for us to add new tables, we only need to create newer migration files with an updated version number and restart the application. In addition to using SQL, we can also write our migration script in Java. In the Java migration style, our migration files are Java classes and must extend the abstract BaseJavaMigration class and implement the migrate method. IDEs typically don't expect Java classes to be in the resources directory and so we're going to create a new package called db/migration in src/main/java. It is very important to know that this new package db/migration should be in the src/main/java directory. Let's create a new Java migration to add a new table: Listing 3.2 V2__Posts.java: Java public class V2__Posts extends BaseJavaMigration { @Override public void migrate(Context context) throws Exception { var sql = """ CREATE TABLE posts ( id SERIAL, author_id INT NOT NULL, post TEXT NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id) ); """; try(var statement = context.getConnection().createStatement()) { statement.execute(sql); } } } The advantage of using Java migration over SQL files is that we can add custom logic, conditions, and validations that will not be possible with plain SQL. For example, we can check if another table exists or get certain values from the environment, etc. As you're probably guessing now, yes, it is possible to mix both SQL and Java-style migration in the same codebase, as long as we ensure the Flyway location is the same in both cases. Flyway Configuration and Customization Hitherto, we've been working with the default Flyway behavior. We can further tweak Flyway to suit our needs. For example, we can change the default location for migration files, configure the database schema (a.k.a., tablespace), change the SQL migration prefix from "V" to whatever we want and so much more. In the configuration below, we configure the path where the migration files are located and disable cleaning the database (i.e., dropping all tables) to prevent accidental use in a production environment. Listing 4.1 application.properties: Properties files spring.flyway.locations=classpath:migrations spring.flyway.clean-disabled=true There are other configurable properties under the spring.flyway key that we can use to fine-tune the behavior of the library. Also, we can consult the Flyway documentation page for reference. Flyway Callbacks Flyway provides us with the ability to configure callbacks that can be invoked at different stages of the migration process. The callback mechanism is a handy way of performing certain actions at different phases of the migration life cycle. Let's say we have some default data that we want to seed when our application is starting up. We can simply create a callback that supports the AFTER_MIGRATE event. Listing 5.1 FlywayDatabaseSeeder.java: Java public class FlywayDatabaseSeeder implements Callback { @Override public boolean supports(Event event, Context context) { return event.name().equals(Event.AFTER_MIGRATE.name()); } @Override public void handle(Event event, Context context) { try(var statement = context.getConnection().createStatement()) { var ADMIN_EMAIL = "superadmin@example.com"; var checkQuery = "SELECT id FROM users WHERE email = %s" .formatted(ADMIN_EMAIL); statement.execute(checkQuery); ResultSet resultSet = statement.getResultSet(); resultSet.last(); //return if the seeder has already been executed if(resultSet.getRow() >= 0) return; var sql = """ INSERT INTO users (email, name) VALUES ('%s', 'Super Admin') """.formatted(ADMIN_EMAIL); statement.execute(sql); } catch (SQLException e) { throw new RuntimeException(e); } } @Override public boolean canHandleInTransaction(Event event, Context context) { return true; } @Override public String getCallbackName() { return FlywayDatabaseSeeder.class.getName(); } } In the listing above, in the supports method, we declared that this callback should only be executed for the AFTER_MIGRATE event and in the handle method, we outlined the logic for inserting the default super admin user if it does not exist already. Before this can work, we need to register the callback class with Flyway in SpringBoot. We do this by creating a FlywayMigrationStrategy bean. Listing 5.2 SpringBootFlywayApplication.java: Java @Bean public FlywayMigrationStrategy flywayMigrationStrategy() { return (flywayOld) -> { /* Update the existing autoconfigured Flyway bean to include our callback class */ Flyway flyway = Flyway.configure() .configuration(flywayOld.getConfiguration()) .callbacks(new FlywayDatabaseSeeder()) .load(); flyway.migrate(); }; } There are other events in the org.flywaydb.core.api.callback.Event enum that we can configure our Callback classes to support. For example, you can have a callback that'll support the AFTER_MIGRATE_ERROR event and send a Slack notification to alert the engineers. Tips and Tricks While developing in your local environment, you can delete the entry for a migration from the flyway_schema_history table. Next time you start your application, the migration for which you deleted its history will be executed again. This way, you can correct errors or update your schema while still developing on your local machine without dropping the entire database. Additionally, in SpringBoot, you can control when Flyway should execute the migration script on application start-up. For example, let's say we don't want the migration to be automatically executed on our local environment. We can do the following: Listing 6.1 SpringBootFlywayApplication.java: Java @Bean public FlywayMigrationStrategy flywayMigrationStrategy(@Value("${spring.profiles.active}") String activeProfile) { return (flywayOld) -> { /* Update the existing autoconfigured Flyway bean to include our callback class */ Flyway flyway = Flyway.configure() .configuration(flywayOld.getConfiguration()) .callbacks(new FlywayDatabaseSeeder()) .load(); if(!"local".equalsIgnoreCase(activeProfile)) { flyway.migrate(); } }; } Conclusion One of the advantages of using a database migration tool is that it makes the database schema a part of the application codebase. This makes it easier to track changes to the database over time as there's a central reference point in the application. The complete source code is available on GitHub, Happy coding.
SQL Server Error 18456 Rectified With Ease Database admins might encounter SQL Server error 18456 every once in a while. Users receive this error message when their attempt to connect to a live server is denied due to invalid credentials. It is a frustrating issue that reduces productivity. Therefore, it becomes all the more essential to find a solution to resume business workflow. Thus, we wrote this article to provide users with the causes, solutions, and best practices to avoid this error. Let us begin our discussion by discovering this problem's underlying causes. Why Does Error 18456 SQL Server Authentication Occur? The SQL Server error 18456 is further classified into separate states numbering from 1 to 133. These cover several different but related issues. Here we will go through the ones that are frequently encountered. Login Disabled: (Error: 18456, Severity: 14, State: 1) There could be a case where a particular user has been locked out of the SQL Server. The DBA does it to prevent the user from accessing sensitive data. Incorrect Credentials: (Error: 18456, Severity: 14, State: 5) Most users get this error for the first time when they try to log in with invalid credentials. It can either be an incorrect user id, a wrong password, or both. Authentication Mismatch: (Error: 18456, Severity: 14, State: 6) Users may be attempting to log in via Windows authentication, whereas, by default, the server is set to SQL Server authentication. Users can verify it and change it accordingly on the SQL Server management console. Database Not Found:(Error: 18456, Severity: 14, State: 38) In this state, the server refuses to open the database requested by the user. There are two different reasons for this. The first one is some issue in the connection string itself, and the other involves login via an expired password. Other minor reasons exist, such as remote login being disabled, permission issues, security policy updates, etc. In the end, most of these issues that cause error 18456 SQL Server authentication involve incorrect login credentials. So, figuring it out will fix the problem. It's now time to go over the various steps to solve this error. SQL Server Error 18456 State 1 and the Steps To Fix It This is the most basic error, so the user needs to follow only a simple procedure to resolve it. Step-1. Ensure proper login credentials. It may seem very basic, but sometimes the simplest things are the ones that we forget to check. Step-2. In state 1, the user is usually locked out of the server. To check whether a user is locked out, DBA can run the following query: ALTER LOGIN [UserLoginName] WITH CHECK_POLICY = OFF, CHECK_EXPIRATION = OFF; This command will temporarily disable passwords for that particular user. It is now the task of the DBA to reset the permissions by executing the following query: ALTER LOGIN [UserLoginName] WITH PASSWORD = 'NewPassword'; Procedure To Resolve State 5 of Error 18456 in SQL Server This state indicates that the login attempt failed due to an invalid user account. Here are detailed steps to fix this error: Step-1. First of all, validate if the user account is still present on the server or not. It can be done by executing the following query on the SSMS: USE master; SELECT name FROM sys.sql_logins WHERE name = 'UserLoginName'; Step-2. Now if the user account is absent, simply create it using the following T-SQL command to clear up SQL Server error 18456: USE master; CREATE LOGIN [UserLoginName] WITH PASSWORD = 'UserPassword'; Step-3. Once the account is created, it’s time to grant the required Server-Level permissions for that the following command will do the trick. USE master; ALTER SERVER ROLE [sysadmin] ADD MEMBER [UserLoginName]; The above command adds the login account as a member of the sysadmin fixed server role, which grants it full administrative privileges. If a user has a different server role or needs specific permission, adjust the command accordingly. Step-4. Next, check the default database assigned to the login account. Execute the following query: USE master; SELECT name, default_database_name FROM sys.sql_logins WHERE name = 'UserLoginName'; Step-5. Admin should make sure that the default database specified is valid and accessible. If the default database is incorrect or unavailable, they can change it by using the following query: USE master; ALTER LOGIN [UserLoginName] WITH DEFAULT_DATABASE = [UserDatabaseName]; Step-6. Ask the user to retry logging in with the fixed login credentials. Ensure that they provide the correct username, password, and default database (if necessary). Apart from these steps, additional queries can be used to address the rest of the states. For the users who are facing this error. It is likely that invalid login credentials are not the only issue troubling their SQL Server. Therefore, users are recommended to go for the below-mentioned tool for complete recovery of their SQL server. Complete Automated Solution for Error 18456 of SQL Server Authentication There is no doubt that the manual procedure to figure out SQL Server error 18456 is complex and inefficient. However, users need not worry as the SysTools SQL Recovery Manager can sort out all issues. It is a must-have professional utility for those users who lack technical skills and want to solve their SQL Server problems. Given below are its easy-to-implement steps. Step-1. Open the tool on your machine and choose the option as per your requirement. Step-2. Browse and attach the master MDF file. This will start the scanning process. Step-3. A list of all users on the SQL Server appears with the password column. Step-4. Select the user you want the new password for, and then click on OK. Step-5. Start your SQL Server and verify the changed passwords for users. Best Practices To Avoid SQL Server Error 18456 These expert-recommended guidelines are given to help users evade most of the situations that cause Error 18456. Also, by following these best practices, users can ensure secure and reliable logins every time. Using strong and unique passwords for all users. Password should not be shared and include upper and lowercase letters, special characters, and numerical values. Update and review login credentials regularly. This is to ensure that permissions are only available on a need-to-know basis and outdated credentials are revoked. Monitor and limit login attempts for all users. As it is critical that unauthorized access is reported and important data is protected. Maintenance of Server connections. It is done so that the server does not lose connection, especially during working hours, as it can prove to be catastrophic. Configure firewalls and implement network security. These measures are required to restrict access to the SQL Server from untrusted sources. Conclusion In this article, together with the causes of SQL Server error 18456, we also discussed the available solutions. Apart from this, we made users aware of the amazing qualities of a professional SQL management tool. By utilizing the tool and ensuring the best practices, users can ensure that their SQL server remains in the best condition.
Today, the demand for real-time data processing and analytics is higher than ever before. The modern data ecosystem requires tools and technologies that can not only capture, store, and process vast amounts of data but also it should deliver insights in real time. This article will cover the powerful combination of MongoDB, Kafka, Debezium, and RisingWave to analyze real-time data, how they work together, and the benefits of using these open-source technologies. Understanding Debezium and RisingWave Before we dive into the implementation details, it’s important to understand what these two tools are and what they do. Debezium: is an open-source distributed platform for change data capture(CDC). CDC is a technique to track data changes written to a source database and automatically sync target databases. For example, Debezium’s MongoDB Connector can monitor for document changes in databases and collections as they occur in real-time, recording those changes as events in Kafka topics. RisingWave: is a distributed open-source SQL database for stream processing. Its main goal is to make it easier and cheaper to construct applications that operate in real-time. As it takes in streaming data, RisingWave performs on-the-fly computations with each new piece of data and promptly updates the outcomes. For example, RisingWave accepts data from sources like Kafka, constructs materialized views for complex data, and you can query them using SQL. Analyzing Real-Time Data: The Pipeline Once we have knowledge about each tool, let’s discuss how MongoDB, Kafka, Debezium, and RisingWave can work together to create an efficient real-time data analysis pipeline. These technologies are free to use and easy to integrate with each other. Data Generation and Storage in MongoDB: Our data pipeline starts with the generation and storage of data in MongoDB. Given MongoDB’s flexible data model, it is possible to store data in multiple formats, making it suitable for diverse data sources. Data Capture with Debezium: The next step in the pipeline is the capture of changes (all of the inserts, updates, and deletes) in MongoDB using Debezium. Debezium provides a CDC connector for MongoDB that can capture row-level changes in the database. Once the changes are captured, they are sent to Kafka for processing. Data Streaming with Kafka: Kafka receives the data from Debezium and then takes care of streaming it to the consumers. In our case, we consume data with RisingWave. Data Processing with RisingWave: Finally, the streamed data is received and processed by RisingWave. RisingWave provides a high-level SQL interface for complex event processing and streaming analytics. The processed data can be passed to BI and Data analytics platforms or used for real-time decision-making, anomaly detection, predictive analytics, and much more. This pipeline’s key strengths are its ability to handle large volumes of data, process events in real time, and produce insights with minimal latency. For example, this solution can be used for building a global hotel search platform to get real-time updates on hotel rates and availability. When rates or availability change in one of the platform’s primary databases, Debezium captures this change and streams it to Kafka, and RisingWave can do trend analysis. This ensures that users always see the most current information when they search for hotels. How To Integrate Quickstart This guide shows you how to configure technically the MongoDB Debezium Connector to send data from MongoDB to Kafka topics and ingest data into RisingWave. After completing this guide, you should understand how to use these tools to create a real-time data processing pipeline and create a data source and materialized view in RisingWave to analyze data with SQL queries. To complete the steps in this guide, you must download/clone and work on an existing sample project on GitHub. The project uses Docker for convenience and consistency. It provides a containerized development environment that includes the services you need to build the sample data pipeline. Before You Begin To run the project in your local environment, you need the following. Git Ensure you have Docker installed in your environment. Ensure that the PostgreSQL interactive terminal, psql, is installed in your environment. For detailed instructions, see Download PostgreSQL. Start the Project The docker-compose file starts the following services in Docker containers: RisingWave Database. MongoDB, configured as a replica set. Debezium. Python app to generate random data for MongoDB. Redpanda with the MongoDB Debezium Connector installed. We use Redpanda as a Kafka broker. Kafka Connect UI to manage and monitor our connectors. To start the project, simply run the following command from the tutorial directory: docker compose up When you start the project, Docker downloads any images it needs to run. You can see the full list of services in docker-compose.yaml file. Data Flow App.py generates random user data (name, address, and email) and inserts them into MongoDB users collection. Because we configured the Debezium MongoDB connector to point to the MongoDB database and the collection we want to monitor, it captures data in real time and sinks them to Redpanda into a Kafka topic called dbserver1.random_data.users. Next steps, we consume Kafka events and create a materialized view using RisingWave. Create a Data Source To consume the Kafka topic with RisingWave, we first need to set up a data source. In the demo project, Kafka should be defined as the data source. Open a new terminal window and run to connect to RisingWave: psql -h localhost -p 4566 -d dev -U root As RisingWave is a database, you can directly create a table for the Kafka topic: CREATE TABLE users (_id JSONB PRIMARY KEY, payload JSONB) WITH ( connector = 'kafka', kafka.topic = 'dbserver1.random_data.users', kafka.brokers = 'message_queue:29092', kafka.scan.startup.mode = 'earliest' ) ROW FORMAT DEBEZIUM_MONGO_JSON; Normalize Data With Materialized Views To normalize user’s data, we create a materialized view in RisingWave: CREATE MATERIALIZED VIEW normalized_users AS SELECT payload ->> 'name' as name, payload ->> 'email' as email, payload ->> 'address' as address FROM users; The main benefit of materialized views is that they save the computation needed to perform complex joins, aggregations, or calculations. Instead of running these operations each time data is queried, the results are calculated in advance and stored. Query Data Use the SELECT command to query data in the materialized view. Let's see the latest results of the normalized_users materialized view: SELECT * FROM normalized_users LIMIT 10; In response to your query, a possible result set (with random data) might look like: Summary This is a basic setup for using MongoDB, Kafka, Debezium, and RisingWave for a real-time data processing pipeline. The setup can be adjusted based on your specific needs, such as adding more Kafka topics, tracking changes in multiple MongoDB collections, implementing more complex data processing logic, or combining multiple streams in RisingWave. Related Resources Debezium connector for MongoDB MongoDB Kafka Connector Recommended Content Query Real-Time Data in Kafka Using SQL. How Streaming database differs from a Traditional database?
Earlier this year, we experienced intermittent timeouts in our application while interacting with our database over a period of two weeks. Despite our best efforts, we couldn't immediately identify a clear cause; there were no code changes that significantly altered our database usage, no sudden changes in traffic, and nothing alarming in our logs, traces, or dashboards. During those two weeks, we deployed 24 different performance and observability-focused changes to address the problem. In this post, I will share a bit about what those were and the value we've gained from them. The Initial Investigation We first looked into these slowdowns after noticing them, receiving a report from a customer, and seeing a flurry of context canceled errors in our error reporting tool, Sentry. Our on-call engineer, Aaron, initiated an incident and began investigating. He opened our API dashboard in Grafana, which provides a high-level overview of the health of our API. He confirmed that we did indeed time out on some API requests, but within a minute, we had returned to normal service. After updating the incident to let everyone know that things seemed okay, he began investigating what had caused the problem. Opening up the trace of a single failure, Aaron noticed that this HTTP request had waited almost 20 seconds to get an available connection from the connection pool. What are connection pools? When our app communicates with our database, it uses a client-side connection pool that is implemented in the database/sql Go package. The package uses these pools to limit the number of processes in our app that can communicate with the database anytime. When an operation uses the database, it sends that query to the database/sql package, which tries to get a connection from its connection pool. If all the available connections are in use, the operation is effectively blocked until it can get a connection. This blocking is the 20-second delay Aaron can see in the trace. Fortunately, we already had observability in place to identify this issue. We've implemented it using a go.patch file to monkey patch) the database/sql package and add a trace to the ConnectionPoolWait method. It's not the most robust approach, but for the sake of adding a single span to a trace, it does the job. Aaron had spotted from our traces that we had a variety of requests stuck waiting on a connection pool. At this point, we took things to Kibana to get a better idea of the distribution of the types of those requests. SQL --- /tmp/sql.go 2022-08-01 23:45:55.000000000 +0100 +++ /opt/homebrew/Cellar/go/1.19/libexec/src/database/sql/sql.go 2022-09-16 13:06:58.000000000 +0100 @@ -28,6 +28,8 @@ "sync" "sync/atomic" "time" + + "go.opencensus.io/trace" ) var ( @@ -1324,6 +1326,9 @@ return conn, nil } + ctx, span := trace.StartSpan(ctx, "database.sql.ConnectionPoolWait") + defer span.End() + // Out of free connections or we were asked not to use one. If we're not // allowed to open any more connections, make a request and wait. if db.maxOpen > 0 && db.numOpen >= db.maxOpen { This would allow us to confirm whether this was a small number of endpoints fighting for contention over a single database connection pool or perhaps many endpoints that might all be using different pools. What we found was that this issue was fairly widespread—there was no single connection pool affected. We'd been hoping for it to be a single pool, as that would have made it a little simpler to hone in on the work in that pool and optimize it. In lieu of that, we started looking at general database health. Historical HTTP traffic and PubSub metrics didn't suggest that anything we received at that time was out of the ordinary. Heroku's Postgres stats also suggested a fairly normal database load, though it did highlight some neglected queries that had been getting slower as our database grew. Having no obvious place to start, we decided to fix anything that looked slow and was a quick win. We'd shipped a number of optimizations, including: Moving policy violations over to using a materialized view rather than having to pull policies and all their related incidents just to do that calculation for each request. Adding some new database indices to speed up slow queries. Re-writing some queries that joined and then filtered on columns that weren't indexed when they could filter on that same column in an indexed table that was already present in the join. At this point, we felt we'd sunk a decent amount of time into investigating this outage, and we'd shipped a good amount of low-hanging fruit; we called it a day and closed the incident. Repeated Slowdown Within a couple of days of shipping the initial incident, lightning struck twice—we were timing out again. I was on call this time and was pulled into the generated incident. I pulled up our dashboards and saw that, again, we'd been timing out due to waiting on connection pools. Looking at Kibana and Google Cloud Trace, there was no discernible pattern in our slow requests. One of our engineers, Lawrence, joined the incident and suggested, rather than playing whack-a-mole and patching query after query, adding a one-second lock timeout to all of our transactions. As we weren't in a great state, this would at least let us quickly identify which requests were holding onto transactions longer than we'd like. We deployed this change, and fortunately, nothing broke, but unfortunately, this meant we were still no closer to identifying the cause of our slowdowns. One notable change we made at this point was to start processing Slack events asynchronously. Whenever an event occurs in a Slack channel, our bot has access to it; we get notified via a webhook. Additionally, whenever Slack syncs a workspace's users, we receive webhooks for each change. These can add up to a lot of events, and Slack can often send us a big batch of these at once. Originally, we would receive this event from Slack, and within the lifetime of that HTTP request, we'd perform any responses we needed to, e.g., offering to attach a GitHub link that a user has just posted in the channel. Regardless of the operation, we'd always do some database queries, like looking up the organization with that Slack team ID. To help alleviate periods of high traffic, we began processing these events asynchronously instead. So when a Slack webhook comes in, we now just forward that straight to PubSub and avoid making any database queries. By using PubSub, we can throttle access to the database a bit more, and if we want, we get some retry logic for free. The downside is that we've introduced some additional latency in the time it takes us to process events, but we feel that's a reasonable tradeoff. This felt like quite a significant change, so we again called it a day and hoped that was the last of our timeouts. Improving Our Observability Despite our best efforts, the next day, we saw another period of timeouts. At this point, we'd made a number of changes that felt like they'd help but weren't making significant progress. We realized we had to double down on improving our observability so we could narrow in on the root of the problem. The golden egg we wanted was to be able to group operations over a time period and sum up the total time that operation spent holding a connection pool. This would give us the ability to query things like "Which API endpoints are hanging onto database connections for the longest?". We already log a metric whenever we process a PubSub message or handle an HTTP request. With this, we know "How long did this operation take," "What operation is it," "What group of services does it belong to?" So, we looked at updating that log with additional info about connection pool usage. In theory, calculating the time we spend in connection pools sounds easy enough, but it, unfortunately, wasn't as simple as 'start timer, run query, end timer.' For starters, we couldn't put a timer around every database query we run, so we needed a middleware to apply this globally. Additionally, whenever we open a database transaction, a connection pool is held for the lifetime of the transaction, so we needed to devise a way to detect if we're in a transaction and alter our counting logic as required. For the middleware, we initially looked at plugging something into Gorm—our ORM. But we soon realized that the hooks provided by Gorm included the time we spent waiting on a connection pool, so we'd be counting the stuff we already knew. Instead, we implemented a middleware in the ngrok/sqlmw package which allowed us to hook into code right before and right after a query or transaction occurred. Here, we called out a new method we added - trackUsage —that utilized go's Context to maintain our new counter. SQL func trackUsage(ctx context.Context) (end func()) { inTransaction, _ := ctx.Value(InTransactionContextKey).(bool) startTime := time.Now() return func() { duration := time.Since(startTime) log.AddCounter(ctx, log.DatabaseDurationCounterKey, duration) // If we're inside a transaction, the connection is held for the entire // transaction duration, so we avoid double-counting each individual // statement within the transaction if !inTransaction { log.AddCounter(ctx, log.DatabaseConnectionDurationCounterKey, duration) } } } Now, we can filter both our logs and traces by the time each operation spends holding a database connection pool. Using tools like Grafana, we can group this by the type of operation and sum up the value over a time period. But, we've yet to really make use of this. While we were waiting for some useful data, we shipped another change that solved the root cause of our slowdowns. The Eventual Fix While sifting through our logs to check: "How are we looking after yesterday's fixes," and "Is our connection pool counter working?" we noticed an unnecessary transaction we were opening whenever we were handling the submission of a Slack modal. This is an HTTP endpoint that Slack calls whenever a user presses the confirmation button in the view you see when you use a slash command like /inc. We removed the transaction for most modal submissions, and in cases where we needed transactional guarantees, we added it for those code paths explicitly. It wasn't until a few days later that we realized this was the root of our problems. As the timeouts were intermittent and didn't occur every day, it took a little while to confirm the issue. But it's now been four months, and we've been database timeout free. This did confirm our suspicions from earlier on—there wasn't one noticeably slow thing that caused this issue. Instead, it was many short transactions that, when added together, caused us some real problems. Wrap Up It was a little disappointing to fight with this issue for days and for there to not be an aha! moment, where we solved everything in one fell swoop. But, on the upside, we're now much better equipped to diagnose problems like this in the future. And we've made some legitimate performance improvements that should make our application faster for our users. Had we been using different databases for different parts of our application, we'd likely have been able to track down this issue sooner. However, that's certainly not free to do that either. With that, we'd have to start thinking about distributed transactions, and our development environments would be less slick, but we're happy with the tradeoffs we've made for now.
Let's say you have an application developed in Node.js (or any other platform). This application connects to a MongoDB database (NoSQL) to store ratings on books (number of stars given and a comment). Let's also say that you have another application developed in Java (or Python, C#, TypeScript… anything). This application connects to a MariaDB database (SQL, relational) to manage a catalog of books (title, year of publishing, number of pages). You are asked to create a report that shows the title and rating information for each book. Notice that the MongoDB database doesn't contain the title of the books, and the relational database doesn't contain the ratings. We need to mix data created by a NoSQL application with data created by a SQL application. A common approach to this is to query both databases independently (using different data sources) and process the data to match by, for example, ISBN (the id of a book) and put the combined information in a new object. This needs to be done in a programming language like Java, TypeScript, C#, Python, or any other imperative programming language that is able to connect to both databases. A polyglot application This approach works. However, joining data is a job for a database. They are built for this kind of data operation. Also, with this approach, the SQL application is no longer an SQL-only application; it becomes a database polyglot, and this increases complexity, making it harder to maintain. With a database proxy like MaxScale, you can join this data at the database level using the best language for data — SQL. Your SQL application doesn't need to become a polyglot. Although this requires an additional element in the infrastructure, you also gain all the functionality that a database proxy has to offer. Things such as automatic failover, transparent data masking, topology isolation, caches, security filters, and more. MaxScale is a powerful, intelligent database proxy that understands both SQL and NoSQL. It also understands Kafka (for CDC or data ingestion), but that's a topic for another occasion. In short, with MaxScale, you can connect your NoSQL application to a fully ACID-compliant relational database and store the data right there next to tables that other SQL applications use. MaxScale allows an SQL application to consume NoSQL data. Let's try this last approach in a quick and easy-to-follow experiment with MaxScale. You will need the following installed on your computer: Docker The mariadb-shell tool The mongosh tool Setting up the MariaDB Database Using a plain text editor, create a new file and save it with the name docker-compose.yml. The file should contain the following: YAML version: "3.9" services: mariadb: image: alejandrodu/mariadb environment: - MARIADB_CREATE_DATABASE=demo - MARIADB_CREATE_USER=user:Password123! - MARIADB_CREATE_MAXSCALE_USER=maxscale_user:MaxScalePassword123! maxscale: image: alejandrodu/mariadb-maxscale command: --admin_host 0.0.0.0 --admin_secure_gui false ports: - "3306:4000" - "27017:27017" - "8989:8989" environment: - MAXSCALE_USER=maxscale_user:MaxScalePassword123! - MARIADB_HOST_1=mariadb 3306 - MAXSCALE_CREATE_NOSQL_LISTENER=user:Password123! This is a Docker Compose file. It describes a set of services to be created by Docker. We are creating two services (or containers) — a MariaDB database server and a MaxScale database proxy. They will be running locally on your machine, but in production environments, it's common to deploy them in separate physical machines. Keep in mind that these Docker images are not suitable for production! They are intended to be suitable for quick demos and tests. You can find the source code for these images on GitHub. For the official Docker images from MariaDB, head to the MariaDB page on Docker Hub. The previous Docker Compose file configures a MariaDB database server with a database (or schema; they are synonyms in MariaDB) called demo. It also creates a username user with the password Password123!. This user has suitable privileges on the demo database. There's an additional user with a name maxscale_user and password MaxScalePassword123!. This is the user that the MaxScale database proxy will use to connect to the MariaDB database. The Docker Compose file also configures the database proxy by disabling HTTPS (don't do this in production!), exposing a set of ports (more on this in a moment), and configuring the database user and location of the MariaDB database proxy (usually an IP address, but here we can use the name of the container previously defined in the Docker file). The last line creates a NoSQL listener that we'll use to connect as a MongoDB client on the default port (27017). To start the services (containers) using the command line, move to the directory in which you saved the Docker Compose file and run the following: Shell docker compose up -d After downloading all the software and starting the containers, you'll have a MariaDB database and MaxScale proxy, both preconfigured for this experiment. Creating a SQL Table in MariaDB Let's connect to the relational database. In the command line, execute the following: Shell mariadb-shell --dsn mariadb://user:'Password123!'@127.0.0.1 Check that you can see the demo database: MariaDB SQL show databases; Switch to the demo database: MariaDB SQL use demo; Connecting to a database with MariaDB Shell. Create the books table: MariaDB SQL CREATE TABLE books( isbn VARCHAR(20) PRIMARY KEY, title VARCHAR(256), year INT ); Insert some data. I'm going to use the cliché of inserting my own books: MariaDB SQL INSERT INTO books(title, isbn, year) VALUES ("Vaadin 7 UI Design By Example", "978-1-78216-226-1", 2013), ("Data-Centric Applications with Vaadin 8", "978-1-78328-884-7", 2018), ("Practical Vaadin", "978-1-4842-7178-0", 2021); Check that the books are stored in the database by running: MariaDB SQL SELECT * FROM books; Inserting data with MariaDB Shell. Creating a JSON Collection in MariaDB We haven't installed MongoDB, yet we can use a MongoDB client (or application) to connect to create collections and documents as if we were using MongoDB, except that the data is stored in a powerful, fully ACID-compliant, and scalable relational database. Let's try that out! In the command line, use the MongoDB shell tool to connect to the MongoDB… wait… it's actually the MariaDB database! Run the following: Shell mongosh By default, this tool tries to connect to a MongoDB server (which, again, happens to be MariaDB this time) running on your local machine (127.0.0.1) using the default port (20017). If everything goes well, you should be able to see the demo database listed when you run the following command: Plain Text show databases Switch to the demo database: Plain Text use demo Connecting to MariaDB using Mongo Shell. We are connected to a relational database from a non-relational client! Let's create the ratings collection and insert some data into it: Plain Text db.ratings.insertMany([ { "isbn": "978-1-78216-226-1", "starts": 5, "comment": "A good resource for beginners who want to learn Vaadin" }, { "isbn": "978-1-78328-884-7", "starts": 4, "comment": "Explains Vaadin in the context of other Java technologies" }, { "isbn": "978-1-4842-7178-0", "starts": 5, "comment": "The best resource to learn web development with Java and Vaadin" } ]) Check that the ratings are persisted in the database: Plain Text db.ratings.find() Querying a MariaDB database using Mongo Shell. Using JSON Functions in MariaDB At this point, we have a single database that, from the outside, looks like a NoSQL (MongoDB) database and a relational (MariaDB) database. We are able to connect to the same database and write and read data from MongoDB clients and SQL clients. All the data is stored in MariaDB, so we can use SQL to join data from MongoDB clients or applications with data from MariaDB clients or applications. Let's explore how MaxScale is using MariaDB to store MongoDB data (collections and documents). Connect to the database using an SQL client like mariadb-shell, and show the tables in the demo schema: MariaDB SQL show tables in demo; You should see both the books and ratings tables listed. ratings Was created as a MongoDB collection. MaxScale translated the commands sent from the MongoDB client and created a table to store the data in a table. Let's see the structure of this table: MariaDB SQL describe demo.ratings; A NoSQL collection is stored as a MariaDB relational table. The ratings the table contains two columns: id: the object ID. doc: the document in JSON format. If we inspect the contents of the table, we'll see that all the data about ratings are stored in the doc column in JSON format: MariaDB SQL SELECT doc FROM demo.ratings \G NoSQL documents are stored in a MariaDB database. Let's get back to our original goal—show the book titles with their rating information. The following is not the case, but let's suppose for a moment that the ratings table is a regular table with columns stars and comment. If that were the case, joining this table with the books table would be easy, and our job would be done: MariaDB SQL /* this doesn’t work */ SELECT b.title, r.stars, r.comment FROM ratings r JOIN books b USING(isbn) Back to reality. We need to convert the doc column of the actual ratings table to a relational expression that can be used as a new table in the query. Something like this: MariaDB SQL /* this still doesn’t work */ SELECT b.title, r.stars, r.comment FROM ratings rt JOIN ...something to convert rt.doc to a table... AS r JOIN books b USING(isbn) That something is the JSON_TABLE function. MariaDB includes a comprehensive set of JSON functions for manipulating JSON strings. We'll use the JSON_TABLE function to convert the doc column into a relational form that we can use to perform SQL joins. The general syntax of the JSON_TABLE function is as follows: MariaDB SQL JSON_TABLE(json_document, context_path COLUMNS ( column_definition_1, column_definition_2, ... ) ) [AS] the_new_relational_table Where: json_document: a string or expression that returns the JSON documents to be used. context_path: a JSON Path expression that defines the nodes to be used as the source of the rows. And the column definitions (column_definition_1, column_definition_2, etc...) have the following syntax: MariaDB SQL new_column_name sql_type PATH path_in_the_json_doc [on_empty] [on_error] Combining this knowledge, our SQL query would look like the following: MariaDB SQL SELECT b.title, r.stars, r.comment FROM ratings rt JOIN JSON_TABLE(rt.doc, '$' COLUMNS( isbn VARCHAR(20) PATH '$.isbn', stars INT PATH '$.starts', comment TEXT PATH '$.comment' ) ) AS r JOIN books b USING(isbn); Joining NoSQL and SQL data in a single SQL query. We could have used the ISBN value as the MongoDB ObjectID and, consequently, as the id column in the ratings table, but I'll leave that to you as an exercise (hint: use _id instead of isbn when inserting data using the MongoDB client or app). A Word on Scalability There's a misconception that relational databases don't scale horizontally (adding more nodes) while NoSQL databases do. But relational databases scale without sacrificing ACID properties. MariaDB has multiple storage engines tailored to different workloads. For example, you can scale a MariaDB database by implementing data sharding with the help of Spider. You can also use a variety of storage engines to handle different workloads on a per-table basis. Cross-engine joins are possible in a single SQL query. Combining multiple storage engines in a single logical MariaDB database. Another more modern alternative is distributed SQL with MariaDB Xpand. A distributed SQL database appears as a single logical relational database to applications through transparent sharding. It employs a shared-nothing architecture that scales both reading and writing. A distributed SQL database deployment. Conclusion Our job here is done! Now, your systems can have an ACID-compliant scalable 360-degree view of your data independent of whether it was created by SQL or NoSQL applications. There's less need to migrate your apps from NoSQL to SQL or to make SQL apps database polyglots. If you want to learn more about other features in MaxScale, watch this video or visit the docs.
Serverless architecture, a paradigm that has been around for several years, has been transforming the way we build and deploy applications. In this approach, developers can focus on writing code without worrying about the underlying infrastructure. Instead of managing servers or virtual machines, they can rely on cloud providers to handle the scaling, maintenance, and availability of their applications. While Serverless architecture has been around for a long time, it is only in recent years that the concept of Serverless databases has gained significant attention. Traditionally, databases, with their stateful and performance-sensitive nature, posed challenges in adopting a Serverless approach. However, the essence of databases is to simplify data management, allowing developers to concentrate on their core business goals. In this context, exploring and understanding Serverless databases can be valuable in the sense of enabling developers. So, what should you consider when choosing a Serverless database? Let's delve into the topic by comparing and analyzing popular serverless databases such as TiDB, MongoDB, Amazon Aurora, CockroachDB, and DynamoDB. Suitability When choosing a database, regardless of whether you're using a Serverless architecture or not, it's important to consider its suitability. Should you go for a general-purpose database or a specialized one? While specialized databases can be more convenient and advantageous in specific scenarios, they may not be the only crucial factor for technical teams. Sometimes, you need to balance the complexities introduced by multiple technology stacks, difficulties in talent recruitment, and the costs of operations and development. Compared to traditional relational databases, NoSQL databases like DynamoDB or MongoDB are often considered to have better scalability and performance due to fewer limitations. However, users may end up paying a high price to compensate for the capabilities they give up with NoSQL. For example, they may need to ensure data consistency on their own and develop custom code for query and computation logic. From this perspective, relational databases like TiDB, Amazon Aurora, or CockroachDB offer more versatility but slightly less flexibility. Moreover, TiDB offers Hybrid Transactional/Analytical Processing (HTAP) capabilities, enabling users to analyze operational data directly without separate ETL processes. This simplifies infrastructure complexity for small to medium-sized users who don't currently require complex data warehouses. In conclusion, choosing the right database involves a careful evaluation of the trade-offs between specialization and generalization, considering factors such as scalability, performance, complexity, and the specific needs of your project. Data Model ACID Transaction Indexing HTAP DynamoDB/MongoDB Schemaless Weak Weak No Amazon Aurora / CockroachDB Mainly Relational Model Strong Strong No TiDB Yes Pay-As-You-Go Pricing Model One of the core features of Serverless is the pay-as-you-go pricing. Almost all Serverless databases attempt to address a common challenge: how to provision resources economically and efficiently under uncertain workloads. Prioritizing lower costs may mean consuming fewer resources. However, in the event of unexpected spikes in business demand, you may have to compromise user experience and system stability. On the other hand, more generous and secure resource provisioning leads to resource waste and higher costs. Striking a balance between these two styles requires complex and meticulous engineering management. This would divert your focus from the core business. Furthermore, the Pay-as-you-go billing model has varying implementations in different Serverless products. Most Serverless products offer granular billing based on storage capacity and read/write operations per unit. This is largely possible due to the distributed architecture that allows finer resource scaling. Amazon Aurora Serverless is an exception in this case – it employs a coarser billing model based on Aurora Capacity Units (ACUs), which roughly corresponds to 2GB of memory along with the respective CPU and network bandwidth. Additionally, it cannot scale down to zero ACUs. This can result in significant waste for infrequently accessed data, especially when you need to construct numerous small-scale clusters of this kind. Moreover, each provider has its own pricing structure. In this regard, TiDB offers more affordable prices compared to others. Pricing Granularity Scale to Zero DynamoDB Per Query + Storage Yes MongoDB TiDB CockroachDB Amazon Aurora ACU + Storage No Scalability In reality, the concept of Serverless databases inherently implies scalability. From 0 Transactions Per Second (TPS) to millions of TPS, Serverless databases should be able to automatically scale based on the workload. While database providers can use a single large database to handle different workloads, this approach would result in significant waste and higher prices. Therefore, users need to adjust the architecture of Serverless databases based on the actual workload. In this regard, distributed databases have a clear advantage. Apart from vertical scaling, they rely primarily on horizontal scaling, which provides better scalability since vertical scaling easily reaches its limits. Compared with relational distributed databases, a NoSQL database offers better scalability due to the absence of strict transaction constraints. However, the reality is always more complex. For example, MongoDB's Serverless offering is designed for small-scale users of less than one terabyte. Its dedicated offering requires manual sharding to achieve further scalability. On the other hand, TiDB and CockroachDB support horizontal scaling with auto-sharding, which enables better scalability. As distributed relational databases, both TiDB and CockroachDB benefit from a storage-computing decoupled design. This enables near real-time scaling of the computing layer. Unlike CockroachDB, TiDB can leverage shared storage at the regional level for rapid and stable storage scaling within a few minutes. TiDB Serverless Architecture Scalability Horizontal Scalability DynamoDB Very High Horizontal scalability MongoDB Low Only support data volume < 1TB Amazon Aurora Low Write - vertically scalability / Horizontal scalability with limited read and write capacity. CockroachDB High Horizontal scalability + compute-storage decoupled TiDB High Horizontal scalability + compute-storage decoupled + fast storage scalability via object storage Rolling Upgrading Database upgrades are unavoidable. Traditionally, they often involve downtime or disruptions to business continuity. However, in the Serverless database realm, minimizing or eliminating downtime during upgrades is a critical consideration. This is where distributed architectures excel. Their inherent fault-tolerant design allows for non-disruptive rolling upgrades of individual components without causing downtime. Non-distributed architectures face challenges in achieving non-disruptive upgrades. If we look at the major players, Amazon Aurora Serverless does not support non-disruptive updates, emphasizing the advantages of distributed architectures in this regard. Rolling Upgrade DynamoDB Yes MongoDB TiDB CockroachDB Amazon Aurora No Connection Model In traditional application models, developers typically prefer long-lived connections to databases instead of frequent short-lived connections. However, maintaining long-lived connections in Serverless applications is often challenging due to architectural limitations. On the other hand, using the HTTP protocol for data access has made it more convenient. Consequently, many databases now offer HTTP-based access interfaces to facilitate data retrieval and manipulation. HTTP Data API DynamoDB Yes MongoDB TiDB Amazon Aurora Yes, for v1 and no for v2. CockroachDB No Professional Support and Services Professional support and services are essential to consider when selecting a database, Serverless or not. It is crucial to assess the support provided by the vendors. While I cannot objectively evaluate specific products' support services, it is valuable to consider the vendors' reputation and responsiveness to user needs. For example, many TiDB users have expressed satisfaction with the vendor's support, citing timely assistance and a willingness to incorporate user feedback for rapid product improvements. Wrapping Up In conclusion, Serverless databases have emerged as a powerful solution for data management in modern applications. By offloading infrastructure management and enabling scalability, they allow developers to focus on their core business objectives. When choosing a Serverless database, it's crucial to consider various factors such as the applicable range, pricing model, scalability, rolling upgrade capabilities, and the availability of HTTP Data APIs. Here's a comparison summary of different Serverless databases we have covered in this article: DynamoDB MongoDB TiDB Amazon Aurora CockroachDB Applicable range Low Medium Very High High High Pay-as-you-go pricing Fine Granularity Fine Granularity Fine Granularity Coarse Granularity Fine Granularity Scalability Very High Low High Low High Rolling upgrade Yes Yes Yes No Yes HTTP Data API Yes Yes Yes Yes, for v1 and no for v2. No It's important to note that this summary table provides a general overview, and the suitability of each database depends on your specific requirements and use case. By carefully evaluating the characteristics and comparing them to your project's needs, you can make an informed decision and select the most suitable Serverless database to drive the success of your application.
Oren Eini
Wizard,
Hibernating Rhinos @ayende
Abhishek Gupta
Principal Developer Advocate,
AWS
Artem Ervits
Solutions Engineer,
Cockroach Labs
Sahiti Kappagantula
Product Associate,
Oorwin