A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
The Database Evolution
Organizations are constantly working to build up their big data capabilities with hopes to compete in the modern economy. SQL and NoSQL database types were supposed to help organizations identify and make sense of hidden patterns in the data that businesses collected. In recent years, however, the momentum of the big data movement has seemed to slow as business leaders around the world have failed to realize the results that were promised several years ago. As the momentum has decelerated, how is the use of SQL and NoSQL databases evolving to support these efforts as businesses attempt to course correct in the big data era? In DZone’s 2020 Trend Report “Database Evolution: SQL or NoSQL in the Age of Big Data,” we explore the role of two popular database types SQL and NoSQL in big data initiatives over the next 6-12 months. Readers will find original research, interviews with industry experts, and additional resources with helpful tips, best practices, and more.
NoSQL is a database solution growing in popularity and maturity in several areas, including several regions on enterprise solutions such as finance and e-commerce. In this tutorial, we will explain how to use MongoDB and Java using Jakarta EE with the newest version of Jakarta NoSQL. If you are unfamiliar with Jakarta NoSQL, it is a Jakarta specification that aims to make life easier between Java and NoSQL. The main goal is to use commons annotation to map multiple database types that currently support key-value, wide-column, document, and graph. The Jakarta NoSQL newest version, 1.0.0-b5, has three hot features: Support for record Simplify database configuration Enhance documentation as a first-class citizen To explore those features, let's create a simple MongoDB application with Jakarta NoSQL and Java SE on the MongoDB instance. The first step is to install and execute the database. To make it smoother, let's run a Docker image. Shell docker run -d --name mongodb-instance -p 27017:27017 mongo Once you have the MongoDB running, let's return to the Java application side and include the Maven dependencies. Thus, you need to include CDI, JSON-B, and JSON-P implementations, and furthermore, the Jakarta NoSQL dependency. We'll create a Book entity with ISBN, title, author, year, and edition. The book entity is an immutable case where you don't change it once you release it. Wait, how about a new edition? It is a new book: thus, we often pay for it. Java @Entity public record Book(@Id String isbn, @Column("title") String title, @Column("author") String author, @Convert(YearConverter.class) @Column("year") Year year, @Column("edition") int edition) { public Book nextEdition(String isbn, Year year) { Objects.requireNonNull(isbn, "isbn is required"); Objects.requireNonNull(year, "year is required"); return new Book(isbn, this.title, this.author, year, this.edition + 1); } } The Book entity as a record decreases the boilerplate to create an immutable class. It is essential to highlight that the record feature is still a Java class, so you can create methods such as a builder to make it easier to create a record or new edition method as we did. The next step is the database properties. This new version explores the philosophy of convention over configuration. Consequently, you don't need to use any class anymore. It requires properties to create it. These changes break compatibility with the previous version. Please check the driver properties on the JNoSQL driver repository. Properties files jnosql.document.database=library jnosql.mongodb.host=localhost:27017 That's it! If you wish to configure it programmatically, you may need to implement the Supplier of a manager database and define it as an alternative with priority; it is natural if you use CDI. Java @Alternative @Priority(Interceptor.Priority.APPLICATION) @ApplicationScoped public class ManagerSupplier implements Supplier<DocumentManager> { @Produces public DocumentManager get() { ... } } The next step is to execute it. We'll create the first interaction with MongoDB. We'll use as the first step the Template interface, which takes the typical behavior among databases. Java try (SeContainer container = SeContainerInitializer.newInstance().initialize()) { String id = UUID.randomUUID().toString(); long start = System.currentTimeMillis(); Template template = container.select(Template.class).get(); Book book = new Book(id, "Java Concurrency in Practice", " Brian Goetz", Year.of(2006), 1); template.insert(book); Optional<Book> optional = template.find(Book.class, id); System.out.println("The result " + optional); long end = System.currentTimeMillis() - start; System.out.println("The total operation is: " + end); template.delete(Book.class, id); } As with any code design decision, the Template has a limitation on features once it takes the standard behavior among NoSQL database types that Jakarta NoSQL supports. There is a specialization to NoSQL types for document types; there is DocumentTemplate, where we can explore the capability of document types. Java try (SeContainer container = SeContainerInitializer.newInstance().initialize()) { DocumentTemplate template = container.select(DocumentTemplate.class).get(); Book first = new Book(randomUUID().toString(), "Effective Java", "Joshua Bloch", Year.of(2001), 1); Book second = first.nextEdition(randomUUID().toString(), Year.of(2008)); Book third = second.nextEdition(randomUUID().toString(), Year.of(2018)); template.insert(List.of(first, second, third)); DocumentQuery query = DocumentQuery.select().from("Book") .where("title").eq("Effective Java") .orderBy("year").desc().build(); System.out.println("The Effective java editions: "); template.select(query).forEach(System.out::println); template.delete(Book.class, first.isbn()); template.delete(Book.class, second.isbn()); template.delete(Book.class, third.isbn()); } We could insert and then list all books from the title order by the edition in descending order. The DocumentTemplate has characteristics for document types where you can explore the particular NoSQL database features even more. Thus there are DocumentTemplate specializations. You can check several samples on Mapping specializations. That's it! In this tutorial, you learned how to integrate Java with MongoDB exploring the new features of Jakarta NoSQL. Jakarta NoSQL is working on the next version, where the big bang of Jakarta EE and support for Jakarta Data are expected. The new features bring capabilities that make it more comfortable for an enterprise Java developer exploring NoSQL databases. References Source code Jakarta NoSQL Jakarta Data proposal Eclipse Microprofile Eclipse Microprofile Config Jakarta Config
Database sharding is the process of dividing data into smaller pieces called "shards." Sharding is typically introduced when there’s a need to scale writes. During the lifespan of a successful application, the database server will hit the maximum number of writes it can perform either at the processing or capacity level. Slicing the data into multiple shards—each one on its own database server—reduces the stress on each individual node, effectively increasing the write capacity of the overall database. This is what database sharding is. Distributed SQL is the new way to scale relational databases with a sharding-like strategy that's fully automated and transparent to applications. Distributed SQL databases are designed from the ground up to scale almost linearly. In this article, you'll learn the basics of distributed SQL and how to get started. Disadvantages of Database Sharding Sharding introduces a number of challenges: Data partitioning: Deciding how to partition data across multiple shards can be a challenge, as it requires finding a balance between data proximity and even distribution of data to avoid hotspots. Failure handling: If a key node fails and there are not enough shards to carry the load, how do you get the data on a new node without downtime? Query complexity: Application code is coupled to the data-sharding logic and queries that require data from multiple nodes need to be re-joined. Data consistency: Ensuring data consistency across multiple shards can be a challenge, as it requires coordinating updates to the data across shards. This can be particularly difficult when updates are made concurrently, as it may be necessary to resolve conflicts between different writes. Elastic scalability: As the volume of data or the number of queries increases, it may be necessary to add additional shards to the database. This can be a complex process with unavoidable downtime, requiring manual processes to relocate data evenly across all shards. Some of these disadvantages can be alleviated by adopting polyglot persistence (using different databases for different workloads), database storage engines with native sharding capabilities, or database proxies. However, while helping with some of the challenges in database sharding, these tools have limitations and introduce complexity that requires constant management. What Is Distributed SQL? Distributed SQL refers to a new generation of relational databases. In simple terms, a distributed SQL database is a relational database with transparent sharding that looks like a single logical database to applications. Distributed SQL databases are implemented as a shared-nothing architecture and a storage engine that scales both reads and writes while maintaining true ACID compliance and high availability. Distributed SQL databases have the scalability features of NoSQL databases—which gained popularity in the 2000s—but don’t sacrifice consistency. They keep the benefits of relational databases and add cloud compatibility with multi-region resilience. A different but related term is NewSQL (coined by Matthew Aslett in 2011). This term also describes scalable and performant relational databases. However, NewSQL databases don’t necessarily include horizontal scalability. How Does Distributed SQL Work? To understand how Distributed SQL works, let’s take the case of MariaDB Xpand—a distributed SQL database compatible with the open-source MariaDB database. Xpand works by slicing the data and indexes among nodes and automatically performing tasks such as data rebalancing and distributed query execution. Queries are executed in parallel to minimize lag. Data is automatically replicated to make sure that there’s no single point of failure. When a node fails, Xpand rebalances the data among the surviving nodes. The same happens when a new node is added. A component called rebalancer ensures that there are no hotspots—a challenge with manual database sharding—which occurs when one node unevenly has to handle too many transactions compared to other nodes that may remain idle at times. Let’s study an example. Suppose we have a database instance with some_table and a number of rows: We can divide the data into three chunks (shards): And then move each chunk of data into a separate database instance: This is what manual database sharing looks like. Distributed SQL does this automatically for you. In the case of Xpand, each shard is called a slice. Rows are sliced using a hash of a subset of the table’s columns. Not only is data that is sliced, but indexes are also sliced and distributed among the nodes (database instances). Moreover, to maintain high availability, slices are replicated in other nodes (the number of replicas per node is configurable). This also happens automatically: When a new node is added to the cluster or when one node fails, Xpand automatically rebalances the data without the need for manual intervention. Here’s what happens when a node is added to the previous cluster: Some rows are moved to the new node to increase the overall system capacity. Keep in mind that, although not shown in the diagram, indexes as well as replicas are also relocated and updated accordingly. A slightly more complete view (with a slightly different relocation of data) of the previous cluster is shown in this diagram: This architecture allows for nearly linear scalability. There’s no need for manual intervention at the application level. To the application, the cluster looks like a single logical database. The application simply connects to the database through a load balancer (MariaDB MaxScale): When the application sends a write operation (for example, INSERT or UPDATE), the hash is calculated and sent to correct the slice. Multiple writes are sent in parallel to multiple nodes. When Not To Use Distributed SQL Sharding a database improves performance but also introduces additional overhead at the communication level between nodes. This can lead to slower performance if the database is not configured correctly or if the query router is not optimized. Distributed SQL might not be the best alternative in applications with less than 10K queries per second or 5K transactions per second. Also, if your database consists of mostly many small tables, then a monolithic database might perform better. Getting Started With Distributed SQL Since a distributed SQL database looks to an application as if it was one logical database, getting started is straightforward. All you need is the following: An SQL client like DBeaver, DbGate, DataGrip, or any SQL client extension for your IDE A distributed SQL database Docker makes the second part easy. For example, MariaDB publishes the mariadb/xpand-single Docker image that allows you to spin up a single-node Xpand database for evaluation, testing, and development. To start an Xpand container, run the following command: Shell docker run --name xpand \ -d \ -p 3306:3306 \ --ulimit memlock=-1 \ mariadb/xpand-single \ --user "user" \ --passwd "password" See the Docker image documentation for details. Note: At the time of writing this article, the mariadb/xpand-single Docker image is not available on ARM architectures. On these architectures (for example Apple machines with M1 processors), use UTM to create a virtual machine (VM) and install, for example, Debian. Assign a hostname and use SSH to connect to the VM to install Docker and create the MariaDB Xpand container. Connecting to the Database Connecting to an Xpand database is the same as connecting to a MariaDB Community or Enterprise server. If you have the mariadb CLI tool installed, simply execute the following: Shell mariadb -h 127.0.0.1 -u user -p You can connect to the database using a GUI for SQL databases like DBeaver, DataGrip, or an SQL extension for your IDE (like this one for VS Code). We are going to use a free and open-source SQL client called DbGate. You can download DbGate and run it as a desktop application or since you are using Docker, you can deploy it as a web application that you can access from anywhere via a web browser (similar to the popular phpMyAdmin). Simply run the following command: Shell docker run -d --name dbgate -p 3000:3000 dbgate/dbgate Once the container starts, point your browser to http://localhost:3000/. Fill in the connection details: Click on Test and confirm that the connection is successful: Click on Save and create a new database by right-clicking on the connection in the left panel and selecting Create database. Try creating tables or importing an SQL script. If you just want to try out something, the Nation or Sakila are good example databases. Connecting From Java, JavaScript, Python, and C++ To connect to Xpand from applications you can use the MariaDB Connectors. There are many programming languages and persistence framework combinations possible. Covering this is outside of the scope of this article, but if you just want to get started and see something in action, take a look at this quick start page with code examples for Java, JavaScript, Python, and C++. The True Power of Distributed SQL In this article, we learned how to spin up a single-node Xpand for development and testing purposes as opposed to production workloads. However, the true power of a distributed SQL database is in its capability to scale not only reads (like in classic database sharding) but also writes by simply adding more nodes and letting the rebalancer optimally relocate the data. Although it is possible to deploy Xpand in a multi-node topology, the easiest way to use it in production is through SkySQL. If you want to learn more about distributed SQL and MariaDB Xpand, here's a list of useful resources: MariaDB Xpand for distributed SQL (video animation) MariaDB Xpand documentation Taking Distributed SQL to the Next Level with Columnar Indexing (talk) Getting Started With Distributed SQL (refcard)
An enumerated type (enum) is a handy data type that allows us to specify a list of constants to which an object field or database column can be set. The beauty of the enums is that we can enforce data integrity by providing the enum constants in a human-readable format. As a result, it’s unsurprising that this data type is natively supported in Java and PostgreSQL. However, the conversion between Java and PostgreSQL enums doesn’t work out of the box. The JDBC API doesn’t recognize enums as a distinct data type, leaving it up to the JDBC drivers to decide how to deal with the conversion. And, usually, the drivers do nothing about it — the chicken-and-egg problem. Many solutions help you map between Java and PostgreSQL enums, but most are ORM or JDBC-specific. This means that what is suggested for Spring Data will not work for Quarkus and vice versa. In this article, I will review a generic way of handling the Java and PostgreSQL enums conversion. This approach works for plain JDBC APIs and popular ORM frameworks such as Spring Data, Hibernate, Quarkus, and Micronaut. Moreover, it’s supported by databases built on PostgreSQL, including Amazon Aurora, Google AlloyDB, and YugabyteDB. Creating Java Entity Object and Enum Assume that we have a Java entity object for a pizza order: Java public class PizzaOrder { private Integer id; private OrderStatus status; private Timestamp orderTime; // getters and setters are omitted } The status field of the object is of an enumerated type defined as follows: Java public enum OrderStatus { Ordered, Baking, Delivering, YummyInMyTummy } The application sets the status to Ordered once we order a pizza online. The status changes to Baking as soon as the chef gets to our order. Once the pizza is freshly baked, it is picked up by someone and delivered to our door - the status is then updated to Delivering. In the end, the status is set to YummyInMyTummy meaning that we enjoyed the pizza (hopefully!) Creating Database Table and Enum To persist the pizza orders in PostgreSQL, let’s create the following table that is mapped to our PizzaOrder entity class: SQL CREATE TABLE pizza_order ( id int PRIMARY KEY, status order_status NOT NULL, order_time timestamp NOT NULL DEFAULT now() ); The table comes with a custom type named order_status. The type is an enum that is defined as follows: SQL CREATE TYPE order_status AS ENUM( 'Ordered', 'Baking', 'Delivering', 'YummyInMyTummy'); The type defines constants (statuses) similar to the Java counterpart. Hitting the Conversion Issue If we connect to PostgreSQL using psql (or another SQL tool) and execute the following INSERT statement, it will complete successfully: SQL insert into pizza_order (id, status, order_time) values (1, 'Ordered', now()); The statement nicely accepts the order status (the enum data type) in a text representation - Ordered. After seeing that, we may be tempted to send a Java enum value to PostgreSQL in the String format. If we use the JDBC API directly, the PreparedStatement can look as follows: Java PreparedStatement statement = conn .prepareStatement("INSERT INTO pizza_order (id, status, order_time) VALUES(?,?,?)"); statement.setInt(1, 1); statement.setString(2, OrderStatus.Ordered.toString()); statement.setTimestamp(3, Timestamp.from(Instant.now())); statement.executeUpdate(); However, the statement will fail with the following exception: Java org.postgresql.util.PSQLException: ERROR: column "status" is of type order_status but expression is of type character varying Hint: You will need to rewrite or cast the expression. Position: 60 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2365) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:355) at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:490) at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:408) Even though PostgreSQL accepts the enum text representation when an INSERT/UPDATE statement is executed directly via a psql session, it doesn’t support the conversion between the varchar (passed by Java) and our enum type. One way to fix this for the plain JDBC API is by persisting the Java enum as an object of the java.sql.Types.OTHER type: Java PreparedStatement statement = conn .prepareStatement("INSERT INTO pizza_order (id, status, order_time) VALUES(?,?,?)"); statement.setInt(1, 1); statement.setObject(2, OrderStatus.Ordered, java.sql.Types.OTHER); statement.setTimestamp(3, Timestamp.from(Instant.now())); statement.executeUpdate(); But, as I said earlier, this approach is not generic. While it works for the plain JDBC API, you need to look for another solution if you are on Spring Data, Quarkus, or another ORM. Casting Types at the Database Level The database provides a generic solution. PostgreSQL supports the cast operator that can perform a conversion between two data types automatically. So, in our case, all we need to do is to create the following operator: SQL CREATE CAST (varchar AS order_status) WITH INOUT AS IMPLICIT; The created operator will map between the varchar type (passed by the JDBC driver) and our database-level order_status enum type. The WITH INOUT AS IMPLICIT clause ensures that the cast will happen transparently and automatically for all the statements using the order_status type. Testing With Plain JDBC API After we create that cast operator in PostgreSQL, the earlier JDBC code snippet inserts an order with no issues: Java PreparedStatement statement = conn .prepareStatement("INSERT INTO pizza_order (id, status, order_time) VALUES(?,?,?)"); statement.setInt(1, 1); statement.setString(2, OrderStatus.Ordered.toString()); statement.setTimestamp(3, Timestamp.from(Instant.now())); statement.executeUpdate(); All we need is to pass the Java enum value as a String, and the driver will send it to PostgreSQL in the varchar representation that will automatically convert the varchar value to the order_status type. If you read the order back from the database, then you can easily reconstruct the Java-level enum from a String value: Java PreparedStatement statement = conn.prepareStatement("SELECT id, status, order_time " + "FROM pizza_order WHERE id = ?"); statement.setInt(1, 1); ResultSet resultSet = statement.executeQuery(); resultSet.next(); PizzaOrder order = new PizzaOrder(); order.setId(resultSet.getInt(1)); order.setStatus(OrderStatus.valueOf(resultSet.getString(2))); order.setOrderTime(resultSet.getTimestamp(3)); Testing With Spring Data Next, let’s validate the cast operator-based approach with Spring Data. Nowadays, you’re likely to use an ORM rather than the JDBC API directly. First, we need to label our PizzaOrder entity class with a few JPA and Hibernate annotations: Java @Entity public class PizzaOrder { @Id private Integer id; @Enumerated(EnumType.STRING) private OrderStatus status; @CreationTimestamp private Timestamp orderTime; // getters and setters are omitted } The @Enumerated(EnumType.STRING) instructs a JPA implementation (usually Hibernate) to pass the enum value as a String to the driver. Second, we create PizzaOrderRepository and save an entity object using the Spring Data API: Java // The repository interface public interface PizzaOrderRepository extends JpaRepository<PizzaOrder, Integer> { } // The service class @Service public class PizzaOrderService { @Autowired PizzaOrderRepository repo; @Transactional public void addNewOrder(Integer id) { PizzaOrder order = new PizzaOrder(); order.setId(id); order.setStatus(OrderStatus.Ordered); repo.save(order); } ... // Somewhere in the source code pizzaService.addNewOrder(1); } When the pizzaService.addNewOrder(1) method is called somewhere in our source code, the order will be created and persisted successfully to the database. The conversion between the Java and PostgreSQL enums will occur with no issues. Lastly, if we need to read the order back from the database, we can use the JpaRepository.findById(ID id) method, which recreates the Java enum from its String representation: Java PizzaOrder order = repo.findById(orderId).get(); System.out.println("Order status: " + order.getStatus()); Testing With Quarkus How about Quarkus, which might be your #1 ORM? There is no significant difference from Spring Data as long as Quarkus favours Hibernate as a JPA implementation. First, we annotate our PizzaOrder entity class with JPA and Hibernate annotations: Java @Entity(name = "pizza_order") public class PizzaOrder { @Id private Integer id; @Enumerated(EnumType.STRING) private OrderStatus status; @CreationTimestamp @Column(name = "order_time") private Timestamp orderTime; // getters and setters are omitted } Second, we introduce PizzaOrderService that uses the EntityManager instance for database requests: Java @ApplicationScoped public class PizzaOrderService { @Inject EntityManager entityManager; @Transactional public void addNewOrder(Integer id) { PizzaOrder order = new PizzaOrder(); order.setId(id); order.setStatus(OrderStatus.Ordered); entityManager.persist(order); } ... // Somewhere in the source code pizzaService.addNewOrder(1); When we call the pizzaService.addNewOrder(1) somewhere in our application logic, Quarkus will persist the order successfully, and PostgreSQL will take care of the Java and PostgreSQL enums conversion. Finally, to read the order back from the database, we can use the following method of the EntityManager that maps the data from the result set to the PizzaOrder entity class (including the enum field): Java PizzaOrder order = entityManager.find(PizzaOrder.class, 1); System.out.println("Order status: " + order.getStatus()); Testing With Micronaut Alright, alright, how about Micronaut? I love this framework, and you might favour it as well. The database-side cast operator is a perfect solution for Micronaut as well. To make things a little different, we won’t use Hibernate for Micronaut. Instead, we’ll rely on Micronaut’s own capabilities by using the micronaut-data-jdbc module: XML <dependency> <groupId>io.micronaut.data</groupId> <artifactId>micronaut-data-jdbc</artifactId> </dependency> // other dependencies First, let’s annotate the PizzaOrder entity: Java @MappedEntity public class PizzaOrder { @Id private Integer id; @Enumerated(EnumType.STRING) private OrderStatus status; private Timestamp orderTime; // getters and setters are omitted } Next, define PizzaRepository: Java @JdbcRepository(dialect = Dialect.POSTGRES) public interface PizzaRepository extends CrudRepository<PizzaOrder, Integer> { } And, then store a pizza order in the database by invoking the following code snippet somewhere in the application logic: Java PizzaOrder order = new PizzaOrder(); order.setId(1); order.setStatus(OrderStatus.Ordered); order.setOrderTime(Timestamp.from(Instant.now())); repository.save(order); As with Spring Data and Quarkus, Micronaut persists the object to PostgreSQL with no issues letting the database handle the conversion between the Java and PostgreSQL enum types. Finally, whenever we need to read the order back from the database, we can use the following JPA API: Java PizzaOrder order = repository.findById(id).get(); System.out.println("Order status: " + order.getStatus()); The findById(ID id) method retrieves the record from the database and recreates the PizzaOrder entity, including the PizzaOrder.status field of the enum type. Wrapping Up Nowadays, it’s highly likely that you will use Java enums in your application logic and as a result will need to persist them to a PostgreSQL database. You can use an ORM-specific solution for the conversion between Java and PostgreSQL enums, or you can take advantage of the generic approach based on the cast operator of PostgreSQL. The cast operator-based approach works for all ORMs, including Spring Data, Hibernate, Quarkus, and Micronaut, as well as popular PostgreSQL-compliant databases like Amazon Aurora, Google AlloyDB, and YugabyteDB.
What Is a Distributed System? Generally, a distributed system is a set of computer programs that work together across multiple independent servers to achieve a common goal. Those servers refer to those commodity servers instead of mainframes. The hardware for cross-server collaboration here is mostly based on Ethernet devices or higher-end RMDA devices. Why Do We Need a Distributed System? The main reason to build a distributed system is to replace the cost of expensive hardware devices with software technology and inexpensive hardware devices. Especially in most private server rooms, not public cloud or supercomputing conditions, procurement costs are an important basis for business decisions. In addition to reducing costs, another benefit of distributed technology is its scalability. By adding several servers to the original number of servers and then combining the scheduling and distribution ability of the distributed system, the new servers can be used to provide additional services. Compared with purchasing more servers in equal numbers or purchasing higher configuration servers, the distributed technology allows you to purchase servers on demand, which reduces the risk of over-provisioning and improves the utilization rate of the hardware resources. Basic Problems With Distributed Systems In distributed technologies, since data storage and computation need to be implemented across multiple independent servers, a series of underlying technologies must be involved. In this article, we discuss only two problems: One is the data copy or replica problem, and another is how to distribute the storage and computation of large data to independent servers. The Problem of Data Replicas The hardware reliability and maintenance of commodity servers are much lower than that of mainframes. Because loose network cables, damaged hard drives, and power failures occur almost every hour in large machine rooms. It is a basic problem for a distributed software system to solve or avoid these hardware problems. One common solution is to replicate data on multiple servers. Once some data replicas are lost, the system can still provide services by using the remaining data replicas. What’s more, when the access load of the system is too large, the system can also provide more services by adding more replicas. In addition, some technologies are needed to ensure that the data replicas are consistent with each other; that is, the data of each replica on different servers is the same. For graph databases, the data replica problem also exists. The way to solve this problem is similar to the way to solve the data replica problem in relational databases or big data systems. The Problem of Data Partitioning The hardware, memory, and CPU of a single server are limited. If the data is too large, it is impossible to store all the data on a single server. Therefore, the TB-level or even PB-level data must be distributed to multiple servers, and we call this process data partitioning. When a request is to access multiple data partitions, the distributed system needs to distribute the request to each correct data partition and then combine the results. The Problem of Data Partitioning in Graph Databases: Graph Partitioning In graph databases, the distribution process is imaginatively called graph partitioning. A big graph is partitioned into multiple small graphs, and the storage and computation of each small graph are stored on different servers.Compared with the partitioning problem in relational databases and big data systems, the graph partitioning problem deserves more special attention. Let’s take a look at a static graph structure, such as the CiteSeer dataset, which is a citation network of scientific papers consisting of 3312 papers and their citations between them. It is a small-scale dataset that can be stored on a single server. The Twitter 2010 dataset is a social network of Twitter users consisting of 12.71 million vertices and 0.23 billion edges. It is relatively easy to store this dataset on a single mainstream server produced in 2022. However, it may be necessary to purchase very expensive high-end servers produced ten years ago to do so. However, the WDC (Web Data Commons) dataset consists of 1.7 billion vertices and 64 billion edges. It is difficult or impossible to store such a large-scale dataset on a current mainstream server. On the other hand, since the data of human beings is growing faster than Moore’s Law and the number of connections or relationships between data is exponentially higher than the speed of data production, the data partitioning problem seems to be an inevitable problem for the graph database system. But it sounds no different from the way data is partitioned or hashed in the mainstream distributed technologies. After all, data is partitioned into multiple big data systems. Wait, is it that easy to partition a graph? No, it is not. In the field of graph databases, the graph partitioning problem is a trade-off among technologies, products, and engineering. Three Problems Faced by Graph Partitioning The first problem: what should be partitioned? In big data or relational database systems, row-based partitioning or column-based partitioning is performed based on records or fields, or partitioning is performed based on data IDs, which are intuitive in terms of semantics and technology. However, the strong connectivity of the graph data structure makes it difficult to partition the graph data. One vertex may be connected to many other vertices through multiple edges, and the other vertices may also be connected to many other vertices through their neighboring edges. It is just like the web pages that are almost linked to each other. So for a graph database, what should be partitioned that can make the semantics intuitive and natural? (In RDBMS, this is equivalent to how to partition the data when there are a large number of foreign keys in the table.) Of course, there also exist some natural semantic partitioning methods. For example, under the COVID-19 epidemic, the transmission chain of various strains in China and other countries are two different network structures. Then, a second problem is introduced. The second problem: is how to ensure that the data of each partition is roughly balanced after the data is partitioned. The naturally formed graphs conform to the power low that a minority of 20% vertices are connected to other 80% vertices, and these minority vertices are called super nodes or dense nodes. This means that the minority vertices are associated with most of the other vertices. Therefore it can be expected that the load and hotspots of the partition containing the super nodes are much higher than that of the other partitions containing the other vertices. The above image shows the visual effect of the association network formed by hyperlinks to websites on the Internet, where the super websites (nodes) are visible. The third problem: how to evaluate and perform repartitioning when the original partitioning methods are gradually outdated as the graph network grows, and the graph distribution and connection patterns change? The following image shows the visual effect of the connections among 86 billion neurons in the human brain. With learning, exercise, sleep, and aging, neuronal connections are constantly changing at the weekly level. The original partitioning method may not be able to keep up with the changes at all. Of course, many other details need to be considered. In this article, we try to avoid using too many technical terms. Unfortunately, there is no silver bullet for the graph partitioning problems from a technical point of view, and each product has to make its trade-offs. Here are some examples of the trade-offs made by different products. Partitioning Methods in Different Graph Database Products 1. Distributed but Unpartitioned Neo4j 3.5 adopts the unpartitioned distributed architecture. The reason to use a distributed system is to ensure the consistency and the ready availability of the written data in multiple replicas. That means the full amount of graph data is stored on each server, and the size of data cannot be larger than the capacity of a single server’s memory and hard disk. We can ensure the failure of a single server in the process of data writing by adding multiple write replicas. We can improve the read performance by adding multiple read replicas (The write performance is not improved). This solution can avoid the three problems mentioned above of graph data partitioning, and in theory, there is nothing wrong with calling such a solution a distributed graph database. To add to that, since the data is complete on each server, ACID transactions are relatively easy to implement. 2. Distributed and Partitioned by Users The distributed and partitioned by users architecture is typically represented by Neo4j 4.x Fabric. Depending on the user’s business case, users can specify that subgraphs can be placed on a (group of) server(s). For example, in one cluster, the subgraph of the product E is placed on the server E, and the subgraph of the product N is placed on server N. (Of course, for the availability of the service itself, these servers can also be placed in the Causal Cluster mentioned in the image mentioned above.) In this process, for both the write and read operations, the user needs to specify a server or a group of servers to operate. This solution leaves the three problems mentioned above to the user for decision at the product level. Therefore, such a solution is also called a distributed graph database. Moreover, this solution can guarantee the ACID transactions in server E. However, there are a certain amount of edges that connect the vertices in server E and the vertices in other servers, so the ACID transactions of these edges cannot be guaranteed technically. For details about the Neo4j 4.x Fabric architecture. 3. Non-Equivalent Distributed, Partitioned, and Coarse-Grained Replicated This solution allows multiple replicas and graph data partitioning, and these two processes require a small number of users to be involved. For details about TigerGraph’s partitioning solution, see this YouTube video. In TigerGraph’s solution, vertices and edges are scattered in multiple partitions after being encoded. The first two problems among the problems mentioned above can be partially solved by encoding vertices and edges. Users can decide whether to read or compute the data in a distributed system or a single server. However, such a set of partitions must be replicated in full and identical copies (so the granularity of the scale-out is the entire graph instead of a partition), which requires a larger amount of storage space. 4. Full-Equivalent Distributed, Partitioned, and Fine-Grained Replicated There are also solutions whose architectural design purpose ranks graph scalability or resiliency relatively at the highest priority of the overall system design. The assumption is that data is being generated faster than Moore’s Law, and the interactions and relationships between data are exponentially higher than the rate of data generation. Therefore it is necessary to be able to handle such explosive growth of data and provide services quickly. In this solution, the obvious characteristic is the separation design of the storage layer and the computing layer, each having the ability for fine-grained scalability. Data is partitioned in the storage layer with a hash or consistent hash solution. Hashing is performed based on the IDs of vertices or primary keys. This solution just solves the first problem. To deal with the super node and load balancing problem (the second problem), another layer of the B-tree data structure is introduced. It splits super nodes into multiple processing units, balances data among threads, and scales out the computing layer. For the third problem, the solution is to use a fine-grained partitioning method so that the scale-out of some partitions can be performed. For sure this solution is also called a distributed graph database. The four solutions mentioned above make different trade-offs at the product and technical levels, with a focus on suitable business scenarios. Hence, these solutions can all be called distributed graph databases.
This two-part series explores from a persistence perspective topics such as application architecture, code design, framework solutions, traps, and news of the Jakarta EE world that can help you design and architect better solutions in the upcoming years. In part one, we coved how potential behaviors that can add extra processing and resource consumption to your database's integration and got up to date with news on what's out there and what's coming for Java applications that rely on relational data storage integration. In part two (this article), we take into consideration the need to leverage a great number of storage options currently for persisting and managing data in Java applications. They vary from relational (SQL) to NoSQL, in-memory graph databases, and can be either traditional on-premise and self-managed or run as a cloud service (DBaaS- Database as a Service). Taking this into consideration, get to know enterprise ways of handling polyglot persistence architectures for Java solutions and become aware of several roadblocks that are possibly slowing down both your service and your team's performance today. Polyglot Persistence As microservices architecture adoption gets consolidated, so do polyglot solutions — polyglot not only from a programming language aspect but also from a storage perspective as well. Solutions composed of smaller independent services allow each service to adopt the type of storage that best addresses its needs. It is, therefore, vital to understand the multiple types of persistence storage solutions out there. Adding to the previous insights on a relational database, next are the topics you must know today about other Java persistence integrations. Java and NoSQL Persistence Integrating Java solutions with NoSQL databases can be painless — if you don't plan to change the DB engine. Amongst SQL storage options, there are multiple options: Key-value, Column Families, Document, Graph, and more. For each type, the market is filled with vendors. If we pick key-value as an example, we could choose between Cassandra, Redis, Amazon DynamoDB, ArangoDB, Couchbase, and many others — but wait; Let's stop listing the options because, to be honest, looking at the whole NoSQL landscape, we can find more than 220 NoSQL database options! As architects, how do we handle such a plurality of APIs and drivers, and how do we ensure our persistence code layer implementation can be performant today and survive future migrations and modernization without causing too much pain? Whenever we integrate service and database: We must "teach" the application how to "speak the database language" — we do that by using a driver. We also must map our data model to the final database model structure — generally with annotations. We configure and fine-tune our database connections and integration through data sources and other configurations. Today, there's no such thing as a Jakarta specification to help abstract and create a single way to handle this many different options. Using the vendor's driver API is a common way. As a consequence, multiple code changes are needed in case there's a need to migrate to another vendor solution. We won't even drill down the pain of migrating from one database type to another. Java architects can ease this pain with the usage of the open-source project JNoSQL. It comes in handy by abstracting the different APIs provided by the NoSQL DB vendors and abstracting multiple NoSQL DB Types, allowing us to learn a single API and easily modify the backing storage solution if/when we need it. JNoSQL currently supports 30 different database options, including popular ones such as MongoDB, Elastic, Cassandra, Couchbase, Redis, Infinispan, Neo4j, and Hazelcast. Using it is very straightforward: Once you add the JNoSQL dependency and the desired database's driver dependency, you will configure the connection details, and you're ready to start coding. The entities should look very familiar to you, as you'll be using annotations such as @Entity, @Id, and @Column. Another good news is: Jakarta EE already has a specification approved and under development that should soon become part of the platform: Jakarta NoSQL. The specification should cover not only common APIs but also APIs for Pagination, bean validation supported with Bean Validation specification, configuration via MicroProfile Config, and much more. SQL and NoSQL Integration Disparities: Is There a Way Out? We understand the excitement that can come with the possibilities offered by specifications such as Jakarta NoSQL. Although architects might think: "I still have the problem of having to learn multiple APIs — for SQL and NoSQL. And I still have different ways of handling these two strategies and all the previously mentioned problems that come with such disparities.". Well, here's more good news: the working groups of the Jakarta EE platform, with the help of the broader Java community, are planning to address this problem with the Jakarta Data specification. It aims to offer an abstraction to data integration — be it SQL or NoSQL, or even cloud-based data services. The Jakarta Data project is approved by the Eclipse Foundation and is currently in progress, planned to be an Apache Licensed specification for future Jakarta EE versions. If you're a Java developer, architect, and/or enthusiast and love to contribute to open source, this is a great chance. All the community is welcome to be part of the Jakarta EE story by contributing to several specifications, including young and promising ones such as Jakarta Data. More Roadblocks to Be Cautious With After getting informed on relational and NoSQL data storage with Java applications, there's yet another persistence storage approach to be covered: in-memory storage. To be more precise, let's discuss persistence issues and challenges that every SQL/NoSQL-based Java service has to exist with, which are addressed by Java native in-memory storage solutions. Model Mapping: Object Domain Model and Database Model Data modeling is the process of identifying entities and their relationships. In relational databases, data is placed in normalized tables with foreign keys used to reference related data in other tables. Whenever we integrate with a database, we have to map our domain classes to database tables (or other models). The example below shows a domain mapping of the classes Person and Address to their respective tables in a relational database. Since it's mapped as an n-n relation, then we end up with one extra table, in this case, PERSON_ADDRESS: On the application side, we rely on the persistence configuration of the classes to map them as best as possible to the relational database model. Unfortunately, capabilities such as inheritance, polymorphism, and encapsulation, to mention some, come from the paradigm of Object Oriented Programming. Such features are not supported and have no equivalent representation in the paradigm brought by relational databases. The same is valid the other way around, where Java cannot do a perfect mapping of every behavior available within database models and systems. Designing these data models requires good knowledge of both worlds and is a time-consuming task - especially when involving heavy documentation practices. Evolving these models can also be tricky since we must take into consideration not only the new application version deployment but, along with it, the respective database model and/or data must also be updated. Database Operations and the Hidden Extra Processing The Object-relational impedance mismatch is a well-known problem in the area of OOPs. It arises due to the fact that the object model and the relational model of data are not directly compatible. As a result, when data is mapped from one model to the other, some information is lost in the process. This loss of information can lead to performance degradation, as the data in the object model is not consistent with the data in the relational model. Now, notice that every object needs proper conversion to its equivalent model in the database every time an operation happens. We can then comprehend that data conversion is happening through every read and right; in other words, we have extra processing and added latency due to the models' conversions: from the Java domain model to the Database model and vice versa. Retrieving Data Nightmare: Complex Multi-Table Queries If you ever came across a long and complex query, where to get the information you need, you must (right, left, inner, outer...) join multiple tables, group by, filter by, etc., and you know exactly how much time is spent, and how much pain and trouble that comes with it. What if we could use Java Streams API instead? What if retrieving data was something as straightforward as the code below: Java public static void booksByAuthor() { final Map<Author, List<Book>> booksByAuthor = AcmeLibrary.data().books().stream() .collect(groupingBy(book -> book.author())); booksByAuthor.entrySet().forEach(e -> { System.out.println(e.getKey().name()); e.getValue().forEach(book -> { System.out.print('\t'); System.out.println(book.title()); }); }); } Our jobs would be way easier right? How to Get Rid of the Roadblocks Summarizing the performance villains that hide within our applications are: Domain mapping: mapping our Java domain model to database models (either SQL or NoSQL) is time-consuming and error-prone. It is also susceptible to adding complexity to maintenance as the model grows and evolves. Object-relational impedance mismatch: by using different paradigms between programing language and database, we go down the path of having extra processing for conversions on every database operation. Complex queries: even with the support of ORMs, we end up having to use complex queries with multiple joins and tables to retrieve the data we need. Such queries are also harder to maintain and track. To enhance a database layer's performance, improve the development experience, and reduce costs, we can rely on in-memory persistence storage solutions. To better explain this approach, we'll reference the technology called MicroStream. MicroStream is an open-source project, and its latest version is 7.1. The fact that it is built on top of CDI allows it to integrate smoothly with multiple runtime solutions such as Helidon (which officially supports MicroStream), Payara Micro, and Wildfly. Some interesting facts about: It's a Java native storage solution, and it runs on a JVM. You can run this storage solution wherever you can run a JVM. Since running on a JVM, it also brings a garbage collector for storage. Save time by skipping data mapping. There's no need to map your class attributes to database tables and columns. The data is stored as an object graph. Any Java type can be persisted without pain. Java public class Book { private final String id; private final String title; private final String author; private final Year release; ... } Data manipulation: Instead of having the traditional four CRUD operations, you can rely uniquely on two APIs that allow you to read and save the object's graph. Java @Storage public class Library { private final Set<Book> books; public Library() { books = new HashSet<>(); } public void add(Book book){ this.books.add(book); } public List<Book> getBooks() { return this.books.stream().collect(Collectors.toUnmodifiableList()); } } Retrieving data couldn't be easier: Instead of complex SQL queries, since we are integrating with Java-based storage, we can rely on our good old get methods. Even better: we can use the powerful Java Streams API to filter and group data as needed. Choose where to store your data: "Run away from vendor lock-in" is a recommendation you've probably heard several times. It is also valid for your chosen persistence storage. When using an in-memory persistence strategy, you will need to leverage how much memory you'll have available to load a data set. It's totally possible that you have such a large data set that you'll need an external persistence solution working along with your in-memory storage. If using MicroStream, you have a large set of options to use as an external storage target. By using internal connectors, it abstracts this integration for the developer and allows the application to use in-memory persistence backed by multiple types of storage solutions such as RDBMS, NoSQL, Object Stores, distributed cache, and more. Other than the technical benefits we've pointed out so far, a Java in-memory data storage solution such as MicroStream can be quite simple to get started with, and over time an architectural solution that follows this approach can be highly cost-saving if compared to traditional relational database clusters or even DBaaS offerings. To learn more about this technology and see it in action, check out the following video: Conclusion Java is a powerful programming language that can be used to create high-performance applications. When working with databases, there are a number of factors that can affect both a solution's performance and architectural design quality aspects. In this two-part series, we learned the following: What's new on the recent Jakarta EE 10 release with respect to persistence with JPA. How data-related design patterns and persistence framework implementation choices can impact our code design positively or negatively. How we can prospect polyglot Java solutions that handle not only traditional relational databases but also NoSQL databases, putting in place the challenges of the current path offered by vendors in contrast to the Jakarta EE specification under development, Jakarta NoSQL. In-memory database solutions advantages if compared to storage that can't match OOP; How such solutions can help you optimize the performance of your application. With the proper information and guidance on how to analyze, architects can then make strategic and solid implementation decisions for their solution's persistence layer and avoid future pain for the whole organization.
This is the first of a three-part app development workshop series designed to help developers understand technologies like Node.js, GraphQL, React, Netlify, and JavaScript to kickstart their app development portfolio. In this post, we’ll cover the fundamental concepts of website applications and introduce DataStax Astra DB as your free, fast, always-on database based on Apache Cassandra®. In the U.S. we spend almost 88% of our mobile internet time buried in apps like Facebook, Instagram, TikTok, and games. With nearly a million new apps released each year and 218 billion app downloads in 2020, learning how to build them is an essential step in any front-end developer’s career. So how can you start building apps? As a modern app developer, you’re spoiled for choice with multi-cloud platforms and open-source frameworks that do all the heavy lifting for you. Suppose you learn the basics and leverage the right tools. In that case, you can go from developing simple to-do lists to a full-stack Netflix clone — which is precisely what we’ll be covering in our application development workshop series. This first workshop is about building a website application to put you on the right track. You’ll learn the fundamentals of web apps, become familiar with popular languages (like JavaScript), and then get a pre-coded to-do list app that you’ll learn how to modify and deploy. It’ll take you less than three hours to complete, and all you need is a web browser and an internet connection. Let’s get started. The Basics of Web App Development A website application (or web app) is software that runs on a remote server and can be accessed using a web browser on any device. Think of apps like Slack, Gmail, Google Docs, and Twitter, for example. Unlike native apps, which are installed directly on a device and designed for specific operating systems, web apps can reuse the same code and adapt their interface to any mobile device, laptop, or tablet. While native apps perform much faster, they’re also expensive to develop because you need to learn different languages and frameworks for each platform or device. Web apps, on the other hand, can be “built once and reused forever,” and are vastly simpler to launch, manage and iterate. When developing a web app, you’ll typically use the following architecture: Figure 1: Architecture of a typical website application. The client is your device’s web browser, which you use to access the app. The front end is what the user interacts with, and is basically a web server that relies on three main technologies: HTML, CSS, and JavaScript. The backend is what interacts with the webserver and stores all the database drivers, APIs, and microservices that bridge the front end with the backend. The data layer is where you store the states of your app. This can be in a filesystem, external storage, document-oriented software, or a database. A popular database you might already be familiar with is Apache Cassandra®, which powers high-performing applications for thousands of companies including Hulu, Netflix, Spotify, and Apple. While this free, open-source database is known for its high availability, scalability, and resilience; the downside is that it’s also notoriously complex to set up and manage. This complexity is an ongoing theme when working with the data layer. As a developer, you need just four operations to interact with your database: create, read, update, and delete. But to execute them, you first have to learn the language for the database you’re using. For example, you’d need Cassandra Query Language (CQL) to interact with a Cassandra database. What’s more, you’d also need to learn how to design the data model (schema) so you can properly format and organize the data stored in your database. It’s a steep learning curve for a new app developer, which is why DataStax built Astra DB, the serverless Cassandra-as-a-Service, as part of our mission to connect every developer to the power of Cassandra. Simplify Cassandra App Development With Astra DB Astra DB is the simplest way for developers to back their applications with cloud-native Cassandra, allowing them to: Deploy modern apps faster without the complexities of installing, managing, or scaling Cassandra. Interact with your database using multiple APIs to avoid data modelling. Bake in security from the ground up with encrypted data and JSON web token (JWT) based authentication for secure communication with your database. Choose whether to deploy your Astra DB on Azure, GCP, or AWS. Autoscale database resources are based on traffic and app requirements so you only pay for what you use. As a developer, you can create an account on Astra DB for free. You get 3.5 million queries and 40 GB each month, and when you’re ready to scale, you can simply pay as you go. Astra DB doesn’t just give you a database, it also gives you a collection of APIs that you can use to interact with your database out of the box. This API layer that sits between your app and your Astra DB is called Stargate. Stargate helps you streamline app development with Cassandra using popular APIs: REST provides a common gateway so you can access your data using a familiar interface through RESTful APIs for any Cassandra database. Document (JSON) is a documents API that lets you save and search schemaless JSON documents in Cassandra (with no data modelling needed). GraphQL is a query language for APIs that lets you insert, query, and change data within the database (without writing in CQL). All of this means fewer languages and frameworks for you to learn and a shorter time to launch. Think of AstraDB as your own personal Samwise Gamgee, with all kinds of useful tools in its bag to make your app development journey easier. Build Your First Web App With JS, Node.JS, and Astra DB There’s no better exercise to get you started as an app developer than building a to-do list. The concepts and techniques you’ll learn in our first workshop are fundamental to building any web app and will put you on track to creating more advanced apps yourself. So, in this workshop you’ll learn how to: Set up a free Astra DB instance Run the web-based IDE Gitpod Interact with the database using APIs Build a to-do list app using React You don’t have to install anything and the technologies involved are all open-source and free to use. Aside from Astra DB, here’s a quick intro to the other technologies we’ll cover in this first workshop: Gitpod is a cloud-based integrated development environment (IDE) powered by VS Code that lets you create ready-made dev environments and edit your code directly in your browser. Node.js is a runtime environment that executes JavaScript outside of a browser to let you generate dynamic page content and interact with the data in your database. React is a JavaScript library to create interactive user interfaces that easily adapt across most browsers and devices. To give you the bigger picture, here’s the architecture we’ll be following to build the to-do list app: To get started, simply sign up for your free Astra DB account, then head over to the workshop video on YouTube. For the source code, slides, exercises, and step-by-step guide go to our DataStax Developers repo on GitHub. When you’ve completed your first app and are ready for the next step in your app development journey, dig into our second workshop on building a TikTok clone!
In this fast-paced world, applications are expected to disseminate information in real time. A few examples are tracking cargo movements, keeping a tab on stock prices, analyzing data of an ongoing football match, etc. Moreover, people use all kinds of devices like desktop, mobile, and tabs to access these applications. Also, some applications require this information and alerts to be actionable that is a user can update one or more attributes. For example, a cargo handling facility can update the status of cargo by performing an arrival scan. So, in essence, we are looking at an interactive real-time dashboard that can be accessed from multiple devices. In this article, I’ll discuss how this kind of dashboard can be built using AWS AppSync. I’ll take a common use case and create a reference solutions architecture for the use case. A quick introduction to AWS AppSync for those who are new to this service. AppSync is a serverless GraphQL interface that is robust and scalable. GraphQL is a data language that was developed to enable apps to fetch data from servers. It has a declarative, self-documenting style. In a GraphQL operation, the client specifies how to structure the data when it is returned by the server. This makes it possible for the client to query only for the data it needs, in the format that it needs it in. The Use Case Let’s take the case of an Internet Service Provider. They operate in a large geographical area. They need a real-time dashboard that would display all active service requests. An administrator of a region, would have a view of the entire region. They should be able to assign open tickets to service engineers based on their location. Service engineers would access the dashboard from their mobile devices and should be able to update the status of the tickets to "in-progress", "complete" etc. The administrator should be able to view the status updates in real time. Another scenario, that needs to be considered is that field engineers would often have poor/ no internet connection. They should be able to update the status of the tickets even if they are offline and once they are connected the data should sync back to the server automatically. The Solution The major component of this solution is an AppSync API. AppSync can connect to multiple sources and can have custom resolvers as well. However, in this example, the API uses DynamoDB as its data store. AppSync API supports three kinds of operations — mutation, subscription, and query. Mutations are used to insert/update records. Subscription complements the mutation operation in a way, that is all users of this application who are subscribed to the API will receive updates whenever there is a mutation. It’s worth mentioning here that a simple insert or update in the DynamoDB database will not send any notifications back to the subscribers. Only mutations through the AppSync API will flow through to the subscribers. Query, on the other hand, is a simple GraphQL query on the API. Now, let’s assume that there is a system to raise service requests that puts the requests onto an EventBridge. Once there is a message on the EventBridge, a Lambda function gets triggered that uses the mutation operation to push the message onto the AppSync API. AppSync receives the message and persists it in the backing DynamoDB data store. At this point, the status of the ticket is open. The UI of this application can be built using any standard JavaScript framework. This would be hosted on AWS S3 and distributed using a CloudFront distribution. A logged-in user of this application would subscribe to the AppSync API based on certain parameters. For example, an administrator of Cook county would probably subscribe using the criteria JSON county: "Cook", status: "Open" to list all open tickets in the county. The administrator can then assign an open ticket to a field engineer. This would result in another mutation operation on the AppSync API and the status of the ticket will be changed to “assigned”. A field engineer logged in to the application and subscribed with the criteria JSON assignedTo: "<login-id>", status: "Assigned" would be able to see a list of tickets that he needs to work on. Once he completes working on a ticket, he would change the status of the ticket to “complete.” This would result in another mutation that would update the record. Everyone interested and subscribed to receive updates on this ticket will immediately receive the update that works once the ticket is complete. It’s worth mentioning here that AppSync works in offline mode as well. So, if the field engineer becomes offline for some reason, his updates will be queued on the client side. Whenever he is back online all pending mutations will go through to the backend. This is the way the interactive dashboard will work based on mutations and subscriptions on the AppSync API. Next, I’ll discuss a little bit about authentication and authorization. In this scenario the best practice would be to use the corporate Active Directory, assuming that one exists to handle authentication. Cognito can be set up to federate with the corporate AD. AppSync connects with Cognito natively. In this case, whenever a user logs in to the application, the application will subscribe to the AppSync API with the corresponding user id and will automatically filter results based on location, role, status, etc. Other authorization methods like custom Lambda authorizer, OpenId Connect, API key, etc. can also be used. Another way to secure an AppSync API is to configure a Web Application Framework (WAF) on top of the API to protect it from common web exploits like cross-site scripting and SQL injection. WAF natively integrates with AppSync API. It is highly recommended to have WAF configured with all public-facing AppSync APIs. Conclusion AWS AppSync is a versatile service that can be utilized to design different kinds of applications. I’ve discussed how easily we can set up an interactive real-time dashboard. AppSync can also be used to create web socket handshakes between a client and a server to design real-time chat applications. It can also be used simply as a backend service API or to aggregate data from multiple sources like SQL, NoSQL, microservices, etc. AWS AppSync could soon become an integral part of a number of digital products and applications.
When dealing with Spring Data JPA or Spring Data R2DBC, we all know that everything revolves around entities. You usually create a repository, and in most cases, it either extends the JpaRepository<T, ID> for Spring Data JPA or the R2dbcRepository<T, ID> for Spring Data R2DBC, and then you’re ready to talk to the database. With the repository in place, things are pretty straightforward: you either use the standard already-provided interface methods, you write derived-query methods, or you can use the @Query annotation to write custom stuff. Up until here, everything’s nice. As I said, everything revolves around entities (with minor exceptions, you can get a primitive return type with a custom query like an Int or String), so as a response, you’ll always get an Entity/List<Entity> for JPA and Mono<Entity>/Flux<Entity> for R2DBC. However, there are certain scenarios when you either don’t want to fetch the entire entity because some columns are redundant/not-needed or, even more, you need to construct a POJO from something completely different from your entity, yet still somehow related. Think about a group by, some counts, or some mapping at the database level, and by default you can’t do that – your repos are bound to entities. That is when projections and DTOs come into play. For Spring Data JPA there is already a fair amount of documentation/info around (official documentation, Vlad Mihalcea, Thorben Janssen, and so on). This is not the case when it comes to Spring Data R2DBC (we still have amazing documentation here). So I decided to write a little something about projections and DTOs in Spring Data R2DBC. Before getting to the code, let’s examine the app. I have a review service that is centered around a single entity. To review, basically, it exposes a CRUD API for it. The app is written in Kotlin with Spring Boot, Spring Data R2DBC, and Spring for GraphQL. Here’s the entity: Java @Table("reviews") data class Review( @Id var id: Int? = null, var text: String, var author: String, @Column("created_at") @CreatedDate var createdAt: LocalDateTime? = null, @LastModifiedDate @Column("last_modified_at") var lastModifiedAt: LocalDateTime? = null, @Column("course_id") var courseId: Int? = null ) And here is its repository: Java @Repository interface ReviewRepository : R2dbcRepository<Review, Int> { @Query("select * from reviews r where date(r.created_at) = :date") fun findAllByCreatedAt(date: LocalDate): Flux<Review> fun findAllByAuthor(author: String): Flux<Review> fun findAllByCreatedAtBetween(startDateTime: LocalDateTime, endDateTime: LocalDateTime): Flux<Review> } Also, I have written some tests for us, so we can see the result of the new things that we are going to write right away: Java @RepositoryIntegrationTest class ReviewRepositoryIntegrationTest : AbstractTestcontainersIntegrationTest() { @Autowired lateinit var reviewRepository: ReviewRepository @Test @RunSql(["/data/reviews.sql"]) fun findAllByAuthor() { StepVerifier.create(reviewRepository.findAllByAuthor("Anonymous")) .expectNextCount(3) .verifyComplete() } @Test @RunSql(["/data/reviews.sql"]) fun findAllByCreatedAt() { StepVerifier.create(reviewRepository.findAllByCreatedAt(LocalDate.parse("2022-11-14"))) .expectNextCount(1) .verifyComplete() } @Test @RunSql(["/data/reviews.sql"]) fun findAllByCreatedAtBetween() { StepVerifier.create( reviewRepository.findAllByCreatedAtBetween( LocalDateTime.parse("2022-11-14T00:08:54.266024"), LocalDateTime.parse("2022-11-17T00:08:56.902252") ) ) .expectNextCount(4) .verifyComplete() } } For the integration test, I am using testcontainers library. If you are curious to find out more about these kinds of tests, make sure to check my article "Testcontainers With Kotlin and Spring Data R2DBC." Projections in Action Now let’s say that for a specific use case, I want to fetch only text of the reviews by some author. This is a perfect example for a projection. As previously mentioned, spring data query methods return instances of the aggregate root (entity) managed by the repository. Whenever we want to filter out some of the columns from the fetched entities, we can model a projection based on certain required attributes of that entity. Let’s create a class-based projection for this use case: Java data class TextProjection(val text: String) And here is how we can use it: Java @Query("select text from reviews where author = :author") fun findAllTextProjectionsByAuthor(author: String): Flux<TextProjection> Alright, let’s test it out: Java @Test @RunSql(["/data/reviews.sql"]) fun findAllTextProjectionsByAuthor() { StepVerifier.create(reviewRepository.findAllTextProjectionsByAuthor("Sponge Bob")) .expectNextMatches { it.text == "Good, loved it!"} .verifyComplete() } Nice! Let’s take a look at another example: what if I want to fetch only the courseId and concatenated text with author (“text – author”)? For this use case, I am going to use an interface-based projection. Java interface AuthoredTextProjection { fun getCourseId(): Int? @Value("#{target.text + ' - ' + target.author}") fun getAuthoredText(): String? } Note the usage of @Value annotation: accessor methods in interface-based projections can be used to compute new values based on the target entity. In our case, that is the concatenation of review’s text and author via dash. The root entity that is backing the interface projection is available in @Value’s SpEL expression via target. Using @Value on custom interface methods to generate new values creates an open interface-based projection, while having only method names exactly the same as the getters of root entity attributes creates a closed interface-based projection. Let’s use our new open interface-based projection: Java @Query("select course_id, \"text\", author from reviews") fun findAllAuthoredTextProjections(): Flux<AuthoredTextProjection> And here’s the test: Java @Test @RunSql(["/data/reviews.sql"]) fun findAllAuthoredTextProjections() { StepVerifier.create(reviewRepository.findAllAuthoredTextProjections()) .expectNextMatches { it.getAuthoredText() == "Amazing, loved it! - Anonymous" && it.getCourseId() == 3 } .expectNextMatches { it.getAuthoredText() == "Great, loved it! - Anonymous" && it.getCourseId() == 3 } .expectNextMatches { it.getAuthoredText() == "Good, loved it! - Sponge Bob" && it.getCourseId() == 3 } .expectNextMatches { it.getAuthoredText() == "Nice, loved it! - Anonymous" && it.getCourseId() == 3 } .verifyComplete() } Great - it's green. With projections, everything’s clear, but let’s consider another use case. What if I want to fetch a grouping of the number of reviews per author for all my records? Now the result of such a query will have a set of attributes that is somewhat different from the target entity’s fields. Now, that’s obvious: projections can’t solve such a problem, since they're mostly based on the target's attributes. DTOs Mapping in Action We can try to map our result set using a DTO that’ll look something like this: Java data class ReviewsCountPerAuthorView(val author: String?, val numberOfReviews: Int?) And here’s the new repository method: nothing complicated – a count of review ids and a group by author. Java @Query("select author, count(id) as \"numberOfReviews\" from reviews group by author") fun countReviewsPerAuthor(): Flux<ReviewsCountPerAuthorView> Here’s the test: Java @Test @RunSql(["/data/reviews.sql"]) fun countReviewsPerAuthor() { StepVerifier.create(reviewRepository.countReviewsPerAuthor()) .expectNextMatches { it == ReviewsCountPerAuthorView("Anonymous", 3) } .expectNextMatches { it == ReviewsCountPerAuthorView("Sponge Bob", 1) } .verifyComplete() } If we try to run it, the test will fail with: expectation "expectNextMatches" failed (predicate failed on value: ReviewsCountPerAuthorView(author=Anonymous, numberOfReviews=null)) Why is that? The strange thing is that only the author field was correctly mapped, and the numberOfReviews is null. Actually, it is not that strange: as previously mentioned, by default, every Spring Data query method will return instance(s) of the repository’s defined root aggregate, and it’ll try to map the result set to the root entity. As mentioned before, projections are backed by a target, which is the root entity. So in this case, our custom DTO was treated as a potential class-based projection with a matching field named author, and that is why only the author field got its value. Okay, we understand the problem, but how do we teach Spring Data R2DBC that this is not a projection, but a custom DTO with custom values that need to be wired in? We define a custom converter using @ReadingConverter, which will map the row’s returned values into our custom DTO: Java @ReadingConverter class ReviewsCountPerAuthorViewReadConverter : Converter<Row, ReviewsCountPerAuthorView> { override fun convert(source: Row): ReviewsCountPerAuthorView { return ReviewsCountPerAuthorView(source.get("author", String::class.java), source.get("numberOfReviews", String::class.java)?.toInt()) } } Now the little thing that remains to be done is to register it by declaring the following @Bean: Java @Bean fun r2dbcCustomConversions(connectionFactory: ConnectionFactory, ): R2dbcCustomConversions? { val dialect = DialectResolver.getDialect(connectionFactory) return R2dbcCustomConversions.of(dialect, listOf(ReviewRepository.ReviewsCountPerAuthorViewReadConverter())) } Now if we run our test, it is green, and we successfully mapped our result set to a custom DTO. Now as you might’ve imagined, if there is a @ReadingConverter, then there’s a @WritingConverter, too, right? You’re correct. SELECT data can be extracted from R2DBC’s Row into a custom POJO/entity via @ReadingConverter. INSERT/UPDATE data can be mapped to an OutboundRow via @WritingConverter. Usually, you want to use these converters for writing/reading embedded objects of an entity to/from the actual table’s rows. Conclusion We’ve seen how projections can make our life a little easier when it comes to filtering out data from an entity. We analyzed how converters (especially the @ReadingConverter) can aid in mapping custom query data to a custom DTO. The code is here. Happy coding!
"Is the application scalable?" asked the manager. It was almost halfway through the weekly system roadmap meeting. I thought we were done with all the tech parts and would soon move to business-related questions. The architect of my team cleared his throat and answered, "It can easily handle double the usual load." The manager gave a long look as if he didn't believe the answer. But, in the absence of any better points, he continued with further updates. I implicitly believed that our system is indeed scalable. After all, the architect had announced it in clear terms. A year down the line, I learned the truth. One fine day, the system crashed due to excessive load. The bank had launched a new loan product for the holiday season. The business strategy paid off and there was a surge in demand. Since our application provided the core data to make the credit decision, it was swamped with 5 times more incoming requests from several other applications. We added more CPUs to handle the load on our service. The same strategy had worked the last couple of times. However, this time it didn't. Our system was scalable until it no longer was. Upon analysis, we found that the issue wasn't entirely in our application code. Our application was made up of multiple services that connected with the bank's central database. The connection to the database was made using a connection pool. Our fault was that we were careless about the number of open connections. Turns out, the database administrations were not. The DBAs had a hard block on the number of open connections for our group of applications. When the incoming requests to our application surged, we ended up acquiring even more connections until we hit the wall. No matter how many additional CPUs we added to increase the application capacity, it didn't make any difference. To add insult to injury, the DBAs refused to increase the permissible number of connections. The only option left was a costly change to our application to handle connections properly. For all practical purposes, our system was no longer scalable. Of course, it was our mistake. We should have accounted for the limited number of connections and used them wisely. Nevertheless, the problem was not unfixable. But the change wasn't cost-effective. Though the situation was ultimately handled, it piqued my interest in the subject. We thought the system was scalable. But it wasn't. At least not until we hit the connection limit. It was clear we were wrong about the true meaning of scalability. I decided to dig deeper into the whole situation. My initial motivation was to avoid getting into the same situation as my team's architect The Usual Definition of System Scalability Scalability is one of the fundamental challenges in building distributed systems. While studying scalability for distributed systems, the most common definition I have come across is as follows: Scalability is the ability of the system to handle an increased workload. How does this definition apply in real life? Developers of an actively used system monitor its workload levels. They create processes for predicting when the performance of the system will become unsatisfactory. The goal is to increase the system's capacity before the danger level is reached. In this approach to scalability, the primary focus is to determine an interval of demand within which the system will perform at an acceptable level. If this interval is sufficiently large, the system is considered scalable. In my view, this definition is pretty limited. The focus of this definition is always on the workload. The main concern of this definition is that a system is scalable if it continues to perform adequately as the workload grows. No attention is given to how a system will have to be modified to keep the performance at the same level. As we found in our case, our system could have continued performing at a good level. But the changes required on the database side were not acceptable. Neither was changing the code cost-effective. The common scalability definition pays no attention to the approach of capacity increase or its overall impact on the system. We don't ask important questions such as: Will doubling the number of processors allow the system to handle double the workload? Is the overhead of coordinating work between processors going to be prohibitively high? In defense of the definition, it is the most common definition used by system designers and developers. We hear it all the time when someone says that a system is scalable and it can handle double the workload with no problems. Scalability by Extension In my quest to understand scalability, I came across another definition of scalability that often gets neglected. However, I have come to believe that it may be far more important in practical situations. Scalability is the ability to handle an increased workload by repeatedly applying a cost-effective strategy for extending a system's capacity. Straight away, the focus shifts towards the strategy for adding capacity. We are no longer interested in a one-time capacity increase to improve scalability. We are interested in a strategy for adding capacity and the number of times the strategy can be applied cost-effectively. This line of thought always makes me think about world-class tennis players like Rafael Nadal or Roger Federer. Over the years, how many times have players like Nadal or Federer adapted their playing style to the changing demands of the game? Many times, I reckon. Each change they made prolonged their careers and made them more successful. However, making a change is not easy. Many players rise to the top and fizzle out because they could not adapt. Even the best players are bound to face difficulties in trying to adapt. The easier it is for a player to make changes, the higher the probability for that player to maintain his dominance in the game. The same goes for scaling a system. When we consider the second definition of scalability, we start considering other types of arguments about our system. If we add more processors to increase capacity, what is the method for coordinating work between the added processors? Will the coordination method take more processing cycles? If yes, the full benefit of adding capacity won't be realized. Hence, adding processors beyond a certain point may not be a cost-effective approach to improving scalability. Focusing on the repeated application of a scaling strategy makes us more cognitive of our choices. For example, replacing an O(n^2) with an O(nlogn) algorithm makes it possible to process a larger workload in the same amount of time. In other words, replacing an algorithm with a more efficient one improves the system's scalability. But can we use this approach repeatedly? I don't think so. Once we have the most efficient algorithm in place, the algorithmic replacement strategy stops being viable. You cannot keep applying the strategy again and again to improve the system's scalability. Is a System Scalable? Armed with the two definitions of scalability, I was finally able to make sense of the fundamental question. This was the same question that popped up in our meeting and was answered inadequately. To answer the question, we end up labeling a system as scalable or not. In my view, this is an oversimplification. What Does It Mean for a System To Be Scalable? Most systems scale in some sense. However, no system is infinitely scalable. So how should we approach scalability? Theoretical Approach Rather than labeling systems, it is more fruitful to compare the scalability of two different systems. Consider the below graph that shows the response vs demand curves for two hypothetical systems A and B. For any given demand level, the response is worse for System A as compared to System B. If there is a maximum tolerable value for the response, System A will reach it earlier than System B. System B is more scalable than System A. Of course, if both lines continue to rise at the same monotonic rate, they will eventually reach a point where the demand for resources exceeds their availability. At that point, the response times will become unsatisfactory for both systems. Though the points may be different for A and B, they signify the limit of a system's scalability. Remember - no system is infinitely scalable. Systems do not follow a monotonic rate of increase in response vs demand metric. The curve looks more like the below example. The hypothetical system tolerates demand increase quite well until it hits a significant design limitation. This limitation forms a knee-like shape in the response demand curve. After crossing a particular demand level, the response metric simply goes out of hand. The goal of designers is to keep the knee of the curve as far to the right as possible. Once a system reaches closer to the knee, it is no longer scalable. Any increase in demand will push it down the hill. Practical Approach The theoretical aspects of scalability were genuine eye-openers for me. But I still thought something was missing. I couldn't completely relate the above graphs to our system. We were able to scale our system successfully during the initial days whenever there was a general increase in load. Things went south only when we hit the database limit. This behavior does not fit into either of the two theoretical models to depict scalability. How To Classify This Behavior Most real system exhibit a more hybrid behavior. The below graph shows this: When a system is running in the bottom green zone, it is responding well. As it begins to linger in the middle zone (shown in yellow), the response starts to get unacceptable. When it finally crosses into the top zone (shown in red), the system becomes unusable. In my opinion, this is a more accurate representation of the second definition of scalability. On the first couple of occasions, the system designers were able to take quick action to keep the response within acceptable limits. It was probably a matter of adding additional memory. On the third occasion, the designers needed more time and effort to bring the response metric down to acceptable levels. The changes needed to improve the response were probably more complicated. In the last case, the system reached a point where no reasonable solution was possible to keep the response times from becoming unusable. The system was no longer scalable using the cost-effective techniques that worked the initial few times. The above turn of events also depicts the situation of our system. Even though we were able to keep our system scalable for some time, it ultimately reached a point beyond which any cost-effective solution was not possible. So, is the system scalable? The answer is - it depends! And not in a cop-out sort of way. Scalability is more of a moving target that evolves rather than a fixed state. If the system owner can afford to continue pouring money to meet the higher demand levels, the system is scalable up to a certain point. Beyond this point, no cost-effective action can alleviate the response metric issues. If the system owner has no money to purchase additional resources at the first very hiccup, the system is not scalable. Whatever the case, no system is infinitely scalable. Concluding Thoughts My journey to understand the true meaning of scalability led to some important conclusions. In my view, you cannot simply slap a label of scalability on a system and call it a day. All systems are scalable to some extent. But it does not mean that you can continue scaling them indefinitely in a cost-effective manner. An ideal scalability situation requires a nuanced approach to system design. Only focusing on adding resources to improve scalability is a trap. You also need to consider the cost-effectiveness of the additional resource. With this post, my goal was to bring this perspective to the discussion. Do share your thoughts in the comments section below.
WordPress is the world’s most popular content management system, used by over 40% of all websites. This wide adoption makes it a top target for threat actors and security researchers that get paid for reporting security issues through their public bug bounty program. Vulnerability brokers are also very interested in acquiring unpatched vulnerabilities enabling them to take over WordPress instances, sometimes offering up to $300,000 for critical ones. As such, WordPress has a heavily reviewed code base in which researchers are not expected to find low-hanging fruits anymore. My previous research on this target required extensive expertise and effort to uncover security issues. This blog post describes a surprisingly simple vulnerability in WordPress’s implementation of pingbacks. While the impact of this vulnerability is low for most users in the case of WordPress, the related vulnerable code pattern is fairly interesting to document as it is also probably present in most web applications. The goal of this blog post is to educate about this pattern and to raise awareness. Disclosure This vulnerability was reported to WordPress on January 21; no fix is available yet. Please refer to the section Patch to obtain guidance on potential remediations to apply to your WordPress instances. It is the first time my team and I have released details about an unpatched vulnerability, and this decision was not taken lightly. This issue was first reported about six years ago in January 2017 by another researcher and numerous others over the years. After my report and further investigation, I could also identify multiple public blog posts documenting the same behavior as the one I'll be covering today. Because of its low impact as-is, its prior publication, and the need to chain it to additional vulnerabilities in third-party software, I believe this release won't endanger WordPress users and can only help them harden their instances. Impact I couldn't generically identify ways to leverage this behavior to take over vulnerable instances without relying on other vulnerable services. It could ease the exploitation of other vulnerabilities in the affected organization's internal network, for instance, using one of the recent Confluence OGNL injections, the epic remote code execution in Jenkins found by @orange_8361, or one of the other chains documented by AssetNote. Technical Details Use of the Vulnerable Construct in the Pingback Feature Pingbacks are a way for blog authors to be notified and displayed when other “friend” blogs reference a given article: they are displayed alongside comments and can be freely accepted or rejected. Under the hood, blogs have to perform HTTP requests to each other to identify the presence of links. Visitors can also trigger this mechanism. This feature has been widely criticized, as it enables attackers to perform distributed denial of service attacks by maliciously asking thousands of blogs to check for pingbacks on a single victim server. Pingbacks are still enabled by default on WordPress instances because of the importance of social and community features when it comes to personal blogging. Though, it is not expected that these requests could be sent to other internal services hosted on the same server or local network segment. The pingback functionality is exposed on the XML-RPC API of WordPress. As a reminder, this is an API endpoint expecting XML documents in which the client can choose a function to invoke along with arguments. One of the implemented methods is pingback.ping, expecting arguments pagelinkedfrom and pagelinkedto: the first one is the address of the article referencing the second one. pagelinkedto has to point to an existing article of the local instance, here http://blog.tld/?p=1, and pagelinkedfrom to the external URL that should contain a link to pagelinkedto. Below is what a request to this endpoint would look like: HTTP POST /xmlrpc.php HTTP/1.1 Host: blog.tld [...] <methodCall> <methodName>pingback.ping</methodName> <params> <param> <value><string>http://evil.tld</string></value> </param> <param> <value><string>http://blog.tld/?p=1</string></value> </param> </params> </methodCall> Implementation of the URL Validation The WordPress Core method wp_http_validate_url() runs a couple of checks on user-provided URLs to reduce the risks of abuse. For instance: The destination can't contain a username and password; The hostname must not contain the following characters: #:?[] The domain name should not point to a local or private IP address like 127.0.0.1, 192.168.*, etc. The destination port of the URL must be either 80, 443, or 8080. The third step may involve resolving domain names if present in the URL (e.g., http://foo.bar.tld). In that case, the IP address of the remote server is obtained by parsing the URL [1] and later resolving it [2] before validating it to exclude non-public IP ranges: src/wp-includes/http.php PHP <? $parsed_url = parse_url( $url ); // [1] // [...] $ip = gethostbyname( $host ); // [2] // [...] if ( $ip === $host ) { // Error condition for gethostbyname() return false; } // IP validation happens here } The validation code looks correctly implemented, and the URL is now considered trusted. What happens next? Implementation of the HTTP Client(s) Two HTTP clients can handle pingback requests after validating the URL based on available PHP features: Requests_Transport_cURL and Requests_Transport_fsockopen. They are both parts of the Requests library, developed independently under the WordPress umbrella. Let's have a look at the implementation of the latter. I know that it uses the PHP streams API from its name. It operates at the transport level, and the client has to craft the HTTP request manually. The URL is parsed again using parse_url(), and then its host part is used to create a destination compatible with the PHP streams API (e.g., tcp://host:port): wp-includes/Requests/Transport/fsockopen.php PHP <? public function request($url, $headers = array(), $data = array(), $options = array()) { // [...] $url_parts = parse_url($url); // [...] $host = $url_parts['host']; // [...] else { $remote_socket = 'tcp://' . $host; } // [...] $remote_socket .= ':' . $url_parts['port']; Further away, this destination is used to create a new stream with stream_socket_client(), and the HTTP request is crafted and written to it: wp-includes/Requests/Transport/fsockopen.php PHP <? $socket = stream_socket_client($remote_socket, $errno, $errstr, ceil($options['connect_timeout']), STREAM_CLIENT_CONNECT, $context); // [...] $out = sprintf("%s %s HTTP/%.1F\r\n", $options['type'], $path, $options['protocol_version']); // [...] if (!isset($case_insensitive_headers['Host'])) { $out .= sprintf('Host: %s', $url_parts['host']); // [...] } // [...] fwrite($socket, $out); As we can see, this process implies another DNS resolution, so stream_socket_client() can identify the host's IP to send the packets. The behavior of the other HTTP client, cURL, is very similar and won't be covered here. The Vulnerability This construct has a problem: the HTTP client has to re-parse the URL and re-resolve the hostname to send its request. Meanwhile, an attacker could have changed the domain to point to a different address than the one validated before! This bug class is also called Time-of-Check-Time-of-Use: a resource is validated but can be changed later before its effective use. It is common to find such vulnerabilities in mitigations against Server-Side Request Forgeries (SSRF). I summarized what these successive steps look like with the diagram below: Exploitation Scenarios I've audited the code in the hope of finding parser differential bugs that would allow reaching unintended ports or performing POST requests without success: the initial URL validation steps are restrictive enough to prevent their exploitation. As mentioned earlier, attackers would have to chain this behavior with another vulnerability to impact the targeted organization's security significantly. Patch I am not aware of any public patch available at the time of writing this publication; the details above are based on an intermediate patch shared with us during the disclosure process. Addressing such vulnerabilities requires persisting the validated data until it is used to perform the HTTP request. It should not be discarded or transformed after the validation step. The WordPress maintainers followed this path by introducing a second, optional argument to wp_http_validate_url(). This parameter is passed by reference and contains the IP addresses on which WordPress performed the validation. The final code is slightly more verbose to accommodate older versions of PHP, but the main idea is here. As a temporary workaround, I recommend system administrators remove the handler pingback.ping of the XML-RPC endpoint. One way to do this is to update the functions.php of the theme in use to introduce the following call: PHP <? add_filter('xmlrpc_methods', function($methods) { unset($methods['pingback.ping']); return $methods; }); It is also possible to block access to xmlrpc.php at the web server level. Timeline Date Action 2022-01-21 My team and I submit the vulnerability to the maintainers with a 90-day disclosure policy. 2022-01-21 My submission is triaged as Duplicate against a report originally sent (exactly) 5 years ago (2017-01-21). 2022-04-11 WordPress requests an extension of 30 days to our 90-day disclosure policy, as they need more time to work on backports. I agree. 2022-05-23 Maintainers share a patch for WordPress 5.9.3. 2022-06-01 My team and I provided positive feedback on the patch. 2022-07-16 My team and I communicate our intent to release this publication on September 6. 2022-09-01 Final heads up about the upcoming publication. 2022-09-06 This article is released, 228 days after our report and 2054 days after the initial report by another researcher. Summary In this article, I described a blind SSRF vulnerability affecting WordPress Core. While the impact is deemed low in this case, this is a widespread vulnerable code pattern that my team and I continue to encounter even in big projects. I encourage developers to check their own code bases for this type of code vulnerability that, as I have demonstrated, can hide in even highly popular and well-reviewed code. I want to thank the WordPress maintainers for their help in addressing this issue, even if I couldn't reach the best outcome possible.
Oren Eini
Wizard,
Hibernating Rhinos @ayende
Abhishek Gupta
Principal Developer Advocate,
AWS
Artem Ervits
Solutions Engineer,
Cockroach Labs
Sahiti Kappagantula
Product Associate,
Oorwin