In-Memory Database Architecture: Ten Years of Experience Summarized (Part 1)
In-memory does not mean not reliable or cache. Learn about the architecture of Tarantool DB to showcase the benefits of living in RAM.
Join the DZone community and get the full member experience.Join For Free
An in-memory database is not a new concept. However, it is associated too closely with terms like "cache" and "non-persistent". In this article, I want to challenge these ideas. In-memory solutions have much wider use cases and offer higher reliability than it would seem at a first glance.
I want to talk about the architectural principles of in-memory databases, and how to take the best of the "in-memory world"— incredible performance — without losing the benefits of disk-based relational systems. First of all, how to ensure data safety.
This sums up ten years of experience with in-memory solutions in one place. At the same time, the reader doesn't need to have as much experience to benefit from this article, a basic familiarity with IT is enough.
- Development history
- Tarantool today
- How does the core work
- Fibers and cooperative multi-tasking
My name is Vladimir Perepelitsa, also known as Mons Anderson. I am a Tarantool architect and product manager. I’ve been using it in production for many years now, for example, in the creation of S3 compatible object storage. So I know it inside out.
To understand a technology, it is useful to take a look at its history. I will summarize what Tarantool was like at the beginning, what it has gone through, and what it is now. Then I will compare it to other databases, review its functionality, explain how it can work via a network, and see what is in the ecosystem around.
This example will let you know what benefits you can get from in-memory solutions. You will know how to use them without loss of reliability, scalability, and usability.
Tarantool was created by Mail.ru Group’s in-house development team in 2008, initially, without a plan to open source it. However, after two years of in-house operation, we realized that the product was mature enough for public sharing. That is how Tarantool’s open-source history started.
commit 9b8dd7032d05e53ffcbde78d68ed3bd47f1d8081 Author: Yuriy Vostrikov <email@example.com> Date: Thu Aug 12 11:39:14 2010 +0400
What was it created for?
Initially, Tarantool was developed for my.mail.ru social network. The company was already large enough then. A MySQL cluster stored profiles, sessions and users, and it was rather expensive. In fact it was so expensive that we had to think about money, not just performance. Hence, the story, “How to save a million dollars on the database…”
That is to say, Tarantool was made to save money on huge MySQL clusters. It gradually evolved from just a cache to a persistent cache, and then to a full-blown database.
Having earned an in-house reputation in one project, it began to spread to others: email service, banner ads, and cloud solutions. As a result of wide application within the Company, new projects were now often launched on Tarantool by default.
Tracing the development of Tarantool, you can see the following picture. Initially, Tarantool was an in-memory cache. At the time of its inception, it was no different from Memcached.
To solve cold cache issues, Tarantool was made persistent. Then we added replication. Having persistent cache with replication, we got ourselves a key-value database. We added indexes to that key-value database, and we were now able to use Tarantool almost as a relational database.
Further, we added Lua functions. Initially, those were stored procedures for working with data. Then, Lua functions developed into a cooperative runtime and an application server.
All that was gradually added with various additional features, capabilities, and other storage functions. Now we have a multi-paradigm database. A bit more detail about this in the following section.
Today, Tarantool is an in-memory computing platform with a flexible data schema.
Tarantool can and should be used to create high-performance applications. This means implementing complex data storage and processing solutions, not just making caches. That said, it is not merely a database — it is a platform to create solutions.
Tarantool is offered in two versions. One is a widely available, plain and clear open-source version. Tarantool is being developed under Simplified BSD license and hosted entirely on GitHub in the Tarantool organization.
There we have Tarantool itself, its core, connectors to external systems; topologies such as sharding or queues; modules, and libraries, both from developers team and community. We host modules from the community as well.
In addition to the open-source version, there is an enterprise branch in Tarantool development. First of all, it includes support, enterprise products, training, custom development and consulting.
The following sections will discuss core functionality available in all product versions. Everything you read you can use for free.
Tarantool today is a basic component for database-centric applications.
How Does the Core Work?
The central concept of Tarantool is that data is always stored in RAM. This data is always accessed from one thread. Changes we make are linearly written to Write Ahead Log.
Indexes are built to data in memory. This means access to data is indexed and predictable. A snapshot of this data is saved from time to time. What is written to disk can be replicated.
Tarantool has one main transactional thread. We call it the TX thread. This thread has Arena in it. This is a memory area allocated by Tarantool for data storage. Data is stored in spaces in Tarantool.
Space is a set or collection of storage units called tuples. Tuple is like a row in a table. Indexes are built to this data. Arena is responsible for storing and arranging everything, along with specialized allocators which operate within Arena.
- Tuple = row
- Space = table
TX thread has an event loop operating within it. Fibers operate within the event loop. These are cooperative primitives from which we can communicate with spaces. From there, we can read data or create data. Fibers can also interact with the event loop and between each other directly or through special primitives called channels.
There is a separate thread, iproto, to work with a user from the outside. It receives a request from the network, processes Tarantool protocol, forwards the request to TX and starts user requests in a separate fiber.
Whenever a change in data occurs, a separate thread called WAL (Write Ahead Log) writes files called xlogs.
When Tarantool accumulates many xlogs, it can have trouble with starting quickly. Therefore, periodic saves called snapshots are used to accelerate the start. There is a fiber called snapshot daemon used to save snapshots. It reads consistent contents of the whole Arena and writes to a snapshot file on disk.
Direct write to disk from Tarantool is unavailable due to cooperative multitasking. Locking is prohibited, and disk write is a locking operation. Therefore, actions with a disk are arranged via a separate pool of threads from the fio library.
Tarantool has replication which is quite elegant. If there is one more replica, then one more thread — relay — is raised to convey data to it. Its objectives include reading xlogs and sending them to replicas. On the replica, a fiber called applier is started. It receives changes from the remote host and applies those changes to Arena.
The changes are written to xlog in the same way as if made locally, via WAL. Knowing how it all works, you can understand and predict the behavior of Tarantool at any time.
What happens at restart? Imagine that Tarantool has been running for some time, there is a snapshot, and there is an xlog.
If it is restarted, then:
- Tarantool finds the last snapshot and starts reading it.
- Upon reading, it checks the xlogs made after that snapshot; and it reads those xlogs.
- Once the reading of snapshots and xlogs is complete, we have the snapshot of data equal to the time of restart.
- Then Tarantool builds the indexes. Only primary indexes are built at a time of snapshot reading.
- When all the data has been uploaded to memory, we can build secondary indexes.
- Tarantool starts the application.
Tarantool Core in Six Lines:
- Data is stored in memory.
- Data is accessed from one thread.
- Changes are written to Write Ahead Log.
- Indexes are built to the data.
- Snapshots are saved from time to time.
- WAL is replicated.
Applications within Tarantool are made in LuaJIT. But why?
First, Lua is a straightforward script language originally created for engineers, not for programmers, for individuals with engineering backgrounds who don’t have deep knowledge of programming.
Lua was intentionally made simple. Therefore, it was possible to create a JIT compiler raising the performance of this script language almost to the level of С. There are cases when a small program in Lua compiled in LuaJIT offers the performance lever similar to a С program.
Lua allows writing efficient apps with ease. The central idea of Tarantool is to be close to data. Starting a program in the same namespace and process where the data is, we can avoid spending time navigating through a network.
Since we address memory directly, reading delay is near-zero, and it can be predicted. All that could be achieved with just Lua functions as well, but Tarantool has an event loop plus fibers inside it. Lua is integrated with them.
- Lua: straightforward script language for engineers
- Highly efficient JIT compilation
- Being close to data
- Cooperative runtime, not procedures
Fibers and Cooperative Multitasking
Fiber is a thread of execution. It is similar to a regular thread, but it is lighter, and it implements cooperative multi-tasking primitives. This imposes the following properties:
- No more than one task is performed at a time.
- There’s no planner in the system. Any fiber must yield control freely.
Having no planner and no tasks executed in parallel reduces stray processes consumption and improves performance. This altogether enables building an application server. You can go out to the outside world from Tarantool.
Within Tarantool, inside the application server itself, there are functions for working with a database. During Tarantool evolution, a platform has developed on top of this application server.
The platform is basically an in-memory database and an integral application server. Or visa versa, application server plus database. Apart from that, Tarantool offers tools for replication, and for sharding; tools for clustering and managing the cluster; and connectors to external systems.
- Fiber is a lightweight thread of execution that implements cooperative multitasking.
- Every task is performed once the current task yields control.
- Application server
- Event loop with fibers
- Non-blocking work with sockets
- A collection of libraries for working with network and data
- Functions for working with databases
- Tarantool platform
- In-memory database
- Built-in application server
- Tools for clustering
- Connectors to external systems
Tarantool is persistent, and it can work with many other systems. Therefore, it is used as a cache proxy to legacy systems. These can be heavy and complex systems, both write true proxy and write-behind proxy.
Furthermore, Tarantool’s architecture, availability of fibers and features to write complex applications make it a good tool for writing queues. I personally know 6 queue implementations: some are available on GitHub, others are in closed repositories or in some projects.
The main reason for that is guaranteed low latency for access. When you are inside Tarantool and go for some data, you provide it from memory. You have quick competitive access to data. In this case, you can build hybrid applications which run really close to data.
In the second part of the article we will talk about:
- Database primitives and functionality
- Comparison to other systems
- In-memory platforms
- Relational databases
- Key-value databases
- Document-oriented databases
- Column-oriented databases
- Use cases
Until the next time, goodbye!
Opinions expressed by DZone contributors are their own.