DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Full-Stack Observability Essentials: Explore the fundamentals of system-wide observability and key components of the OpenTelemetry standard.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Trending

  • LTS JDK 21 Features
  • How to Submit a Post to DZone
  • Spring Authentication With MetaMask
  • Docker and Kubernetes Transforming Modern Deployment

Reviewing LevelDB, Part II

Oren Eini user avatar by
Oren Eini
·
Mar. 22, 13 · Interview
Like (0)
Save
Tweet
Share
3.55K Views

Join the DZone community and get the full member experience.

Join For Free

I think that the very first thing that we want to do is to actually discover how exactly is leveldb saving the information to disk. In order to do that, we are going to trace the calls (with commentary) for the Put method.

We start from the client code:

    leveldb::DB* db;

    leveldb::DB::Open(options, "play/testdb", &db);

    status = db->Put(leveldb::WriteOptions(), "Key", "Hello World");

This calls the following method:

    // Default implementations of convenience methods that subclasses of DB

    // can call if they wish

    Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) {

      WriteBatch batch;

      batch.Put(key, value);

      return Write(opt, &batch);

    }

     

    Status DB::Delete(const WriteOptions& opt, const Slice& key) {

     WriteBatch batch;

     batch.Delete(key);

     return Write(opt, &batch);

   }

I included the Delete method as well, because this code teaches us something important, all the modifications calls are always going through the same WriteBatch call. Let us look at that now.

    Status DBImpl::Write(const WriteOptions& options, WriteBatch* my_batch) {

      Writer w(&mutex_);

      w.batch = my_batch;

      w.sync = options.sync;

      w.done = false;

     

      MutexLock l(&mutex_);

      writers_.push_back(&w);

      while (!w.done && &w != writers_.front()) {

       w.cv.Wait();

     }

     if (w.done) {

       return w.status;

     }

    

     // May temporarily unlock and wait.

     Status status = MakeRoomForWrite(my_batch == NULL);

     uint64_t last_sequence = versions_->LastSequence();

     Writer* last_writer = &w;

     if (status.ok() && my_batch != NULL) {  // NULL batch is for compactions

       WriteBatch* updates = BuildBatchGroup(&last_writer);

       WriteBatchInternal::SetSequence(updates, last_sequence + 1);

       last_sequence += WriteBatchInternal::Count(updates);

    

       // Add to log and apply to memtable.  We can release the lock

       // during this phase since &w is currently responsible for logging

       // and protects against concurrent loggers and concurrent writes

       // into mem_.

       {

         mutex_.Unlock();

         status = log_->AddRecord(WriteBatchInternal::Contents(updates));

         if (status.ok() && options.sync) {

           status = logfile_->Sync();

         }

         if (status.ok()) {

           status = WriteBatchInternal::InsertInto(updates, mem_);

         }

         mutex_.Lock();

       }

       if (updates == tmp_batch_) tmp_batch_->Clear();

    

       versions_->SetLastSequence(last_sequence);

     }

    

     while (true) {

       Writer* ready = writers_.front();

       writers_.pop_front();

       if (ready != &w) {

         ready->status = status;

         ready->done = true;

         ready->cv.Signal();

       }

       if (ready == last_writer) break;

     }

    

     // Notify new head of write queue

     if (!writers_.empty()) {

       writers_.front()->cv.Signal();

     }

     return status;

   }

Now we have a lot of code to go through. Let us see what conclusions we can draw from this.

The first 15 lines or so seems to create a new Writer, not sure what that is yet, and register that in a class variable. Maybe it is actually being written on a separate thread?

I am going to switch over and look at that line of thinking .First thing to do is to look at the Writer implementation. This writer looks like this:

        struct DBImpl::Writer {

          Status status;

          WriteBatch* batch;

          bool sync;

          bool done;

          port::CondVar cv;

         

          explicit Writer(port::Mutex* mu) : cv(mu) { }

        };

So this is just a data structure with no behavior. Note that we have CondVar, whatever that is. Which accepts a mutex. Following the code, we see this is a pthread condition variable. I haven’t dug too deep into this, but it appears like it is similar to the .NET lock variable. Except that there seems to be the ability to associate multiple variables with a single mutex. Which could be a useful way to signal on specific conditions. The basic idea is that you can wait for a specific operation, not just a single variable.

Now that I get that, let us see what we can figure out about the writers_ usage. This is just a standard (non thread safe) std::deque, (a data structure merging properties of list & queue). Thread safety is achieved via the call to MutexLock on line 7. I am going to continue ignoring the rest of the function and look where else this value is being used now. Back now, and it appears that the only place where writers_ are used in in this method or methods that it calls.

What this means in turn is that unlike what I thought, there isn’t a dedicated background thread for this operation. Rather, this is a way for leveldb to serialize access. As I understand it. Calls to the Write() method would block on the mutex access, then it waits until its write is the current one (that is what the &w != writers_.front() means. Although the code also seems to suggest that another thread may pick up on this behavior and batch multiple writes to disk at the same time. We will discuss this later on.

Right now, let us move to line 17, and MakeRoomForWrite. This appears to try to make sure that we have enough room to the next write. I don’t really follow the code there yet, I’ll ignore that for now and move on to the rest of the Write() method.

In line 18, we get the current sequence number, although I am not sure why that is, I think it is possible this is for the log. The next interesting bit is in BuildBatchGroup, this method will merge existing pending writes into one big write (but not too big a write). This is a really nice way to merge a lot of IO into a single disk access, without introducing latency in the common case.

The rest of the code is dealing with the actual write to the log  / mem table 20 – 45, then updating the status of the other writers we might have modified, as well as starting the writes for existing writers that may have not got into the current batch.

And I think that this is enough for now. We haven’t got to disk yet, I admit, but we did get a lot of stuff done. On my next post, I’ll dig even deeper, and try to see how the data is actually structured, I think that this would be interesting…



LevelDB

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.


Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: