DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Trending

  • Modular Software Architecture: Advantages and Disadvantages of Using Monolith, Microservices and Modular Monolith
  • How TIBCO Is Evolving Integration for the Multi-Cloud Era
  • Future Skills in Cybersecurity: Nurturing Talent for the Evolving Threatscape
  • LLMs for Bad Content Detection: Pros and Cons
  1. DZone
  2. Coding
  3. Java
  4. An Exploration Into Lucene Disk Format

An Exploration Into Lucene Disk Format

Oren Eini user avatar by
Oren Eini
·
Apr. 01, 14 · Interview
Like (0)
Save
Tweet
Share
3.35K Views

Join the DZone community and get the full member experience.

Join For Free

i realized lately that i wanted to know a lot more about exactly how lucene is storing data on disk. oh, i know the general stuff about segments and files, etc. but i wanted to know the actual bits & bytes. so i started tracing into lucene and trying to figure out what it is doing.

and, by the way, the only thing that the lucene.net codebase is missing is this sign:

image

at any rate, this is how lucene writes the segment file. note that this is done in a crc32 signed file:

and the info write method is:

today, i would probably use a json file for something like that (bonus point, you know if it is corrupted and it is human readable), but this code was written in 2001, so that explains it.

this is the format of the format of a segment file, and the segments.gen file is generated using:

moving on to actually writing data, i created ten lucene documents and wrote them. then just debugged through the code to see what will happen. it started by creating _0.fdx and _0.fdt files. the .fdt is for fields, the fdx is for field indexes.

both of those files are used when writing the stored fields. this is the empty operation, writing an unstored field.

this is how fields are actually stored:

and then it ends up in:

note that this particular data goes in the fdt file, while the fdx appears to be a quick way to go from a known document id to the relevant position in the fdx file.

as i was going through the code, i did some searches, and found a very detailed explanation of the actual file format in the docs . that is really nice and quite informative, however, just seeing how the “let us take the documents and make them searchable” part is quite interesting. lucene has a lot of chains of responsibilities going through. and it is also quite interesting to see the design choices that were made.

unfortunately, lucene is very much wedded to its file format, and making changes to it isn’t going to be possible, which is a shame, since it impacts quite a lot of the way lucene works in general.

Lucene

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.


Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: