How To Embed Documents for Semantic Search
Explore embedding documents to be used for a semantic search. Follow examples to learn how embedding influences search results and how to improve the results.
Join the DZone community and get the full member experience.
Join For FreeIn this post, you will take a closer look at embedding documents to be used for a semantic search. By means of examples, you will learn how embedding influences the search result and how you can improve the results. Enjoy!
Introduction
In a previous post, a chat with documents using LangChain4j and LocalAI was discussed. One of the conclusions was that the document format has a large influence on the results. In this post, you will take a closer look at the influence of source data and the way it is embedded in order to get a better search result.
The source documents are two Wikipedia documents. You will use the discography and list of songs recorded by Bruce Springsteen. The interesting part of these documents is that they contain facts and are mainly in a table format. The same documents were used in the previous post, so it will be interesting to see how the findings from that post compare to the approach used in this post.
This blog can be read without reading the previous blogs if you are familiar with the concepts used. If not, it is recommended to read the previous blogs as mentioned in the prerequisites paragraph.
The sources used in this blog can be found on GitHub.
Prerequisites
The prerequisites for this blog are:
- Basic knowledge of embedding and vector stores
- Basic Java knowledge: Java 21 is used
- Basic knowledge of LangChain4j - see the previous blogs:
- You need LocalAI if you want to run the examples at the end of this blog. See a previous blog on how you can make use of LocalAI. Version 2.2.0 is used for this blog.
Embed Whole Document
The easiest way to embed a document is to read the document, split it into chunks, and embed the chunks. Embedding means transforming the text into vectors (numbers). The question you will ask also needs to be embedded.
The vectors are stored in a vector store which is able to find the results that are the closest to your question and will respond with these results. The source code consists of the following parts:
- The text needs to be embedded. An embedding model is needed for that; for simplicity, use the
AllMiniLmL6V2EmbeddingModel
. This model uses the BERT model, which is a popular embedding model. - The embeddings need to be stored in an embedding store. Often, a vector database is used for this purpose; but in this case, you can use an in-memory embedding store.
- Read the two documents and add them to a
DocumentSplitter
. Here you will define to split the documents into chunks of 500 characters with no overlap. - By means of the
DocumentSplitter
, the documents are split intoTextSegments
. - The embedding model is used to embed the
TextSegments
. TheTextSegments
and their embedded counterpart are stored in the embedding store. - The question is also embedded with the same model.
- Ask the embedding store to find relevant embedded segments to the embedded question. You can define how many results the store should retrieve. In this case, only one result is asked for.
- If a match is found, the following information is printed to the console:
- The score: A number indicating how well the result corresponds to the question
- The original text: The text of the segment
- The metadata: Will show you the document the segment comes from
private static void askQuestion(String question) {
EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();
EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
// Read and split the documents in segments of 500 chunks
Document springsteenDiscography = loadDocument(toPath("example-files/Bruce_Springsteen_discography.pdf"));
Document springsteenSongList = loadDocument(toPath("example-files/List_of_songs_recorded_by_Bruce_Springsteen.pdf"));
ArrayList<Document> documents = new ArrayList<>();
documents.add(springsteenDiscography);
documents.add(springsteenSongList);
DocumentSplitter documentSplitter = DocumentSplitters.recursive(500, 0);
List<TextSegment> documentSegments = documentSplitter.splitAll(documents);
// Embed the segments
Response<List<Embedding>> embeddings = embeddingModel.embedAll(documentSegments);
embeddingStore.addAll(embeddings.content(), documentSegments);
// Embed the question and find relevant segments
Embedding queryEmbedding = embeddingModel.embed(question).content();
List<EmbeddingMatch<TextSegment>> embeddingMatch = embeddingStore.findRelevant(queryEmbedding,1);
System.out.println(embeddingMatch.get(0).score());
System.out.println(embeddingMatch.get(0).embedded().text());
System.out.println(embeddingMatch.get(0).embedded().metadata());
}
The questions are the following, and are some facts that can be found in the documents:
public static void main(String[] args) {
askQuestion("on which album was \"adam raised a cain\" originally released?");
askQuestion("what is the highest chart position of \"Greetings from Asbury Park, N.J.\" in the US?");
askQuestion("what is the highest chart position of the album \"tracks\" in canada?");
askQuestion("in which year was \"Highway Patrolman\" released?");
askQuestion("who produced \"all or nothin' at all?\"");
}
Question 1
The following is the result for question 1: "On which album was 'Adam Raised a Cain' originally released?"
0.6794537224516205
Jim Cretecos 1973 [14]
"57 Channels (And Nothin'
On)" Bruce Springsteen Human Touch
Jon Landau
Chuck Plotkin
Bruce
Springsteen
Roy Bittan
1992 [15]
"7 Rooms of Gloom"
(Four Tops cover)
Holland–Dozier–
Holland †
Only the Strong
Survive
Ron Aniello
Bruce
Springsteen
2022 [16]
"Across the Border" Bruce Springsteen The Ghost of Tom
Joad
Chuck Plotkin
Bruce
Springsteen
1995 [17]
"Adam Raised a Cain" Bruce Springsteen Darkness on the Edge
of Town
Jon Landau
Bruce
Springsteen
Steven Van
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=4, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} }
What do you see here?
- The score is 0.679…: This means that the segment matches 67.9% of the question.
- The segment itself contains the specified information at Line 27. The correct segment is chosen - this is great.
- The metadata shows the document where the segment comes from.
You also see how the table is transformed into a text segment: it isn’t a table anymore. In the source document, the information is formatted as follows:
Another thing to notice is where the text segment is split. So, if you had asked who produced this song, it would be an incomplete answer, because this row is split in column 4.
Question 2
The following is the result for question 2: "What is the highest chart position of 'Greetings from Asbury Park, NJ' in the US?"
0.6892728817378977
29. Greetings from Asbury Park, N.J. (LP liner notes). Bruce Springsteen. US: Columbia
Records. 1973. KC 31903.
30. Nebraska (LP liner notes). Bruce Springsteen. US: Columbia Records. 1982. TC 38358.
31. Chapter and Verse (CD booklet). Bruce Springsteen. US: Columbia Records. 2016. 88985
35820 2.
32. Born to Run (LP liner notes). Bruce Springsteen. US: Columbia Records. 1975. PC 33795.
33. Tracks (CD box set liner notes). Bruce Springsteen. Europe: Columbia Records. 1998. COL
492605 2 2.
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=100, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} }
The information is found in the correct document, but the wrong text segment is found. This segment comes from the References section and you needed the information from the Songs table, just like for question 1.
Question 3
The following is the result for question 3: "What is the highest chart position of the album 'Tracks' in Canada?"
0.807258199400863
56. @billboardcharts (November 29, 2021). "Debuts on this week's #Billboard200 (1/2)..." (https://twitter.com/bil
lboardcharts/status/1465346016702566400) (Tweet). Retrieved November 30, 2021 – via Twitter.
57. "ARIA Top 50 Albums Chart" (https://www.aria.com.au/charts/albums-chart/2021-11-29). Australian
Recording Industry Association. November 29, 2021. Retrieved November 26, 2021.
58. "Billboard Canadian Albums" (https://www.fyimusicnews.ca/fyi-charts/billboard-canadian-albums).
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=142, file_name=Bruce_Springsteen_discography.pdf, document_type=PDF} }
The information is found in the correct document, but also here, the segment comes from the References section, while the answer to the question can be found in the Compilation albums table. This can explain some of the wrong answers that were given in the previous post.
Question 4
The following is the result for question 4: "In which year was 'Highway Patrolman' released?"
0.6867325432140559
"Highway 29" Bruce Springsteen The Ghost of Tom
Joad
Chuck Plotkin
Bruce
Springsteen
1995 [17]
"Highway Patrolman" Bruce Springsteen Nebraska Bruce
Springsteen 1982 [30]
"Hitch Hikin' " Bruce Springsteen Western Stars
Ron Aniello
Bruce
Springsteen
2019 [53]
"The Hitter" Bruce Springsteen Devils & Dust
Brendan O'Brien
Chuck Plotkin
Bruce
Springsteen
2005 [24]
"The Honeymooners" Bruce Springsteen Tracks
Jon Landau
Chuck Plotkin
Bruce
Springsteen
Steven Van
Zandt
1998
[33]
[76]
"House of a Thousand
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=31, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} }
The information is found in the correct document and the correct segment is found. However, it is difficult to retrieve the correct answer because of the formatting of the text segment, and you do not have any context about what the information represents. The column headers are gone, so how should you know that 1982 is the answer to the question?
Question 5
The following is the result for question 5: "Who produced 'All or Nothin’ at All'?"
0.7036564758755796
Zandt (assistant)
1978 [18]
"Addicted to Romance" Bruce Springsteen She Came to Me
(soundtrack)
Bryce Dessner 2023
[19]
[20]
"Ain't Good Enough for
You" Bruce Springsteen The Promise
Jon Landau
Bruce
Springsteen
2010
[21]
[22]
"Ain't Got You" Bruce Springsteen Tunnel of Love
Jon Landau
Chuck Plotkin
Bruce
Springsteen
1987 [23]
"All I'm Thinkin' About" Bruce Springsteen Devils & Dust
Brendan O'Brien
Chuck Plotkin
Bruce
Springsteen
2005 [24]
"All or Nothin' at All" Bruce Springsteen Human Touch
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/example-files, index=5, file_name=List_of_songs_recorded_by_Bruce_Springsteen.pdf, document_type=PDF} }
The information is found in the correct document, but again, the segment is split in the row where the answer can be found. This can explain the incomplete answers that were given in the previous post.
Conclusion
Two answers are correct, one is partially correct, and two are wrong.
Embed Markdown Document
What would change when you convert the PDF documents into Markdown files? Tables are probably better to recognize in Markdown files than in PDF documents, and they allow you to segment the document at the row level instead of some arbitrary chunk size. Only the parts of the documents that contain the answers to the questions are converted; this means the Studio albums and Compilation albums from the discography and the List of songs recorded.
The segmenting is done as follows:
- Split the document line per line.
- Retrieve the data of the table in the variable
dataOnly
. - Save the header of the table in the variable
header
. - Create a
TextSegment
for every row indataOnly
and add the header to the segment.
The source code is as follows:
List<Document> documents = loadDocuments(toPath("markdown-files"));
List<TextSegment> segments = new ArrayList<>();
for (Document document : documents) {
String[] splittedDocument = document.text().split("\n");
String[] dataOnly = Arrays.copyOfRange(splittedDocument, 2, splittedDocument.length);
String header = splittedDocument[0] + "\n" + splittedDocument[1] + "\n";
for (String splittedLine : dataOnly) {
segments.add(TextSegment.from(header + splittedLine, document.metadata()));
}
}
Question 1
The following is the result for question 1: "On which album was 'Adam Raised a Cain' originally released?"
0.6196628642947255
| Title |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK
|-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---|
|The Essential Bruce Springsteen|14|41|—|—|5|22|—|4|2|15|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_compilation_albums.md, document_type=UNKNOWN} }
The answer is incorrect.
Question 2
The following is the result for question 2: "What is the highest chart position of 'Greetings from Asbury Park, NJ' in the US?"
0.8229951885990189
| Title |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK
|-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---|
| Greetings from Asbury Park,N.J. |60|71|—|—|—|—|—|—|35|41|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_studio_albums.md, document_type=UNKNOWN} }
The answer is correct, and the answer can easily be retrieved, as you have the header information for every column.
Question 3
The following is the result for question 3: "What is the highest chart position of the album 'Tracks' in Canada?"
0.7646818618182345
| Title |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK
|-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---|
|Tracks|27|97|—|63|—|36|—|4|11|50|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_compilation_albums.md, document_type=UNKNOWN} }
The answer is correct.
Question 4
The following is the result for question 4: "In which year was 'Highway Patrolman' released?"
0.6108392657222184
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Working on the Highway" |Bruce Springsteen| Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt |1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
The answer is incorrect. The correct document is found, but the wrong segment is chosen.
Question 5
The following is the result for question 5: "Who produced 'All or Nothin’ at All'?"
0.6724577751120745
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
| "All or Nothin' at All" | Bruce Springsteen | Human Touch | Jon Landau Chuck Plotkin Bruce Springsteen Roy Bittan |1992 |
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
The answer is correct and complete this time.
Conclusion
Three answers are correct and complete. Two answers are incorrect. Note that the incorrect answers are for different questions as before. However, the result is slightly better than with the PDF files.
Alternative Questions
Let’s build upon this a bit further. You are not using a Large Language Model (LLM) here, which will help you with textual differences between the questions you ask and the interpretation of results. Maybe it helps when you change the question in order to use terminology that is closer to the data in the documents. The source code can be found here.
Question 1
Let’s change question 1 from "On which album was 'Adam Raised a Cain' originally released?" to "What is the original release of 'Adam Raised a Cain'?". The column in the table is named original release, so that might make a difference.
The result is the following:
0.6370094541277747
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
| "Adam Raised a Cain" | Bruce Springsteen | Darkness on the Edge of Town | Jon Landau Bruce Springsteen Steven Van Zandt (assistant) | 1978|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
The answer is correct this time and the score is slightly higher.
Question 4: Attempt #1
Question 4 is, "In which year was 'Highway Patrolman' released?" Remember that you only asked for the first relevant result. However, more relevant results can be displayed. Set the maximum number of results to 5.
List<EmbeddingMatch<TextSegment>> relevantMatches = embeddingStore.findRelevant(queryEmbedding,5);
The result is:
0.6108392657222184
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Working on the Highway" |Bruce Springsteen| Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt |1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6076896858171996
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Turn! Turn! Turn!" (with Roger McGuinn) | Pete Seeger † | Magic Tour Highlights (EP) | John Cooper | 2008|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6029946650419344
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Darlington County" | Bruce Springsteen | Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt | 1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6001672430441461
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Downbound Train" | Bruce Springsteen | Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt |1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.5982557901838741
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Highway Patrolman" | Bruce Springsteen | Nebraska | Bruce Springsteen | 1982|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
As you can see, Highway Patrolman is a result, but only the fifth result. That is a bit strange, though.
Question 4: Attempt #2
Let’s change question 4 from, "In which year was 'Highway Patrolman' released?" to, "In which year was the song 'Highway Patrolman' released?" So, you add "the song" to the question.
The result is:
0.6506125707025556
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Working on the Highway" |Bruce Springsteen| Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt |1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.641000538311824
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Raise Your Hand" (live) (Eddie Floyd cover) | Steve Cropper Eddie Floyd Alvertis Isbell † | Live 1975–85 | Jon Landau Chuck Plotkin Bruce Springsteen |1986 |
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6402738046796352
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Darlington County" | Bruce Springsteen | Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt | 1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6362427185719677
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Highway Patrolman" | Bruce Springsteen | Nebraska | Bruce Springsteen | 1982|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.635837703599965
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Wreck on the Highway"| Bruce Springsteen |The River | Jon Landau Bruce Springsteen Steven Van Zandt |1980 |
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
Now Highway Patrolman is the fourth result. It is getting better.
Question 4: Attempt #3
Let’s add the words "of the album Nebraska" to question 4. The question becomes, "In which year was the song 'Highway Patrolman' of the album Nebraska released?"
The result is:
0.6468954949440158
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Working on the Highway" |Bruce Springsteen| Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt |1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6444919056791143
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Darlington County" | Bruce Springsteen | Born in the U.S.A. | Jon Landau Chuck Plotkin Bruce Springsteen Steven Van Zandt | 1984|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6376680100362238
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Highway Patrolman" | Bruce Springsteen | Nebraska | Bruce Springsteen | 1982|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
0.6367565537138745
| Title |Album details| US | AUS | GER | IRE | NLD |NZ |NOR|SWE|UK
|-----------------------------------------------|-------------|---|---|---|---|---|---|---|---|---|
|The Essential Bruce Springsteen|14|41|—|—|5|22|—|4|2|15|
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_discography_compilation_albums.md, document_type=UNKNOWN} }
0.6364950606665447
| song | writer(s) | original release | Producer(s) |year|
|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|-|
|"Raise Your Hand" (live) (Eddie Floyd cover) | Steve Cropper Eddie Floyd Alvertis Isbell † | Live 1975–85 | Jon Landau Chuck Plotkin Bruce Springsteen |1986 |
Metadata { metadata = {absolute_directory_path=/<project directory>/mylangchain4jplanet/target/classes/markdown-files, file_name=bruce_springsteen_list_of_songs_recorded.md, document_type=UNKNOWN} }
Again, an improvement: Highway Patrolman is now listed as the third result. Still, it is strange why it is not listed as the first result. However, by adding more information, it ranks higher in the result list. This is as expected.
Conclusion
Changing the question with terminology that is closer to the source data helps in order to get a better result. Adding more context to the question also helps. Displaying more results gives you more insight and lets you determine the correct answer from the result list.
Markdown Embedding Combined With LLM
The conclusions up until now are:
- The format of the documents and the way the documents are segmented and embedded have a significant influence on the results.
- If the question uses terminology that is close to the data in the documents, better results are achieved.
In the previous blog, the PDF files were used as source data. This is now solved by using Markdown and better segmenting and embedding. The second bullet above can probably be solved by using an LLM.
Let’s see what happens when you combine the embedding in this post with an LLM from the previous post. The full source code can be found on GitHub.
The questions and results are:
- Q: On which album was “Adam Raised a Cain” originally released?
A: The album “Adam Raised a Cain” was originally released on The Essential Bruce Springsteen album in 1995.- This answer is wrong but corresponds to the first embedded result earlier.
- Q: What is the highest chart position of “Greetings from Asbury Park, NJ” in the US?
A: The highest chart position of “Greetings from Asbury Park, NJ” in the US is 60.- This answer is correct.
- Q: What is the highest chart position of the album “Tracks” in Canada?
A: The album “Tracks” by Metallica has not reached the highest chart position in Canada. According to the provided information, the highest chart position it reached in Canada was #27.- This answer is wrong. The first embedded result earlier was correct, so it is strange that the wrong answer is given here.
- Q: In which year was “Highway Patrolman” released?
A: The song “Highway Patrolman” was released in 1984.- The answer is wrong, but it can be explained. The first embedded result earlier showed the result for Working on the Highway, and this song was released in 1984.
- Q: In which year was “Higway Patrolman” released?
A: The song “Highway Patrolman” was released in 1984.- The question here included a typo. The result is similar to the previous one, so this is consistent.
- Q: Who produced “All or Nothin’ at All”?
A: The song “All or Nothin’ at All” was produced by Bruce Springsteen, Jon Landau, Chuck Plotkin, and Roy Bittan. It was originally released on Springsteen’s 1992 album “Human Touch.”- This answer is correct and complete. With the PDF as source data, the answer was always incomplete.
The results are worse than using the PDF files as source documents. However, when a correct answer is given, it is complete. The answers are also consistent; a minor typo does not change the answer like it did before.
Markdown Embedding Combined With System Messages
In the previous blog, the final solution also included adding system messages to the chat memory. Let’s see what happens when you combine this with Markdown source documents. The full source code can be found on GitHub. Below is a small extract of the system messages.
ArrayList<ChatMessage> chatMessages = new ArrayList<>();
chatMessages.add(new SystemMessage("Use the provided documents to answer the questions."));
chatMessages.add(new SystemMessage("If the answer cannot be found in the documents, write \"I could not find an answer\"."));
The questions and results are:
- Q: On which album was “Adam Raised a Cain” originally released?
A: The album “Adam Raised a Cain” was originally released on The Essential Bruce Springsteen in 1995.- This answer is wrong and similar to the previous paragraph.
- Q: What is the highest chart position of “Greetings from Asbury Park, NJ” in the US?
A: Based on the provided information, the highest chart position of “Greetings from Asbury Park, N.J.” in the US is 60.- This answer is correct and similar to the previous paragraph.
- Q: What is the highest chart position of the album “Tracks” in Canada?
A: Based on the information provided, the highest chart position of the album “Tracks” in Canada is 64.- This answer is still wrong. However, Metallica is not mentioned anymore, so it is a bit less wrong than in the previous paragraph.
- Q: In which year was “Highway Patrolman” released?
A: The song “Highway Patrolman” was released in 1984.- The answer is wrong and identical to the previous paragraph.
- Q: In which year was “Higway Patrolman” released?
A: The song “Highway Patrolman” was released in 1984.- The answer is wrong and identical to the previous paragraph.
- Q: Who produced “All or Nothin’ at All”?
A: The song “All or Nothin’ at All” was produced by Bruce Springsteen, Jon Landau, Chuck Plotkin, and Roy Bittan. It was originally released on the album “Human Touch” in 1992.- This answer is correct, complete, and similar to the previous paragraph.
Adding system messages did not have any influence on the results.
Overall Conclusion
What did you learn from all of this?
- The way documents are read and embedded seems to have the largest influence on the result.
- An advantage of this approach is that you are able to display a number of results. This allows you to determine which result is the correct one.
- Changing your question in order to use the terminology used in the text segments helps to get a better result.
- Querying a vector store is very fast. Embedding costs some time, but you only need to do this once. Using an LLM takes a lot more time to retrieve a result when you do not use a GPU.
An interesting resource to read is Deconstructing RAG, a blog from LangChain. When improvements are made in this area, better results will be the consequence.
Published at DZone with permission of Gunter Rotsaert, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments