DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

How does AI transform chaos engineering from an experiment into a critical capability? Learn how to effectively operationalize the chaos.

Data quality isn't just a technical issue: It impacts an organization's compliance, operational efficiency, and customer satisfaction.

Are you a front-end or full-stack developer frustrated by front-end distractions? Learn to move forward with tooling and clear boundaries.

Developer Experience: Demand to support engineering teams has risen, and there is a shift from traditional DevOps to workflow improvements.

Related

  • How I Built an AI Portal for Document Q and A, Summarization, Transcription, Translation, and Extraction
  • How to Merge HTML Documents in Java
  • Google Cloud Document AI Basics
  • Thumbnail Generator Microservice for PDF in Spring Boot

Trending

  • How to Improve Software Architecture in a Cloud Environment
  • Beyond Bytecode: Exploring the Relationship Between JVM, JIT, and Performance
  • DevOps in the Cloud - How to Streamline Your CI/CD Pipeline for Multinational Teams
  • Master SQL Performance Optimization: Step-by-Step Techniques With Case Studies

Indexing Chinese in Solr

By 
Jason  Hull user avatar
Jason Hull
·
Jan. 30, 12 · Interview
Likes (2)
Comment
Save
Tweet
Share
27.8K Views

Join the DZone community and get the full member experience.

Join For Free

Recently, we had a project where we helped a client index a corpus of Chinese language documents in Solr. We have asked Dan Funk, a committer to Project Blacklight to provide a guest blog post for us on the details of how to approach indexing Chinese, particularly when you are a non-speaker.

Take it away, Dan!

Indexing Chinese in Solr

Prologue (Including thanks, and some vital orientation)

Before I start, I’d like to lay some thanks on a few people who helped me muddle through indexing a language I can’t speak, and having me come off looking like a pro.  Wiley Kestner (@prairie_dogg) sat for hours giving me tips and pointers about Chinese. Christopher Ball helped me quickly put an excellent and professional face on my work by using the Blacklight project. And Eric Pugh (@dep4b) provided some much needed mentoring – helping me see a way forward in what I initially believed was an intractable problem.

If you don’t read Chinese or have not worked with it before, here are few things you should know:


  1. Chinese words are frequently made up of more than one character, and words are not separated by spaces.  (read this as “Tokenization is a big problem.”)
  2. Spoken Chinese is completely different from written Chinese, so don’t stress about the multitude of dialects when you are indexing.
  3. There are two common types of written Chinese: Standardized and Traditional.   Since Traditional can be converted to Standardized fairly easily, the focus of this document is on Standardized.  Traditional text has many more characters and thus the potential for deeper subtler meanings.
  4. Though traditionally written from top to bottom, right to left, it is far more common to see Chinese written from left to right – particularly on the web.
  5. Don’t depend on your documents being in UTF – you are far more likely to encounter GB2312 encoding.
  6. A great method for testing relevancy in a language you don’t know is to use a Judgment List, please see Eric Pugh’s presentation here for more information.

My Best Advice:

Ok, here are two most important pieces of advice I can give you:

#1:  Separate your Chinese text into its own field(s). 
That is to say, don’t try and index multiple languages in the same field.  If your Lucene/Solr field structure is complicated, add a second core with duplicate field names.   Why?
A. You set yourself up for handling additional languages fluidly and effectively.
B.  You can use the best indexer available for each language (see advice #2)
C. You improve overall performance because the indexes are smaller and tighter.
D. You remove confusing, and likely false, results in a language the end user does not understand.

#2: Use the CJK or Paoding analyzers for your Chinese Text. 
There is some great documentation out there for CJK, but if you would like to give Paoding a shot, here are some directions to help get you up and running:

1. Don’t use the binary distribution.  It won’t work with the latest versions of Solr.  Instead, grab the source:

dan@maus:~$ cd code

dan@maus:~/code $ svn co  http://paoding.googlecode.com/svn/trunk/ paoding-analysis

2. Compile it with Ant.

dan@maus:~/code/paoding-analysis $  cd paoding-analysis

dan@maus:  ant

…

Building jar: paoding-analysis.jar

3. Build a modified Solr war file.

The Paoding analyzer, while brilliant at analyzing Chinese text,  was not originally built to work well in a web deployed environment, and depends heavily on file paths to get to its built in dictionaries.  To correct for this, you will need to inject the analyzer and it’s configuration files into your solr war file.  I tested this approach with apache-solr-3.4 doing the following:

dan@maus:~$ mkdir temp

dan@maus:~$ cd temp

dan@maus:~/tmp$ unzip /usr/local/apache-solr-3.4.0/dist/apache-solr-3.4.0.war

dan@maus:~/tmp$ cp ~/code/paoding-analysis/paoding-analysis.jar WEB-INF/lib/

dan@maus:~/tmp$ cp ~/code/paoding-analysis/classes/*.properties WEB-INF/

dan@maus:~/tmp$ zip -r * apache-solr-3.4.0-paoding.war

4. Update your Solr configuration and add support for a paoding string.

<fieldType name=”paoding” class=”solr.TextField”>

<analyzer class=”net.paoding.analysis.analyzer.PaodingAnalyzer”/>

</fieldType>

5. Copy over Paoding’s dictionary files into your solr home directory.

dan@maus:~/solr_home$ cp ~/code/paoding-analysis/dic my_solr_home

6. Set an environment variable to let the Paoding Analyzer know where to find the dictionary files:

dan@maus:~/solr_home$  java -DPAODING_DIC_HOME=./dic -jar start.jar


Choosing the right Analyzer

Now that I’ve recommended Paoding and CJK, let me back that up with some details.  Below I delve just a little more into the structure of Chinese text, and then run through a comparison of the available tokenizers to help give you an idea of their differences.

The Structure of Chinese Text

Most languages uses spaces to separate their words.  A common misconception is that Chinese words are its characters – but this is the case only a fraction of the time.

Take 的 (de) for example. It is the single most common character in Standard Chinese by far. It has little use on its own, but when placed with other characters it can mean:

我的my;

高的high, tall;

是的 that’s it, that’s right;

是…的one who…;

目的 goal, true, real;

的确 certainly


In short, you can’t search for the characters individually as if they all carry the same weight or the relevance of the search results will be embarrassingly reduced.

What Analyzers are available?

Let me introduce you to the options, then follow up with some comparisons that will show off how the tokenizing will actually differ …
To my knowledge what follows is a complete list of the open source options available for parsing, indexing and searching Chinese characters in Solr/Lucene.  While commercial options definitely exist, they were not a part of this comparison.

Method Pros Cons
Default Solr setup No new configuration required, and roughly supports multiple languages. Tokenized on spaces – but will shift to character tokenization for Chinese text.  See previous section for why this is problematic.
CJK Thoughtfully parses Chinese characters – understands that character groups alter meaning. Ships with and is part of Solr’s default configuration. Does not use a dictionary, depends largely on an n-gram based algorithm that creates all possible groupings of pairs of symbols in the text.
Smart Chinese Uses a dictionary to pull out characters.  Ships with solr as an add-on package. The dictionary is minimal and handles general cases well, but many nuances of the language are lost.  It requires a custom Solr configuration.
Paoding Uses a large set of dictionaries, and provides exceptionally good search results across a multitude of contexts. Can be very difficult to configure and setup – almost all documentation is written in Chinese.  Does not ship with Solr, and must be built from source to work correctly with the latest stable Solr versions.


About the Sample Document Set:

A set of 12 documents were loaded into Lucene.  The first 10 are about “types of fish” and are based on a quick google search of the same.  The 11th document is a wikipedia article on Hồ Chí Minh , and the 12th document is about a person whose name begins Hồ Chí.

Example 1: 爬蟲

爬蟲 means “Reptile”.
爬 : [pá] crawl, climb,
蟲 : [chóng] The traditional form of 虫. meaning worm, paired with 书 to mean insect.


So here is a case where we have a traditional character*, and a paired set of characters that have an alternate meaning from what they mean separately.

In this table the “T1”, “T2” … represent the terms parsed out by the various analyzers.  In the example below the string “爬蟲” is split into two tokens by the default solr setup, but remain a single token in CJK.

Method T1 T2 Hits
Default Solr setup 爬 蟲 2 hits (doc 8 and doc 3)
* 爬蟲類:”reptile” – a good hit.
* 爬岩鳅: “Beaufortia loach” – bad hit.
- Even more problematic, is that your highlighting will identify these as two seperate hits, even when it gets it right.
CJK 爬蟲   1 hit (doc 8 )
It gets the right document.  But this is because CJK always groups by 2, we will see it fall short on the next example.
Smart Chinese 爬 蟲 2 hits (doc 8 and doc 3)
Paoding 爬蟲   1 hit (doc 8 )

* Note: A second run, replacing the traditional symbol 蟲 with the standardized 虫 symbol does not match any documents in the test set, though it would have been correct to do so.   The ICU Project provides an API that would perform this conversion.

Example 2: 胡志明

Hồ Chí Minh was profoundly important leader in Vietnam.   However, divide these characters up and you might get “A recklessly clear magazine.”

胡志明 means “Hồ Chí Minh”.

胡 : [Hu] “recklessly”  胡说 nonsense (F鬍) (=胡子 húzi) beard (F衚) 胡同 hútòng lane
志 : [zhì] (=意志 yìzhì) will, (=标志 biāozhì) mark; 同志 tóngzhì comrade, (F誌) 杂志 zázhì magazine
明 : [míng] bright, clear, distinct, next (day or year), <family name>; 明白 míngbai clear, understand

And if you randomly pair (in the case of CJK), you just get sounds, common pairings used in names.

Method T1 T2 T3 Hits
Default Solr setup 胡 志 明 6 Hits:
胡志明 (Hồ Chí Minh)
胡 志 (Ho Chi)
明目  (eyesight)
眼目 (eyes)
杂志中的 (magazines)
起明显 (the aparent)
CJK 胡志 志明   2 Hits
胡志明 (Hồ Chí Minh)
胡志  (Ho Chi)
Smart Chinese 胡 志 明 4 Hits:
胡志明 (Hồ Chí Minh)
胡 志 (Ho Chi)
明目  (eyesight)
眼目 (eyes)
Paoding 胡志明     胡志明 (Hồ Chí Minh)


In Conclusion

It is possible for you to index Chinese, even if you don’t speak it.  The largest problem you will face is in correctly parsing the text, but there are several effective tools that help solve the problem.  I would strongly discourage you from indexing Chinese content with Solr’s default settings.  You will not get good results.  If you need to quickly add support for Chinese to an existing project, I highly recommend using the CJK analyzer.  However, if you have a discerning audience, a specialized area, or the need to enhance the quality of your results over time (by expanding on the included dictionaries) then Paoding is an excellent choice.

Resources and References

http://www.zein.se/patrick/3000char.html – The most common Chinese characters in order of frequency
http://translate.google.com/ – A fantastic way to quickly translate a few characters or a whole page of text.
http://site.icu-project.org/ – Provides an API for converting from Traditional to Standardized Chinese Characters.



Source:  http://www.opensourceconnections.com/2011/12/23/indexing-chinese-in-solr/


Document Dictionary (software)

Opinions expressed by DZone contributors are their own.

Related

  • How I Built an AI Portal for Document Q and A, Summarization, Transcription, Translation, and Extraction
  • How to Merge HTML Documents in Java
  • Google Cloud Document AI Basics
  • Thumbnail Generator Microservice for PDF in Spring Boot

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: