Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Alternative Ways of Fredhopper Reindexing, a.k.a. Paths Less Traveled

DZone's Guide to

Alternative Ways of Fredhopper Reindexing, a.k.a. Paths Less Traveled

Data is constantly coming into your application, especially if you're working with a CMS. Read on to learn how to reindex this data.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

I would like to start this blog by giving credit to Fredhopper support. They were the ones that provided the methods described below during quite a long and extensive back and forth communication. Keeping that in mind, let's dive into the heart of the problem.

If you work with Fredhopper, and you probably do if you are reading this blog, you should be familiar with the fact that the Fredhopper Indexer should be reindexed periodically (depending on the frequency of incremental updates). This is an official instruction from Fredhopper customer support, however, performing a reindex is a bit specific if you work with SDL's SmartTarget/Experience Optimization module because content comes to Fredhopper during publishing/unpublishing of Tridion components. How exactly? Well, for every publish/unpublish action, an XML containing the item information is forwarded to the Fredhopper Indexer instance by the deployer, it is then verified and enriched with additional data by a Kettle job, and eventually (after being processed) it is placed into the processed/batch folder.

The key in the previous sentence is in for every publish/unpublish action. In case you have extensive publishing activity on your site, these files can be measured in millions, which can create problems when performing a reindex the official way, since all of the instructions need to be copied into the catalog01 folder as explained in the official documentation.

So, let's take a look at all the different approaches to performing a reindex, starting with the official documentation.

First Path: Official Procedure (With Minor Tweaks)

The official procedure states that the catalog01 folder should be deleted only after a schema change, however, according to Fredhopper support, the catalog01 folder should always be cleared when performing a reindex to avoid double processing of the same instructions. If you think about it, for every reindex procedure, you copy the content of the processed items into this folder. If it is left uncleaned, then subsequent reindexes will hold the previously copied instructions + the freshly copied processed items which causes unnecessary processing of the same files.

Pros: Requires no republish time, works well for small numbers of instructions.

Cons: Requires copying of all the instructions (can be millions), which can result in a considerable time for just copying, plus a considerable reindexing time.

Second Path: Dummy Reindex With Republish

The second option is based on the premise that you can always recreate the indexes via (re)publishing. Of course, this is a viable option only if you can quickly and easily republish your (to be indexed) content. The main advantage of this approach is that you don't have to keep the processed instructions indefinitely to recreate the current index, however, do note that this approach clears the indexes and propagates this to the query instances, meaning that it cannot be performed without any "downtime" unless you have another cluster. Once cleaned, the Indexer indexes will be repopulated over time as the publishing progresses and these changes will be automatically pushed to the query instances via the sync processes. So how do you perform this procedure?

  • Backup and remove any contents from the catalog01 folder.
  • Place the metadata.xml to the location: /home/fredhopper/fredhopper/<INSTANCE>/data/fas-xml/catalog01. The metadata.xml file is located in: /SmartTarget 2014/Fredhopper extensions/data in the SmartTarget installation folder.
  • Execute the reindex from command line:
./bin/reindex <INSTANCE>
  • Run the fresh-index-to-live command:
./bin/fresh-index-to-live <INSTANCE>
  • Republish all components.

Pros: Doesn't require keeping and copying of the processed instructions, reseeds Fredhopper with latest data which is always a good thing, the dummy reindex is done extremely fast.

Cons: Because it pushes an empty index to the query instances, there will be a "downtime" until the new indexes are received from the Indexer, can result in a considerable publishing time depending on the amount of content.

Third Path: Recreating Seed Data From Current Indexes

The third and arguably the best procedure is somewhat similar to the first procedure, but instead of copying all of the processed files, a "snapshot" of the current indexes is created and is used as a basis for the reindexing. Because the current indexes are used as the basis of this procedure, it is imperative that the current indexes are in a valid state. To check whether the indexes are indeed valid on the Indexer instance run the following command:

./bin/chk-valid-indexes <INSTANCE>

As a side note, this command checks the index of the given <INSTANCE> and all its sub-indexes against a minimal number of items configured with the MINITEMS setting in config/fasrc. If the number of items in the catalog are less than MINITEMS (which is by default 22), it will return that the indexes are invalid, so beware! Why is this so? We will have to ask the creative developer.

Having the pre-requisite checked, the rest of the procedure is described below:

  • Create folder, name it source-xml
  • Download the correct fas-assembly zipped file from here and move this file to the souce-xml folder.
  • Make a capture of your Indexer instance with the options ( -i -c ) and move this capture to source-xml:
./bin/capture-export <INSTANCE> /tmp/fredhopper-export/indexer-export.zip -i -c
  • Extract fas-assembly zipped file and extract the capture inside source-xml folder.
  • Edit bin/isview script and change memory setting to 4000m.
vi ./bin/isview
  • Change the memory setting in following line to:
$FRED_BIN/jexec -silent -jdk -XX:-UsePopCountInstruction -jdk -Xmx4000m $DEBUG \
  • Create new directory and name it generated-xml in source-xml folder.
  • Run the following command to create XML files from the indexes. The benefit of this step is that it will generate only a few XML files which contain thousands of items instead of loading hundreds of thousands of small XML files (as is done by standard reindex procedures):
./bin/isview -c generated-xml/file-
  • After the command is executed, check the folder generated-xml, it should contain a few XML files.
  • Ensure that fas-xml/catalog01 folder in your live Indexer environment is empty and move all XML files from source-xml/generated-xml to fas-xml/catalog01 folder.
  • Run the reindex command:
./bin/reindex <INSTANCE>

And voila! Your indexes should be generated.

Pros: Doesn't require keeping and copying of the processed instructions; requires no republish time; consistency of data in Fredhopper is preserved.

Cons: This option is possible only if the indexes are valid; hard to automate due to its complexity.

So, there you go. All three of these procedures produce the same result, and picking one can be based on your customer needs and preferences (high indexer response time, high/low publishing activity, etc.). Having all this said, what is your preferred way of reindexing data?

If you have any questions, feel free to contact us.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
sdl ,cms ,big data ,reindexing

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}