DZone
Java Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Java Zone > FastVectorHighlighter Issues Revisited

FastVectorHighlighter Issues Revisited

Itamar Syn-hershko user avatar by
Itamar Syn-hershko
·
Jun. 28, 12 · Java Zone · Interview
Like (0)
Save
Tweet
6.07K Views

Join the DZone community and get the full member experience.

Join For Free

In a previous post I described how to use FVH to highlight contents which went through filters / readers like HTMLStripCharFilter in the analysis process. As DIGY in the comments spotted right away, my approach was all wrong. Yes, I knew any CharFilter or Tokenizer implementation would store term positions and offsets that take into account any skips done in the content, but since it didn't work for me I didn't care to look any deeper and just made that work around, and then ran to tell.

So, don't use that. Instead, rely on your analyzer to store positions and offsets and on FVH to use them correctly when highlighting. As it happens, the custom analyzers I used suffered from a nasty bug that was not allowing them to consider skips. Now that I fixed that, it all works like a charm.

However, two issues still remained. First, since my stored fields contain HTML, the fragments may contain HTML tags as well, sometimes partial ones. In many cases the fragment that will end up on your webpage would ruin the page layout because of a stubborn misplaced </div> tag that found its way to the fragment. Escaping all <'s and >'s is not a really good solution - you don't really want your fragments to contain ugly looking HTML tags.

The second issue was having duplicate content. I wanted to process the content more than once - index it with 2 or more analyzers, but didn't want to store it more than once since it was exactly the same content.  To still be able to highlight on those other fields as well, I needed FVH to allow me to specify a field name to pull the stored contents from.

Solving the first problem was quite easy, and required nothing more than a simple extension function. It is called on the fragment string after receiving it from FVH. To be on the safe side, I made sure to ask for a larger fragment than I originally intended, so even if a lot of HTML noise is present, some context will remain in the fragment:

public static string HtmlStripFragment(this string fragment)
{
    if (string.IsNullOrEmpty(fragment)) return string.Empty;
 
    var sb = new StringBuilder(fragment.Length);
    bool withinHtml = false, first = true;
    foreach (var c in fragment)
    {
        if (c == '>')
        {
            if (first) sb.Length = 0;
            withinHtml = false;
            first = false;
            continue;
        }
        if (withinHtml)
            continue;
        if (c == '<')
        {
            first = false;
            withinHtml = true;
            continue;
        }
        sb.Append(c);
    }
 
    // FVH was instantiated with "[b]" and "[/b]" as post- and pre- tags for highlighting,
    // so they won't get lost in translation
    return sb.Append("...").Replace("[b]", "<b>").Replace("[/b]", "</b>").ToString();
}

 The second issue was solved by subclassing FragmentsBuilder, only this time it was a bit less intrusive:

public class CustomFragmentsBuilder : BaseFragmentsBuilder
{
    public string ContentFieldName { get; protected set; }
 
    /// <summary>
    /// a constructor.
    /// </summary>
    public CustomFragmentsBuilder()
    {
    }
 
    public CustomFragmentsBuilder(string contentFieldName)
    {
        ContentFieldName = contentFieldName;
    }
 
    /// <summary>
    /// a constructor.
    /// </summary>
    /// <param name="preTags">array of pre-tags for markup terms</param>
    /// <param name="postTags">array of post-tags for markup terms</param>
    public CustomFragmentsBuilder(String[] preTags, String[] postTags)
        : base(preTags, postTags)
    {
    }
 
    public CustomFragmentsBuilder(string contentFieldName, String[] preTags, String[] postTags)
        : base(preTags, postTags)
    {
        ContentFieldName = contentFieldName;
    }
 
    /// <summary>
    /// do nothing. return the source list.
    /// </summary>
    public override List<WeightedFragInfo> GetWeightedFragInfoList(List<WeightedFragInfo> src)
    {
        return src;
    }
 
    protected override Field[] GetFields(IndexReader reader, int docId, string fieldName)
    {
        var field = ContentFieldName ?? fieldName;
        var doc = reader.Document(docId, new MapFieldSelector(new[] {field}));
        return doc.GetFields(field); // according to Document class javadoc, this never returns null
    }
}

 And as always the usual disclaimer applies - this isn't necessarily the best way to do this, and I'd definitely like to hear of more elegant ways to achieve that if such exist.

Fragment (logic) HTML Strings Implementation Data Types Filter (software)

Published at DZone with permission of Itamar Syn-hershko, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Portfolio Architecture Examples: Retail Collection
  • Ultra-Fast Microservices: When Microstream Meets Payara
  • Refactoring Java Application: Object-Oriented And Functional Approaches
  • Ultra-Fast Microservices: When Microstream Meets Wildfly

Comments

Java Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo