Over a million developers have joined DZone.

PHP Hidden Gem: similar_text()

· Web Dev Zone

Start coding today to experience the powerful engine that drives data application’s development, brought to you in partnership with Qlik.

Every once in a while you need to solve this unique problem. This problem that you usually only have to solve once or twice. So you start developing, but quickly wonder what the best solution is. You can write your own solution, but it pays off to at least search through PHP.net a bit to see if there is not already a solution for it in PHP.

Today, I was having such a situation. I am working on a hobby project where I aggregate feeds from several different sources. With the blogs I work it right now, it often happens that an author posts the same post to a few different sites. However, because of site formats and sometimes also quick edits an author makes on one site but not on the author, the article contents are usually not identical strings. So I needed something that would help me figure out whether or not two strings are nearly identical.

First I was Googling around a bit, trying to figure out what was a sane approach for this. I was thinking of simply piping it into diff and parsing the response, but that seemed highly inefficient. I also came by xdiff, a PECL extension which allows for some very cool functionality, but in the end I didn't really want to diff, I just wanted to figure out if strings were similar.

The fact that Google brought me to PHP.net though made me consider PHP might actually have a solution to my problem already. Then I remembered using soundex() before, which is sort of what I need, however I need it for more than just a single word. I decided to check the soundex() page anyway, to see if there were similar functions listed that would perhaps help me out. And indeed there were: levensthein and similar_text both seemed to do something approaching what I needed.

I first checked levensthein, however it allows checking strings of maximum 255 characters, too short when comparing full blog post bodies. So I went with similar_text, and indeed, that worked fine. My current isDuplicate() method is now:

static public function isDuplicate($item1, $item2)
  similar_text($item1->get('text' ), $item2->get('text' ), $perc);
  if ($perc > 75)
    return true;

I am still trying to figure out which percentage will catch the duplicates but not catch too many posts which are only similar but not actually duplicates, but with the above 75% I seem to catch quite a few duplicates so far.

So, next time you are in a similar situation, remember, there is an awesome function out there called similar_text(). And perhaps more importantly: Next time you need to implement some very specific functionality that you usually don't need, first try PHP.net (and Google) before writing your own solution.

Create data driven applications in Qlik’s free and easy to use coding environment, brought to you in partnership with Qlik.


The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}