Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Detect Stolen and Duplicate Tweets with Solr

DZone's Guide to

Detect Stolen and Duplicate Tweets with Solr

· Java Zone
Free Resource

Learn how to stop testing everything every sprint and only test the code you’ve changed. Brought to you by Parasoft.

A new feature “duplication detection” is implemented for the open source webapp jetwick and seems to work pretty good thanks to the great performance of Solr.

To try it, go to the tweet about this blog post and click on the ‘Find Similar’ button below the tweet to investigate existing duplicates. With that feature it is possible to skip spam, identify different accounts of the same user, skip tweets with wrong retweet or attribution.

but also to see stolen tweets i.e. when users tweeting without attribution or not knowing the original tweet. (Or if all tweeters had a common different source, e.g. news paper. Thanks to pannous for pointing this out)

Examples for ‘stolen’ or duplicated tweets:

Janell albert@janellalbert74

World Cup hero Donovan files for divorce from actress wife: World Cup hero Landon Donovan has filed for divorce … http://bit.ly/eeKJWw

about 7 hours ago via twitterfeedRetweetReply

ervin myers@jaemas

World Cup hero Donovan files for divorce from actress wife: World Cup hero Landon Donovan has filed for divorce … http://bit.ly/gGfT8R

about 7 hours ago via twitterfeedRetweetReply

So this is an example for a user using two twitter accounts, because the tweet has the same twitter client and they were posted on identical times.

The following German example looks more like ‘stolen’ tweets:

Klaus Redegeld@muc4u

Lufthansa sagt "Fahrt Bahn". Bahn sagt "Fahrt Auto". ADAC sagt "Fahrt morgen". Morgen sagen alle:"Wären Sie mal gestern gefahren"

 

Newsteam Berlin@Newsteam_Berlin

Lufthansa sagt, fahrt Bahn, Bahn sagt fahrt Auto, ADAC sagt,, fahrt morgen und morgen sagen alle, wären wir mal gestern gefahren

 

and a lot more: ste_pos, Kleines79, …

the oldest tweet and therefor the original is:

 schlenzalot@schlenzalot

Lufthansa sagt "Fahrt Bahn". Bahn sagt "Fahrt Auto". ADAC sagt "Fahrt morgen". Morgen sagen alle:"Wären Sie mal gestern gefahren"

(Translate the tweet)

As you can see it is not necessary for the successful detection that the tweets have exactly the same string.

Detecting duplicated tweets could be interesting for all people wanting to give the ‘correct’ guy its attribution. Because it is often the case that not the original tweet but the ‘stolen’ tweet is more popular (has more retweets). Especially for heavy follower accounts.

But it is also useful for “tweet readers” like jetwick to avoid twitter noise and reading the same content twice.

Update: This seems to be the first tweet about santa and wikileaks:

Dear kids, there is no Santa. Those presents are from your parents. Love, Wikileaks

suliz has only 67 followers .. now take that tweet from ihackinjosh with over 6000 followers. This tweet has over 600 retweets, although suliz has tweeted nearly one day earlier ...

 

From http://karussell.wordpress.com/2010/12/23/detect-stolen-and-duplicate-tweets-with-solr/

Get the top tips for Java developers and best practices to overcome common challenges. Brought to you by Parasoft.

Topics:

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}