Detect Stolen and Duplicate Tweets with Solr
Join the DZone community and get the full member experience.Join For Free
A new feature “duplication detection” is implemented for the open source webapp jetwick and seems to work pretty good thanks to the great performance of Solr.
To try it, go to the tweet about this blog post and click on the ‘Find Similar’ button below the tweet to investigate existing duplicates. With that feature it is possible to skip spam, identify different accounts of the same user, skip tweets with wrong retweet or attribution.
but also to see stolen tweets i.e. when users tweeting without attribution or not knowing the original tweet. (Or if all tweeters had a common different source, e.g. news paper. Thanks to pannous for pointing this out)
Examples for ‘stolen’ or duplicated tweets:
So this is an example for a user using two twitter accounts, because the tweet has the same twitter client and they were posted on identical times.
The following German example looks more like ‘stolen’ tweets:
the oldest tweet and therefor the original is:
As you can see it is not necessary for the successful detection that the tweets have exactly the same string.
Detecting duplicated tweets could be interesting for all people wanting to give the ‘correct’ guy its attribution. Because it is often the case that not the original tweet but the ‘stolen’ tweet is more popular (has more retweets). Especially for heavy follower accounts.
But it is also useful for “tweet readers” like jetwick to avoid twitter noise and reading the same content twice.
Update: This seems to be the first tweet about santa and wikileaks:
Dear kids, there is no Santa. Those presents are from your parents. Love, Wikileaks
suliz has only 67 followers .. now take that tweet from ihackinjosh with over 6000 followers. This tweet has over 600 retweets, although suliz has tweeted nearly one day earlier ...
Opinions expressed by DZone contributors are their own.