Over a million developers have joined DZone.

Detect Stolen and Duplicate Tweets with Solr

· Java Zone

Microservices! They are everywhere, or at least, the term is. When should you use a microservice architecture? What factors should be considered when making that decision? Do the benefits outweigh the costs? Why is everyone so excited about them, anyway?  Brought to you in partnership with IBM.

A new feature “duplication detection” is implemented for the open source webapp jetwick and seems to work pretty good thanks to the great performance of Solr.

To try it, go to the tweet about this blog post and click on the ‘Find Similar’ button below the tweet to investigate existing duplicates. With that feature it is possible to skip spam, identify different accounts of the same user, skip tweets with wrong retweet or attribution.

but also to see stolen tweets i.e. when users tweeting without attribution or not knowing the original tweet. (Or if all tweeters had a common different source, e.g. news paper. Thanks to pannous for pointing this out)

Examples for ‘stolen’ or duplicated tweets:

Janell albert@janellalbert74

World Cup hero Donovan files for divorce from actress wife: World Cup hero Landon Donovan has filed for divorce … http://bit.ly/eeKJWw

about 7 hours ago via twitterfeedRetweetReply

ervin myers@jaemas

World Cup hero Donovan files for divorce from actress wife: World Cup hero Landon Donovan has filed for divorce … http://bit.ly/gGfT8R

about 7 hours ago via twitterfeedRetweetReply

So this is an example for a user using two twitter accounts, because the tweet has the same twitter client and they were posted on identical times.

The following German example looks more like ‘stolen’ tweets:

Klaus Redegeld@muc4u

Lufthansa sagt "Fahrt Bahn". Bahn sagt "Fahrt Auto". ADAC sagt "Fahrt morgen". Morgen sagen alle:"Wären Sie mal gestern gefahren"

 

Newsteam Berlin@Newsteam_Berlin

Lufthansa sagt, fahrt Bahn, Bahn sagt fahrt Auto, ADAC sagt,, fahrt morgen und morgen sagen alle, wären wir mal gestern gefahren

 

and a lot more: ste_pos, Kleines79, …

the oldest tweet and therefor the original is:

 schlenzalot@schlenzalot

Lufthansa sagt "Fahrt Bahn". Bahn sagt "Fahrt Auto". ADAC sagt "Fahrt morgen". Morgen sagen alle:"Wären Sie mal gestern gefahren"

(Translate the tweet)

As you can see it is not necessary for the successful detection that the tweets have exactly the same string.

Detecting duplicated tweets could be interesting for all people wanting to give the ‘correct’ guy its attribution. Because it is often the case that not the original tweet but the ‘stolen’ tweet is more popular (has more retweets). Especially for heavy follower accounts.

But it is also useful for “tweet readers” like jetwick to avoid twitter noise and reading the same content twice.

Update: This seems to be the first tweet about santa and wikileaks:

Dear kids, there is no Santa. Those presents are from your parents. Love, Wikileaks

suliz has only 67 followers .. now take that tweet from ihackinjosh with over 6000 followers. This tweet has over 600 retweets, although suliz has tweeted nearly one day earlier ...

 

From http://karussell.wordpress.com/2010/12/23/detect-stolen-and-duplicate-tweets-with-solr/

Discover how the Watson team is further developing SDKs in Java, Node.js, Python, iOS, and Android to access these services and make programming easy. Brought to you in partnership with IBM.

Topics:

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}