DZone
Java Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Java Zone > Detect Stolen and Duplicate Tweets with Solr

Detect Stolen and Duplicate Tweets with Solr

Peter Karussell user avatar by
Peter Karussell
·
Dec. 24, 10 · Java Zone · Interview
Like (0)
Save
Tweet
8.90K Views

Join the DZone community and get the full member experience.

Join For Free

A new feature “duplication detection” is implemented for the open source webapp jetwick and seems to work pretty good thanks to the great performance of Solr.

To try it, go to the tweet about this blog post and click on the ‘Find Similar’ button below the tweet to investigate existing duplicates. With that feature it is possible to skip spam, identify different accounts of the same user, skip tweets with wrong retweet or attribution.

but also to see stolen tweets i.e. when users tweeting without attribution or not knowing the original tweet. (Or if all tweeters had a common different source, e.g. news paper. Thanks to pannous for pointing this out)

Examples for ‘stolen’ or duplicated tweets:

Janell albert@janellalbert74

World Cup hero Donovan files for divorce from actress wife: World Cup hero Landon Donovan has filed for divorce … http://bit.ly/eeKJWw

about 7 hours ago via twitterfeedRetweetReply

ervin myers@jaemas

World Cup hero Donovan files for divorce from actress wife: World Cup hero Landon Donovan has filed for divorce … http://bit.ly/gGfT8R

about 7 hours ago via twitterfeedRetweetReply

So this is an example for a user using two twitter accounts, because the tweet has the same twitter client and they were posted on identical times.

The following German example looks more like ‘stolen’ tweets:

Klaus Redegeld@muc4u

Lufthansa sagt "Fahrt Bahn". Bahn sagt "Fahrt Auto". ADAC sagt "Fahrt morgen". Morgen sagen alle:"Wären Sie mal gestern gefahren"

 

Newsteam Berlin@Newsteam_Berlin

Lufthansa sagt, fahrt Bahn, Bahn sagt fahrt Auto, ADAC sagt,, fahrt morgen und morgen sagen alle, wären wir mal gestern gefahren

 

and a lot more: ste_pos, Kleines79, …

the oldest tweet and therefor the original is:

 schlenzalot@schlenzalot

Lufthansa sagt "Fahrt Bahn". Bahn sagt "Fahrt Auto". ADAC sagt "Fahrt morgen". Morgen sagen alle:"Wären Sie mal gestern gefahren"

(Translate the tweet)

As you can see it is not necessary for the successful detection that the tweets have exactly the same string.

Detecting duplicated tweets could be interesting for all people wanting to give the ‘correct’ guy its attribution. Because it is often the case that not the original tweet but the ‘stolen’ tweet is more popular (has more retweets). Especially for heavy follower accounts.

But it is also useful for “tweet readers” like jetwick to avoid twitter noise and reading the same content twice.

Update: This seems to be the first tweet about santa and wikileaks:

Dear kids, there is no Santa. Those presents are from your parents. Love, Wikileaks

suliz has only 67 followers .. now take that tweet from ihackinjosh with over 6000 followers. This tweet has over 600 retweets, although suliz has tweeted nearly one day earlier ...

 

From http://karussell.wordpress.com/2010/12/23/detect-stolen-and-duplicate-tweets-with-solr/

Open source twitter POST (HTTP) Strings News Blog Papers (software) Data Types

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Data Lakes, Warehouses and Lakehouses. Which is Best?
  • Conducting Sprint Retrospective Meetings
  • How the TypeScript ReturnType Works
  • API Testing for Open Banking Operations

Comments

Java Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo