DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
11 Monitoring and Observability Tools for 2023
Learn more
  1. DZone
  2. Data Engineering
  3. Data
  4. Google and Microdata / Stealing Your Content

Google and Microdata / Stealing Your Content

Niels Matthijs user avatar by
Niels Matthijs
·
Jan. 04, 12 · Interview
Like (0)
Save
Tweet
Share
4.87K Views

Join the DZone community and get the full member experience.

Join For Free

not too long ago i wrote about the real-life use of html5 microdata and how it takes us one step closer to the ideal of a semantic web. while i'm still pretty excited to see the web expand in this direction, there is at least one serious bump in the road worth mentioning. bottom line: the easier it is for crawlers and other pieces of software to read our data, the easier it becomes for them to steal our data for their own gain. and currently we have no way to protect ourselves.

like thieves in the night

this is not a new problem of course, there is plenty of software our there today that crawls specific sites and pages in order to harvest data. as long as websites do not provide an api to access their data, this is the only way feasible to accomplish certain tasks. for example, a site like icheckmovies.com provides a service where users can import their imdb votes, but imdb does not offer other sites a way to access this particular data. so icheckmovies asks you for the page url containing your votes and crawls the page looking for the data it needs. as long as the html source does not change, this is a pretty reliable way to extract data online. when imdb does change the source html (like they did a couple of weeks ago), the service breaks and has to be adapted to match the new html structure.

i'm not sure about the specifics, but legally speaking this is somewhat of a gray area. when the data is public it can be used by others. on the other hand, you can't just copy a whole database of information from another site. that's why big sites like imdb (or any other database-fueled data site) introduce known errors into their data (google maps has a couple of non-existing towns for example). if these errors make it onto other sites, they know they've been robbed of their hard work.



the new google

search engines like google search also crawl your site for data. this is not really a problem because if all goes well they will direct people to your site based on the search criteria they entered. it uses your data simply to produce a search result snippet so users can make some kind of initial decision before they click through to your site. google generates traffic for our websites, so nobody minds.

but what if google was going to use the data on your site for other things beside generating links to your site? according to an article published on hbr google is aiming to produce immediate answers for direct answers, effectively bypassing the sites where it got its information. it's nothing more than an extension on what they are doing with exchange rate calculation and simple math problems , but because google has access to an almost unlimited amount of data, it can actually start aggregating and analyzing that data to predict the answer to more complex questions . in the end, it's not even stealing your data, but simply using it to predict the correct answer.

google and microdata

semantics (more specifically microdata) are crucial in this process. it allows machines to understand data that would otherwise be captured in language-dependent full sentences. google isn't guessing anymore, it knows. and because it knows, it will answer you directly rather than point to a source that might hold the answer to your question. for users of google, this is superb as this saves a few clicks and they still get the information they were looking for. other services too will have a much easier time figuring out your data. a site author can change the html all he wants, as long as the microdata implementation remains the same (which in theory it should), services that crawl your pages don't need to be rewritten every time you change something in the source.

as content authors though, we could feel a bit cheated by this. external services are using our carefully marked up data for their own benefits. google does provide extra links to its sources, but only in a collapsed view which is likely to be ignored by people just looking for the answer. what this means is that we are doing all the hard work while google is taking all the credit.

blogs like mine might (at least for some time) escape the first few blows because we offer opinions and contextual articles, not so much single answers to direct questions. then again, i believe it's probably just a matter of time before we're going to feel the consequences of this. google could just as well roll out a list of film reviews (with some source links in the footer that nobody is going to click anyway), reliably harvesting its information from sites that use the movie and review microdata formats. that way it shows our reviews without giving us the proper credit for writing them.

conclusion

what bothers me the most is that content authors gain very little by going the extra step to mark up our data with microdata, we may even lose a part of our audience that way. sure the people we lose are probably just looking for a simple answer and may not be particularly interested in the rest of our site, but branding works in mysterious ways. currently there is no way to protect ourselves from this and we are at the mercy of google and other search engines to provide visible source links and quotes so we are at least given the proper credit for our work.

if search engine developers play this right both engines and content authors could benefit from the semantic web, but if they're going to claim all the credit for the data we are providing, many people are going to be discouraged to keep writing for the web. not only that, it could hurt the success of the semantic web itself, setting us back several steps in the process to make more sense out of this enormous cluster of information we call the internet.

source: http://www.onderhond.com/blog/work/google-microdata-stealing-content

Google (verb) Microdata (HTML) Data (computing)

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • gRPC on the Client Side
  • Debezium vs DBConvert Streams: Which Offers Superior Performance in Data Streaming?
  • 10 Most Popular Frameworks for Building RESTful APIs
  • Apache Kafka Is NOT Real Real-Time Data Streaming!

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: