DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > Yahoo Opens Largest Database to the Public

Yahoo Opens Largest Database to the Public

Yahoo have made 13.5 TB of data on around 20 million users available for everyone to play with.

Adi Gaskell user avatar by
Adi Gaskell
·
Mar. 14, 16 · Big Data Zone · News
Like (8)
Save
Tweet
11.75K Views

Join the DZone community and get the full member experience.

Join For Free

Machine learning is a topic I’ve touched upon a lot recently, and whilst there are efforts underway to develop machines that are more efficient learners, at the moment it still requires huge databases and a lot of computational grunt to train the algorithms successfully.

Alas, access to that kind of data is something that is often lacking, with large-scale datasets typically the preserve of machine learning academics or scientists at huge companies.

A project from Yahoo Labs aims to make such data more widely available. The Webscope database has been available to Yahoo researchers for some time, but they have recently opened it up to the public.

Data for the Masses

The database, which currently provides in the region of 13.5 TB of anonymized user-news item interaction data from around 20 million users over a 3 month period.

The data consists of anonymized user interactions with news content on a range of Yahoo properties with the aim of promoting independent research in machine learning and recommendation systems.

They also hope to open up this world to a wider range of participants and level the playing field between industrial and academic research.

The data provides users with both demographic information for users plus their interactions with content.  Each interaction is timestamped and contains information on the device used to access the content.

The data has already been given a thorough working over by the Personalization Science team at Yahoo, and they’re confident that the public will have similar fun in areas such as behavior modeling, machine learning, recommendation services, and content modelling.

“We hope that this data release will similarly inspire our fellow researchers, data scientists, and machine learning enthusiasts in academia, and help validate their models on an extensive, “real-world” dataset,” they say.

Suffice to say, the field of research possible with this data is relatively limited due to the subject matter included, but it’s nonetheless a tasty amount of data for scientists to play around with, and hopefully it will prove to be valuable to the community.

Database Machine learning Data (computing)

Published at DZone with permission of Adi Gaskell, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Top Six Kubernetes Best Practices for Fleet Management
  • Stupid Things Orgs Do That Kill Productivity w/ Netflix, FloSports & Refactoring.club
  • Why I'm Choosing Pulumi Over Terraform
  • How to Set Up and Run PostgreSQL Change Data Capture

Comments

Big Data Partner Resources

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo