Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Getting Data From the Web

DZone's Guide to

Getting Data From the Web

Learn a nice and clever way to get/copy a table from a Wikipedia web page using a simple JavaScript syntax — even if you are not familiar with JavaScript.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

In this article, I will show you a nice and clever way to get/copy a table from a Wikipedia web page using a simple JavaScript syntax. If you are not familiar with JavaScript, don’t worry — you can still follow along.

Image title

By the way, this process is often described as scraping data with a browser. Let’s start!

Go to the following Wikipedia web page. Scroll down to the economy section. Switch to the developer tool. On Internet Explorer, press F12. On Google Chrome: click on the menu, then More Tools > Developer Tools (see picture below).

Now, it is time to select the table to copy the data. Click on the arrow of the Developer Tools, then click on the first element of the first table and click on the <tbody> tag to select the table (see below):

Notice that there is a $0 after the tag <tbody>. This sign allows us to process the element table; in other words, the $0 is the table now.

Click on the Console tab.

Then write $0 and click Enter. The table selected earlier is now on the console.

Cool, right?

To save the table write this code and press enter var wholeTable = $$("tr", $0).

Warning: Be sure to use the straight apostrophe instead of the curly apostrophe around the tr.

To access any cell in the table, i.e. the first cell, use this code and press Enter wholeTable[0].cells[0].innerText.

The table is a set of lines and columns, and this is how it is accessible: wholeTable[Line].cells[Column]. The inneText is just a way to display the data in the cell.

Now, let’s get the data by making the following loop:

var tempObj = [];
for (i = 0; i < wholeTable.length; i++) {
    tempObj[i] = {
        Country: "",
        GDP: ""
    };

    //Copy the first column
    tempObj[i].Country = wholeTable[i].cells[1].innerText;

    //Copy the second column
    tempObj[i].GDP = parseFloat(wholeTable[i].cells[2].innerText.replace(/[^\d\.\-]/g, ""));
}

I create an empty object tempObj (an array) to copy the cells’ data in the object properties’ Country and GDP.

If you are using another table, feel free to write the titles that correspond to the table you would like to copy.

You can copy any column by adding this line of code with the right column number: wholeTable[i].cells[NumberOfTheColumn].innerText;.

The following line, parseFloat(wholeTable[i].cells[2].innerText.replace(/[^\d\.\-]/g, “”), is just a trick to convert the text to a value; otherwise, I will get a string instead of a value.

Well, guess what? The table is ready! The last step is to copy and paste the tempObj and use it in any environment you want. Write copy(tempObj) and press Enter. Check here.

Feel free to share your experience using this method or another method. You are also welcome to ask any questions about this topic.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
javascript ,web scraping ,big data ,data analytics ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}