5 Ways to Request and Parse Web Data
This article details the 5 ways to request and parse data from the internet. The article includes codes to make requesting and parsing data easy to do.
Join the DZone community and get the full member experience.Join For Free
1. HttpURLConnection – Send and Receive Data
HttpURLConnection has been part of Java JDK since version 1.1. It provides methods to send GET/POST requests and receive responses over HTTP protocol. They work with methods in BufferReader and InputStreamReader to read the data. You don’t need any external libraries.
This is the code snippet for the HTTP GET request.
This is the code snippet for the HTTP POST request.
Over the years, there have been enhanced HTTP libraries available, but the idea remains the same.
2. Gson – Map JSON to Objects
The data you receive have different formats, such as plain text, HTML, XML, JSON, pdf, and jpg, etc. How would you like to extract the exact data you want? If it is plain text, you can use the String methods, such as indexOf() to find or use substring() to extract.
XML used to be a popular data transfer format. Today, JSON is preferred for data transfer because it is easy to read and parse. For each JSON data item, you can use libraries such as Jackson or Google Gson to map JSON data items to Java objects for the process.
First, you should define the class. If we have a JSON file, you can use a tool to generate a class from JSON. Here is an online tool. After you have the Java class, you can use Google Gson to map JSON data to objects.
This is the code snippet to extract data from JSON.
3. Jsoup – Request and Parse Data in HTML
If the data format is HTML, Jsoup is a good tool because it provides both retrieval and parse. Therefore, Jsoup is the ideal tool for web scraping or web crawling. To set up, you can download Jsoup here. If you use maven, you can add the following in pom.xml.
To use it, you map the page to Jsoup Document. Then you can retrieve the whole page by HTML(). If you want to retrieve some elements on the page, you can specify by using select().
Here is the code snippet for the Jsoup GET request.
Here is the code snippet for the Jsoup POST request.
4. Jsoup and HttpURLConnection – Download Images
If you want to download the images from the HTML page, you can use both Jsoup and HttpURLConnection. First, use Jsoup to get the image links. Then use HttpURLConnection to download to your local directory.
Here is the code snippet to download images.
5. Selenium With Chrome Headless – Request and Parse Dynamic Data
Jsoup cannot work in some cases. For example, some websites request to login to see data; Sometimes the data are dynamically generated. There is a workaround for this – selenium with chrome headless.
Selenium is the tool for web application testing. It can simulate human actions such as a click or enter data. It can also work with the browser in headless mode, which means the browser is invisible. Google has been using Chrome headless for their crawling. It is available for the public to use as well. Here we can use selenium with chrome headless to extract data from dynamically generated data.
First, download selenium here. Alternatively, you can add the dependency in pom.xml like this.
Next, download the chrome driver. After you download, unzip the “chromedriver.exe” to a directory. In the code, you will set the chrome web driver by specifying the absolute path of the exe file. Then define ChomeOptions and add arguments of “–headless”. After you initialize ChromeDriver, you can retrieve the whole page or particular elements.
This is the code snippet of using selenium with chrome headless.
Published at DZone with permission of Vivien H.. See the original article here.
Opinions expressed by DZone contributors are their own.