Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Converting WordPress Posts in CSV Format

DZone's Guide to

Converting WordPress Posts in CSV Format

Check out your WordPress data by converting it to a CSV format for R analysis.

· Big Data Zone ·
Free Resource

How to Simplify Apache Kafka. Get eBook.

Over the weekend I wanted to look into the WordPress data behind this blog (very meta!) and wanted to get the data in CSV format so I could do some analysis in R.2015 07 07 06 59 02

I found a couple of WordPress CSV plugins but unfortunately I couldn’t get any of them to work and ended up working with the raw XML data that WordPress produces when you ‘export’ a blog.

I had the problem of the export being incomplete which I ‘solved’ by importing the posts in two parts of a few years each.

I then spent quite a few hours struggling to get the data into shape using R’s rvest library but eventually decided to do the scraping using Python’s beautifulsoup and save it to a CSV file for analysis in R.

The structure of the XML that we want to extract is as follows:

<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
...
    <channel>
<item>
<title>First thoughts on Ruby...</title>
<link>http://www.markhneedham.com/blog/2006/08/29/first-thoughts-on-ruby/</link>
<pubDate>Tue, 29 Aug 2006 13:31:05 +0000</pubDate>
...

I wrote the following script to parse the files:

from bs4 import BeautifulSoup
from soupselect import select
from dateutil import parser

import csv

def read_page(page):
    return BeautifulSoup(open(page, 'r').read())

with open("posts.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["title", "date"])

    for row in select(read_page("part2.xml"), "item"):
        title = select(row, "title")[0].text.encode("utf-8")
        date = parser.parse(select(row, "pubdate")[0].text)
        writer.writerow([title, date])

    for row in select(read_page("part1.xml"), "item"):
        title = select(row, "title")[0].text.encode("utf-8")
        date = parser.parse(select(row, "pubdate")[0].text)
        writer.writerow([title, date])

We end up with a CSV file that looks like this:

$ head -n 10 posts.csv
title,date
Functional C#: Writing a 'partition' function,2010-02-01 23:34:02+00:00
Coding: Wrapping/not wrapping 3rd party libraries and DSLs,2010-02-02 23:54:21+00:00
Functional C#: LINQ vs Method chaining,2010-02-05 18:06:28+00:00
F#: function keyword,2010-02-07 02:54:13+00:00
Willed vs Forced designs,2010-02-08 22:48:05+00:00
Functional C#: Extracting a higher order function with generics,2010-02-08 23:17:47+00:00
Javascript: File encoding when using string.replace,2010-02-10 00:02:02+00:00
F#: Inline functions and statically resolved type parameters,2010-02-10 23:06:14+00:00
Javascript: Passing functions around with call and apply,2010-02-12 20:18:02+00:00

Let’s quickly look over the data in R and check it’s being correctly exported:

require(dplyr)
require(lubridate)

df = read.csv("posts.csv")

> df %>% count()
Source: local data frame [1 x 1]

     n
1 1501

So we’ve exported 1501 posts. Let’s cross check with the WordPress dashboard:

2015 07 07 07 06 02

We’ve gained two extra posts! A bit more exploration of the WordPress dashboard reveals that there are actually 2 draft posts lying around.

We probably want to remove those from the export and luckily there’s a ‘status’ tag for each post that we can check. We want to make sure it doesn’t have the value ‘draft':

from bs4 import BeautifulSoup
from soupselect import select
from dateutil import parser

import csv

def read_page(page):
    return BeautifulSoup(open(page, 'r').read())

with open("posts.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["title", "date"])

    for row in select(read_page("part2.xml"), "item"):
        if (not row.find("wp:status")) or row.find("wp:status").text != "draft":
            title = select(row, "title")[0].text.encode("utf-8")
            date = parser.parse(select(row, "pubdate")[0].text)
            writer.writerow([title, date])

    for row in select(read_page("part1.xml"), "item"):
        if (not row.find("wp:status")) or row.find("wp:status").text != "draft":
            title = select(row, "title")[0].text.encode("utf-8")
            date = parser.parse(select(row, "pubdate")[0].text)
            writer.writerow([title, date])

I also had to check if that tag actually existed since there were a couple of posts which didn’t have it but had been published. If we check the resulting CSV file in R we can see that we’ve now got all the posts:

> df = read.csv("posts.csv")
> df %>% count()
Source: local data frame [1 x 1]

     n
1 1499

Now we’re ready to test a couple of hypotheses that I have but that’s for another post!

12 Best Practices for Modern Data Ingestion. Download White Paper.

Topics:
big data ,big data applications ,wordpress ,csv ,csv conversion ,r language ,r analysis

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}