Over a million developers have joined DZone.

Easy Google Sitemap Generation With SitemapGen

· Web Dev Zone

Start coding today to experience the powerful engine that drives data application’s development, brought to you in partnership with Qlik.

Over at my company we recently launched SATechEvents. One of the items that was still on our to do list and that we knew was extremely important was an XML site map a.k.a the Google Sitemap. We built SATechEvents on the Ruby On Rails framework and was looking for an easy way to generate a Google site map.

Looking on the Google Webmaster Tools website I found a link to a site map generator on Sourceforge. This is then also the tool we are using to generate our site maps. It is dead easy to set-up and works perfectly. Please note that you will need Python version 2.2 or later on your server to run the Python script. To get started first download the site map generator from Sourceforge (1.4 was the latest version when this was written).

After you have downloaded the zip file, go ahead and unzip to a folder. You will find a bunch of files but the two you are really interested in is the sitemap_gen.py and sample-config.xml file. And in actual fact, unless you want to check out the Python code, the only file you need to be concerned with is the sample-config.xml.

Crack open the sample-config.xml file and save it as config.xml. This is the file in which you will define all your directives that the Python script will use to generate your Google Sitemap. The first directive we need to edit is the site node. It is then also the first item listed in the XML document.

<site base_url="http://www.example.com/"

First order of business is to set your base URL and this of course should be the base URL for your website. The next parameter tells the script where to store the generated site map. Now the best place to store your XML site map is in the root directory of your site so this is the path you want to feed the store_info parameter. The last item, verbose, takes an integer from 0 - 3 that tells the script how much diagnostic output the script gives.

You have a couple of options of how you want to generate your site map. You can specify individual URL’s as such:

<url href="http://www.example.com/stats?q=age"

You can also pass it a list of URL’s specified in a text document:

<urllist path="example_urllist.txt"  encoding="UTF-8"  />

You can also pass the script a directory which it should walk to create you site map file:

<directory path="/var/www/docroot"

If you for example you already have some XML site map files you can specify these and have the generator combine all of those into one XML site map:

<sitemap path="/var/www/docroot/subpath/sitemap.xml" />

The one that I went with was to use the access log as the base from which to generate my site map file as a lot of the URL’s are dynamically generated and allows for a possibility to automate this script using Cron in future.

<accesslog path="/etc/httpd/logs/access.log" encoding="UTF-8" />

Now when going this root you will need to use the filter that is specified last in the config.xml document to ensure that no URL’s you do not want indexed are included in the file. I left the first two filters that are there by default intact as they look to be useful to have in there:

<!-- Exclude URLs that end with a '~'   (IE: emacs backup files)      -->
<filter action="drop" type="wildcard" pattern="*~" />

<!-- Exclude URLs within UNIX-style hidden files or directories -->
<filter action="drop" type="regexp" pattern="/\.[^/]*" />

Then I added the following filter rules to get just the files I wanted:

<!-- Exlude all image files -->
<filter action="drop" type="wildcard" pattern="*.jpg*" />
<filter action="drop" type="wildcard" pattern="*.gif*" />
<filter action="drop" type="wildcard" pattern="*.png*" />

<!-- Exlude all js and css files -->
<filter action="drop" type="wildcard" pattern="*.css*" />
<filter action="drop" type="wildcard" pattern="*.js*" />

<!-- Exlude .ico and .txt files -->
<filter action="drop" type="wildcard" pattern="*.ico" />
<filter action="drop" type="wildcard" pattern="*.txt" />

<!-- Exlude all account related activity -->
<filter action="drop" type="wildcard" pattern="http://www.satechevents.co.za/account/*" />

As you can see from the above, I wanted to ensure no image files are indexed such as gif’s, jpg’s etc. I also excluded all scripts, css as well as .ico and text files. The one thing to note here is that when using Rails tags like javascript_include_tag etc. to include assets such as script or stylesheets etc. Rails will add a id of sorts to the end of the included file so, simply defining the following exclusion filter, won’t work:

<filter action="drop" type="wildcard" pattern="*.css" />

You need to add an additional wild card at the end of the file extension as well:

<filter action="drop" type="wildcard" pattern="*.css*" />

This goes for all assets such as images, scripts and stylesheets. Once you have your config.xml prepared, upload your config.xml as well as the sitemap_gen.py files to your web sites root directory. Next log into your server using, for example SSH, and move to the directory where you uploaded the previous files to.

All that is left is to run the Python script as follows:

python sitemap_gen.py --config=config.xml --testing

In the beginning it is best to add the ‘–testing’ switch to the script to prevent it from pinging Google and informing it about your site map. When the script runs you should see output that will look something like this:

Reading configuration file: /path/config.xml
Opened URLLIST "/path/urllist.txt"
Walking DIRECTORY "/var/www/html/dir"
Walking DIRECTORY "/var/www/html/dir2"
Opened ACCESSLOG "/etc/httpd/logs/access-0.log"
Sorting and normalizing collected URLs.
Writing Sitemap file "/path/sitemap.xml.gz" with 1092 URLs
Count of file extensions on URLs:
208 .html
574 .jpg
Number of errors: 0
Number of warnings: 0

The following step is to go to the directory into which the site map file should have been generated. If you left it as a compressed file, unzip the gzip file and open the XML file contained inside. If you find any items listed in the site map you do not want in there, just create a new exclusion filter for it and re-run the script.

Once you are satisfied with the output of the script, run it one last time, but this time, leave off the –testing switch. You will see similar output as before with one difference, you will now see the following two lines appear as part of the scripts output.

Notifying search engines.
Notifying www.google.com

That is it! All that is left is to head over to Google Webmaster Tools and add the site map to your site information listing. As mentioned earlier, using access logs you can write a simple Cron script to run the Python script on a specified schedule to keep your site map file updated and fresh. Looking forward to your comments.

Create data driven applications in Qlik’s free and easy to use coding environment, brought to you in partnership with Qlik.


Published at DZone with permission of Schalk Neethling. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}