DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Easy Google Sitemap Generation With SitemapGen

Schalk Neethling user avatar by
Schalk Neethling
·
Oct. 18, 08 · News
Like (0)
Save
Tweet
Share
10.32K Views

Join the DZone community and get the full member experience.

Join For Free

Over at my company we recently launched SATechEvents. One of the items that was still on our to do list and that we knew was extremely important was an XML site map a.k.a the Google Sitemap. We built SATechEvents on the Ruby On Rails framework and was looking for an easy way to generate a Google site map.

Looking on the Google Webmaster Tools website I found a link to a site map generator on Sourceforge. This is then also the tool we are using to generate our site maps. It is dead easy to set-up and works perfectly. Please note that you will need Python version 2.2 or later on your server to run the Python script. To get started first download the site map generator from Sourceforge (1.4 was the latest version when this was written).

After you have downloaded the zip file, go ahead and unzip to a folder. You will find a bunch of files but the two you are really interested in is the sitemap_gen.py and sample-config.xml file. And in actual fact, unless you want to check out the Python code, the only file you need to be concerned with is the sample-config.xml.

Crack open the sample-config.xml file and save it as config.xml. This is the file in which you will define all your directives that the Python script will use to generate your Google Sitemap. The first directive we need to edit is the site node. It is then also the first item listed in the XML document.

<site base_url="http://www.example.com/"
store_into="/var/www/docroot/sitemap.xml.gz"
verbose="1">

First order of business is to set your base URL and this of course should be the base URL for your website. The next parameter tells the script where to store the generated site map. Now the best place to store your XML site map is in the root directory of your site so this is the path you want to feed the store_info parameter. The last item, verbose, takes an integer from 0 - 3 that tells the script how much diagnostic output the script gives.

You have a couple of options of how you want to generate your site map. You can specify individual URL’s as such:

<url href="http://www.example.com/stats?q=age"
lastmod="2004-11-14T01:00:00-07:00"
changefreq="yearly"
priority="0.3"
/>

You can also pass it a list of URL’s specified in a text document:

<urllist path="example_urllist.txt"  encoding="UTF-8"  />

You can also pass the script a directory which it should walk to create you site map file:

<directory path="/var/www/docroot"
url="http://www.example.com/"
default_file="index.html"
/>

If you for example you already have some XML site map files you can specify these and have the generator combine all of those into one XML site map:

<sitemap path="/var/www/docroot/subpath/sitemap.xml" />

The one that I went with was to use the access log as the base from which to generate my site map file as a lot of the URL’s are dynamically generated and allows for a possibility to automate this script using Cron in future.

<accesslog path="/etc/httpd/logs/access.log" encoding="UTF-8" />

Now when going this root you will need to use the filter that is specified last in the config.xml document to ensure that no URL’s you do not want indexed are included in the file. I left the first two filters that are there by default intact as they look to be useful to have in there:

<!-- Exclude URLs that end with a '~'   (IE: emacs backup files)      -->
<filter action="drop" type="wildcard" pattern="*~" />

<!-- Exclude URLs within UNIX-style hidden files or directories -->
<filter action="drop" type="regexp" pattern="/\.[^/]*" />

Then I added the following filter rules to get just the files I wanted:

<!-- Exlude all image files -->
<filter action="drop" type="wildcard" pattern="*.jpg*" />
<filter action="drop" type="wildcard" pattern="*.gif*" />
<filter action="drop" type="wildcard" pattern="*.png*" />

<!-- Exlude all js and css files -->
<filter action="drop" type="wildcard" pattern="*.css*" />
<filter action="drop" type="wildcard" pattern="*.js*" />

<!-- Exlude .ico and .txt files -->
<filter action="drop" type="wildcard" pattern="*.ico" />
<filter action="drop" type="wildcard" pattern="*.txt" />

<!-- Exlude all account related activity -->
<filter action="drop" type="wildcard" pattern="http://www.satechevents.co.za/account/*" />

As you can see from the above, I wanted to ensure no image files are indexed such as gif’s, jpg’s etc. I also excluded all scripts, css as well as .ico and text files. The one thing to note here is that when using Rails tags like javascript_include_tag etc. to include assets such as script or stylesheets etc. Rails will add a id of sorts to the end of the included file so, simply defining the following exclusion filter, won’t work:

<filter action="drop" type="wildcard" pattern="*.css" />

You need to add an additional wild card at the end of the file extension as well:

<filter action="drop" type="wildcard" pattern="*.css*" />

This goes for all assets such as images, scripts and stylesheets. Once you have your config.xml prepared, upload your config.xml as well as the sitemap_gen.py files to your web sites root directory. Next log into your server using, for example SSH, and move to the directory where you uploaded the previous files to.

All that is left is to run the Python script as follows:

python sitemap_gen.py --config=config.xml --testing

In the beginning it is best to add the ‘–testing’ switch to the script to prevent it from pinging Google and informing it about your site map. When the script runs you should see output that will look something like this:

Reading configuration file: /path/config.xml
Opened URLLIST "/path/urllist.txt"
Walking DIRECTORY "/var/www/html/dir"
Walking DIRECTORY "/var/www/html/dir2"
Opened ACCESSLOG "/etc/httpd/logs/access-0.log"
Sorting and normalizing collected URLs.
Writing Sitemap file "/path/sitemap.xml.gz" with 1092 URLs
Count of file extensions on URLs:
208 .html
574 .jpg
...
Number of errors: 0
Number of warnings: 0

The following step is to go to the directory into which the site map file should have been generated. If you left it as a compressed file, unzip the gzip file and open the XML file contained inside. If you find any items listed in the site map you do not want in there, just create a new exclusion filter for it and re-run the script.

Once you are satisfied with the output of the script, run it one last time, but this time, leave off the –testing switch. You will see similar output as before with one difference, you will now see the following two lines appear as part of the scripts output.

Notifying search engines.
Notifying www.google.com

That is it! All that is left is to head over to Google Webmaster Tools and add the site map to your site information listing. As mentioned earlier, using access logs you can write a simple Cron script to run the Python script on a specified schedule to keep your site map file updated and fresh. Looking forward to your comments.

Site map Google (verb) Sitemaps

Published at DZone with permission of Schalk Neethling. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Promises, Thenables, and Lazy-Evaluation: What, Why, How
  • PostgreSQL: Bulk Loading Data With Node.js and Sequelize
  • ChatGPT: The Unexpected API Test Automation Help
  • The Top 3 Challenges Facing Engineering Leaders Today—And How to Overcome Them

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: