Over a million developers have joined DZone.

PyGoogleGrabber -- Using Python to Find Out How Many Pages Your Domain Has Listed in Google

DZone's Guide to

PyGoogleGrabber -- Using Python to Find Out How Many Pages Your Domain Has Listed in Google

Free Resource

David Walsh posted a PHP snippet that would check Google to see how many pages are listed for a given domain. I decided to do the same thing in Python. The code given is a bit more verbose because it actually closes the connection to Google and catches errors.

import urllib
import httplib
import re

def get_google_results(domain = 'codecraig.com'):
# headers required by Google
headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT'}

url = '/search?q=site:' + domain
conn = None

# connect to google and search
conn = httplib.HTTPConnection('www.google.com')
conn.request('GET', url, {}, headers)

# get the response
resp = conn.getresponse()

# make sure it was successful
if resp.status == 200:
# get the response HTML
html = resp.read()
html = html.decode('ascii', 'ignore')

# search for the results count
m = re.search('Results .* about <b>(.*)</b> from', html)
if m and len(m.groups()) > 0:
return m.groups()[0]
return 0
return 'ERROR: ' + str(resp.status) + " " + str(resp.reason)
except Exception, e:
return "ERROR: " + str(e)
# close the connection to Google
if conn:

# Simple driver
if __name__ == '__main__':
domains = ['codecraig.com', 'davidwalsh.name', 'python.org', 'digg.com', 'dzone.com', 'some-domain-that-doesnt-exist']
for domain in domains:
print domain, get_google_results(domain)

Results (as of 1/24/2008 around 10:35pm EST):

codecraig.com 93
davidwalsh.name 171
python.org 1,490,000
digg.com 3,520,000
dzone.com 499,000
some-domain-that-doesnt-exist 0



Opinions expressed by DZone contributors are their own.


Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}