Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Groovy Google Grabber -- Using Groovy to Find Out How Many Pages Your Domain Has Listed in Google

DZone's Guide to

Groovy Google Grabber -- Using Groovy to Find Out How Many Pages Your Domain Has Listed in Google

·
Free Resource

David Walsh posted a PHP snippet that would check Google to see how many pages are listed for a given domain. I thought it was interesting so I decided to convert it to Python and now I have converted it to Groovy. I am very new to Groovy so I am sure there is a Groovier way to do this, if so I'd love to hear your feedback.

To start I am using GroovyHTTP which is a class written by Tony Landis. I modified it slightly to work with Google, that is, I had to remove the "content-type" and "content-length" headers and I added the "user-agent" header (similar to what I did in the Python version).

class GroovyHTTP {
public method='POST'
public uri
public host
public path
public port
public params=null
public socket=null
public writer=null
public reader=null
public writedata
public headers = []
public content

// set the url and create new URI object
def GroovyHTTP(url) {
uri = new URI(url)
host = uri.getHost()
path = uri.getRawPath()
port = uri.getPort()
def tpar = uri.getQuery()
if(tpar != null && tpar != '') {
tpar.tokenize('&').each{
def pp = it.tokenize('=');
this.setParam(pp[0],pp[1]);
}
}
if(port == null || port < 0) port = 80
if(path == null || path.length() == 0) path = "/"
}

// sets the method (GET or POST)
def setMethod(setmethod) {
method = setmethod
}

// push params into this request
def setParam(var,value) {
if(params != null)
params += '&'
else
params=''
params += var +'='+URLEncoder.encode(value)
}

// clear params
def clearParams() {
params = null
}

// open a new socket
def open() {
socket = new Socket(host, port)
}

// write data to the socket
def write() {
def contentLen = 0
if(params!=null) contentLen = params.length()
def writedata = '';

if(this.method == 'GET')
writedata += "GET " + path +'?'+ params + " HTTP/1.0\r\n"
else
writedata += "POST " + path + " HTTP/1.0\r\n"

writedata +=
"Host: " + host + "\r\n" +
"User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11\r\n\r\n" +
//"Content-Type: application/x-www-form-urlencoded\r\n" +
//"Content-Length: " + contentLen + "\r\n\r\n" +
params + "\r\n"
"Connection: close\r\n\r\n"

writer = new PrintWriter(socket.getOutputStream(), true)
writer.write(writedata)
writer.flush()
}

// read response from the server
def read() {
reader = new DataInputStream(socket.getInputStream())
def c = null
while (null != ((c = reader.readLine()))) {
if(c=='') break
headers.add(c)
}
}

// get header value by name
def getHeader(name) {
def pattern = name + ': '
def result
headers.each{
if(it ==~ /${pattern}.*/) {
result = it.replace(pattern,'').trim()
return 2
}
}
return result
}

// get the response content
def getContent() {
def row
content = ''
while (null != ((row = reader.readLine()))) content += row + "\r\n"
return content = content.trim();
}

// close the socket
def close() {
reader.close()
writer.close()
socket.close()
}
}


With that class all done now we just create the 'get_groovy_results' function...

def get_google_results(domain='codecraig.com') {
h = new GroovyHTTP('http://www.google.com/search')
h.setMethod('GET')
h.setParam('q', 'site:' + domain)
h.open()
h.write()
h.read()
println h.getHeader('Server')
tmp = h.getContent()
h.close()

r = /Results .* about <b>(.*)<\/b> from/
matcher = (tmp =~ r)
if (matcher && matcher.count > 0) {
return matcher[0][1]
}
return 0
}
Now, the code to test it all...
domains = ['codecraig.com', 'davidwalsh.name', 'python.org', 'digg.com', 'dzone.com', 'some-domain-that-doesnt-exist']
domains.each() {
count = get_google_results(it)
println "${it} ${count}"
}

Results (as of 1/24/2008 around 11:25pm EST):

codecraig.com 93
davidwalsh.name 171
python.org 1,490,000
digg.com 3,530,000
dzone.com 499,000
some-domain-that-doesnt-exist 0

Enjoy!

Topics:

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}