Over a million developers have joined DZone.

Ruby based image crawler

· Web Dev Zone

Start coding today to experience the powerful engine that drives data application’s development, brought to you in partnership with Qlik.

I don't write much code these days and felt it was time to sharpen the saw.

I have a need to download a ton of images from a site (I got permission first...) but it is going to take forever to do by hand. Even though there are tons of tools out there for image crawling I figured this would be a great exercise to brush up on some skills and delve further into a language I am still fairly new to, Ruby. This allows me to use basic language constructs, network IO, and file IO, all while getting all the images I need in a fast manner.

As I have mentioned a few times on this blog, I am still new to Ruby so any advice for how to make this code cleaner is appreciated.

You can download the file here.

Here is the source:

require 'net/http'
require 'uri'

class Crawler

# This is the domain or domain and path we are going
# to crawl. This will be the starting point for our
# efforts but will also be used in conjunction with
# the allow_leave_site flag to determine whether the
# page can be crawled or not.
attr_accessor :domain

# This flag determines whether the crawler will be
# allowed to leave the root domain or not.
attr_accessor :allow_leave_site

# This is the path where all images will be saved.
attr_accessor :save_path

# This is a list of extensions to skip over while
# crawling through links on the site.
attr_accessor : omit_extensions # Remove the space between : and o - it makes a smiley here....

# This keeps track of all the pages we have visited
# so we don't visit them more than once.
attr_accessor :visited_pages

# This keeps track of all the images we have downloaded
# so we don't download them more than once.
attr_accessor :downloaded_images

def begin_crawl
if domain.nil? || domain.length < 4 || domain[0, 4] != "http"
@domain = "http://#{domain}"
end

crawl(domain)
end

private

def initialize
@domain = ""
@allow_leave_site = false
@save_path = ""
@omit_extensions = []
@visited_pages = []
@downloaded_images = []
end

def crawl(url = nil)

# If the URL is empty or nil we can move on.
return if url.nil? || url.empty?

# If the allow_leave_site flag is set to false we
# want to make sure that the URL we are about to
# crawl is within the domain.
return if !allow_leave_site && (url.length < domain.length || url[0, domain.length] != domain)

# Check to see if we have crawled this page already.
# If so, move on.
return if visited_pages.include? url

puts "Fetching page: #{url}"

# Go get the page and note it so we don't visit it again.
res = fetch_page(url)
visited_pages << url

# If the response is nil then we cannot continue. Move on.
return if res.nil?

# Some links will be relative so we need to grab the
# document root.
root = parse_page_root(url)

# Parge the image and anchor tags out of the result.
images, links = parse_page(res.body)

# Process the images and links accordingly.
handle_images(root, images)
handle_links(root, links)
end

def parse_page_root(url)
end_slash = url.rindex("/")
if end_slash > 8
url[0, url.rindex("/")] + "/"
else
url + "/"
end
end

def handle_images(root, images)
if !images.nil?
images.each {|i|

# Make sure all single quotes are replaced with double quotes.
# Since we aren't rendering javascript we don't really care
# if this breaks something.
i.gsub!("'", "\"")

# Grab everything between src=" and ".
src = i.scan(/src=[\"\']([^\"\']+)/)[0][0]

# If the src is empty move on.
next if src.nil? || src.empty?

# If we don't have an absolute path already, let's make one.
if !root.nil? && src[0,4] != "http"
src = root + src
end

save_image(src)
}
end
end

def save_image(url)

# Check to see if we have saved this image already.
# If so, move on.
return if downloaded_images.include? url

puts "Saving image: #{url}"

# Save this file name down so that we don't download
# it again in the future.
downloaded_images << url

# Parse the image name out of the url. We'll use that
# name to save it down.
file_name = parse_file_name(url)

while File.exist?(save_path + "\\" + file_name)
file_name = "_" + file_name
end

# Get the response and data from the web for this image.
response = Net::HTTP.get_response(URI.parse(url))

File.open(save_path + "\\" + file_name, "wb+") do |f|
f << response.body
end
end


def parse_file_name(url)

# Find the position of the last slash. Everything after
# it is our file name.
spos = url.rindex("/")
url[spos + 1, url.length - 1]
end


def handle_links(root, links)
if !links.nil?
links.each {|l|

# Make sure all single quotes are replaced with double quotes.
# Since we aren't rendering javascript we don't really care
# if this breaks something.
l.gsub!("'", "\"")

# Grab everything between href=" and ".
href = l.scan(/(\href+)="([^"\\]*(\\.[^"\\]*)*)"/)[0][1]

# We don't want to follow mailto or empty links
next if href.nil? || href.empty? || (href.length > 6 && href[0,6] == "mailto")

# If we don't have an absolute path already, let's make one.
if !root.nil? && href[0,4] != "http"
href = root + href
end

# Down the rabbit hole we go...
crawl(href)
}
end
end


def parse_page(html)
# Start with all lowercase to ensure we don't have any
# case sensitivity issues.
html.downcase!

images = html.scan(/]*>/)
links = html.scan(/]*>/)

return [ images, links ]
end

def fetch_page(url, limit = 10)
# Make sure we are supposed to fetch this type of resource.
return if should_omit_extension(url)

# You should choose better exception.
raise ArgumentError, 'HTTP redirect too deep' if limit == 0

response = Net::HTTP.get_response(URI.parse(url))
case response
when Net::HTTPSuccess then response
when Net::HTTPNotFound then nil
when Net::HTTPRedirection then fetch_page(response['location'], limit - 1)
else
response.error!
end
end


def should_omit_extension(url)
# Get the index of the last slash.
spos = url.rindex("/")

# Get the index of the last dot.
dpos = url.rindex(".")

# If the last dot is before the last slash, we don't have
# an extension and can return.
return false if spos > dpos

# Grab the extension.
ext = url[dpos + 1, url.length - 1]

# The return value is whether or not the extension we
# have for this URL is in the omit list or not.
omit_extensions.include? ext

end

end

crawler = Crawler.new
crawler.save_path = "C:\\Users\\jmcdonald\\Desktop\\CrawlerOutput"
crawler.omit_extensions = [ "doc", "pdf", "xls", "rtf", "docx", "xlsx", "ppt",
"pptx", "avi", "wmv", "wma", "mp3", "mp4", "pps", "swf" ]
crawler.domain = "www.yoursite.com"
crawler.allow_leave_site = false
crawler.begin_crawl

 

Create data driven applications in Qlik’s free and easy to use coding environment, brought to you in partnership with Qlik.

Topics:

Published at DZone with permission of Jason McDonald, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}