Over a million developers have joined DZone.

Ruby based image crawler

DZone's Guide to

Ruby based image crawler

· Web Dev Zone
Free Resource

Add user login and MFA to your next project in minutes. Create a free Okta developer account, drop in one of our SDKs to your application and get back to building.

I don't write much code these days and felt it was time to sharpen the saw.

I have a need to download a ton of images from a site (I got permission first...) but it is going to take forever to do by hand. Even though there are tons of tools out there for image crawling I figured this would be a great exercise to brush up on some skills and delve further into a language I am still fairly new to, Ruby. This allows me to use basic language constructs, network IO, and file IO, all while getting all the images I need in a fast manner.

As I have mentioned a few times on this blog, I am still new to Ruby so any advice for how to make this code cleaner is appreciated.

You can download the file here.

Here is the source:

require 'net/http'
require 'uri'

class Crawler

# This is the domain or domain and path we are going
# to crawl. This will be the starting point for our
# efforts but will also be used in conjunction with
# the allow_leave_site flag to determine whether the
# page can be crawled or not.
attr_accessor :domain

# This flag determines whether the crawler will be
# allowed to leave the root domain or not.
attr_accessor :allow_leave_site

# This is the path where all images will be saved.
attr_accessor :save_path

# This is a list of extensions to skip over while
# crawling through links on the site.
attr_accessor : omit_extensions # Remove the space between : and o - it makes a smiley here....

# This keeps track of all the pages we have visited
# so we don't visit them more than once.
attr_accessor :visited_pages

# This keeps track of all the images we have downloaded
# so we don't download them more than once.
attr_accessor :downloaded_images

def begin_crawl
if domain.nil? || domain.length < 4 || domain[0, 4] != "http"
@domain = "http://#{domain}"



def initialize
@domain = ""
@allow_leave_site = false
@save_path = ""
@omit_extensions = []
@visited_pages = []
@downloaded_images = []

def crawl(url = nil)

# If the URL is empty or nil we can move on.
return if url.nil? || url.empty?

# If the allow_leave_site flag is set to false we
# want to make sure that the URL we are about to
# crawl is within the domain.
return if !allow_leave_site && (url.length < domain.length || url[0, domain.length] != domain)

# Check to see if we have crawled this page already.
# If so, move on.
return if visited_pages.include? url

puts "Fetching page: #{url}"

# Go get the page and note it so we don't visit it again.
res = fetch_page(url)
visited_pages << url

# If the response is nil then we cannot continue. Move on.
return if res.nil?

# Some links will be relative so we need to grab the
# document root.
root = parse_page_root(url)

# Parge the image and anchor tags out of the result.
images, links = parse_page(res.body)

# Process the images and links accordingly.
handle_images(root, images)
handle_links(root, links)

def parse_page_root(url)
end_slash = url.rindex("/")
if end_slash > 8
url[0, url.rindex("/")] + "/"
url + "/"

def handle_images(root, images)
if !images.nil?
images.each {|i|

# Make sure all single quotes are replaced with double quotes.
# Since we aren't rendering javascript we don't really care
# if this breaks something.
i.gsub!("'", "\"")

# Grab everything between src=" and ".
src = i.scan(/src=[\"\']([^\"\']+)/)[0][0]

# If the src is empty move on.
next if src.nil? || src.empty?

# If we don't have an absolute path already, let's make one.
if !root.nil? && src[0,4] != "http"
src = root + src


def save_image(url)

# Check to see if we have saved this image already.
# If so, move on.
return if downloaded_images.include? url

puts "Saving image: #{url}"

# Save this file name down so that we don't download
# it again in the future.
downloaded_images << url

# Parse the image name out of the url. We'll use that
# name to save it down.
file_name = parse_file_name(url)

while File.exist?(save_path + "\\" + file_name)
file_name = "_" + file_name

# Get the response and data from the web for this image.
response = Net::HTTP.get_response(URI.parse(url))

File.open(save_path + "\\" + file_name, "wb+") do |f|
f << response.body

def parse_file_name(url)

# Find the position of the last slash. Everything after
# it is our file name.
spos = url.rindex("/")
url[spos + 1, url.length - 1]

def handle_links(root, links)
if !links.nil?
links.each {|l|

# Make sure all single quotes are replaced with double quotes.
# Since we aren't rendering javascript we don't really care
# if this breaks something.
l.gsub!("'", "\"")

# Grab everything between href=" and ".
href = l.scan(/(\href+)="([^"\\]*(\\.[^"\\]*)*)"/)[0][1]

# We don't want to follow mailto or empty links
next if href.nil? || href.empty? || (href.length > 6 && href[0,6] == "mailto")

# If we don't have an absolute path already, let's make one.
if !root.nil? && href[0,4] != "http"
href = root + href

# Down the rabbit hole we go...

def parse_page(html)
# Start with all lowercase to ensure we don't have any
# case sensitivity issues.

images = html.scan(/]*>/)
links = html.scan(/]*>/)

return [ images, links ]

def fetch_page(url, limit = 10)
# Make sure we are supposed to fetch this type of resource.
return if should_omit_extension(url)

# You should choose better exception.
raise ArgumentError, 'HTTP redirect too deep' if limit == 0

response = Net::HTTP.get_response(URI.parse(url))
case response
when Net::HTTPSuccess then response
when Net::HTTPNotFound then nil
when Net::HTTPRedirection then fetch_page(response['location'], limit - 1)

def should_omit_extension(url)
# Get the index of the last slash.
spos = url.rindex("/")

# Get the index of the last dot.
dpos = url.rindex(".")

# If the last dot is before the last slash, we don't have
# an extension and can return.
return false if spos > dpos

# Grab the extension.
ext = url[dpos + 1, url.length - 1]

# The return value is whether or not the extension we
# have for this URL is in the omit list or not.
omit_extensions.include? ext



crawler = Crawler.new
crawler.save_path = "C:\\Users\\jmcdonald\\Desktop\\CrawlerOutput"
crawler.omit_extensions = [ "doc", "pdf", "xls", "rtf", "docx", "xlsx", "ppt",
"pptx", "avi", "wmv", "wma", "mp3", "mp4", "pps", "swf" ]
crawler.domain = "www.yoursite.com"
crawler.allow_leave_site = false


Launch your application faster with Okta’s user management API. Register today for the free forever developer edition!


Published at DZone with permission of Jason McDonald, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}