DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • All About GPU Threads, Warps, and Wavefronts
  • The Pros and Cons of API-Led Connectivity
  • Your Go-to Guide to Develop Cryptocurrency Blockchain in Node.Js
  • Python Exception Handling: Try, Except, and Finally in Python

Trending

  • Top Book Picks for Site Reliability Engineers
  • Event-Driven Architectures: Designing Scalable and Resilient Cloud Solutions
  • Unlocking AI Coding Assistants Part 4: Generate Spring Boot Application
  • Unlocking the Benefits of a Private API in AWS API Gateway

Block Web Crawlers With Rails

Sometimes you just need to block web crawlers from accessing your web site or web app. In this post we take a look at how to do just that using Rails.

By 
Dan Croak user avatar
Dan Croak
·
Feb. 11, 17 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
8.9K Views

Join the DZone community and get the full member experience.

Join For Free

Search engines “crawl” and “index” web content through programs called robots (a.k.a. crawlers or spiders). This may be problematic for our projects in situations such as:

  • a staging environment
  • migrating data from a legacy system to new locations
  • rolling out alpha or beta features

Approaches to blocking crawlers in these scenarios include:

  • authentication (best)
  • robots.txt (crawling)
  • X-Robots-Tag (indexing)

Problem: Duplicate Content

With multiple environments or during a data migration period, duplicate content may be accessible to crawlers. Search engines will have to guess which version to index, assign authority, and rank in query results.

For example, we periodically backup our production data to the staging environment using Parity:

production backup
staging restore production

Things Search Engines Do

In order to provide results, a search engine may prepare by doing these things:

  1. check a domain’s robots settings (e.g. http://example.com/robots.txt)
  2. request a web page on the domain (e.g. http://example.com/)
  3. check the webpage’s X-Robots-Tag HTTP response header
  4. cache the web page (saving its response body)
  5. index the web page (extract keywords from the response body for fast lookup)
  6. follow links on the web page to other web pages and repeat

Steps 1, 2, 3, and 6 are generally “crawling” steps and steps 4 and 5 are generally “indexing” steps.

Solution: Authentication (Best)

The most reliable way to hide content from a crawler is with authentication such as HTTP Basic authentication:

class ApplicationController < ActionController::Base
  if ENV["DISALLOW_ALL_WEB_CRAWLERS"].present?
    http_basic_authenticate_with(
      name: ENV.fetch("BASIC_AUTH_USERNAME"),
      password: ENV.fetch("BASIC_AUTH_PASSWORD"),
    )
  end
end

This often is all we need for situations such as a staging environment. The following approaches are more limited but may be more suitable for other situations.

Notice we can control whether crawlers are allowed to access content via config in the environment. We can use Parity again to add configuration to Heroku staging:

staging config:set DISALLOW_ALL_WEB_CRAWLERS=true

Solution: robots.txt (Crawling)

The robots exclusion standard helps robots decide what action to take. A robot first looks at the /robots.txt file on the domain before crawling it.

It is a de-facto standard (not owned by a standards body) and is opt-in by robots. Mainstream robots such as Googlebot respect the standard, but bad actors may not.

An example /robots.txt file looks like this:

User-agent: *
Disallow: /

This blocks (i.e. disallows) all content (/) to all crawlers (User-agents). See this list of Google crawlers for examples of user agent tokens.

Globbing and regular expressions are not supported in this file. See what can go in it.

Add Climate Control to the

Gemfile

to control environment variables in tests:

gem "climate_control"

In spec/requests/robots_txt_spec.rb:

require "rails_helper"

describe "robots.txt" do
  context "when not blocking all web crawlers" do
    it "allows all crawlers" do
      get "/robots.txt"

      expect(response.code).to eq "404"
      expect(response.headers["X-Robots-Tag"]).to be_nil
    end
  end

  context "when blocking all web crawlers" do
    it "blocks all crawlers" do
      ClimateControl.modify "DISALLOW_ALL_WEB_CRAWLERS" => "true" do
        get "/robots.txt"
      end

      expect(response).to render_template "disallow_all"
      expect(response.headers["X-Robots-Tag"]).to eq "none"
    end
  end
end

Google recommends no robots.txt if we want all our content to be crawled.

In config/routes.rb:

get "/robots.txt" => "robots_txts#show"

In app/controllers/robots_txts_controller.rb:

class RobotsTxtsController < ApplicationController
  def show
    if disallow_all_crawlers?
      render "disallow_all", layout: false, content_type: "text/plain"
    else
      render nothing: true, status: 404
    end
  end

  private

  def disallow_all_crawlers?
    ENV["DISALLOW_ALL_WEB_CRAWLERS"].present?
  end
end

If we’re using an authentication library such as Clearance site-wide, we’ll want to skip its filter in our controller:

class ApplicationController < ActionController::Base
  before_action :require_login
end

class RobotsTxtsController < ApplicationController
  skip_before_action :require_login
end

Remove the default Rails robots.txt and prepare the custom directory:

rm public/robots.txt
mkdir app/views/robots_txts

In app/views/robots_txts/disallow_all.erb:

User-agent: *
Disallow: /

Solution: X-Robots-Tag (Indexing)

It is possible for search engines to index content without crawling it, because websites might link to it. So, our robots.txt technique blocked crawling, but not indexing.

Adding a X-Robots-Tag header to our responses short-circuits the entire process; well-behaved crawlers won’t make HTTP requests at all to content on the domain.

You may have seen meta tags like this in projects you’ve worked on:

<meta name="robots" content="noindex,nofollow">

The X-Robots-Tag header has the same effect as the robots meta tag, but applies it to all content types in our app (e.g. images, scripts, styles), not only HTML files.

To block robots in our environment, we want a header like this:

X-Robots-Tag: none

The none directive is equivalent to noindex, nofollow. It tells robots not to index, follow links, or cache.

In lib/rack_x_robots_tag.rb:

module Rack
  class XRobotsTag
    def initialize(app)
      @app = app
    end

    def call(env)
      status, headers, response = @app.call(env)

      if ENV["DISALLOW_ALL_WEB_CRAWLERS"].present?
        headers["X-Robots-Tag"] = "none"
      end

      [status, headers, response]
    end
  end
end

In config/application.rb:

require_relative "../lib/rack_x_robots_tag"

module YourAppName
  class Application < Rails::Application

    config.middleware.use Rack::XRobotsTag
  end
end

Our specs will now pass.

Conclusion

Our environment’s content can be blocked in three different ways from crawling and indexing by web robots that respect the robots exclusion standard (most importantly Google).

Use authentication to entirely hide it, or robots.txt plus the X-Robots-Tag for more granular control.

Blocks

Published at DZone with permission of Dan Croak, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • All About GPU Threads, Warps, and Wavefronts
  • The Pros and Cons of API-Led Connectivity
  • Your Go-to Guide to Develop Cryptocurrency Blockchain in Node.Js
  • Python Exception Handling: Try, Except, and Finally in Python

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: