DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Java, Spring Boot, and MongoDB: Performance Analysis and Improvements
  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Self-Hosted Inference Doesn’t Have to Be a Nightmare: How to Use GPUStack
  • Smart Deployment Strategies for Modern Applications

Trending

  • Detecting Bugs and Vulnerabilities in Java With SonarQube
  • Build Self-Managing Data Pipelines With an LLM Agent
  • From Data Movement to Local Intelligence: The Shift from Centralized to Federated AI
  • Architecting Petabyte-Scale Hyperspectral Pipelines on AWS
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Close Site Search Indexing via Kubernetes HAProxy Ingress

Close Site Search Indexing via Kubernetes HAProxy Ingress

Block search engine indexing on your site in 5 simple steps using HAProxy Kubernetes Ingress Controller annotations and a robots.txt file.

By 
Alexander Sharov user avatar
Alexander Sharov
·
Oct. 22, 24 · Tutorial
Likes (4)
Comment
Save
Tweet
Share
6.9K Views

Join the DZone community and get the full member experience.

Join For Free

In Kubernetes, Ingress resources are frequently used as traffic controllers, providing external access to services within the cluster. Ingress is essential for routing incoming traffic to your service; however, there may be scenarios in which you want to prevent search engines from indexing your service's content: it might be a development environment or something else.

Title image

This blog post will walk you through the process of blocking your site's indexing on Kubernetes Ingress using robots.txt file, preventing search engine bots from crawling and indexing your content.

Prerequisites

To proceed with the tutorial, you should have a basic grasp of Kubernetes basic objects, Ingress resources, and the official HAProxy ingress controller. You will also need access to the Kubernetes cluster and the necessary permissions to make configuration changes.

Keep in mind that for the purposes of this article, I assume that the HAProxy ingress controller is set as the default controller. Otherwise, if you did not select HAProxy as the default controller, you must add the ingressClassName option to all Ingress code examples.

Step 1: Create an Ingress Kubernetes Resource

In the first part of our journey, we'll set up a small Ingress resource to expose our service outside of the Kubernetes cluster. Pay attention: for the time being, all web crawlers will have access to the service. To apply the code below, use the command kubectl apply -f ingress.yaml.

YAML
 
# file: ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
 name: dzone-close-site-indexing-haproxy-example
 annotations:
   cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
 rules:
   - host: dzone-close-site-indexing-haproxy-example.referrs.me
     http:
       paths:
         - path: /
           pathType: Prefix
           backend:
             service:
               name: dzone-close-site-indexing-haproxy-example-service
               port:
                 number: 80
 tls:
   - hosts:
       - dzone-close-site-indexing-haproxy-example.referrs.me
     secretName: dzone-close-site-indexing-haproxy-example

Step 2: Modify the Ingress Configuration

The robots.txt file is used to control how search engines index documents. Its file specifies which URLs search engine crawlers can access on your website. The most basic file that restricts access to the web service looks like this:

Plain Text
 
User-agent: *
Disallow: /


HAProxy does not require you to add this file to your web server or website. This can be achieved with the following configuration, which should be added to the backend section for the specific group of servers:

Plain Text
 
acl robots_path path_beg /robots.txt
http-request return status 200 content-type "text/plain" lf-string "User-agent: *\nDisallow: /\n" if robots_path


K8S Annotations regulate all manipulations of the HAProxy frontend/backend configuration for a single ingress resource. The full list of HAProxy annotations can be found in the official documentation on GitHub.

In our case, we need to use haproxy.org/backend-config-snippet with the HAProxy snippet for blocking any indexing. To do this, edit, open your Ingress resource YAML file, and add the following annotation to the metadata section:

YAML
 
#file: ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
 name: dzone-avoid-indexing-haproxy-example
 annotations:
   …
    haproxy.org/backend-config-snippet: |
      acl robots_path path_beg /robots.txt
      http-request return status 200 content-type "text/plain" lf-string "User-agent: *\nDisallow: /\n" if robots_path
spec:
  ...


Step 3: Apply the Configuration Changes

After changing the Ingress YAML file, save it and apply it to the Kubernetes cluster with the kubectl command: kubectl apply -f ingress.yaml.

The Ingress controller will detect the changes and update the configuration accordingly.

Step 4: Verify the Configuration

Inspect the generated robots.txt file to confirm that indexing prevention is working properly. The Ingress controller generates this file based on the annotation you supply.

Retrieve the external IP or domain associated with your Ingress resource and add /robots.txt to the URL. Example:

Plain Text
 
$ curl dzone-close-site-indexing-haproxy-example.referrs.me/robots.txt
User-agent: *
Disallow: /


As we can see, the answer contains a robots.txt file that prevents any search indexing.

Step 5: Test Indexing Prevention

To verify that search engines are not indexing your site, you can run a search for your website on popular search engines. Keep in mind that search results sometimes take some time to reflect changes, so the indexing status may not be fully updated right away.

Conclusion

Annotations make it easy to avoid search engine indexing while using HAProxy Kubernetes Ingress. By adding the appropriate annotation to your Ingress resource, you can prohibit search engine bots from crawling and indexing your website's content. A similar approach can be used with other ingress controllers, such as Nginx, Traefic, and others. A similar annotation can also be used for K8S Gateway API resources, which are actively replacing Ingresses.

As a final note, robots.txt is a time-honored way for website creators to specify whether or not their sites should be crawled by various bots. However, it turns out that AI crawlers from large language model (LLM) companies frequently ignore the contents of robots.txt and crawl your site regardless. To avoid such situations, utilize password security, noindex, or enterprise load balancer features like HAProxy AI-crawler, which may be also configured as a K8S annotation.

Annotation HAProxy Kubernetes Search engine (computing)

Opinions expressed by DZone contributors are their own.

Related

  • Java, Spring Boot, and MongoDB: Performance Analysis and Improvements
  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Self-Hosted Inference Doesn’t Have to Be a Nightmare: How to Use GPUStack
  • Smart Deployment Strategies for Modern Applications

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook