DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Java, Spring Boot, and MongoDB: Performance Analysis and Improvements
  • Microservices in Practice: Deployment Shouldn't Be an Afterthought
  • Setting Kubernetes Labels and Annotations
  • The Evolution of Scalable and Resilient Container Infrastructure

Trending

  • Creating a Web Project: Caching for Performance Optimization
  • Analyzing Techniques to Provision Access via IDAM Models During Emergency and Disaster Response
  • Navigating Change Management: A Guide for Engineers
  • How to Introduce a New API Quickly Using Micronaut
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Close Site Search Indexing via Kubernetes HAProxy Ingress

Close Site Search Indexing via Kubernetes HAProxy Ingress

Block search engine indexing on your site in 5 simple steps using HAProxy Kubernetes Ingress Controller annotations and a robots.txt file.

By 
Alexander Sharov user avatar
Alexander Sharov
·
Oct. 22, 24 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
6.4K Views

Join the DZone community and get the full member experience.

Join For Free

In Kubernetes, Ingress resources are frequently used as traffic controllers, providing external access to services within the cluster. Ingress is essential for routing incoming traffic to your service; however, there may be scenarios in which you want to prevent search engines from indexing your service's content: it might be a development environment or something else.

Title image

This blog post will walk you through the process of blocking your site's indexing on Kubernetes Ingress using robots.txt file, preventing search engine bots from crawling and indexing your content.

Prerequisites

To proceed with the tutorial, you should have a basic grasp of Kubernetes basic objects, Ingress resources, and the official HAProxy ingress controller. You will also need access to the Kubernetes cluster and the necessary permissions to make configuration changes.

Keep in mind that for the purposes of this article, I assume that the HAProxy ingress controller is set as the default controller. Otherwise, if you did not select HAProxy as the default controller, you must add the ingressClassName option to all Ingress code examples.

Step 1: Create an Ingress Kubernetes Resource

In the first part of our journey, we'll set up a small Ingress resource to expose our service outside of the Kubernetes cluster. Pay attention: for the time being, all web crawlers will have access to the service. To apply the code below, use the command kubectl apply -f ingress.yaml.

YAML
 
# file: ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
 name: dzone-close-site-indexing-haproxy-example
 annotations:
   cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
 rules:
   - host: dzone-close-site-indexing-haproxy-example.referrs.me
     http:
       paths:
         - path: /
           pathType: Prefix
           backend:
             service:
               name: dzone-close-site-indexing-haproxy-example-service
               port:
                 number: 80
 tls:
   - hosts:
       - dzone-close-site-indexing-haproxy-example.referrs.me
     secretName: dzone-close-site-indexing-haproxy-example

Step 2: Modify the Ingress Configuration

The robots.txt file is used to control how search engines index documents. Its file specifies which URLs search engine crawlers can access on your website. The most basic file that restricts access to the web service looks like this:

Plain Text
 
User-agent: *
Disallow: /


HAProxy does not require you to add this file to your web server or website. This can be achieved with the following configuration, which should be added to the backend section for the specific group of servers:

Plain Text
 
acl robots_path path_beg /robots.txt
http-request return status 200 content-type "text/plain" lf-string "User-agent: *\nDisallow: /\n" if robots_path


K8S Annotations regulate all manipulations of the HAProxy frontend/backend configuration for a single ingress resource. The full list of HAProxy annotations can be found in the official documentation on GitHub.

In our case, we need to use haproxy.org/backend-config-snippet with the HAProxy snippet for blocking any indexing. To do this, edit, open your Ingress resource YAML file, and add the following annotation to the metadata section:

YAML
 
#file: ingress.yaml
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
 name: dzone-avoid-indexing-haproxy-example
 annotations:
   …
    haproxy.org/backend-config-snippet: |
      acl robots_path path_beg /robots.txt
      http-request return status 200 content-type "text/plain" lf-string "User-agent: *\nDisallow: /\n" if robots_path
spec:
  ...


Step 3: Apply the Configuration Changes

After changing the Ingress YAML file, save it and apply it to the Kubernetes cluster with the kubectl command: kubectl apply -f ingress.yaml.

The Ingress controller will detect the changes and update the configuration accordingly.

Step 4: Verify the Configuration

Inspect the generated robots.txt file to confirm that indexing prevention is working properly. The Ingress controller generates this file based on the annotation you supply.

Retrieve the external IP or domain associated with your Ingress resource and add /robots.txt to the URL. Example:

Plain Text
 
$ curl dzone-close-site-indexing-haproxy-example.referrs.me/robots.txt
User-agent: *
Disallow: /


As we can see, the answer contains a robots.txt file that prevents any search indexing.

Step 5: Test Indexing Prevention

To verify that search engines are not indexing your site, you can run a search for your website on popular search engines. Keep in mind that search results sometimes take some time to reflect changes, so the indexing status may not be fully updated right away.

Conclusion

Annotations make it easy to avoid search engine indexing while using HAProxy Kubernetes Ingress. By adding the appropriate annotation to your Ingress resource, you can prohibit search engine bots from crawling and indexing your website's content. A similar approach can be used with other ingress controllers, such as Nginx, Traefic, and others. A similar annotation can also be used for K8S Gateway API resources, which are actively replacing Ingresses.

As a final note, robots.txt is a time-honored way for website creators to specify whether or not their sites should be crawled by various bots. However, it turns out that AI crawlers from large language model (LLM) companies frequently ignore the contents of robots.txt and crawl your site regardless. To avoid such situations, utilize password security, noindex, or enterprise load balancer features like HAProxy AI-crawler, which may be also configured as a K8S annotation.

Annotation HAProxy Kubernetes Search engine (computing)

Opinions expressed by DZone contributors are their own.

Related

  • Java, Spring Boot, and MongoDB: Performance Analysis and Improvements
  • Microservices in Practice: Deployment Shouldn't Be an Afterthought
  • Setting Kubernetes Labels and Annotations
  • The Evolution of Scalable and Resilient Container Infrastructure

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!