Noindex vs. Nofollow vs. Disallow

March 29, 2019

 

Portions of the following are adapted from my book, Tech SEO Guide, now available on Amazon.

There’s a common theme of confusion regarding the difference between noindex, nofollow and disallow commands. All three are powerful tools to use to improve a website’s organic search performance, but each has unique situations where they are appropriate to apply. Sadly, many times they are applied incorrectly, which significantly harms a website’s search performance.

Two Search Robot Operations

To understand what noindex, nofollow, and disallow commands do, let’s take a step back to consider what search engine robots do. Search engines send around robots to crawl through and understand a website. These robots are complex but have two basic operations.

  • Crawling: Once a robot discovers a website, it crawls through all the pages and files on the website it can find. Limits can be placed on which files and pages a robot can see, and other changes can be made to ensure a robot finds everything it should.
  • Indexing: After a crawl, robots take all the information gathered during that crawl to decide what information contained on a particular page can be and should be shown within search results. As part of this, search engine robots will also decide what search results a website’s pages should be included in (if any) and where the page should rank within those results.

Disallow vs. Noindex vs. Nofollow

Disallow: Controlling Crawling

The first method of controlling a search robot is with a disallow command. This is specified on a robots.txt file. The “robots.txt” file is a plain text file placed in the root directory of your website. It provides directives to robots telling them which directories you would prefer they not crawl.

When specified, a search robot that respects this command will not crawl the page, file or directory that has been disallowed. For example, you could specify this on the robots.txt file to discourage the search robot from crawling anything located in /a-secret-directory.

Disallow: /a-secret-directory

You can also specify a disallow for only a certain robot. For example, this robots.txt file entry instructs Google’s bots to avoid the “my-content-admin-area” directory. However, Bing’s bots could still crawl this directory.

user-agent: googlebot
Disallow: /my-content-admin-area/

The disallowed files may still be indexed and appear in search results. For example, Google and Bing may find a link to the disallowed page on your website or elsewhere on the web. They couldn’t crawl the page to see the contents of the page, but they would know the page exists and could possibly show the page in Google’s index.

Generally, it is best to not disallow anything. One set of files you want to make sure never to disallow are JavaScript, CSS, or image files. These files control how the page looks and Google does rely on these design factors to evaluate a page, especially when determining mobile friendliness.

Meta Robots Nofollow: Controlling Crawling

Next, we have the nofollow command. There are actually two different nofollow statements. The nofollow command that controls crawling is the meta robots nofollow. This nofollow is applied at a page-level by specifying the nofollow in a meta robots tag in the page’s <head>.

<html>
<head>
...
<meta name="robots" content="nofollow" />
</head>
<body>
...
</body>
</html>

When placed in the <head> of a web page, the meta nofollow instructs a search engine robot to not crawl any links on the page. This is part of a larger set of directives you can specify within the meta robots tag.

Robots that respect this directive will be able to crawl this page but will not crawl pages linked to from this page. If you do not want robots to crawl to the page at all, let alone links contained on this page, then the robots.txt disallow is the better method of controlling crawling.

Rel Nofollow: Explaining the Nature of the Link

The other nofollow is the rel=”nofollow” command. This might influence crawling but the bigger purpose is to further explain the nature of why this link is included. Traditionally, rel=”nofollow” was used to specify any links that were sponsored or had a monetary relationship. Google has since introduced other types of qualifiers: rel=”sponsored” and rel=”ugc”. The rel=”sponsored” qualifier is for any paid link, rel=”ugc” is for any link contained within user-generated content, and the rel=”nofollow” is for any other link you’d rather Google’s bots not associate with your website.

These rel commands are specified on a link-level with a “rel” attribute added to a specific <a> tag. For example, this link would be nofollowed and this link to /no-robots-here page wouldn’t be associated from your website.

<a href="/no-robots-here" rel="nofollow">Link</a>

Noindex: Controlling Indexing

The “noindex” command can be specified on a page within the meta robots tag. When the meta noindex tag is included on a page, search robots are allowed to crawl the page but are discouraged from indexing the page (meaning the page won’t be included within search results if this command is respected).

Example:

<meta name="robots" content="noindex" />

A couple of notes:

  • You previously could specify a noindex on the robots.txt file. However, this is no longer supported by Google (and likely never was). With that official lack of support, the only way of specifying noindex is on a page level.
  • If you can’t add a meta tag to the page’s <head>, you can also use X-Robots in the HTTP header. This can be helpful for noindexing non-HTML content, such as PDFs or some images.

Using Noindex and Disallow

It is important to be clear on how the Disallow and Noindex commands work together. There are three ways these commands can be combined to affect indexing and crawling.

 DisallowNoindex
Scenario 1 X
Scenario 2X 
Scenario 3XX

In Scenario 1, the page with a noindex setting will not be included in a search result. However, a robot may still crawl the page, meaning the robots can access content on the page and follow links on the page.

In Scenario 2, the page will not be crawled but may be indexed and appear in search results. Because the robot did not crawl the page, the robot knows nothing about it. Any content included about this page in search result will be gathered from other sources, like links to the page.

Scenario 3 will operate exactly like Scenario 2 if the noindex was specified within the meta robots tag. This is because when a Disallow is specified, a robot will not crawl to the page. If the robot doesn’t crawl to the page, it will not see the meta tag indicating not to index a page. If a page needs to be set to noindex and disallowed, set the noindex first then after the page is removed from the search index, set the disallow.

Nofollow Guidelines

When to Use Nofollow to Control Crawls?

Generally, robots should be told they can follow all links on a page. Being too aggressive in specifying which links to follow or nofollow can begin to look as if the website is attempting to manipulate a robot’s perception of a website. This is a practice known as page sculpting, where nofollow commands are used to sculpt how signals from one page are passed to another. At best, these attempts to manipulate a robot no longer work. At worst, attempts to manipulate robots with rel nofollow can lead to a penalty.

When To Use Rel Qualifiers On Links

Rel=”nofollow”, rel=”sponsored”, or rel=”ugc” should be used for specific instances where you need to clearly signal the nature of the link. The prime example are links on page where a payment was made in exchange for the link. For example, if a blog post includes links to ads, those links should have a rel nofollow attribute. However, with the additional qualifiers, Google is making it clear that any user-generated links should have this qualifier.

Disallow, Noindex, or Nofollow Are Optional

Disallow, Noindex and Nofollow are optional—robots don’t have to follow any of these commands. Really, the word command is a bit of an overstatement. These directives are recommendations. Google’s bots can ignore any one of these recommendations. Often times, these commands being ignored is a sign of a bigger problem about robots incorrectly understanding how to crawl your website. In these situations, you want to research what that bigger problem is and address that instead of only re-tooling your noindex, disallow, or nofollow commands.

As well, because these commands are optional, you want to not rely on them for any critical aspects of your website. If an area of a website should not be publicly accessible or if you want to ensure a part of your website doesn’t end up in a Google search result, you should consider alternatives. A common area where this becomes a problem are staging websites, which you clearly don’t want Google’s bots crawling and definitely don’t want to have indexed. On a staging website, a disallow or noindex isn’t enough of a guarantee that bots will leave the site alone. Instead, you’d want to require a login to access that staging site. A login isn’t optional and can’t be ignored, which will mean that bots won’t be able to crawl or index it.

Summarizing Robot Directives

The biggest thing to remember is there are two operations: crawling and indexing. We can control or influence both of these using different directives.

To sum up, those directives are:

  • Disallow tells a robot not to crawl a page, file, or directory.
  • Noindex tells a robot not to index the page.
  • Meta nofollow tells a robot not to follow a specific link or all links on a page.
  • Rel=”nofollow” (or rel=”sponsored” or rel=”ugc”) further qualifies the nature of the link

Use Disallow, Noindex, Meta Nofollow and rel qualifiers sparingly and only after carefully considering all implications about how their use will affect your website’s SEO performance. As you use these, make sure you aren’t blocking robots from seeing important parts of your website—such as JavaScript, CSS, or image files. When in doubt, don’t add any directive.

Testing Robot Commands

If you decide to use robot commands, you want to test them to make sure robots are understanding the commands correctly. While you can use crawl tools to help with this, a simpler method for testing is within Google Search Console.

Testing Robots.txt

In Google Search Console, you can check your current robots.txt file to see what, if any, pages are currently listed as pages you do not want Google to access. This isn’t currently available within the navigation in Google Search Console, but is available as a legacy tool (access directly here).

On this page, you will see your website’s current robots.txt file. Below the robots.txt file, you can enter in URLs from your website and test to see if Google would be prevented from crawling this page due to the robots.txt file. In this example, the wp-admin directory is blocked from crawling but all other URLs should be allowed for crawling.

robots.txt tester images

 Testing Crawlability and Indexability

The other method of testing if robots can crawl or index a page within Google Search Console is by using the URL inspector. In the new Google Search Console, enter in a URL you wish to test.

URL tester in GSC

After the results load, within the coverage report you can see if crawling and indexing are allowed. In this example, both are allowed—which is the intended response. If, however, I had specified a noindex or disallow for this page, the crawl or indexed allowed answers should be a “no”.

Get Help

If you need help, let’s talk before you implement any changes. Or, for more information on noindex, nofollow, disallow, and other technical SEO subjects, please refer to the Tech SEO Guide in paperback or Kindle on Amazon. Now available for only $9.99!

 

 

Resources

 

You may also like

Using Log File Analysis To Improve Your SEO Performance

Using Log File Analysis To Improve Your SEO Performance

You have to understand how Google’s bots are understanding and crawling through your website. Learn what log files are and how you can start analyzing your website’s log files to improve your SEO performance.

Measuring How Speed Impacts Visitors

Measuring How Speed Impacts Visitors

Are you getting the full story about page speed on your website? Along with load time metrics, we also need to know how speed affects our visitors.

How To Find & Fix Duplicated Content

How To Find & Fix Duplicated Content

Put simply: Duplicate content confuses both human and robot visitors. Let’s walk through how we deal with duplicated content: locating, evaluating, and resolving.

Performing Regular Tech SEO Checks with SE Ranking

Performing Regular Tech SEO Checks with SE Ranking

Although a Tech SEO Audit can be completed using a variety of tools, in this post, I want to walk through how we can do each of these tasks using SE Ranking’s audit tool.

2020 SEO Plans & Common Misconceptions

2020 SEO Plans & Common Misconceptions

As companies make their 2020 SEO plans, there are three big misconceptions that I see repeatedly that should be avoided when adjusting SEO tactics in 2020.

Basic HTML Tags You Need To Know For SEO

Basic HTML Tags You Need To Know For SEO

When you are managing a website, there are a few basic HTML tags you need to know. The better you understand these HTML basics, the better you can diagnose problems affecting SEO on your website.