Noindex vs. Nofollow vs. Disallow
Portions of the following are adapted from my book, Tech SEO Guide, now available on Amazon.
There’s a common theme of confusion regarding the difference between noindex, nofollow and disallow commands. All three are powerful tools to use to improve a website’s organic search performance, but each has unique situations where they are appropriate to apply. Sadly, many times they are applied incorrectly, which significantly harms a website’s search performance.
Two Search Robot Operations
To understand what noindex, nofollow, and disallow commands do, let’s take a step back to consider what search engine robots do. Search engines send around robots to crawl through and understand a website. These robots are complex but have two basic operations.
- Crawling: Once a robot discovers a website, it crawls through all the pages and files on the website it can find. Limits can be placed on which files and pages a robot can see, and other changes can be made to ensure a robot finds everything it should.
- Indexing: After a crawl, robots take all the information gathered during that crawl to decide what information contained on a particular page can be and should be shown within search results. As part of this, search engine robots will also decide what search results a website’s pages should be included in (if any) and where the page should rank within those results.
Disallow vs. Noindex vs. Nofollow
Disallow: Controlling Crawling
The first method of controlling a search robot is with a disallow command. This is specified on a robots.txt file. When specified, a search robot that follows this command will not crawl the page, file or directory that has been disallowed. However, the disallowed files may still be indexed and appear in search results.
For example, you could specify this on the robots.txt file to discourage the search robot from crawling anything located in /a-secret-directory.
Meta Robots Nofollow: Controlling Crawling
Next, we have the nofollow command. There are actually two different nofollow statements. The nofollow command that controls crawling is the the meta robots nofollow. This nofollow is applied at a page-level by specifying the nofollow in a meta robots tag. When placed in the <head> of a web page, this instructs a search engine robot to not crawl any links on the page. This is part of a larger set of directives you can specify within the meta robots tag. Example:
<meta name="robots" content="nofollow" />
Rel Nofollow: Explaining the Nature of the Link
The other nofollow is the rel=”nofollow” command. This might influence crawling but the bigger purpose is to further explain the nature of why this link is included. Traditionally, rel=”nofollow” was used to specify any links that were sponsored or had a monetary relationship. Google has since introduced other types of qualifiers: rel=”sponsored” and rel=”ugc”. The rel=”sponsored” qualifier is for any paid link, rel=”ugc” is for any link contained within user-generated content, and the rel=”nofollow” is for any other link you’d rather Google’s bots not associate with your website.
These rel commands are specified on a link-level with a “rel” attribute added to a specific <a> tag. For example, this link would be nofollowed and this link to /no-robots-here page wouldn’t be associated from your website.
<a href="/no-robots-here" rel="nofollow">Link</a>
Noindex: Controlling Indexing
The “noindex” command can be specified on a page within the meta robots tag. When the meta noindex tag is included on a page, search robots are allowed to crawl the page but are discouraged from indexing the page (meaning the page won’t be included within search results). Example:
<meta name="robots" content="noindex" />
Note that you previously could specify a noindex on the robots.txt file. However, this is no longer supported by Google.
Using Noindex and Disallow
It is important to be clear on how the Disallow and Noindex commands work together. There are three ways these commands can be combined to affect indexing and crawling.
In Scenario 1, the page with a noindex setting will not be included in a search result. However, a robot may still crawl the page, meaning the robots can access content on the page and follow links on the page.
In Scenario 2, the page will not be crawled but may be indexed and appear in search results. Because the robot did not crawl the page, the robot knows nothing about it. Any content included about this page in search result will be gathered from other sources, like links to the page.
Scenario 3 will operate exactly like Scenario 2 if the noindex was specified within the meta robots tag. This is because when a Disallow is specified, a robot will not crawl to the page. If the robot doesn’t crawl to the page, it will not see the meta tag indicating not to index a page. If a page needs to be set to noindex and disallowed, set the noindex first then after the page is removed from the search index, set the disallow.
When to Use Nofollow to Control Crawls?
Generally, robots should be told they can follow all links on a page. Being too aggressive in specifying which links to follow or nofollow can begin to look as if the website is attempting to manipulate a robot’s perception of a website. This is a practice known as page sculpting, where nofollow commands are used to sculpt how signals from one page are passed to another. At best, these attempts to manipulate a robot no longer work. At worst, attempts to manipulate robots with rel nofollow can lead to a penalty.
When To Use Rel Qualifiers On Links
Rel=”nofollow”, rel=”sponsored”, or rel=”ugc” should be used for specific instances where you need to clearly signal the nature of the link. The prime example are links on page where a payment was made in exchange for the link. For example, if a blog post includes links to ads, those links should have a rel nofollow attribute. However, with the additional qualifiers, Google is making it clear that any user-generated links should have this qualifier.
Disallow, Noindex, or Nofollow Are Optional
Disallow, Noindex and Nofollow are optional—robots don’t have to follow any of these commands. Really, the word command is a bit of an overstatement. These directives are recommendations. Google’s bots can ignore any one of these recommendations. Often times, these commands being ignored is a sign of a bigger problem about robots incorrectly understanding how to crawl your website. In these situations, you want to research what that bigger problem is and address that instead of only re-tooling your noindex, disallow, or nofollow commands.
As well, because these commands are optional, you want to not rely on them for any critical aspects of your website. If an area of a website should not be publicly accessible or if you want to ensure a part of your website doesn’t end up in a Google search result, you should consider alternatives. A common area where this becomes a problem are staging websites, which you clearly don’t want Google’s bots crawling and definitely don’t want to have indexed. On a staging website, a disallow or noindex isn’t enough of a guarantee that bots will leave the site alone. Instead, you’d want to require a login to access that staging site. A login isn’t optional and can’t be ignored, which will mean that bots won’t be able to crawl or index it.
To sum up:
- Disallow tells a robot not to crawl a page, file, or directory.
- Noindex tells a robot not to index the page.
- Nofollow tells a robot not to follow a specific link or all links on a page.
- Rel=”nofollow” (or rel=”sponsored” or rel=”ugc” further qualifies the nature of the link)
Use Disallow, Noindex, and Nofollow sparingly and only after carefully considering all implications about how their use will affect your website’s SEO performance. If you need help, let’s talk before you implement any changes.
For more information on noindex, nofollow, disallow, and other technical SEO subjects, please refer to the Tech SEO Guide in paperback or Kindle on Amazon. Now available for only $9.99!