Noindex vs. Nofollow vs. Disallow
March 29, 2019
There are three main tools used to prevent search engine robots from doing certain things on a website: noindex, nofollow and disallow commands. All three are powerful tools to use to improve a website’s organic search performance, but each has unique situations where they are appropriate to apply. Unfortunately, these three tools can be applied incorrectly, which significantly harms a website’s search performance.
Table of Contents
- Robot Operations
- Disallow vs. Noindex vs. Nofollow
- Using Noindex & Disallow
- When To Noindex
- When To Nofollow
- When To Use Rel Qualifiers
- Disallow, Noindex, or Nofollow Are Optional
- Summarizing Robot Directives
- Testing Robot Commands
Two Search Robot Operations
To understand what noindex, nofollow, and disallow commands do, let’s take a step back to consider what search engine robots do. Search engines send around robots to crawl through and understand a website. These robots are complex but have two basic operations.
- Crawling: Once a robot discovers a website, it crawls through all the pages and files on the website it can find. Limits can be placed on which files and pages a robot can see, and other changes can be made to ensure a robot finds everything it should.
- Indexing: After a crawl, robots take all the information gathered during that crawl to decide what information contained on a particular page can be and should be shown within search results. As part of this, search engine robots will also decide what search results a website’s pages should be included in (if any) and where the page should rank within those results.
Disallow vs. Noindex vs. Nofollow
Disallow: Controlling Crawling
The first method of controlling a search robot is with a disallow command. This is specified on a robots.txt file. The “robots.txt” file is a plain text file placed in the root directory of your website. It provides directives to robots telling them which directories you would prefer they not crawl.
When specified, a search robot that respects this command will not crawl the page, file or directory that has been disallowed. For example, you could specify this on the robots.txt file to discourage the search robot from crawling anything located in /a-secret-directory.
You can also specify a disallow for only a certain robot. For example, this robots.txt file entry instructs Google’s bots to avoid the “my-content-admin-area” directory. However, Bing’s bots could still crawl this directory.
The disallowed files may still be indexed and appear in search results. For example, Google and Bing may find a link to the disallowed page on your website or elsewhere on the web. They couldn’t crawl the page to see the contents of the page, but they would know the page exists and could possibly show the page in Google’s index.
Meta Robots Nofollow: Controlling Crawling
Next, we have the nofollow command. There are actually two different nofollow statements. The nofollow command that controls crawling is the meta robots nofollow. This nofollow is applied at a page-level by specifying the nofollow in a meta robots tag in the page’s <head>.
<meta name="robots" content="nofollow" />
When placed in the <head> of a web page, the meta nofollow instructs a search engine robot to not crawl any links on the page. This is part of a larger set of directives you can specify within the meta robots tag.
Robots that respect this directive will be able to crawl this page but will not crawl pages linked to from this page. If you do not want robots to crawl to the page at all, let alone links contained on this page, then the robots.txt disallow is the better method of controlling crawling.
Rel Nofollow & Rel Qualifiers: Explaining the Nature of the Link
The other nofollow is the rel=”nofollow” command, which is also called a rel qualifier. Rel qualifiers do not influence how robots crawl but explain the nature of why this link is included when robots crawl. Traditionally, rel=”nofollow” was used to specify any links that were sponsored or had a monetary relationship. Google has since introduced other types of qualifiers: rel=”sponsored” and rel=”ugc”. The rel=”sponsored” qualifier is for any paid link, rel=”ugc” is for any link contained within user-generated content, and the rel=”nofollow” is for any other link you’d rather Google’s bots not associate with your website.
These rel commands are specified on a link-level with a “rel” attribute added to a specific <a> tag. For example, this link would be nofollowed and this link to /no-robots-here page wouldn’t be associated from your website.
<a href="/no-robots-here" rel="nofollow">Link</a>
Noindex: Controlling Indexing
The “noindex” command can be specified on a page within the meta robots tag. When the meta noindex tag is included on a page, search robots are allowed to crawl the page but are discouraged from indexing the page (meaning the page won’t be included within search results if this command is respected).
<meta name="robots" content="noindex" />
A couple of notes:
- You previously could specify a noindex on the robots.txt file. However, this is no longer supported by Google (and likely never was). With that official lack of support, the only way of specifying noindex is on a page level.
- If you can’t add a meta tag to the page’s <head>, you can also use X-Robots in the HTTP header. This can be helpful for noindexing non-HTML content, such as PDFs or some images.
Using Noindex and Disallow
It is important to be clear on how the Disallow and Noindex commands work together. There are three ways these commands can be combined to affect indexing and crawling.
In Scenario 1, the page with a noindex setting will not be included in a search result. However, a robot may still crawl the page, meaning the robots can access content on the page and follow links on the page.
In Scenario 2, the page will not be crawled but may be indexed and appear in search results. Because the robot did not crawl the page, the robot knows nothing about it. Any content included about this page in search result will be gathered from other sources, like links to the page.
Scenario 3 will operate exactly like Scenario 2 if the noindex was specified within the meta robots tag. This is because when a Disallow is specified, a robot will not crawl to the page. If the robot doesn’t crawl to the page, it will not see the meta tag indicating not to index a page. If a page needs to be set to noindex and disallowed, set the noindex first then after the page is removed from the search index, set the disallow.
When To Noindex
Generally, it should be left up to robots to decide what should or should not be indexed. However, there are times you’d want to decide on behalf of the search robot and take more control over what pages appear in search results. There are two main questions to consider regarding noindex commands. If you currently have pages noindexed, it is good to regularly use these questions to reconsider the reasons for those pages being noindexed.
Question #1: Search Result Page Clicks
Noindex is a tool that prevents pages from appearing in search results. Accordingly, the question to consider before using this tool is what pages aren’t a good representation of your website in search results? That is, after people conduct a search and begin reviewing search results, searchers are looking to click on the best pages on the best websites—best, in this case, meaning the pages and websites that seem to most closely match that searcher’s interests and intentions. Go through your website and compare the pages you have to the search terms those pages rank for—what pages aren’t a good representation of your website for searchers and what pages are unlikely to entice a click given the various search results they are (or could) rank for?
Question #2: Quality Entry Point
It isn’t just about how the pages appear on search results and if the page can entice a searcher to click to your website. It is also about ensuring the page is a good entry point from a search result. After somebody clicks a search result to come to your website, you want that searcher to be fully satisfied by the page. Forget any possible benefits people staying on your website could have for SEO—instead, think about all the benefits people staying on your website can have for your business. What pages are the best entry point from search results and have the best opportunity to satisfy visitors?
Standard Pages to Noindex
There are a few standard types of pages that should almost always be noindexed before evaluating performance in search results, assuming this type of content does appear on your website.
- Category or tag pages on a blog—it might be better for people to find a relevant blog post or the main blog page from the search result.
- Landing pages—a landing page for an email or advertisement campaign shouldn’t be indexed in a search result since you don’t want organic search traffic coming to this page too.
- Duplicate content—if two pages share the same content, one of the pages can be set to noindex to prevent the pages from competing against each other to appear in the same search results.
When To Nofollow
Generally, robots should be told they can follow all links on a page. Being too aggressive in specifying which links to follow or nofollow can begin to look as if the website is attempting to manipulate a robot’s perception of a website. This is a practice known as page sculpting, where nofollow commands are used to sculpt how signals from one page are passed to another. At best, these attempts to manipulate a robot no longer work. At worst, attempts to manipulate robots with rel nofollow can lead to a penalty.
When To Use Rel Qualifiers On Links
Rel=”nofollow”, rel=”sponsored”, or rel=”ugc” should be used for specific instances where you need to clearly signal the nature of the link. The prime example are links on page where a payment was made in exchange for the link. For example, if a blog post includes links to ads, those links should have a rel nofollow attribute. However, with the additional qualifiers, Google is making it clear that any user-generated links should have this qualifier.
Disallow, Noindex, or Nofollow Are Optional
Disallow, Noindex and Nofollow are optional—robots don’t have to follow any of these commands. Really, the word command is a bit of an overstatement. These directives are recommendations. Google’s bots can ignore any one of these recommendations. Often, these commands being ignored is a sign of a bigger problem about robots incorrectly understanding how to crawl your website. In these situations, you want to research what that bigger problem is and address that instead of only re-tooling your noindex, disallow, or nofollow commands.
As well, because these commands are optional, you want to not rely on them for any critical aspects of your website. If an area of a website should not be publicly accessible or if you want to ensure a part of your website doesn’t end up in a Google search result, you should consider alternatives. A common area where this becomes a problem are staging websites, which you clearly don’t want Google’s bots crawling and definitely don’t want to have indexed. On a staging website, a disallow or noindex isn’t enough of a guarantee that bots will leave the site alone. Instead, you’d want to require a login to access that staging site. A login isn’t optional and can’t be ignored, which will mean that bots won’t be able to crawl or index it.
Summarizing Robot Directives
The biggest thing to remember is there are two operations: crawling and indexing. We can control or influence both of these using different directives.
To sum up, those directives are:
- Disallow tells a robot not to crawl a page, file, or directory.
- Noindex tells a robot not to index the page.
- Meta nofollow tells a robot not to follow a specific link or all links on a page.
- Rel=”nofollow” (or rel=”sponsored” or rel=”ugc”) further qualifies the nature of the link
Testing Robot Commands
If you decide to use robot commands, you want to test them to make sure robots are understanding the commands correctly. While you can use crawl tools to help with this, a simpler method for testing is within Google Search Console.
In Google Search Console, you can check your current robots.txt file to see what, if any, pages are currently listed as pages you do not want Google to access. This isn’t currently available within the navigation in Google Search Console, but is available as a legacy tool (access directly here).
On this page, you will see your website’s current robots.txt file. Below the robots.txt file, you can enter in URLs from your website and test to see if Google would be prevented from crawling this page due to the robots.txt file. In this example, the wp-admin directory is blocked from crawling but all other URLs should be allowed for crawling.
Testing Crawlability and Indexability
The other method of testing if robots can crawl or index a page within Google Search Console is by using the URL inspector. In the new Google Search Console, enter in a URL you wish to test.
After the results load, within the coverage report you can see if crawling and indexing are allowed. In this example, both are allowed—which is the intended response. If, however, I had specified a noindex or disallow for this page, the crawl or indexed allowed answers should be a “no”.
If you need help, let’s talk before you implement any changes. Or, for more information on noindex, nofollow, disallow, and other technical SEO subjects, please refer to the Tech SEO Guide in paperback or Kindle on Amazon. Now available for only $9.99!