Using Log File Analysis To Improve Your SEO Performance

May 08, 2020

A big part of SEO, and certainly the technical side of SEO, is understanding how Google’s bots are understanding and crawling through your website. We need to know if the bots are finding every page, if bots are crawling often enough to see changes, if bots are running into sections that confuse or trap them, and if bots can “see” the website correctly. One of the best ways to do this is by reviewing your website’s log files. Let’s talk about what log files are and how you can start analyzing your website’s log files to improve your SEO performance.

What Are Log Files?

Log files are a record of every file accessed on your website. Within the log file, you can see what files were accessed, when those files were accessed, some information about who accessed those files, and where people found the website.

What files were accessed isn’t just the pages that were loaded. Let’s say somebody visits your website and loads your About Us page at about-us.html. That visitor needs to load the HTML file of about-us.html, but also needs to load every image located on that page. That visitor also needs to load the JavaScript and CSS files required to see the proper design of that page. All these requests will be recorded in the access logs. For each file requested, you’ll be able to see the response status code of the file, which helps you easily identify any errors visitors may be encountering.

The “who” information doesn’t include personally identifiable information; the log file won’t give you visitors’ email addresses or names. (It does include an IP address, which can help you identify people in some cases, which means you do want to keep privacy in mind.) Instead, what the log file tells you are visitors’ countries and user agents. The country isn’t all that helpful (for more useful geographic information look at the location reports in Google Analytics), but the user agent is. The user agent tells you what browser and operating system people were using. Importantly for SEO considerations, the user agent tells you if the visitor was a search engine robot. For example, here is an entry in a log file for Googlebot crawling a page:

12.345.67.890 - - [01/Feb/2020:19:57:26 -0700] "GET / HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Side note: when a bot visits a website, that bot’s visit is usually called a crawl instead of a visit.

Finally, you can also see some information about where the visitor located the page being visited via the Referrer information. However, Googlebot blocks this information so you can’t tell where their bots found the pages (despite the fact this information would be massively helpful). As well, some browsers and plugins block this information. For example, here is a record of a person visiting the “/conversion-optimization” page on a website who was referred from Google organic.

12.345.67.890 - - [27/Feb/2020:13:41:58 -0700] "GET /conversion-optimization / HTTP/1.1" 200 24594 "https://www.google.com/" "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"

How to Access Log Files?

For most hosting providers, you can access log files through your website’s control panel. For most hosting providers, you can access log files through your website’s control panel. These will often be referred to as access logs. If you can’t locate your log file at your hosting company, you can also email your hosting support to determine if you are able to access and how.

As an example, in cPanel, you can search for “access” and locate the “Raw Access” in the “Metrics” area. This will download as a GZ file, so you will need a tool like 7-Zip File Manager to extract the files.

cPanel Access Log
cPanel Access Log – Accessing log files from cPanel

Here are ways to access at a few of the more popular hosting companies:

What questions can log files answer?

The information in the log file can help you answer several key questions to better understand how search engines (and other visitors) are using and understanding your website. This includes questions like are search robots crawling often enough, are robots finding every file, are robots running into errors, and can robots accurately load all pages on the website.

I’m making this very robot-centric and am focusing here on how you can use log files to improve your SEO performance. And, really, I’m making this very Google-centric too since in almost every case stronger SEO performance is synonymous with performance in Google’s search results. However, you can review many of these same questions for other search engines. I’d also encourage you to use log files to understand non-robotic visitors too. For example, you can review your log file to identify any 404 errors your visitors are encountering and then fix those 404s to help visitors avoid those errors in the future.

Although there are different ways of parsing the log files, I’m going to explain how we can answer questions with our log files by using a tool called Apache Logs Viewer. This tool is relatively inexpensive (€20 or about $22US as of this writing) and provides a rich set of reports to help us extract information from our log files. If a fair amount of your traffic or business relies on organic search, I recommend you use a tool like this to more easily explore your access logs. If you are using Apache Logs Viewer, once you’ve downloaded your access logs from your hosting company, you can go to File-> Add Access Log to open the access log within this tool.

All right, let’s go through three of the key these questions you can answer with a log file in more detail and talk about how we get this information.

  1. How often does Google crawl my website?
  2. What pages is Googlebot crawling?
  3. Can bots crawl images, JavaScript, or CSS files?

There are way more questions than these that log files can help to answer, however, my hope isn’t to show you every report but instead to help you get started using and exploring your website’s log files.

1. How often does Google crawl my website?

How To Check This

Step #1: Filter to Googlebot.

Filter to Googlebot in Log File
Filtering to Googlebot User Agent

Filter the log file view to only show log entries where Googlebot is the user agent. Type in Googlebot in the “User Agent” field, select “Include” from the dropdown next to this field, and then click “Apply Filter”.

Step #2: Report on Hits Each Day.

Apache Logs Viewer - Hits Per Day Report
Report on Hits Each Day

On the “Reports” menu, select “Hits Each Day”.

Note: in this case, hits are essentially synonymous with crawls or visits.

Step #3: Review activity.

Graph of hits from Googlebot by day
Hits Each Day Graph

This will load a graph showing how many hits Googlebot made to the website each day. Over time, you may start to notice patterns of when Googlebot crawl more or less often. You also want to make sure Googlebot is crawling more often after major site changes.

Why Check This

Search engine robots don’t crawl consistently. Some websites naturally get crawled more often than others if Google’s bots begin to understand the website is speaking to topics that are frequently updated (think news websites). However, even within a website, Google’s bots can understand that certain pages deserve to be crawled more frequently than others based on the nature of the content.

Google’s bots do a good job of understanding which pages to crawl more frequently but that isn’t always the case. As a result, you need to know how often Google’s bots are crawling your website and make sure it is often enough relative to how often your content is updated. If your website isn’t crawled often enough, you can try different things to get Google to crawl more often—such as updating your XML sitemap, adding dates to your content to make the update frequency more visible, or altering the kinds of changes you make to your text to something more substantial.

2. What pages is Googlebot crawling?

How To Check This

Step #1: Filter to Googlebot.

Filter to Googlebot in Log File
Filtering to Googlebot User Agent

Filter the log file view to only show log entries where Googlebot is the user agent. Type in Googlebot in the “User Agent” field, select “Include” from the dropdown next to this field, and then click “Apply Filter”.

Step #2: Load page report.

Access page report
Statistics Menu – Click on “Pages”

Next, go to the “Statistics” menu and select “Pages”. You can then choose a specific date range or to use all data available in the log file.

Step #3: Review request.

Page Count Report
Report on What Pages Were Hit (Crawled) by Googlebot

Finally, you can review the number of times each page (each file, really, as this will include JavaScript, CSS, and images too) was crawled by Googlebot. Is there anything getting crawled more than it seems like it should? What is missing from the most crawled pages? To make this easier to review, you can export to CSV and open it in Excel.

Why Check This

After determining the overall crawl volume, you need to know what pages Googlebot is looking at. For starters, you want to know if Google is looking at every page on your website or if they are only crawling a few key pages. If Googlebot is only crawling a few pages, chances are they are missing out on the entirety of your website. If you see this happening, you are also probably struggling to get all of the pages on your website to rank higher (or at all) in search results. Especially when you are looking at a month’s worth of data in your log file, you should see most of your pages getting crawled at least once or twice. If Googlebot isn’t crawling all pages on the website, that could be a sign Googlebot doesn’t think those pages are very important and not worth its time to recrawl. In that case, adding more internal links to the pages that aren’t getting crawled (or external links if those external links are of decent quality) will help make those pages look more important and increase the chances that those pages will be crawled.

You want to pay close attention to pages that have recently been updated—is Googlebot crawling those pages? If not, this would explain why you aren’t seeing those changes appear in search results. Using the “Request Indexing” feature in Google Search Console may help and so could resubmitting the XML sitemap with these updated URLs included. But, requesting indexing and XML sitemap resubmissions are hacks to prompt Google to crawl something—so more than trying to find ways to hack Google to get them to crawl, you need to figure out why recent updates are being ignored. Are the bots ignoring the updates because your content is lower quality and Google doesn’t see the need to pay attention to it? Or are bots ignoring the content because of poor site architecture and they simply can’t find those updates?

In a similar way, you want to see if Googlebot is crawling older pages you’ve removed from your website. This can help you understand why older pages are still in Google’s index and appearing in search results even if they currently return a 404 not-found error message indicating that page has been removed. Likely, if an old, removed page is still in search results, that is because Google hasn’t crawled that old page and hasn’t seen that the page was removed. One trick that can help here is submitting an XML sitemap full of those removed pages in Google Search Console—this prompts Google to recrawl those old pages and can sometimes speed up the removal of those pages from the index and from search results.

3. Can bots crawl images, JavaScript, or CSS files?

How To Check This

Step #1: Filter to Googlebot.

Filter to Googlebot in Log File
Filtering to Googlebot User Agent

Filter the log file view to only show log entries where Googlebot is the user agent. Type in Googlebot in the “User Agent” field, select “Include” from the dropdown next to this field, and then click “Apply Filter”.

Step #2: Filter to JavaScript, CSS, and images.

Filter to JS, CSS, and images using Regex
Filtering to JavaScript, CSS, and Image Files

Next, we want to filter to review only JavaScript, CSS, and images. In the dropdown next to the “Request” filter, select “Include (Regex)”. Then, in the filter box add in a regular expression to look for these files, such as: js|css|jpeg|jpg|gif|png. Once this is input, click “Apply Filter”.

Step #3: View status codes.

View Status Codes Report
View Status Codes for JavaScript, CSS, and Image Files

Once filtered, you can scan through all the requests to make sure all of the requested JavaScript, CSS, and image files have a response status 200, indicating they loaded correctly for Googlebot. You can also go to the “Reports” menu, select “Status Codes”, and then select “Status Code (full)” to view a report showing all the status codes encountered on these files types.

Why Check This

Googlebot doesn’t just read the HTML code and the text of your website. Instead, Google’s bots want to see your website the same way a human sees the website: with images, with design, and with functionality. That means Googlebot needs to be able to load your website’s JavaScript, CSS, and image files. If you are blocking them from doing so—for example with a disallow on the robots.txt file—then the bots won’t be able to fully understand the page. That can cause bots to ignore certain pages and keep those pages from appearing in search results.

More simply, image search can be a key means of driving traffic to your website, especially if your content lends itself to people searching for visuals (such as informational sites sharing diagrams or some ecommerce websites). If you block Googlebot from seeing the images when they crawl your website, you remove your chances of appearing in image search and that limits one potential source of traffic. Plus, not only can images appear in image-specific searches, but images can be used in featured snippets alongside text—blocking images also removes the possibility for appearing in these features on search results and capturing a would-be visitor’s attention by doing so.

How Often Should I Check My Log Files?

How often you need to review your log files depends on how actively you update your website. If you are adding and removing dozens of pages from your website every day, then reviewing log files weekly will help you make sure Googlebot is seeing these changes. If your website is less active, only making a few changes each week or a few changes each month, then a monthly or quarterly review of the log file will probably be enough. However, if you have recently redesigned or restructured your website, you do want to review log files more frequently (daily or weekly) to make sure the search robots are starting to crawl through the updated website to detect all the changes you’ve made.

Of course, how often you should check is also dependent upon how important organic traffic is to your website and to your business. If most of your traffic or revenue comes from organic search, then monitoring log files more often is critical—after all, any problem with a robot crawling your website could lead to significant negative impacts on your company’s bottom line. On the other hand, if your website’s traffic predominately comes from email, ads, or word of mouth instead of organic traffic, then reviewing the log file to understand how robots are using your website isn’t as important (however, log file analysis could still help you understand your non-bot users).

If you have any questions about log files or need help analyzing log files to improve your SEO, please contact me.

You may also like

Languages to Know For Tech SEO & Web Analytics

Working on technical SEO and/or web analytics inevitably will require getting into the code. What languages do you need to know to succeed?

Measuring How Speed Impacts Visitors

Are you getting the full story about page speed on your website? Along with load time metrics, we also need to know how speed affects our visitors.

How To Find & Fix Duplicated Content

Put simply: Duplicate content confuses both human and robot visitors. Let’s walk through how we deal with duplicated content: locating, evaluating, and resolving.