How You Can Use Screaming Frogs Extraction Feature To Become SEO Hero
Working as an SEO is cool, sometimes it’s frustrating but it’s an all-around great role to work in, however, sometimes you may feel as if you’re separated from the rest of the company because your workflow is completely different to theirs. I understand, I have been there before and I’m currently still in that situation now.
I personally think that this is a good thing and it differentiates you away from the rest, some employees will think that you’re some sort of wizard for knowing this stuff and in some cases, you may look down uponed by those who don’t necessarily believe in SEO.
I’ve previously had this tension whilst working with web developers, especially when recommending changes and going the extra mile to improve their code base, they don’t like it. Some web developers may even claim that they know SEO and I’ve even heard some developers claim that meta keywords are still a thing.
I have previously found that trying to help as many departments as possible is always worthy in the long-run, even if that means going the extra mile and intentionally fulfilling tasks that aren’t within your remit. This will help spread the word about what you do and will help build good judgement. It really depends on your role, but the tasks can range from the following:
- Providing guidance to a web developer on how to improve their code base from an SEO standpoint;
- Providing a sophisticated content audit to not only help SEO but also UX;
- Providing an ad-hoc service to other departments that may involve some Google Analytics work;
- Using SEO tools to help other departments collate data, e.g the PR team may want to be updated every time the company has been referenced on the web, therefore, you can use Moz fresh alerts to do this;
- Help improve the automation of boring, manual tasks that may take up a lot of their time.
In this case, if you’re looking to help speed up the process of manual, boring work for a colleague, you can use Screaming Frog to do this. It’s a beast for automating the process of identifying and logging certain things on a web page.
I remember when I used to work agency-side, a poor soul was asked to navigate through a massive eCommerce site to find all the pages with mixed content errors. This can be done easily via Screaming Frog, so I decided to provide a helping hand to free up this employee’s workflow. As you could imagine, she greatly appreciated it.
Therefore, in this post, I thought I would discuss how you can use the extraction feature within Screaming Frog to go the extra mile and to help other departments fulfil certain tasks, whether that’s to identify mixed content errors across an entire site after a https deployment or to create a spreadsheet outlining all of the heading tags on each page.
Screaming Frog is very feasible and it can be used for almost anything if you know how to use the extraction feature.
What Is Screaming Frogs Extraction Feature?
First off, let’s start from the bottom so everyone’s on the same page. The extraction feature is a tool within Screaming Frog that allows you, the webmaster to customise how Screaming Frog collates its data across your site. There are some pre-defined settings within Screaming Frog, however, you can customise the extraction request depending on your requirements. I have previously used Screaming Frog to do all of the following:
- Finding incorrect references of CCtld’s across an entire site (e.g a .com domain referencing .co.uk, when it should be referencing .com);
- Finding pages that reference a certain phrase in its paragraph tag (finding internal linking opportunities);
- Extracting all of the email addresses that are being used on a website;
- Finding pages on a site that have mixed content errors, post-launch of a http deployment;
- Lots more uses of Screaming Frogs extraction feature that can be used to save time.
How Do I Access The Extraction Feature Brett?
Good question, this can be done easily if you have a license of Screaming Frog, if not, you’ll need to purchase one before doing so. The extraction tool within Screaming Frog can be accessed by going to: Configuration > Custom > Extraction.
Once clicked you will then be able to see the following filters:
Cool Cheers Brett, So, How Do We Use This Extraction Tool?
Using the extraction feature at first may come across as very overly complicated, I remember the first time I ran across it, I felt like crawling back into my hole. It takes some time to learn, but once it’s mastered it can be used for almost anything and it will be your go-to piece of kit in your toolset. I love it.
To start with, there are lots of filters that can be used in order to tell Screaming Frog what information you’re looking to extract from its crawl. In essence, Screaming Frog will go through and crawl every page on a site, however, when it detects a certain piece of information (that’s specified via your filter query) it will then store that piece of information into another environment for you. This area is stored in the ‘Custom’ section of the ‘Overview’ page, under a tab called ‘Extraction’.
Therefore, everything that’s collected via your filters will go into this section for your viewing.
However, in order to use the filters, you will need to get to grips of the available expressions that’s used within Srceaming Frog to filter your data. At the moment, the following are available:
XPath is a query language that describes a way of finding and processing items that are contained within XML documents, XPath is a shorter expression for saying XML Path Language.
You may also find that XPath is also used within HTML documents because HTML has a similar hierarchical structure and XPath can be used to quickly filter and find elements on a web page.
CSSPath is very similar to XPath, however, you will need CSS selectors in place in order to pull the correct information from the elements.
Both XPath and CSS Path have similar syntax’s but many think that CSSPath is easier to use than XPath, plus it’s faster. CSSPath also gives you the opportunity to pull attributes also.
Regex is a more complex syntax (I think) that was developed to spot patterns and sequences in the code, Regex can also be used in multiple programming languages. In this scenario, using Screaming Frog, regex could be used to filter the data even further by using an include/exclude command that’s accessible under the Configuration tab.
Let’s say we only wanted to crawl the URL’s that have ‘community’ in their URL’s, we can do so by using the following Regex syntax: .*community.*.
The tricky part is understanding how they work, plus, let’s not forget that they work differently for every site because each website’s code base is different. In order to fully utilise the expressions to your advantage and to ensure that Screaming Frog collates the right data, you’ll need to understand your site’s architecture and how it’s been coded.
This includes everything from knowing what div’s are being used, the CSS classes that style your HTML elements, the structure of your site and even your heading/paragraph tags.
It really depends on your requirements and what you’re looking to extract out of the site. You will need to have an understanding how the elements, that you’re looking to extract, is built on your website.
If you don’t know how your website operates then how are you meant to tell Screaming Frog what to look for?
As an example, if you’re looking to use Screaming Frogs extraction feature to outline all of the pages that reference a certain term, you will need to understand how your content is coded. In this case, the content will most likely be wrapped in a paragraph tag, therefore, you could use an X Path query like this:
//p[contains(text() ,'your search query here')]
This XPath query will tell Screaming Frog to grab all of the page URL’s that reference ‘your search query here’ in its paragraph tag. I find this relatively useful when I have just created a new piece of content and I am looking for internal linking opportunities.
An easy way of obtaining the XPath query associated with a certain element on a page is by right clicking on an element whilst in the Inspect Element view. You can get to the Inspect Element view by right-clicking on a web page and clicking Inspect.
Once you have decided on the method of extraction, you will then need to determine what element you would like Screaming Frog to extract from. If you have selected either XPath or CSS Path a drop-down menu will appear, however, no drop down menu appears for Regex. The options for both XPath and CSSPath are:
- Extract Inner HTML: This one means that the inner content of the HTML will be selected. If the selected element has other tags inside, these may be selected too. E.g if a heading tag was included within the element, this would be included;
- Extract HTML Element: This essentially means that Screaming Frog will select the entire element and the content within the HTML;
- Extract Text: This one is relatively simple, this means that text within the elements and sub-elements will be selected only.
Once your chosen method of extraction has been selected and you have selected a drop-down option, you will then need to include your query before beginning the crawl request.
Understanding The Methods Of Extraction
In order to better your understanding of XPath, CSS Path and Regex syntax’s I suggest taking a look at the official documentation that is provided by Screaming Frog, this post is also quite useful.
There is no easy approach to this, you will have to simply get off your butt and attempt to learn the syntax’s yourself. If you have a clear understanding of how your site is coded, you’ll find this task easy peasy.
Once you have your queries in place, you will need to add them into the filtering sections of Screaming Frog like so:
Once you have completed a crawl using your desired extraction filters, the Extraction section will look something like this. You can then export this data into a spreadsheet by clicking the Export button.
Examples Of The Syntax’s I Personally Use
Finding Incorrect Reference Of CCtld’s
This XPath query can be used to find links are pointing to a specific URL, however, you may need to name the div class depending on what yours is called. In this case, I was only looking for Screaming Frog to check a specific section of my site, as all of my clients international sites are referenced in the header section (US, UK, AU etc).
I am only interested in finding the links that are pointing to international domains internally within a section div class. In essence, all links within this div class should reference the same site, not the international websites. Ideally I am looking for typo’s when creating internal links.
Find Pages That Reference A Certain Phrase
This XPath query can be used to outline all of the pages on your site that reference a specific phrase, I find this relatively useful when conducting an internal link building campaign.
//p[contains(text() ,'your search query here')]
Extracting Email Addresses From A Site
This XPath query can be used to extract email addresses that are being used on the site.
Extracting All Heading Tag Contents
A simple way of using Screaming Frogs extraction tool to get all of the content listed within heading tags. This can be changed depending on your requirements, if you’re interested in collating the H1’s, change //h3 to //h1.
Extracting Specific Heading Tag Contents, Not All
Very similar to the above heading extraction, however, in some cases there may be multiple headings PER page. If you’re only interested in grabbing the first, the following XPath can be used. The number can be changed depending on your preference.
To grab the first 10 H3’s that are contained on a page, the following query can be used:
/descendant::h3[position() >= 0 and position() <= 10]
Extracting Hreflang Code
If you’re looking to extract the exact hreflang code being used on each page, you can do so by using the following:
On the other hand, if you’re looking for Screaming Frog to only output the hreflang values, instead of the entire line of code, you can do so by using the following:
Extracting Schema Markup
Schema markup and its item type can also be extracted from each page to give you an overview of what’s being used on each.
Extracting Social Media Tags (Open Grab Tags & Twitter Cards)
If you’re interested in the social media tagging side of things, you can use the following query(s).
//meta[starts-with(@property, 'og:title')]/@content //meta[starts-with(@property, 'og:description')]/@content //meta[starts-with(@property, 'og:type')]/@content
Extracting Mobile Annotations
Mobile annotations can also be extracted if that’s something you’re interested in.
//link[contains(@media, '640') and @href]/@href
If you’re keen to extract iframes, you can do so by using the following:
Extracting Links Using AMP
If you’re interested in obtaining an overview of which pages have AMP, the following query may be of interest:
Leave a Comment
Thanks for this great post! It’s wonderful and full of very useful tips 🙂
I am trying to extract data from my Pinterest account to fill in an excel table with data like “Board name”, “followers” …
But I can’t extract any data :/