Friday, March 24, 2023
HomeMobile MarketingHow To Crawl A Massive Web site And Extract Information Utilizing Screaming...

How To Crawl A Massive Web site And Extract Information Utilizing Screaming Frog’s web optimization Spider


We’re aiding a number of shoppers proper now with Marketo migrations. As massive firms make the most of enterprise options like this, it’s like a spider internet that weaves itself into processes and platforms over years… till the purpose that firms aren’t even conscious of each touchpoint.

With an enterprise advertising and marketing automation platform like Marketo, types are the entry level of knowledge all through websites and touchdown pages. Corporations typically have 1000’s of pages and tons of of types all through their websites that should be recognized for updating.

A fantastic software for that is Screaming Frog’s web optimization Spider… maybe the preferred platform within the web optimization marketplace for crawling, auditing, and extracting information from a web site. The platform is feature-rich and presents tons of of choices for just about each activity you require. The options lengthen far past optimization for search, although, with one extremely useful characteristic for extracting information out of your web site because it’s being crawled.

Screaming Frog web optimization Spider: Crawl And Extract

A key characteristic of Screaming Frog web optimization Spider is that you may carry out customized extractions primarily based on Regex, XPath, or CSSPath specifics. This is available in extraordinarily helpful as we want to crawl the consumer’s websites and audit and seize the MunchkinID and FormId values from pages.

With the software, open Configuration > Customized > Extraction to determine parts you want to extract.

screamingfrog custom extraction

The extraction display screen permits for just about limitless information assortment:

Screaming Frog SEO Spider Extraction Rules

Regex, XPath, and CSSPath Extraction

For the MunchkinID, the identifier is positioned inside the kind script that’s inside the web page:

<script sort='textual content/javascript' id='marketo-fat-js-extra'>
    /* <![CDATA[ */
    var marketoFat = {
        "id": "123-ABC-456",
        "prepopulate": "",
        "ajaxurl": "https://yoursite.com/wp-admin/admin-ajax.php",
        "popout": {
            "enabled": false
        }
    };
    /* ]]> */

We then apply a Regex rule to seize the id from inside the script tag that’s inserted within the web page:

Regex: ["']id["']: *["'](.*?)["']

For the Kind ID, the information is in an enter tag inside the Marketo kind:

<enter sort="hidden" identify="formid" class="mktoField mktoFieldDescriptor" worth="1234">

We apply an XPath rule to seize the id from inside the kind that’s inserted within the web page. The XPath question appears for a kind with an enter with a reputation of formid, then the extraction saves the worth:

XPath: //kind/enter[@name="formid"]/@worth

Extract Inline Type Tags

We’re serving to a consumer proper now clear up a web site the place they used inline kinds on the Elementor plugin to customise just about each ingredient with a web page. To determine the place inline kinds have been used, we scrapted the location with numerous RegEx guidelines for customized extraction:

<spans+(?:[^>]*?s+)?kinds*=s*"([^"]*)"
<as+(?:[^>]*?s+)?kinds*=s*"([^"]*)"
<divs+(?:[^>]*?s+)?kinds*=s*"([^"]*)"
  • Heading Tag Inline Type:
<h+(?:[^>]*?s+)?kinds*=s*"([^"]*)"

Exclude Subdomains In Your Crawl

At Martech Zone, we serve the location in a number of languages at totally different subdomains. Crawling these translations isn’t obligatory since all of the property and data relies on the core web site. Due to this, we enabled the Exclude Checklist Configuration and added the next rule:

.*.martech.zone

You may as well use this to skip crawling pointless paths like tags by including:

martech.zone/tag/.*

The platform even has a pleasant technique to check some URLs in opposition to the foundations to make sure it really works correctly earlier than you crawl your web site.

Screaming Frog web optimization Spider Javascript Rendering

One other nice possibility of Screaming Frog is that you just aren’t restricted to the HTML within the web page, you’ll be able to render any JavaScript that’s going to insert types inside your web site. Inside Configuration > Spider, you’ll be able to go to the Rendering tab and allow this.

Screaming Frog SEO Spider Javascript Rendering

This does take a bit of longer to crawl the location, in fact, however you’ll get types which can be rendered client-side by JavaScript in addition to types which can be inserted server-side.

Whereas it is a very particular utility, it’s an extremely helpful one as you’re working with massive websites. You’ll completely wish to audit the place your types are embedded all through the location.

Obtain Screaming Frog web optimization Spider

Disclosure: Martech Zone is utilizing its affiliate hyperlinks on this article.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments