Site Crawler

Note that Site Crawler is in an opt-in Free Trial period. Please reach out to mairin@page-vault.com if you would like access to this feature. 

What is the Site Crawler?

The Page Vault Site Crawler is a feature designed to automatically crawl websites and simultaneously make captures of those websites. It is ideal for users who need to capture multiple entire pages or sections of a website.

The Site Crawler will collect URLs from the original domain. All external links will be excluded.

🕷️ What are common use cases for the Site Crawler?

  • Quickly collect all URLs within a page, or up to 5 layers (subsequent pages) of a website. You can use this list to determine which pages or sections to capture using our Batch tool
  • Save time and effort by automatically crawling and capturing an entire website or a website section, up to 5,000 URLs. 

✅ All captures made with the Site Crawler during the Free Trial period will not be counted towards your monthly PDF usage. 

Watch a 3 minute demo here to get started.

How do I Get Started with the Site Crawler?

There are two ways you can use the Site Crawler: from the Page Vault Portal, or through Capture Mode.

Crawling from the Page Vault Portal:

  1. Log into your Page Vault account and select “Site Crawler” in the top navigation bar.
  2. Select New Crawl, and enter in the URL that you wish to crawl.
  3. If needed, enter in cookies. Learn how and when to manage cookies here.
  4. Choose which type of crawl you’d like to make:
    • Crawl and Capture will crawl up to 5000 URLs and automatically capture each URL. The output will be a complete list of URLs and a PDF capture of each URL.
    • Crawl Only will provide a list of up to 5000 URLs crawled from the website
  5. Select the number of layers you wish to crawl – if you are trying to capture an entire website, set this to “5.”
  6. Set the URL maximum. Please note that if you increase this to 5000, the capture process can take up to a couple hours.
  7. If you are capturing your crawled URLs, select a folder to which you will save your captures.
  8. Click Start Crawl!

You can also use the Site Crawler directly from the Page Vault Capture Mode.

  1. Launch Capture Mode
  2. Navigate to the site you wish to crawl
  3. Select “Site Crawler” and fill out the appropriate details for your crawl. Note that cookies are not required in this section, because the browser will store the cookies automatically. Before you start the crawl, be sure to click out of any marketing pop-ups.
  4. Select Start! You can close out of Capture Mode as your crawl will run in the background.
  5. Navigate back to the Site Crawler section in your portal to view the crawl progress.

Canceling, Deleting, Downloading Crawls

Once a crawl is started, you can cancel it at any time while it’s running. This is helpful if you started a crawl by mistake, or if you notice that the capture results have a pop-up obstructing the captured content. When you cancel a crawl, all the crawled and captured material will be saved to the crawl details page.

Once a crawl has completed, you are able to take a few actions on the crawl results.

  • Delete – this will delete the crawl record. Note that if you captured content during the crawl, the content will still be stored in your Page Vault Portal.
  • Download Crawl CSV – this will download the URL crawl results in a CSV file. Captures will not be downloaded.
  • Retry Failed URLs – occasionally, a URL’s capture will fail. This option allows you to retry any failed URLs to automatically attempt to capture them again.

FAQs

Who has access to Site Crawler Free Trial? Anyone who signed up for the Free Trial will see this feature on their profile. If you are interested in testing the first version of this feature, please email mairin@page-vault.com.

Do I have to capture all the URLs that I crawl? No – our crawler lets you distinguish between whether you’d like to just crawl a website or crawl and capture a website simultaneously.

What are “layers”? Layers refer to the hierarchical structure of a website. The URL that you enter represents Layer 1. Any URL collected from that page will then become Layer 2, and all URLs on those pages will be crawled and will become layer 3. Site Crawler beta can capture up to 5 layers.

What types of URLs will be captured? The crawler will collect URLs include in a site’s page. External links, videos, and email domain links will not be included. Any duplicate links will automatically be removed.

What if I have two captures that are identical but the URLs are different? Occasionally, a website will have two URLs that point to the same page – resulting in two identical captures. In its current state, our crawler removes all duplicate URLs, but we cannot detect if two different URLs point to an identical page. If you are seeing the same URLs being captured multiple times, please reach out to support@page-vault.com

Updated on April 16, 2024
Can't find what you need?