Working with Wraith, Pt. 2

This post is an addendum for Working with Wraith.

If you don’t want to spend the time documenting out each URL on your site, Wraith can perform a spider of your site’s links upon execution. This functionality is courtesy of the Anemone library that’s required for Wraith.

How to use Wraith’s spidering

It’s pretty simple: comment out (or remove) the path: section of your config.yaml. For example, if this is your current path: section and you want to use spidering, change this:

paths:
  home: /
  articles: /articles/

…to this:

#paths:
#  home: /
#  uk_index: /uk

What about exclusions?

In the case of the Jekyll theme I’m using, when spidering my site a tags.html page was being created, with additional parameters being tacked on for each of the tags that I’ve made. To resolve the issue, you can specify Anemone specific parameters to change the behavior of the spidering. Open /lib/wraith_manager.rb and find the below section of code.

def spider_base_domain

    spider_list = []
    #set the crawl domain to the base domain in the confing 
    crawl_url = wraith.base_domain
    #ignore urls to file extension such as images etc
    ext = %w(flv swf png jpg gif asx zip rar tar 7z gz jar js css dtd xsd ico raw mp3 mp4 wav wmv ape aac ac3 wma aiff mpg mpeg avi mov ogg mkv mka asx asf mp2 m1v m3u f4v pdf doc xls ppt pps bin exe rss xml)
    Anemone.crawl(crawl_url) do |anemone|
      # Don't spider the tags for a Jekyll blog
      anemone.skip_links_like %r{^/tags.*}
      anemone.on_every_page do |page|
          #puts page.url
          #add the urls to the array
          spider_list << page.url.path
      end
    end

You can use regular expressions with the aptly named anemone.skip_links_like method. If you want to prevent /tags* from appearing in your spider results (like I did), you can use the code below.

anemone.skip_links_like %r{^/tags.*}


A big thanks to the BBC News developers who created this excellent application and the community who maintains it! You can find the github page here.