Ruby web scraping tutorial on morph.io – Part 3, continue writing your scraper

This post is part of a series of posts that provide step-by-step instructions on how to write a simple web scraper using Ruby on morph.io. If you find any problems, let us know in the comments so we can improve these tutorials.


In the last post we started writing our scraper and gathering some data. In this post we’ll expand our scraper to get more of the data we’re after.

So now that you’ve got the title for the first member get the electorate (the place the member is ‘member for’) and party.

Looking at the page source again, you can see this information is in the first and second <dd> elements in the member’s <li>.

<li>
  <p class='title'>
    <a href="http://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=WN6">
      The Hon Ian Macfarlane MP
    </a>
  </p>
  <p class='thumbnail'>
    <a href="http://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=WN6">
      <img alt="Photo of The Hon Ian Macfarlane MP" src="http://parlinfo.aph.gov.au/parlInfo/download/handbook/allmps/WN6/upload_ref_binary/WN6.JPG" width="80" />
    </a>
  </p>
  <dl>
    <dt>Member for</dt>
    <dd>Groom, Queensland</dd>
    <dt>Party</dt>
    <dd>Liberal Party of Australia</dd>
    <dt>Connect</dt>
    <dd>
      <a class="social mail" href="mailto:Ian.Macfarlane.MP@aph.gov.au"
      target="_blank">Email</a>
    </dd>
  </dl>
</li>

Get the electorate and party by first getting an array of the <dd> elements and then selecting the one you want by its index in the array. Remember that [0] is the first item in an Array.

Try getting the data in your irb session:

>> page.at('.search-filter-results').at('li').search('dd')[0].inner_text
=> "Groom, Queensland"
>> page.at('.search-filter-results').at('li').search('dd')[1].inner_text
=> "Liberal Party of Australia"

Then add the code to expand your member object in your scraper.rb:

member = {
  title: page.at('.search-filter-results').at('li').at('.title').inner_text.strip,
  electorate: page.at('.search-filter-results').at('li').search('dd')[0].inner_text,
  party: page.at('.search-filter-results').at('li').search('dd')[1].inner_text
}

Save and run your scraper using bundle exec ruby scraper.rb and check that your object includes the attributes with values you expect.

OK, now you just need the url for the member’s individual page. Look at that source code again and you’ll find it in the href of the <a> inside the <p> with the class title.

In your irb session, first get the <a> element:

>> page.at('.search-filter-results').at('li').at('.title a')
=> #<Nokogiri::XML::Element:0x3fca485cfba0 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fca48432a18 name="href" value="http://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=WN6">] children=[#<Nokogiri::XML::Text:0x3fca4843b5c8 "The Hon Ian Macfarlane MP">]>

You get a Nokogiri XML Element with one attribute. The attribute has the name “href” and the value is the url you want. You can use the attr() method here to return this value:

>> page.at('.search-filter-results').at('li').at('.title a').attr('href')
=> "http://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=WN6"

You can now add this final attribute to your member object in scraper.rb:

member = {
  title: page.at('.search-filter-results').at('li').at('.title').inner_text.strip,
  electorate: page.at('.search-filter-results').at('li').search('dd')[0].inner_text,
  party: page.at('.search-filter-results').at('li').search('dd')[1].inner_text,
  url: page.at('.search-filter-results').at('li').at('.title a').attr('href')
}

Save and run your scraper file to make sure all is well. This is a good time to do another git commit to save your progress.

Now you’ve written a scraper to get information about one member of Australian Parliament. It’s time to get information about all the members on the first page.

Currently you’re using page.at('.search-filter-results').at('li') to target the first list item in the members list. You can adapt this to get every list item using the search() method:

page.at('.search-filter-results').search('li')

Use a ruby each loop to run your code to collect and print your member object once for each list item.

page.at('.search-filter-results').search('li').each do |li|
  member = {
    title: li.at('.title').inner_text.strip,
    electorate: li.search('dd')[0].inner_text,
    party: li.search('dd')[1].inner_text,
    url: li.at('.title a').attr('href')
  }

  p member
end

Save and run the file and see if it collects all the members on the page as expected. Now you’re really scraping!

You still don’t have all the members though, they are split over 3 pages and you only have the first. In our next post we’ll work out how to deal with this pagination.

This entry was posted in Morph and tagged , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*
*

Subscribe without commenting