diff --git a/README.md b/README.md index 082e4e1..18a006b 100644 --- a/README.md +++ b/README.md @@ -1,12 +1,19 @@ # UK Planning Scraper -**PRE-ALPHA: Only works with Idox and Northgate sites and spews a lot of stuff to STDOUT. Not for production use.** +**PRE-ALPHA: Only works with Idox and Northgate sites and spews a lot of stuff +to STDOUT. Not for production use.** -This gem scrapes planning applications data from UK local planning authority websites, eg Westminster City Council. Data is returned as an array of hashes, one hash for each planning application. +This gem scrapes planning applications data from UK local planning authority +websites, eg Westminster City Council. Data is returned as an array of hashes, +one hash for each planning application. -This scraper gem doesn't use a database. Storing the output is up to you. It's just a convenient way to get the data. +This scraper gem doesn't use a database. Storing the output is up to you. It's +just a convenient way to get the data. -Currently this only works for Idox and Northgate sites. The ultimate aim is to provide a consistent interface in a single gem for all variants of all planning systems: Idox Public Access, Northgate Planning Explorer, OcellaWeb, Agile Planning and all the one-off systems. +Currently this only works for Idox and Northgate sites. The ultimate aim is to +provide a consistent interface in a single gem for all variants of all planning +systems: Idox Public Access, Northgate Planning Explorer, OcellaWeb, Agile +Planning and all the one-off systems. This project is not affiliated with any organisation. @@ -15,7 +22,8 @@ This project is not affiliated with any organisation. Add this line to your application's Gemfile: ```ruby -gem 'uk_planning_scraper', :git => 'https://github.com/adrianshort/uk_planning_scraper/' +gem 'uk_planning_scraper', \ + git: 'https://github.com/adrianshort/uk_planning_scraper/' ``` And then execute: @@ -38,66 +46,87 @@ require 'pp' ### Scrape from a council +Applications in Westminster decided in the last seven days: + ```ruby -apps = UKPlanningScraper::Authority.named('Westminster').scrape({ decided_days: 7 }) -pp apps +pp UKPlanningScraper::Authority.named('Westminster').decided_days(7).scrape ``` ### Scrape from a bunch of councils +Scrape the last week's planning decisions across the whole of +London (actually 23 of the 35 authorities right now): + ```ruby -auths = UKPlanningScraper::Authority.tagged('london') +authorities = UKPlanningScraper::Authority.tagged('london') -auths.each do |auth| - apps = auth.scrape({ decided_days: 7 }) - pp apps # You'll probably want to save `apps` to your database here +authorities.each do |authority| + applications = authority.decided_days(7).scrape + pp applications + # You'll probably want to save `applications` to your database here end ``` -Yes, we just scraped the last week's planning decisions across the whole of London (actually 23 of the 35 authorities right now) with five lines of code. - ### Satisfy your niche interests +Launderette applications validated in the last seven days in Scotland: + ```ruby -auths = UKPlanningScraper::Authority.tagged('scotland') +authorities = UKPlanningScraper::Authority.tagged('scotland') -auths.each do |auth| - apps = auth.scrape({ validated_days: 7, keywords: 'launderette' }) - pp apps # You'll probably want to save `apps` to your database here +authorities.each do |authority| + applications = authority.validated_days(7).keywords('launderette').scrape + pp applications # You'll probably want to save `apps` to your database here end ``` -### More search parameters +### More scrape parameter methods + +Chain as many scrape parameter methods on a `UKPlanningScraper::Authority` +object as you like, making sure that `scrape` comes last. ```ruby +received_from(Date.parse("1 Jan 2016")) +received_to(Date.parse("31 Dec 2016")) + +# Received in the last n days (including today) +# Use instead of received_to, received_from +received_days(7) + +validated_to(Date.today) +validated_from(Date.today - 30) +validated_days(7) # instead of validated_to, validated_from + +decided_to(Date.today) +decided_from(Date.today - 30) +decided_days(7) # instead of decided_to, decided_from + +# Check that the systems you're scraping return the +# results you expect for multiple keywords (AND or OR?) +keywords("hip gable") + +applicant_name("Mr and Mrs Smith") # Currently Idox only +application_type("Householder") # Currently Idox only +development_type("") # Currently Idox only -# Don't try these all at once -params = { - received_to: Date.today, - received_from: Date.today - 30, - received_days: 7, # instead of received_to, received_from - validated_to: Date.today, - validated_from: Date.today - 30, - validated_days: 7, # instead of validated_to, validated_from - decided_to: Date.today, - decided_from: Date.today - 30, - decided_days: 7 # instead of decided_to, decided_from - keywords: "hip gable", # Check that the systems you're scraping return the results you expect for multiple keywords (AND or OR?) -} - -apps = UKPlanningScraper::Authority.named('Camden').scrape(params) +scrape # runs the scraper ``` ### Save to a SQLite database -This gem has no interest whatsoever in persistence. What you do with the data it outputs is up to you: relational databases, document stores, VHS and clay tablets are all blissfully none of its business. But using the [ScraperWiki](https://github.com/openaustralia/scraperwiki-ruby) gem is a really easy way to store your data: +This gem has no interest whatsoever in persistence. What you do with the data it +outputs is up to you: relational databases, document stores, VHS and clay +tablets are all blissfully none of its business. But using the +[ScraperWiki](https://github.com/openaustralia/scraperwiki-ruby) gem is a really +easy way to store your data: ```ruby require 'scraperwiki' # Must be installed, of course -ScraperWiki.save_sqlite([:authority_name, :council_reference], apps) +ScraperWiki.save_sqlite([:authority_name, :council_reference], applications) ``` -That `apps` param can be a hash or an array of hashes, which is what gets returned by our `Authority.scrape`. +That `applications` param can be a hash or an array of hashes, which is what +gets returned by our `Authority.scrape`. ### Find authorities by tag @@ -130,11 +159,18 @@ and whatever you'd like to add that would be useful to others. ### WTF is up with London? -London has got 32 London Boroughs, tagged `londonboroughs`. These are the councils under the authority of the Mayor of London and the Greater London Authority. +London has got 32 London Boroughs, tagged `londonboroughs`. These are the +councils under the authority of the Mayor of London and the Greater London +Authority. -It has 33 councils: the London Boroughs plus the City of London (named `City of London`). We don't currently have a tag for this, but if you want to add `londoncouncils` please go ahead. +It has 33 councils: the London Boroughs plus the City of London (named `City of +London`). We don't currently have a tag for this, but if you want to add +`londoncouncils` please go ahead. -And it's got 35 local planning authorities: the 33 councils plus the two `londondevelopmentcorporations`, named `London Legacy Development Corporation` and `Old Oak and Park Royal Development Corporation`. The tag `london` covers all (and only) the 35 local planning authorities in London. +And it's got 35 local planning authorities: the 33 councils plus the two +`londondevelopmentcorporations`, named `London Legacy Development Corporation` +and `Old Oak and Park Royal Development Corporation`. The tag `london` covers +all (and only) the 35 local planning authorities in London. ```ruby UKPlanningScraper::Authority.tagged('londonboroughs').size @@ -151,13 +187,13 @@ UKPlanningScraper::Authority.tagged('london').size ```ruby UKPlanningScraper::Authority.named('Merton').tags -# => ["england", "london", "londonboroughs", "northgate", "outerlondon", "southlondon"] + # => ["england", "london", "londonboroughs", "northgate", "outerlondon", "southlondon"] UKPlanningScraper::Authority.not_tagged('london') -# => [...] + # => [...] UKPlanningScraper::Authority.named('Islington').tagged?('southlondon') -# => false + # => false ``` ### List all authorities @@ -177,15 +213,20 @@ The list of authorities is in a CSV file in `/lib/uk_planning_scraper`: https://github.com/adrianshort/uk_planning_scraper/blob/master/lib/uk_planning_scraper/authorities.csv -The easiest way to add to or edit this list is to edit within GitHub (use the pencil icon) and create a new pull request for your changes. If accepted, your changes will be available to everyone with the next version of the gem. +The easiest way to add to or edit this list is to edit within GitHub (use the + pencil icon) and create a new pull request for your changes. If accepted, your + changes will be available to everyone with the next version of the gem. The file format is one line per authority, with comma-separated: -- Name (omit "the", "council", "borough of", "city of", etc. and write "and" not "&", except for `City of London` which is a special case) +- Name (omit "the", "council", "borough of", "city of", etc. and write "and" not + "&", except for `City of London` which is a special case) - URL of the search form (use the advanced search URL if there is one) -- Tags (use as many comma-separated tags as is reasonable, lowercase and all one word.) +- Tags (use as many comma-separated tags as is reasonable, lowercase and all one + word.) -There's no need to manually add tags to the `authorities.csv` file for the software systems like `idox`, `northgate` etc as these are added automatically. +There's no need to manually add tags to the `authorities.csv` file for the +software systems like `idox`, `northgate` etc as these are added automatically. Please check the tag list before you change anything: @@ -195,10 +236,17 @@ pp UKPlanningScraper::Authority.tags ## Development -After checking out the repo, run `bin/setup` to install dependencies. You can also run `bin/console` for an interactive prompt that will allow you to experiment. +After checking out the repo, run `bin/setup` to install dependencies. You can +also run `bin/console` for an interactive prompt that will allow you to +experiment. -To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org). +To install this gem onto your local machine, run `bundle exec rake install`. To +release a new version, update the version number in `version.rb`, and then run +`bundle exec rake release`, which will create a git tag for the version, push +git commits and tags, and push the `.gem` file to +[rubygems.org](https://rubygems.org). ## Contributing -Bug reports and pull requests are welcome on GitHub at https://github.com/adrianshort/uk_planning_scraper. +Bug reports and pull requests are welcome on GitHub at +https://github.com/adrianshort/uk_planning_scraper. diff --git a/lib/uk_planning_scraper.rb b/lib/uk_planning_scraper.rb index ece96e1..3dd2d0a 100644 --- a/lib/uk_planning_scraper.rb +++ b/lib/uk_planning_scraper.rb @@ -1,5 +1,6 @@ require "uk_planning_scraper/version" require "uk_planning_scraper/authority" +require "uk_planning_scraper/authority_scrape_params" require "uk_planning_scraper/application" require 'uk_planning_scraper/idox' require 'uk_planning_scraper/northgate' diff --git a/lib/uk_planning_scraper/authority.rb b/lib/uk_planning_scraper/authority.rb index 06257d0..bac57ce 100644 --- a/lib/uk_planning_scraper/authority.rb +++ b/lib/uk_planning_scraper/authority.rb @@ -3,6 +3,7 @@ require 'csv' module UKPlanningScraper class Authority attr_reader :name, :url + @@authorities = [] def initialize(name, url) @@ -10,43 +11,25 @@ module UKPlanningScraper @url = url.strip @tags = [] # Strings in arbitrary order @applications = [] # Application objects + @scrape_params = {} end - def scrape(params, options = {}) + def scrape(options = {}) default_options = { delay: 10, } - options = default_options.merge(options) # The user-supplied options override the defaults - - # Validated within the last n days - # Assumes that every scraper/system can do a date range search - if params[:validated_days] - params[:validated_to] = Date.today - params[:validated_from] = Date.today - (params[:validated_days] - 1) - end - - # Received within the last n days - # Assumes that every scraper/system can do a date range search - if params[:received_days] - params[:received_to] = Date.today - params[:received_from] = Date.today - (params[:received_days] - 1) - end - - # Decided within the last n days - # Assumes that every scraper/system can do a date range search - if params[:decided_days] - params[:decided_to] = Date.today - params[:decided_from] = Date.today - (params[:decided_days] - 1) - end - + # The user-supplied options override the defaults + options = default_options.merge(options) + # Select which scraper to use case system when 'idox' - @applications = scrape_idox(params, options) + @applications = scrape_idox(@scrape_params, options) when 'northgate' - @applications = scrape_northgate(params, options) + @applications = scrape_northgate(@scrape_params, options) else - raise SystemNotSupported.new("Planning system not supported for #{@name} at URL: #{@url}") + raise SystemNotSupported.new("Planning system not supported for \ + #{@name} at URL: #{@url}") end # Post processing @@ -58,6 +41,10 @@ module UKPlanningScraper output = [] # FIXME - silently ignores invalid apps. How should we handle them? @applications.each { |app| output << app.to_hash if app.valid? } + + # Reset so that old params don't get used for new scrapes + clear_scrape_params + output # Single point of successful exit end @@ -82,15 +69,15 @@ module UKPlanningScraper def system if @url.match(/search\.do\?action=advanced/i) - s = 'idox' + 'idox' elsif @url.match(/generalsearch\.aspx/i) - s = 'northgate' + 'northgate' elsif @url.match(/ocellaweb/i) - s = 'ocellaweb' + 'ocellaweb' elsif @url.match(/\/apas\//) - s = 'agileplanning' + 'agileplanning' else - s = 'unknownsystem' + 'unknownsystem' end end @@ -135,7 +122,8 @@ module UKPlanningScraper def self.load # Don't run this method more than once return unless @@authorities.empty? - CSV.foreach(File.join(File.dirname(__dir__), 'uk_planning_scraper', 'authorities.csv')) do |line| + CSV.foreach(File.join(File.dirname(__dir__), 'uk_planning_scraper', \ + 'authorities.csv')) do |line| auth = Authority.new(line[0], line[1]) auth.add_tags(line[2..-1]) auth.add_tag(auth.system) diff --git a/lib/uk_planning_scraper/authority_scrape_params.rb b/lib/uk_planning_scraper/authority_scrape_params.rb new file mode 100644 index 0000000..819c6e2 --- /dev/null +++ b/lib/uk_planning_scraper/authority_scrape_params.rb @@ -0,0 +1,134 @@ +require 'date' + +module UKPlanningScraper + class Authority + # Parameter methods for Authority#scrape + # Desgined to be method chained, eg: + # + # applications = UKPlanningScraper::Authority.named("Barnet"). \ + # development_type("Q22").keywords("illuminat"). \ + # validated_days(30).scrape + + def validated_days(n) + # Validated within the last n days + # Assumes that every scraper/system can do a date range search + check_class(n, Fixnum) + + unless n > 0 + raise ArgumentError.new("validated_days must be greater than 0") + end + + validated_from(Date.today - (n - 1)) + validated_to(Date.today) + self + end + + def received_days(n) + # received within the last n days + # Assumes that every scraper/system can do a date range search + check_class(n, Fixnum) + + unless n > 0 + raise ArgumentError.new("received_days must be greater than 0") + end + + received_from(Date.today - (n - 1)) + received_to(Date.today) + self + end + + def decided_days(n) + # decided within the last n days + # Assumes that every scraper/system can do a date range search + check_class(n, Fixnum) + + unless n > 0 + raise ArgumentError.new("decided_days must be greater than 0") + end + + decided_from(Date.today - (n - 1)) + decided_to(Date.today) + self + end + + def applicant_name(s) + unless system == 'idox' + raise NoMethodError.new("applicant_name is only implemented for Idox. \ + This authority (#{@name}) is #{system.capitalize}.") + end + + check_class(s, String) + @scrape_params[:applicant_name] = s.strip + self + end + + def application_type(s) + unless system == 'idox' + raise NoMethodError.new("application_type is only implemented for \ + Idox. This authority (#{@name}) is #{system.capitalize}.") + end + + check_class(s, String) + @scrape_params[:application_type] = s.strip + self + end + + def development_type(s) + unless system == 'idox' + raise NoMethodError.new("development_type is only implemented for \ + Idox. This authority (#{@name}) is #{system.capitalize}.") + end + + check_class(s, String) + @scrape_params[:development_type] = s.strip + self + end + + private + + # Handle the simple params with this + def method_missing(method_name, *args) + sc_params = { + validated_from: Date, + validated_to: Date, + received_from: Date, + received_to: Date, + decided_from: Date, + decided_to: Date, + keywords: String + } + + value = args[0] + + if sc_params[method_name] + check_class(value, sc_params[method_name], method_name.to_s) + value.strip! if value.class == String + + if value.class == Date && value > Date.today + raise ArgumentError.new("#{method_name} can't be a date in the " + \ + "future (#{value.to_s})") + end + + @scrape_params[method_name] = value + self + else + raise NoMethodError.new(method_name.to_s) + end + end + + def clear_scrape_params + @scrape_params = {} + end + + # https://stackoverflow.com/questions/5100299/how-to-get-the-name-of-the-calling-method + def check_class( + param_value, + expected_class, + param_name = caller_locations(1, 1)[0].label) # name of calling method + unless param_value.class == expected_class + raise TypeError.new("#{param_name} must be a " \ + "#{expected_class} not a #{param_value.class.to_s}") + end + end + end +end diff --git a/lib/uk_planning_scraper/version.rb b/lib/uk_planning_scraper/version.rb index 5bcfcc3..c413138 100644 --- a/lib/uk_planning_scraper/version.rb +++ b/lib/uk_planning_scraper/version.rb @@ -1,3 +1,3 @@ module UKPlanningScraper - VERSION = "0.3.2" + VERSION = "0.4.0" end diff --git a/spec/brighton_spec.rb b/spec/brighton_spec.rb index ec6cfbc..bb828df 100644 --- a/spec/brighton_spec.rb +++ b/spec/brighton_spec.rb @@ -11,7 +11,7 @@ describe UKPlanningScraper::Authority do it 'returns apps' do apps = VCR.use_cassette("#{self.class.description}") { - scraper.scrape({ decided_days: 4 }, { delay: 0 }) + scraper.decided_days(4).scrape({ delay: 0 }) } pp apps # expect(authority).to be_a(UKPlanningScraper::Authority)