Matchers

Creating your own matcher

A matcher is a simple function taking the data to be evaluated as argument(s) and returning a boolean value according to its validity.

Current matchers

Date Matchers

The date matchers use a lot of words to delimit their date range. They are separated to set the maximum and minimum date. In order of precedence they are for minimum date:

  • min_date
  • on
  • after
  • since

And for maximum date:

  • max_date
  • on
  • before

Their values could be dates parseables by dateparser, date or datetime objects. They also support None value, so that limit isn’t verified.

scrapy_mosquitera.matchers.date_matches(data, **kwargs)

Return True if data is a date in the valid date range. Otherwise False.

Parameters:
  • data (string, date or datetime) – the date to validate
  • kwargs (dict) – special delimitation parameters
Return type:

bool

scrapy_mosquitera.matchers.date_in_period_matches(data, period='day', check_maximum=True, **kwargs)

Return True if data is a date in the valid date range defined by period. Otherwise False.

This matcher is ideal for cases like the following one.

A forum post is created at 04-10-2016. Then on 04-28-2016, I try to scrape the forum covering the last few days. However, the forum doesn’t display the post date but some sentences like X weeks ago. So, in the forum nomenclature, the posts fall in the next table:

Start date End date Name
04-15-2016 04-21-2016 One week ago
04-08-2016 04-14-2016 Two weeks ago
04-01-2016 04-07-2016 Three weeks ago

On 04-28-2016, if I calculate two weeks ago it will return 04-14-2016. Comparing it to the forum meaning, we’re working with fixed dates and the forum with date ranges. Then, if I scrape until 04-10-2016, the crawl will miss the posts from 04-10-2016 to 04-13-2016 since the last valid date would be two weeks ago (three weeks ago is out of scope (04-07-2016 < 04-10-2016)).

This matcher comes to solve this, so you can provide the period (in this case week) and you won’t miss items by coverage issues. However, it’s inclusive because to satisfy the date 04-10-2016 it will include the full week [04-08-2016, 04-14-2016], so a post-filtering should be made to only allow valid items.

Parameters:
  • data (string, date or datetime) – the date to validate
  • period (string) – the period to evaluate (‘day’, ‘month’, ‘year’)
  • check_maximum (bool) – check maximum date
  • kwargs (dict) – special delimitation parameters
Return type:

bool