Matchers¶
Creating your own matcher¶
A matcher is a simple function taking the data to be evaluated as argument(s) and returning a boolean value according to its validity.
Current matchers¶
Date Matchers¶
The date matchers use a lot of words to delimit their date range. They are separated to set the maximum and minimum date. In order of precedence they are for minimum date:
- min_date
- on
- after
- since
And for maximum date:
- max_date
- on
- before
Their values could be dates parseables by dateparser, date
or datetime
objects.
They also support None
value, so that limit isn’t verified.
-
scrapy_mosquitera.matchers.
date_matches
(data, **kwargs)¶ Return
True
ifdata
is a date in the valid date range. OtherwiseFalse
.Parameters: - data (string, date or datetime) – the date to validate
- kwargs (dict) – special delimitation parameters
Return type: bool
-
scrapy_mosquitera.matchers.
date_in_period_matches
(data, period='day', check_maximum=True, **kwargs)¶ Return
True
ifdata
is a date in the valid date range defined byperiod
. OtherwiseFalse
.This matcher is ideal for cases like the following one.
A forum post is created at 04-10-2016. Then on 04-28-2016, I try to scrape the forum covering the last few days. However, the forum doesn’t display the post date but some sentences like X weeks ago. So, in the forum nomenclature, the posts fall in the next table:
Start date End date Name 04-15-2016 04-21-2016 One week ago 04-08-2016 04-14-2016 Two weeks ago 04-01-2016 04-07-2016 Three weeks ago On 04-28-2016, if I calculate two weeks ago it will return 04-14-2016. Comparing it to the forum meaning, we’re working with fixed dates and the forum with date ranges. Then, if I scrape until 04-10-2016, the crawl will miss the posts from 04-10-2016 to 04-13-2016 since the last valid date would be two weeks ago (three weeks ago is out of scope (04-07-2016 < 04-10-2016)).
This matcher comes to solve this, so you can provide the period (in this case week) and you won’t miss items by coverage issues. However, it’s inclusive because to satisfy the date 04-10-2016 it will include the full week [04-08-2016, 04-14-2016], so a post-filtering should be made to only allow valid items.
Parameters: - data (string, date or datetime) – the date to validate
- period (string) – the period to evaluate (‘day’, ‘month’, ‘year’)
- check_maximum (bool) – check maximum date
- kwargs (dict) – special delimitation parameters
Return type: bool