OLD | NEW |
1 crawler | 1 crawler |
2 ======= | 2 ======= |
3 | 3 |
4 Backend for the Adblock Plus Crawler. It provides the following URLs: | 4 Backend for the Adblock Plus Crawler. It provides the following URLs: |
5 | 5 |
6 * */crawlableSites* - Return a list of sites to be crawled | 6 * */crawlableSites* - Return a list of sites to be crawled |
7 * */crawlerData* - Receive data on filtered elements | 7 * */crawlerRequests* - Receive all requests made, and whether they were filtered |
8 | 8 |
9 Required packages | 9 Required packages |
10 ----------------- | 10 ----------------- |
11 | 11 |
12 * [simplejson](http://pypi.python.org/pypi/simplejson/) | 12 * [simplejson](http://pypi.python.org/pypi/simplejson/) |
13 | 13 |
14 Database setup | 14 Database setup |
15 -------------- | 15 -------------- |
16 | 16 |
17 Just execute the statements in _schema.sql_. | 17 Just execute the statements in _schema.sql_. |
18 | 18 |
19 Configuration | 19 Configuration |
20 ------------- | 20 ------------- |
21 | 21 |
22 Just add an empty _crawler_ section to _/etc/sitescripts_ or _.sitescripts_. | 22 Just add an empty _crawler_ section to _/etc/sitescripts_ or _.sitescripts_. |
23 | 23 |
| 24 If you want to import crawlable sites or domain-specific filters from |
| 25 easylist (see below), you need to make _easylist\_repository_ point to |
| 26 the local Mercurial repository of easylist. |
| 27 |
24 Also make sure that the following keys are configured in the _DEFAULT_ | 28 Also make sure that the following keys are configured in the _DEFAULT_ |
25 section: | 29 section: |
26 | 30 |
27 * _database_ | 31 * _database_ |
28 * _dbuser_ | 32 * _dbuser_ |
29 * _dbpassword_ | 33 * _dbpassword_ |
30 * _basic\_auth\_realm_ | 34 * _basic\_auth\_realm_ |
31 * _basic\_auth\_username_ | 35 * _basic\_auth\_username_ |
32 * _basic\_auth\_password_ | 36 * _basic\_auth\_password_ |
33 | 37 |
34 Extracting crawler sites | 38 Importing crawlable sites from easylist |
35 ------------------------ | 39 --------------------------------------- |
36 | 40 |
37 Make _filter\_list\_repository_ in the _crawler_ configuration section | 41 python -m sitescripts.crawler.bin.import_sites |
38 point to the local Mercurial repository of a filter list. | |
39 | 42 |
40 Then execute the following: | 43 Importing domain-specific filters from easylist |
| 44 ----------------------------------------------- |
41 | 45 |
42 python -m sitescripts.crawler.bin.extract_sites > sites.sql | 46 python -m sitescripts.crawler.bin.import_filters |
43 | |
44 Now you can execute the insert statements from _crawler.sql_. | |
OLD | NEW |