Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code

Side by Side Diff: sitescripts/crawler/README.md

Issue 8492019: sitescripts: Collect unmatched filters (Closed)
Patch Set: Created Oct. 2, 2012, 5:02 a.m.
Left:
Right:
Use n/p to move between diff chunks; N/P to move between comments.
Jump to:
View unified diff | Download patch
« no previous file with comments | « no previous file | sitescripts/crawler/bin/extract_sites.py » ('j') | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 crawler 1 crawler
2 ======= 2 =======
3 3
4 Backend for the Adblock Plus Crawler. It provides the following URLs: 4 Backend for the Adblock Plus Crawler. It provides the following URLs:
5 5
6 * */crawlableSites* - Return a list of sites to be crawled 6 * */crawlableSites* - Return a list of sites to be crawled
7 * */crawlerData* - Receive data on filtered elements 7 * */crawlerRequests* - Receive all requests made, and whether they were filtered
8 8
9 Required packages 9 Required packages
10 ----------------- 10 -----------------
11 11
12 * [simplejson](http://pypi.python.org/pypi/simplejson/) 12 * [simplejson](http://pypi.python.org/pypi/simplejson/)
13 13
14 Database setup 14 Database setup
15 -------------- 15 --------------
16 16
17 Just execute the statements in _schema.sql_. 17 Just execute the statements in _schema.sql_.
18 18
19 Configuration 19 Configuration
20 ------------- 20 -------------
21 21
22 Just add an empty _crawler_ section to _/etc/sitescripts_ or _.sitescripts_. 22 Just add an empty _crawler_ section to _/etc/sitescripts_ or _.sitescripts_.
23 23
24 If you want to import crawlable sites or domain-specific filters from
25 easylist (see below), you need to make _easylist\_repository_ point to
26 the local Mercurial repository of easylist.
27
24 Also make sure that the following keys are configured in the _DEFAULT_ 28 Also make sure that the following keys are configured in the _DEFAULT_
25 section: 29 section:
26 30
27 * _database_ 31 * _database_
28 * _dbuser_ 32 * _dbuser_
29 * _dbpassword_ 33 * _dbpassword_
30 * _basic\_auth\_realm_ 34 * _basic\_auth\_realm_
31 * _basic\_auth\_username_ 35 * _basic\_auth\_username_
32 * _basic\_auth\_password_ 36 * _basic\_auth\_password_
33 37
34 Extracting crawler sites 38 Importing crawlable sites from easylist
35 ------------------------ 39 ---------------------------------------
36 40
37 Make _filter\_list\_repository_ in the _crawler_ configuration section 41 python -m sitescripts.crawler.bin.import_sites
38 point to the local Mercurial repository of a filter list.
39 42
40 Then execute the following: 43 Importing domain-specific filters from easylist
44 -----------------------------------------------
41 45
42 python -m sitescripts.crawler.bin.extract_sites > sites.sql 46 python -m sitescripts.crawler.bin.import_filters
43
44 Now you can execute the insert statements from _crawler.sql_.
OLDNEW
« no previous file with comments | « no previous file | sitescripts/crawler/bin/extract_sites.py » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld