OLD | NEW |
1 # python-abp | 1 python-abp |
| 2 ========== |
2 | 3 |
3 This repository contains a library for working with Adblock Plus filter lists | 4 This repository contains a library for working with Adblock Plus filter lists, |
4 and the script that is used for building Adblock Plus filter lists from the | 5 a script for rendering diffs between filter lists, and the script that is used |
5 form in which they are authored into the format suitable for consumption by the | 6 for building Adblock Plus filter lists from the form in which they are authored |
6 adblocking software. | 7 into the format suitable for consumption by the adblocking software (aka |
| 8 rendering). |
7 | 9 |
8 ## Installation | 10 .. contents:: |
| 11 |
| 12 |
| 13 Installation |
| 14 ------------ |
9 | 15 |
10 Prerequisites: | 16 Prerequisites: |
11 | 17 |
12 * Linux, Mac OS X or Windows (any modern Unix should work too), | 18 * Linux, Mac OS X or Windows (any modern Unix should work too), |
13 * Python (2.7 or 3.5+), | 19 * Python (2.7 or 3.5+), |
14 * pip. | 20 * pip. |
15 | 21 |
16 To install: | 22 To install:: |
17 | 23 |
18 $ pip install -U python-abp | 24 $ pip install --upgrade python-abp |
19 | 25 |
20 ## Rendering of filter lists | 26 |
| 27 Rendering of filter lists |
| 28 ------------------------- |
21 | 29 |
22 The filter lists are originally authored in relatively smaller parts focused | 30 The filter lists are originally authored in relatively smaller parts focused |
23 on a particular type of filters, related to a specific topic or relevant | 31 on particular types of filters, related to a specific topic or relevant for a |
24 for particular geographical area. | 32 particular geographical area. |
25 We call these parts _filter list fragments_ (or just _fragments_) | 33 We call these parts *filter list fragments* (or just *fragments*) to |
26 to distinguish them from full filter lists that are | 34 distinguish them from full filter lists that are consumed by the adblocking |
27 consumed by the adblocking software such as Adblock Plus. | 35 software such as Adblock Plus. |
28 | 36 |
29 Rendering is a process that combines filter list fragments into a filter list. | 37 Rendering is a process that combines filter list fragments into a filter list. |
30 It starts with one fragment that can include other ones and so forth. | 38 It starts with one fragment that can include other ones and so forth. |
31 The produced filter list is marked with a [version and a timestamp][1]. | 39 The produced filter list is marked with a `version and a timestamp <https://adbl
ockplus.org/filters#special-comments>`_. |
32 | 40 |
33 Python-abp contains a script that can do this called `flrender`: | 41 Python-abp contains a script that can do this called ``flrender``:: |
34 | 42 |
35 $ flrender fragment.txt filterlist.txt | 43 $ flrender fragment.txt filterlist.txt |
36 | 44 |
37 This will take the top level fragment in `fragment.txt`, render it and save into | |
38 `filterlist.txt`. | |
39 | 45 |
40 The `flrender` script can also be used by only specifying `fragment.txt`: | 46 This will take the top level fragment in ``fragment.txt``, render it and save it |
41 | 47 into ``filterlist.txt``. |
42 $flrender fragment.txt | |
43 | |
44 in which case the rendering result will be sent to `stdout`. Moreover, when | |
45 it's run with no positional arguments: | |
46 | 48 |
47 $flrender | 49 The ``flrender`` script can also be used by only specifying ``fragment.txt``:: |
48 | 50 |
49 it will read from `stdin` and send the results to `stdout`. | 51 $ flrender fragment.txt |
| 52 |
| 53 |
| 54 in which case the rendering result will be sent to ``stdout``. Moreover, when |
| 55 it's run with no positional arguments:: |
| 56 |
| 57 $ flrender |
| 58 |
| 59 |
| 60 it will read from ``stdin`` and send the results to ``stdout``. |
50 | 61 |
51 Fragments might reference other fragments that should be included into them. | 62 Fragments might reference other fragments that should be included into them. |
52 The references come in two forms: http(s) includes and local includes: | 63 The references come in two forms: http(s) includes and local includes:: |
53 | 64 |
54 %include http://www.server.org/dir/list.txt% | 65 %include http://www.server.org/dir/list.txt% |
55 %include easylist:easylist/easylist_general_block.txt% | 66 %include easylist:easylist/easylist_general_block.txt% |
56 | 67 |
57 The first instruction contains a URL that will be fetched and inserted at the | 68 |
58 point of reference. | 69 The http include contains a URL that will be fetched and inserted at the point |
59 The second one contains a path inside easylist repository. | 70 of reference. |
60 `flrender` needs to be able to find a copy of the repository on the local | 71 The local include contains a path inside the easylist repository. |
61 filesystem. We use `-i` option to point it to to the right directory: | 72 ``flrender`` needs to be able to find a copy of the repository on the local |
| 73 filesystem. We use ``-i`` option to point it to to the right directory:: |
62 | 74 |
63 $ flrender -i easylist=/home/abc/easylist input.txt output.txt | 75 $ flrender -i easylist=/home/abc/easylist input.txt output.txt |
64 | 76 |
65 Now the second reference above will be resolved to | 77 |
66 `/home/abc/easylist/easylist/easylist_general_block.txt` and the fragment will | 78 Now the local include referenced above will be resolved to: |
67 be loaded from this file. | 79 ``/home/abc/easylist/easylist/easylist_general_block.txt`` |
| 80 and the fragment will be loaded from this file. |
68 | 81 |
69 Directories that contain filter list fragments that are used during rendering | 82 Directories that contain filter list fragments that are used during rendering |
70 are called sources. | 83 are called sources. |
71 They are normally working copies of the repositories that contain filter list | 84 They are normally working copies of the repositories that contain filter list |
72 fragments. | 85 fragments. |
73 Each source is identified by a name: that's the part that comes before ":" | 86 Each source is identified by a name: that's the part that comes before ":" in |
74 in the include instruction and it should be the same as what comes before "=" | 87 the include instruction and it should be the same as what comes before "=" in |
75 in the `-i` option. | 88 the ``-i`` option. |
76 | 89 |
77 Commonly used sources have generally accepted names. For example the main | 90 Commonly used sources have generally accepted names. For example the main |
78 EasyList repository is referred to as `easylist`. | 91 EasyList repository is referred to as ``easylist``. |
79 If you don't know all the source names that are needed to render some list, | 92 If you don't know all the source names that are needed to render some list, |
80 just run `flrender` and it will report what it's missing: | 93 just run ``flrender`` and it will report what it's missing:: |
81 | 94 |
82 $ flrender easylist.txt output/easylist.txt | 95 $ flrender easylist.txt output/easylist.txt |
83 Unknown source: 'easylist' when including 'easylist:easylist/easylist_gener | 96 Unknown source: 'easylist' when including 'easylist:easylist/easylist_gener |
84 al_block.txt' from 'easylist.txt' | 97 al_block.txt' from 'easylist.txt' |
85 | 98 |
86 You can clone the necessary repositories to a local directory and add `-i` | 99 |
| 100 You can clone the necessary repositories to a local directory and add ``-i`` |
87 options accordingly. | 101 options accordingly. |
88 | 102 |
89 ## Rendering diffs | 103 |
| 104 Generating diffs |
| 105 ---------------- |
90 | 106 |
91 A diff allows a client running ad blocking software such as Adblock Plus to | 107 A diff allows a client running ad blocking software such as Adblock Plus to |
92 update the filter lists incrementally, instead of downloading a new copy of a | 108 update the filter lists incrementally, instead of downloading a new copy of a |
93 full list during each update. This is meant to lessen the amount of resources | 109 full list during each update. This is meant to lessen the amount of resources |
94 used when updating filter lists (e.g. network data, memory usage, battery | 110 used when updating filter lists (e.g. network data, memory usage, battery |
95 consumption, etc.), allowing clients to update their lists more frequently using | 111 consumption, etc.), allowing clients to update their lists more frequently |
96 less resources. | 112 using less resources. |
97 | 113 |
98 Python-abp contains a script called `fldiff` that will find the diff between the | 114 python-abp contains a script called ``fldiff`` that will find the diff between |
99 latest filter list, and any number of previous filter lists: | 115 the latest filter list, and any number of previous filter lists:: |
100 | 116 |
101 $ fldiff -o diffs/easylist easylist.txt archive/* | 117 $ fldiff -o diffs/easylist/ easylist.txt archive/* |
102 | 118 |
103 where `-o diffs/easylist` is the (optional) output directory where the diffs | 119 |
104 should be written, `easylist.txt` is the most recent version of the filter list, | 120 where ``-o diffs/easylist/`` is the (optional) output directory where the diffs |
105 and `archive/*` is the directory where all the archived filter lists are. When | 121 should be written, ``easylist.txt`` is the most recent version of the filter |
106 called like this, the shell should automatically expand the `archive/*` | 122 list, and ``archive/*`` is the directory where all the archived filter lists are
. |
| 123 When called like this, the shell should automatically expand the ``archive/*`` |
107 directory, giving the script each of the filenames separately. | 124 directory, giving the script each of the filenames separately. |
108 | 125 |
109 In the above example, the output of each archived `list[version].txt` will be | 126 In the above example, the output of each archived ``list[version].txt`` will be |
110 written to `diffs/diff[version].txt`. If the output argument is omitted, the | 127 written to ``diffs/diff[version].txt``. If the output argument is omitted, the |
111 diffs will be written to the current directory. | 128 diffs will be written to the current directory. |
112 | 129 |
113 The script produces three types of lines, as specified in the [technical | 130 The script produces three types of lines, as specified in the `technical |
114 specification][5]: | 131 specification <https://docs.google.com/document/d/1SoEqaOBZRCfkh1s5Kds5A5RwUC_nq
bYYlGH72sbsSgQ/>`_: |
115 | |
116 * Special comments of the form `! <name>:[ <value>]` | |
117 * Added filters of the form `+ <filter-text>` | |
118 * Removed filters of the form `- <filter-text>` | |
119 | 132 |
120 | 133 |
121 ## Library API | 134 * Special comments of the form ``! <name>:[ <value>]`` |
| 135 * Added filters of the form ``+ <filter-text>`` |
| 136 * Removed filters of the form ``- <filter-text>`` |
122 | 137 |
123 Python-abp can also be used as a library for parsing filter lists. For example | 138 |
| 139 Library API |
| 140 ----------- |
| 141 |
| 142 python-abp can also be used as a library for parsing filter lists. For example |
124 to read a filter list (we use Python 3 syntax here but the API is the same): | 143 to read a filter list (we use Python 3 syntax here but the API is the same): |
125 | 144 |
| 145 .. code-block:: python |
| 146 |
126 from abp.filters import parse_filterlist | 147 from abp.filters import parse_filterlist |
127 | 148 |
128 with open('filterlist.txt') as filterlist: | 149 with open('filterlist.txt') as filterlist: |
129 for line in parse_filterlist(filterlist): | 150 for line in parse_filterlist(filterlist): |
130 print(line) | 151 print(line) |
131 | 152 |
132 If `filterlist.txt` contains a filter list: | 153 |
| 154 If ``filterlist.txt`` contains this filter list:: |
133 | 155 |
134 [Adblock Plus 2.0] | 156 [Adblock Plus 2.0] |
135 ! Title: Example list | 157 ! Title: Example list |
136 | 158 |
137 abc.com,cdf.com##div#ad1 | 159 abc.com,cdf.com##div#ad1 |
138 abc.com/ad$image | 160 abc.com/ad$image |
139 @@/abc\.com/ | 161 @@/abc\.com/ |
140 ... | 162 |
141 | 163 |
142 the output will look something like: | 164 the output will look something like: |
143 | 165 |
| 166 .. code-block:: python |
| 167 |
144 Header(version='Adblock Plus 2.0') | 168 Header(version='Adblock Plus 2.0') |
145 Metadata(key='Title', value='Example list') | 169 Metadata(key='Title', value='Example list') |
146 EmptyLine() | 170 EmptyLine() |
147 Filter(text='abc.com,cdf.com##div#ad1', selector={'type': 'css', 'value': 'd
iv#ad1'}, action='hide', options=[('domain', [('abc .com', True), ('cdf.com', Tr
ue)])]) | 171 Filter(text='abc.com,cdf.com##div#ad1', selector={'type': 'css', 'value': 'd
iv#ad1'}, action='hide', options=[('domain', [('abc .com', True), ('cdf.com', Tr
ue)])]) |
148 Filter(text='abc.com/ad$image', selector={'type': 'url-pattern', 'value': 'a
bc.com/ad'}, action='block', options=[('image', True)]) | 172 Filter(text='abc.com/ad$image', selector={'type': 'url-pattern', 'value': 'a
bc.com/ad'}, action='block', options=[('image', True)]) |
149 Filter(text='@@/abc\\.com/', selector={'type': 'url-regexp', 'value': 'abc\\
.com'}, action='allow', options=[]) | 173 Filter(text='@@/abc\\.com/', selector={'type': 'url-regexp', 'value': 'abc\\
.com'}, action='allow', options=[]) |
150 ... | |
151 | 174 |
152 `abp.filters` module also exports a lower-level function for parsing individual | |
153 lines of a filter list: `parse_line`. It returns a parsed line object just like | |
154 the items in the iterator returned by `parse_filterlist`. | |
155 | 175 |
156 For further information on the library API use `help()` on `abp.filters` and | 176 The ``abp.filters`` module also exports a lower-level function for parsing |
157 its contents in interactive Python session, read the docstrings or look at the | 177 individual lines of a filter list: ``parse_line``. It returns a parsed line |
158 tests for some usage examples. | 178 object just like the items in the iterator returned by ``parse_filterlist``. |
159 | 179 |
160 ## Testing | 180 For further information on the library API use ``help()`` on ``abp.filters`` and |
| 181 its contents in an interactive Python session, read the docstrings, or look at |
| 182 the tests for some usage examples. |
161 | 183 |
162 Unit tests for `python-abp` are located in the `/tests` directory. | |
163 [Pytest][2] is used for quickly running the tests | |
164 during development. | |
165 [Tox][3] is used for testing in different | |
166 environments (Python 2.7, Python 3.5+ and PyPy) and code quality | |
167 reporting. | |
168 | 184 |
169 In order to execute the tests, first create and activate development | 185 Testing |
170 virtualenv: | 186 ------- |
171 | 187 |
172 $ python setup.py devenv | 188 Unit tests for ``python-abp`` are located in the ``/tests`` directory. `Pytest <
http://pytest.org/>`_ |
173 $ . devenv/bin/activate | 189 is used for quickly running the tests during development. `Tox <https://tox.read
thedocs.org/>`_ is used for |
| 190 testing in different environments (Python 2.7, Python 3.5+ and PyPy) and code |
| 191 quality reporting. |
174 | 192 |
175 With the development virtualenv activated use pytest for a quick test run: | |
176 | 193 |
177 (devenv) $ pytest tests | 194 Development |
| 195 ----------- |
178 | 196 |
179 and tox for a comprehensive report: | 197 When adding new functionality, add tests for it (preferably first). If some |
180 | 198 code will never be reached on a certain version of Python, it may be exempted |
181 (devenv) $ tox | 199 from coverage tests by adding a comment, e.g. ``# pragma: no py2 cover``. |
182 | |
183 ## Development | |
184 | |
185 When adding new functionality, add tests for it (preferably first). Code | |
186 coverage (as measured by `tox -e qa`) should not decrease and the tests | |
187 should pass in all Tox environments. | |
188 | 200 |
189 All public functions, classes and methods should have docstrings compliant with | 201 All public functions, classes and methods should have docstrings compliant with |
190 [NumPy/SciPy documentation guide][4]. One exception is the constructors of | 202 `NumPy/SciPy documentation guide <https://github.com/numpy/numpy/blob/master/doc
/HOWTO_DOCUMENT.rst.txt>`_. |
191 classes that the user is not expected to instantiate (such as exceptions). | 203 One exception is the constructors of classes that the user is not expected to |
| 204 instantiate (such as exceptions). |
192 | 205 |
193 | 206 |
194 ## Using the library with R | 207 Using the library with R |
| 208 ------------------------ |
195 | 209 |
196 Clone the repo to you local machine. Then create a virtualenv and install | 210 Clone the repo to your local machine. Then create a virtualenv and install |
197 python abp there: | 211 python-abp there:: |
198 | 212 |
199 $ cd python-abp | 213 $ cd python-abp |
200 $ virtualenv env | 214 $ virtualenv env |
201 $ pip install --upgrade . | 215 $ pip install --upgrade . |
202 | |
203 Then import it with `reticulate` in R: | |
204 | |
205 > library(reticulate) | |
206 > use_virtualenv("~/python-abp/env", required=TRUE) | |
207 > abp <- import("abp.filters.rpy") | |
208 | |
209 Now you can use the functions with `abp$functionname`, e.g. | |
210 `abp.line2dict("@@||g.doubleclick.net/pagead/$subdocument,domain=hon30.org")` | |
211 | 216 |
212 | 217 |
213 [1]: https://adblockplus.org/filters#special-comments | 218 Then import it with ``reticulate`` in R: |
214 [2]: http://pytest.org/ | 219 |
215 [3]: https://tox.readthedocs.org/ | 220 .. code-block:: R |
216 [4]: https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt | 221 |
217 [5]: https://docs.google.com/document/d/1SoEqaOBZRCfkh1s5Kds5A5RwUC_nqbYYlGH72s
bsSgQ/ | 222 > library(reticulate) |
| 223 > use_virtualenv("~/python-abp/env", required=TRUE) |
| 224 > abp <- import("abp.filters.rpy") |
| 225 |
| 226 Now you can use the functions with ``abp$functionname``, e.g. |
| 227 ``abp.line2dict("@@||g.doubleclick.net/pagead/$subdocument,domain=hon30.org")``. |
OLD | NEW |