| OLD | NEW |
| 1 # python-abp | 1 # python-abp |
| 2 | 2 |
| 3 This repository contains a library for working with Adblock Plus filter lists | 3 This repository contains a library for working with Adblock Plus filter lists, |
| 4 and the script that is used for building Adblock Plus filter lists from the | 4 a script for rendering diffs between filter lists, and the script that is used |
| 5 form in which they are authored into the format suitable for consumption by the | 5 for building Adblock Plus filter lists from the form in which they are authored |
| 6 adblocking software. | 6 into the format suitable for consumption by the adblocking software (aka |
| 7 rendering). |
| 7 | 8 |
| 8 ## Installation | 9 ## Installation |
| 9 | 10 |
| 10 Prerequisites: | 11 Prerequisites: |
| 11 | 12 |
| 12 * Linux, Mac OS X or Windows (any modern Unix should work too), | 13 * Linux, Mac OS X or Windows (any modern Unix should work too), |
| 13 * Python (2.7 or 3.5+), | 14 * Python (2.7 or 3.5+), |
| 14 * pip. | 15 * pip. |
| 15 | 16 |
| 16 To install: | 17 To install: |
| 17 | 18 |
| 18 $ pip install -U python-abp | 19 $ pip install --upgrade python-abp |
| 19 | 20 |
| 20 ## Rendering of filter lists | 21 ## Rendering of filter lists |
| 21 | 22 |
| 22 The filter lists are originally authored in relatively smaller parts focused | 23 The filter lists are originally authored in relatively smaller parts focused |
| 23 on a particular type of filters, related to a specific topic or relevant | 24 on particular types of filters, related to a specific topic or relevant for a |
| 24 for particular geographical area. | 25 particular geographical area. |
| 25 We call these parts _filter list fragments_ (or just _fragments_) | 26 We call these parts _filter list fragments_ (or just _fragments_) to |
| 26 to distinguish them from full filter lists that are | 27 distinguish them from full filter lists that are consumed by the adblocking |
| 27 consumed by the adblocking software such as Adblock Plus. | 28 software such as Adblock Plus. |
| 28 | 29 |
| 29 Rendering is a process that combines filter list fragments into a filter list. | 30 Rendering is a process that combines filter list fragments into a filter list. |
| 30 It starts with one fragment that can include other ones and so forth. | 31 It starts with one fragment that can include other ones and so forth. |
| 31 The produced filter list is marked with a [version and a timestamp][1]. | 32 The produced filter list is marked with a [version and a timestamp][1]. |
| 32 | 33 |
| 33 Python-abp contains a script that can do this called `flrender`: | 34 Python-abp contains a script that can do this called `flrender`: |
| 34 | 35 |
| 35 $ flrender fragment.txt filterlist.txt | 36 $ flrender fragment.txt filterlist.txt |
| 36 | 37 |
| 37 This will take the top level fragment in `fragment.txt`, render it and save into | 38 This will take the top level fragment in `fragment.txt`, render it and save it |
| 38 `filterlist.txt`. | 39 into `filterlist.txt`. |
| 39 | 40 |
| 40 The `flrender` script can also be used by only specifying `fragment.txt`: | 41 The `flrender` script can also be used by only specifying `fragment.txt`: |
| 41 | 42 |
| 42 $flrender fragment.txt | 43 $ flrender fragment.txt |
| 43 | 44 |
| 44 in which case the rendering result will be sent to `stdout`. Moreover, when | 45 in which case the rendering result will be sent to `stdout`. Moreover, when |
| 45 it's run with no positional arguments: | 46 it's run with no positional arguments: |
| 46 | 47 |
| 47 $flrender | 48 $ flrender |
| 48 | 49 |
| 49 it will read from `stdin` and send the results to `stdout`. | 50 it will read from `stdin` and send the results to `stdout`. |
| 50 | 51 |
| 51 Fragments might reference other fragments that should be included into them. | 52 Fragments might reference other fragments that should be included into them. |
| 52 The references come in two forms: http(s) includes and local includes: | 53 The references come in two forms: http(s) includes and local includes: |
| 53 | 54 |
| 54 %include http://www.server.org/dir/list.txt% | 55 %include http://www.server.org/dir/list.txt% |
| 55 %include easylist:easylist/easylist_general_block.txt% | 56 %include easylist:easylist/easylist_general_block.txt% |
| 56 | 57 |
| 57 The first instruction contains a URL that will be fetched and inserted at the | 58 The http include contains a URL that will be fetched and inserted at the point |
| 58 point of reference. | 59 of reference. |
| 59 The second one contains a path inside easylist repository. | 60 The local include contains a path inside the easylist repository. |
| 60 `flrender` needs to be able to find a copy of the repository on the local | 61 `flrender` needs to be able to find a copy of the repository on the local |
| 61 filesystem. We use `-i` option to point it to to the right directory: | 62 filesystem. We use `-i` option to point it to to the right directory: |
| 62 | 63 |
| 63 $ flrender -i easylist=/home/abc/easylist input.txt output.txt | 64 $ flrender -i easylist=/home/abc/easylist input.txt output.txt |
| 64 | 65 |
| 65 Now the second reference above will be resolved to | 66 Now the local include referenced above will be resolved to: |
| 66 `/home/abc/easylist/easylist/easylist_general_block.txt` and the fragment will | 67 `/home/abc/easylist/easylist/easylist_general_block.txt` |
| 67 be loaded from this file. | 68 and the fragment will be loaded from this file. |
| 68 | 69 |
| 69 Directories that contain filter list fragments that are used during rendering | 70 Directories that contain filter list fragments that are used during rendering |
| 70 are called sources. | 71 are called sources. |
| 71 They are normally working copies of the repositories that contain filter list | 72 They are normally working copies of the repositories that contain filter list |
| 72 fragments. | 73 fragments. |
| 73 Each source is identified by a name: that's the part that comes before ":" | 74 Each source is identified by a name: that's the part that comes before ":" in |
| 74 in the include instruction and it should be the same as what comes before "=" | 75 the include instruction and it should be the same as what comes before "=" in |
| 75 in the `-i` option. | 76 the `-i` option. |
| 76 | 77 |
| 77 Commonly used sources have generally accepted names. For example the main | 78 Commonly used sources have generally accepted names. For example the main |
| 78 EasyList repository is referred to as `easylist`. | 79 EasyList repository is referred to as `easylist`. |
| 79 If you don't know all the source names that are needed to render some list, | 80 If you don't know all the source names that are needed to render some list, |
| 80 just run `flrender` and it will report what it's missing: | 81 just run `flrender` and it will report what it's missing: |
| 81 | 82 |
| 82 $ flrender easylist.txt output/easylist.txt | 83 $ flrender easylist.txt output/easylist.txt |
| 83 Unknown source: 'easylist' when including 'easylist:easylist/easylist_gener | 84 Unknown source: 'easylist' when including 'easylist:easylist/easylist_gener |
| 84 al_block.txt' from 'easylist.txt' | 85 al_block.txt' from 'easylist.txt' |
| 85 | 86 |
| 86 You can clone the necessary repositories to a local directory and add `-i` | 87 You can clone the necessary repositories to a local directory and add `-i` |
| 87 options accordingly. | 88 options accordingly. |
| 88 | 89 |
| 89 ## Rendering diffs | 90 ## Rendering diffs |
| 90 | 91 |
| 91 A diff allows a client running ad blocking software such as Adblock Plus to | 92 A diff allows a client running ad blocking software such as Adblock Plus to |
| 92 update the filter lists incrementally, instead of downloading a new copy of a | 93 update the filter lists incrementally, instead of downloading a new copy of a |
| 93 full list during each update. This is meant to lessen the amount of resources | 94 full list during each update. This is meant to lessen the amount of resources |
| 94 used when updating filter lists (e.g. network data, memory usage, battery | 95 used when updating filter lists (e.g. network data, memory usage, battery |
| 95 consumption, etc.), allowing clients to update their lists more frequently using | 96 consumption, etc.), allowing clients to update their lists more frequently |
| 96 less resources. | 97 using less resources. |
| 97 | 98 |
| 98 Python-abp contains a script called `fldiff` that will find the diff between the | 99 python-abp contains a script called `fldiff` that will find the diff between |
| 99 latest filter list, and any number of previous filter lists: | 100 the latest filter list, and any number of previous filter lists: |
| 100 | 101 |
| 101 $ fldiff -o diffs/easylist easylist.txt archive/* | 102 $ fldiff -o diffs/easylist/ easylist.txt archive/* |
| 102 | 103 |
| 103 where `-o diffs/easylist` is the (optional) output directory where the diffs | 104 where `-o diffs/easylist/` is the (optional) output directory where the diffs |
| 104 should be written, `easylist.txt` is the most recent version of the filter list, | 105 should be written, `easylist.txt` is the most recent version of the filter |
| 105 and `archive/*` is the directory where all the archived filter lists are. When | 106 list, and `archive/*` is the directory where all the archived filter lists are. |
| 106 called like this, the shell should automatically expand the `archive/*` | 107 When called like this, the shell should automatically expand the `archive/*` |
| 107 directory, giving the script each of the filenames separately. | 108 directory, giving the script each of the filenames separately. |
| 108 | 109 |
| 109 In the above example, the output of each archived `list[version].txt` will be | 110 In the above example, the output of each archived `list[version].txt` will be |
| 110 written to `diffs/diff[version].txt`. If the output argument is omitted, the | 111 written to `diffs/diff[version].txt`. If the output argument is omitted, the |
| 111 diffs will be written to the current directory. | 112 diffs will be written to the current directory. |
| 112 | 113 |
| 113 The script produces three types of lines, as specified in the [technical | 114 The script produces three types of lines, as specified in the [technical |
| 114 specification][5]: | 115 specification][5]: |
| 115 | 116 |
| 116 * Special comments of the form `! <name>:[ <value>]` | 117 * Special comments of the form `! <name>:[ <value>]` |
| 117 * Added filters of the form `+ <filter-text>` | 118 * Added filters of the form `+ <filter-text>` |
| 118 * Removed filters of the form `- <filter-text>` | 119 * Removed filters of the form `- <filter-text>` |
| 119 | 120 |
| 120 | |
| 121 ## Library API | 121 ## Library API |
| 122 | 122 |
| 123 Python-abp can also be used as a library for parsing filter lists. For example | 123 python-abp can also be used as a library for parsing filter lists. For example |
| 124 to read a filter list (we use Python 3 syntax here but the API is the same): | 124 to read a filter list (we use Python 3 syntax here but the API is the same): |
| 125 | 125 |
| 126 from abp.filters import parse_filterlist | 126 from abp.filters import parse_filterlist |
| 127 | 127 |
| 128 with open('filterlist.txt') as filterlist: | 128 with open('filterlist.txt') as filterlist: |
| 129 for line in parse_filterlist(filterlist): | 129 for line in parse_filterlist(filterlist): |
| 130 print(line) | 130 print(line) |
| 131 | 131 |
| 132 If `filterlist.txt` contains a filter list: | 132 If `filterlist.txt` contains this filter list: |
| 133 | 133 |
| 134 [Adblock Plus 2.0] | 134 [Adblock Plus 2.0] |
| 135 ! Title: Example list | 135 ! Title: Example list |
| 136 | 136 |
| 137 abc.com,cdf.com##div#ad1 | 137 abc.com,cdf.com##div#ad1 |
| 138 abc.com/ad$image | 138 abc.com/ad$image |
| 139 @@/abc\.com/ | 139 @@/abc\.com/ |
| 140 ... | |
| 141 | 140 |
| 142 the output will look something like: | 141 the output will look something like: |
| 143 | 142 |
| 144 Header(version='Adblock Plus 2.0') | 143 Header(version='Adblock Plus 2.0') |
| 145 Metadata(key='Title', value='Example list') | 144 Metadata(key='Title', value='Example list') |
| 146 EmptyLine() | 145 EmptyLine() |
| 147 Filter(text='abc.com,cdf.com##div#ad1', selector={'type': 'css', 'value': 'd
iv#ad1'}, action='hide', options=[('domain', [('abc .com', True), ('cdf.com', Tr
ue)])]) | 146 Filter(text='abc.com,cdf.com##div#ad1', selector={'type': 'css', 'value': 'd
iv#ad1'}, action='hide', options=[('domain', [('abc .com', True), ('cdf.com', Tr
ue)])]) |
| 148 Filter(text='abc.com/ad$image', selector={'type': 'url-pattern', 'value': 'a
bc.com/ad'}, action='block', options=[('image', True)]) | 147 Filter(text='abc.com/ad$image', selector={'type': 'url-pattern', 'value': 'a
bc.com/ad'}, action='block', options=[('image', True)]) |
| 149 Filter(text='@@/abc\\.com/', selector={'type': 'url-regexp', 'value': 'abc\\
.com'}, action='allow', options=[]) | 148 Filter(text='@@/abc\\.com/', selector={'type': 'url-regexp', 'value': 'abc\\
.com'}, action='allow', options=[]) |
| 150 ... | |
| 151 | 149 |
| 152 `abp.filters` module also exports a lower-level function for parsing individual | 150 The `abp.filters` module also exports a lower-level function for parsing |
| 153 lines of a filter list: `parse_line`. It returns a parsed line object just like | 151 individual lines of a filter list: `parse_line`. It returns a parsed line |
| 154 the items in the iterator returned by `parse_filterlist`. | 152 object just like the items in the iterator returned by `parse_filterlist`. |
| 155 | 153 |
| 156 For further information on the library API use `help()` on `abp.filters` and | 154 For further information on the library API use `help()` on `abp.filters` and |
| 157 its contents in interactive Python session, read the docstrings or look at the | 155 its contents in an interactive Python session, read the docstrings, or look at |
| 158 tests for some usage examples. | 156 the tests for some usage examples. |
| 159 | 157 |
| 160 ## Testing | 158 ## Testing |
| 161 | 159 |
| 162 Unit tests for `python-abp` are located in the `/tests` directory. | 160 Unit tests for `python-abp` are located in the `/tests` directory. [Pytest][2] |
| 163 [Pytest][2] is used for quickly running the tests | 161 is used for quickly running the tests during development. [Tox][3] is used for |
| 164 during development. | 162 testing in different environments (Python 2.7, Python 3.5+ and PyPy) and code |
| 165 [Tox][3] is used for testing in different | 163 quality reporting. |
| 166 environments (Python 2.7, Python 3.5+ and PyPy) and code quality | |
| 167 reporting. | |
| 168 | 164 |
| 169 In order to execute the tests, first create and activate development | 165 In order to execute the tests, first create and activate a development |
| 170 virtualenv: | 166 virtualenv: |
| 171 | 167 |
| 172 $ python setup.py devenv | 168 $ python setup.py devenv |
| 173 $ . devenv/bin/activate | 169 $ . devenv/bin/activate |
| 174 | 170 |
| 175 With the development virtualenv activated use pytest for a quick test run: | 171 With the development virtualenv activated use pytest for a quick test run: |
| 176 | 172 |
| 177 (devenv) $ pytest tests | 173 (devenv) $ pytest tests |
| 178 | 174 |
| 179 and tox for a comprehensive report: | 175 and tox for a comprehensive report: |
| 180 | 176 |
| 181 (devenv) $ tox | 177 (devenv) $ tox |
| 182 | 178 |
| 183 ## Development | 179 ## Development |
| 184 | 180 |
| 185 When adding new functionality, add tests for it (preferably first). Code | 181 When adding new functionality, add tests for it (preferably first). Code |
| 186 coverage (as measured by `tox -e qa`) should not decrease and the tests | 182 coverage (as measured by `tox -e coverage2` and `tox -e coverage3`) should not |
| 187 should pass in all Tox environments. | 183 decrease and the tests should pass in all tox environments. |
| 188 | 184 |
| 189 All public functions, classes and methods should have docstrings compliant with | 185 All public functions, classes and methods should have docstrings compliant with |
| 190 [NumPy/SciPy documentation guide][4]. One exception is the constructors of | 186 [NumPy/SciPy documentation guide][4]. One exception is the constructors of |
| 191 classes that the user is not expected to instantiate (such as exceptions). | 187 classes that the user is not expected to instantiate (such as exceptions). |
| 192 | 188 |
| 193 | |
| 194 ## Using the library with R | 189 ## Using the library with R |
| 195 | 190 |
| 196 Clone the repo to you local machine. Then create a virtualenv and install | 191 Clone the repo to you local machine. Then create a virtualenv and install |
| 197 python abp there: | 192 python abp there: |
| 198 | 193 |
| 199 $ cd python-abp | 194 $ cd python-abp |
| 200 $ virtualenv env | 195 $ virtualenv env |
| 201 $ pip install --upgrade . | 196 $ pip install --upgrade . |
| 202 | 197 |
| 203 Then import it with `reticulate` in R: | 198 Then import it with `reticulate` in R: |
| 204 | 199 |
| 205 > library(reticulate) | 200 > library(reticulate) |
| 206 > use_virtualenv("~/python-abp/env", required=TRUE) | 201 > use_virtualenv("~/python-abp/env", required=TRUE) |
| 207 > abp <- import("abp.filters.rpy") | 202 > abp <- import("abp.filters.rpy") |
| 208 | 203 |
| 209 Now you can use the functions with `abp$functionname`, e.g. | 204 Now you can use the functions with `abp$functionname`, e.g. |
| 210 `abp.line2dict("@@||g.doubleclick.net/pagead/$subdocument,domain=hon30.org")` | 205 `abp.line2dict("@@||g.doubleclick.net/pagead/$subdocument,domain=hon30.org")` |
| 211 | 206 |
| 212 | 207 |
| 213 [1]: https://adblockplus.org/filters#special-comments | 208 [1]: https://adblockplus.org/filters#special-comments |
| 214 [2]: http://pytest.org/ | 209 [2]: http://pytest.org/ |
| 215 [3]: https://tox.readthedocs.org/ | 210 [3]: https://tox.readthedocs.org/ |
| 216 [4]: https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt | 211 [4]: https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt |
| 217 [5]: https://docs.google.com/document/d/1SoEqaOBZRCfkh1s5Kds5A5RwUC_nqbYYlGH72s
bsSgQ/ | 212 [5]: https://docs.google.com/document/d/1SoEqaOBZRCfkh1s5Kds5A5RwUC_nqbYYlGH72s
bsSgQ/ |
| OLD | NEW |