| Left: | ||
| Right: |
| OLD | NEW |
|---|---|
| 1 # python-abp | 1 # python-abp |
| 2 | 2 |
| 3 This repository contains the script that is used for building Adblock Plus | 3 This repository contains a library for working with Adblock Plus filter lists |
| 4 filter lists from the form in which they are authored into the format suitable | 4 and the script that is used for building Adblock Plus filter lists from the |
| 5 for consumption by the adblocking software. | 5 form in which they are authored into the format suitable for consumption by the |
| 6 adblocking software. | |
|
mathias
2017/08/08 12:24:35
For an introduction that is a bit too much. How ab
| |
| 6 | 7 |
| 7 ## Installation | 8 ## Installation |
| 8 | 9 |
| 9 Prerequisites: | 10 Prerequisites: |
| 10 | 11 |
| 11 * Linux, Mac OS X or Windows (any modern Unix should work too), | 12 * Linux, Mac OS X or Windows (any modern Unix should work too), |
| 12 * Python (2.7 or 3.5), | 13 * Python (2.7 or 3.5+), |
| 13 * pip. | 14 * pip. |
| 14 | 15 |
| 15 To install: | 16 To install: |
| 16 | 17 |
| 17 $ pip install -U python-abp | 18 $ pip install -U python-abp |
| 18 | 19 |
| 19 ## Rendering of filter lists | 20 ## Rendering of filter lists |
| 20 | 21 |
| 21 The filter lists are originally authored in relatively smaller parts focused | 22 The filter lists are originally authored in relatively smaller parts focused |
| 22 on a particular type of filters, related to a specific topic or relevant | 23 on a particular type of filters, related to a specific topic or relevant |
| 23 for particular geographical area. | 24 for particular geographical area. |
| 24 We call these parts _filter list fragments_ (or just _fragments_) | 25 We call these parts _filter list fragments_ (or just _fragments_) |
| 25 to distinguish them from full filter lists that are | 26 to distinguish them from full filter lists that are |
| 26 consumed by the adblocking software such as Adblock Plus. | 27 consumed by the adblocking software such as Adblock Plus. |
| 27 | 28 |
| 28 Rendering is a process that combines filter list fragments into a filter list. | 29 Rendering is a process that combines filter list fragments into a filter list. |
| 29 It starts with one fragment that can include other ones and so forth. | 30 It starts with one fragment that can include other ones and so forth. |
| 30 The produced filter list is marked with a version, a timestamp and | 31 The produced filter list is marked with a version, a timestamp and |
| 31 a [checksum](https://adblockplus.org/filters#special-comments). | 32 a [checksum][1]. |
| 32 | 33 |
| 33 Python-abp contains a script that can do this called `flrender`: | 34 Python-abp contains a script that can do this called `flrender`: |
| 34 | 35 |
| 35 $ flrender fragment.txt filterlist.txt | 36 $ flrender fragment.txt filterlist.txt |
| 36 | 37 |
| 37 This will take the top level fragment in `fragment.txt`, render it and save into | 38 This will take the top level fragment in `fragment.txt`, render it and save into |
| 38 `filterlist.txt`. | 39 `filterlist.txt`. |
| 39 | 40 |
| 40 Fragments might reference other fragments that should be included into them. | 41 Fragments might reference other fragments that should be included into them. |
| 41 The references come in two forms: http(s) includes and local includes: | 42 The references come in two forms: http(s) includes and local includes: |
| 42 | 43 |
| 43 %include http://www.server.org/dir/list.txt% | 44 %include http://www.server.org/dir/list.txt% |
| 44 %include easylist:easylist/easylist_general_block.txt | 45 %include easylist:easylist/easylist_general_block.txt% |
| 45 | 46 |
| 46 The first instruction contains a URL that will be fetched and inserted at the | 47 The first instruction contains a URL that will be fetched and inserted at the |
| 47 point of reference. | 48 point of reference. |
| 48 The second one contains a path inside easylist repository. | 49 The second one contains a path inside easylist repository. |
| 49 `flrender` needs to be able to find a copy of the repository on the local | 50 `flrender` needs to be able to find a copy of the repository on the local |
| 50 filesystem. We use `-i` option to point it to to the right directory: | 51 filesystem. We use `-i` option to point it to to the right directory: |
| 51 | 52 |
| 52 $ flrender -i easylist=/home/abc/easylist input.txt output.txt | 53 $ flrender -i easylist=/home/abc/easylist input.txt output.txt |
| 53 | 54 |
| 54 Now the second reference above will be resolved to | 55 Now the second reference above will be resolved to |
| (...skipping 13 matching lines...) Expand all Loading... | |
| 68 If you don't know all the source names that are needed to render some list, | 69 If you don't know all the source names that are needed to render some list, |
| 69 just run `flrender` and it will report what it's missing: | 70 just run `flrender` and it will report what it's missing: |
| 70 | 71 |
| 71 $ flrender easylist.txt output/easylist.txt | 72 $ flrender easylist.txt output/easylist.txt |
| 72 Unknown source: 'easylist' when including 'easylist:easylist/easylist_gener | 73 Unknown source: 'easylist' when including 'easylist:easylist/easylist_gener |
| 73 al_block.txt' from 'easylist.txt' | 74 al_block.txt' from 'easylist.txt' |
| 74 | 75 |
| 75 You can clone the necessary repositories to a local directory and add `-i` | 76 You can clone the necessary repositories to a local directory and add `-i` |
| 76 options accordingly. | 77 options accordingly. |
| 77 | 78 |
| 79 ## Library API | |
| 80 | |
| 81 Python-abp can also be used as a library for parsing filter lists. For example | |
| 82 to read a filter list (we use Python 3 syntax here but the API is the same): | |
| 83 | |
| 84 from abp.filters import parse_filterlist | |
| 85 | |
| 86 with open('filterlist.txt') as filterlist: | |
| 87 for line in parse_filterlist(filterlist): | |
| 88 print(line) | |
| 89 | |
| 90 If `filterlist.txt` contains a filter list: | |
| 91 | |
| 92 [Adblock Plus 2.0] | |
| 93 ! Title: Example list | |
| 94 | |
| 95 abc.com,cdf.com##div#ad1 | |
| 96 abc.com/ad$image | |
| 97 @@/abc\.com/ | |
| 98 ... | |
| 99 | |
| 100 the output will look similar to the following: | |
| 101 | |
| 102 Header(version='Adblock Plus 2.0') | |
| 103 Metadata(key='Title', value='Example list') | |
| 104 EmptyLine() | |
| 105 Filter(text='abc.com,cdf.com##div#ad1', selector={'type': 'css', 'value': 'd iv#ad1'}, action='hide', options=[('domain', [('abc .com', True), ('cdf.com', Tr ue)])]) | |
| 106 Filter(text='abc.com/ad$image', selector={'type': 'url-pattern', 'value': 'a bc.com/ad'}, action='block', options=[('image', True)]) | |
| 107 Filter(text='@@/abc\\.com/', selector={'type': 'url-regexp', 'value': 'abc\\ .com'}, action='allow', options=[]) | |
| 108 ... | |
| 109 | |
| 110 In general `parse_filterlist` takes an iterable of strings (such as a list or | |
| 111 an open file) and returns an iterable of parsed filter list lines. Each line | |
| 112 will have its `.type` attribute set to a string indicating its type. It will | |
| 113 also have a `.to_string()` method that converts it to a unicode string in the | |
| 114 filter list format (most of the time it's the same as the string from which the | |
| 115 filter was parsed). Further attributes depend on the type of the line. | |
| 116 | |
| 117 **Note:** `parse_filterlist` returns an iterator, not a list, and only consumes | |
| 118 the input lines when its output is iterated over. This allows much more memory | |
| 119 efficient handling of large filter lists, however there are two things to watch | |
| 120 out for: | |
| 121 | |
| 122 **Note:** iteration over parsed lines may throw a `ParseError` exception if a | |
| 123 line cannot be parsed. The exception will contain the information about the | |
| 124 error and the original line that failed parsing. | |
|
mathias
2017/08/08 12:24:35
It is not clear what bits this is about (I assume
Vasily Kuznetsov
2017/08/08 14:31:12
Yeah, we've discussed this. But for now that chang
| |
| 125 | |
| 126 - When you're parsing filters from a file, you need to complete the iteration | |
| 127 before you close the file. | |
| 128 - Once you iterate over the output of `parse_filterlist` once, it will be | |
| 129 consumed and you won't be iterate over it again. | |
| 130 | |
| 131 If you find that this is bothering you, you probably want to convert the output | |
|
mathias
2017/08/08 12:24:34
Everything in this section from here on, maybe inc
| |
| 132 of `parse_filterlist` to a list: | |
| 133 | |
| 134 lines_list = list(parse_filterlist(filterlist)) | |
| 135 | |
| 136 This will load the whole file into memory but unless you're dealing with a | |
| 137 gigantic filter list that should not be a problem. | |
| 138 | |
| 139 ### Line types | |
| 140 | |
| 141 As mentioned above, lines of different types have different attributes: | |
| 142 | |
| 143 | type | attributes | | |
|
mathias
2017/08/08 12:24:35
Are you sure this kind of table markup is supporte
Vasily Kuznetsov
2017/08/08 14:31:12
Indeed the table markup was not part of the origin
| |
| 144 |------------|------------------------------------------------------------------ ------| | |
| 145 | header | `version` - plugin version string | | |
| 146 | emptyline | no options | | |
| 147 | comment | `text` - text of the comment | | |
| 148 | metadata | `key` - name of the metadata field, `value` - value of the field | | |
| 149 | include | `target` - url/path of the file to include | | |
| 150 | filter | `text` - text of the filter, `selector` - what to look for, `acti on` - what to do with selected items, `options` - filter options | | |
| 151 | |
| 152 #### Filter atributes | |
|
mathias
2017/08/08 12:24:35
This section mentions "Selector" but not ".selecto
| |
| 153 | |
| 154 Selector is a dictionary with two keys: | |
| 155 | |
| 156 | key | meaning | | |
| 157 |--------------|---------------------------------------------------------------- --| | |
| 158 | type | 'css', 'abp-simple', 'url-pattern', 'url-regexp', 'extended-css ' | | |
| 159 | value | the selector itself, the meaning is type-dependent | | |
| 160 | |
| 161 It's preferable to import `SELECTOR_TYPE` namespace from `abp.filters` to refer | |
| 162 to filter types instead of using strings. `SELECTOR_TYPE` contains constants | |
| 163 for each filter type: `SELECTOR_TYPE.CSS`, `SELECTOR_TYPE.ABP_SIMPLE`, | |
| 164 `SELECTOR_TYPE.URL_PATTERN`, `SELECTOR_TYPE.URL_REGEXP` and | |
| 165 `SELECTOR_TYPE.XCSS`. | |
| 166 | |
| 167 Action instructs adblocking software on what should be done with the items | |
| 168 matching the selector: | |
| 169 | |
| 170 | action | meaning | | |
| 171 |--------|---------------------------------------------------------------------- --| | |
| 172 | block | block http(s) request that matches the selector | | |
| 173 | allow | allow http(s) request that matches the filter (whitelist the resource ) | | |
| 174 | hide | hide the DOM element that matches the selector | | |
| 175 | show | show the DOM element that matches the selector (whitelist the element ) | | |
| 176 | |
| 177 The action constants are contained in `FILTER_ACTION` namespace, which can also | |
| 178 be imported from `abp.filters` (`FILTER_ACTION.BLOCK`, `FILTER_ACTION.ALLOW`, | |
| 179 etc.) | |
| 180 | |
| 181 Options is a list of tuples consisting of option name and option value. The | |
| 182 option value is `True` or `False` for flags or, for options with a value, it's | |
| 183 a string, list of strings or a list of `(string, boolean)` tuples. See | |
| 184 [documentation on authoring the filter rules][2] for the list of existing | |
| 185 options and their meanings. | |
| 186 | |
| 187 ### Other functions | |
| 188 | |
| 189 `abp.filters` module also exports a lower-level function for parsing individual | |
| 190 lines of a filter list: `parse_line`. It returns a parsed line object just like | |
| 191 the items in the iterator returned by `parse_filterlist`. | |
| 192 | |
| 78 ## Testing | 193 ## Testing |
| 79 | 194 |
| 80 Unit tests for `python-abp` are located in the `/tests` directory. | 195 Unit tests for `python-abp` are located in the `/tests` directory. |
| 81 [Pytest](http://pytest.org/) is used for quickly running the tests | 196 [Pytest][3] is used for quickly running the tests |
| 82 during development. | 197 during development. |
| 83 [Tox](https://tox.readthedocs.org/) is used for testing in different | 198 [Tox][4] is used for testing in different |
| 84 environments (Python 2.7, Python 3.5 and PyPy) and code quality | 199 environments (Python 2.7, Python 3.5+ and PyPy) and code quality |
| 85 reporting. | 200 reporting. |
| 86 | 201 |
| 87 In order to execute the tests, first create and activate development | 202 In order to execute the tests, first create and activate development |
| 88 virtualenv: | 203 virtualenv: |
| 89 | 204 |
| 90 $ python setup.py devenv | 205 $ python setup.py devenv |
| 91 $ . devenv/bin/activate | 206 $ . devenv/bin/activate |
| 92 | 207 |
| 93 With the development virtualenv activated use pytest for a quick test run: | 208 With the development virtualenv activated use pytest for a quick test run: |
| 94 | 209 |
| 95 (devenv) $ py.test tests | 210 (devenv) $ pytest tests |
| 96 | 211 |
| 97 and tox for a comprehensive report: | 212 and tox for a comprehensive report: |
| 98 | 213 |
| 99 (devenv) $ tox | 214 (devenv) $ tox |
| 215 | |
| 216 | |
| 217 [1]: https://adblockplus.org/filters#special-comments | |
| 218 [2]: https://adblockplus.org/filters#options | |
| 219 [3]: http://pytest.org/ | |
| 220 [4]: https://tox.readthedocs.org/ | |
| OLD | NEW |