| Index: README.md |
| =================================================================== |
| --- a/README.md |
| +++ b/README.md |
| @@ -1,20 +1,21 @@ |
| # python-abp |
| -This repository contains the script that is used for building Adblock Plus |
| -filter lists from the form in which they are authored into the format suitable |
| -for consumption by the adblocking software. |
| +This repository contains a library for working with Adblock Plus filter lists |
| +and the script that is used for building Adblock Plus filter lists from the |
| +form in which they are authored into the format suitable for consumption by the |
| +adblocking software. |
|
mathias
2017/08/08 12:24:35
For an introduction that is a bit too much. How ab
|
| ## Installation |
| Prerequisites: |
| * Linux, Mac OS X or Windows (any modern Unix should work too), |
| -* Python (2.7 or 3.5), |
| +* Python (2.7 or 3.5+), |
| * pip. |
| To install: |
| $ pip install -U python-abp |
| ## Rendering of filter lists |
| @@ -23,30 +24,30 @@ |
| for particular geographical area. |
| We call these parts _filter list fragments_ (or just _fragments_) |
| to distinguish them from full filter lists that are |
| consumed by the adblocking software such as Adblock Plus. |
| Rendering is a process that combines filter list fragments into a filter list. |
| It starts with one fragment that can include other ones and so forth. |
| The produced filter list is marked with a version, a timestamp and |
| -a [checksum](https://adblockplus.org/filters#special-comments). |
| +a [checksum][1]. |
| Python-abp contains a script that can do this called `flrender`: |
| $ flrender fragment.txt filterlist.txt |
| This will take the top level fragment in `fragment.txt`, render it and save into |
| `filterlist.txt`. |
| Fragments might reference other fragments that should be included into them. |
| The references come in two forms: http(s) includes and local includes: |
| %include http://www.server.org/dir/list.txt% |
| - %include easylist:easylist/easylist_general_block.txt |
| + %include easylist:easylist/easylist_general_block.txt% |
| The first instruction contains a URL that will be fetched and inserted at the |
| point of reference. |
| The second one contains a path inside easylist repository. |
| `flrender` needs to be able to find a copy of the repository on the local |
| filesystem. We use `-i` option to point it to to the right directory: |
| $ flrender -i easylist=/home/abc/easylist input.txt output.txt |
| @@ -70,30 +71,150 @@ |
| $ flrender easylist.txt output/easylist.txt |
| Unknown source: 'easylist' when including 'easylist:easylist/easylist_gener |
| al_block.txt' from 'easylist.txt' |
| You can clone the necessary repositories to a local directory and add `-i` |
| options accordingly. |
| +## Library API |
| + |
| +Python-abp can also be used as a library for parsing filter lists. For example |
| +to read a filter list (we use Python 3 syntax here but the API is the same): |
| + |
| + from abp.filters import parse_filterlist |
| + |
| + with open('filterlist.txt') as filterlist: |
| + for line in parse_filterlist(filterlist): |
| + print(line) |
| + |
| +If `filterlist.txt` contains a filter list: |
| + |
| + [Adblock Plus 2.0] |
| + ! Title: Example list |
| + |
| + abc.com,cdf.com##div#ad1 |
| + abc.com/ad$image |
| + @@/abc\.com/ |
| + ... |
| + |
| +the output will look similar to the following: |
| + |
| + Header(version='Adblock Plus 2.0') |
| + Metadata(key='Title', value='Example list') |
| + EmptyLine() |
| + Filter(text='abc.com,cdf.com##div#ad1', selector={'type': 'css', 'value': 'div#ad1'}, action='hide', options=[('domain', [('abc .com', True), ('cdf.com', True)])]) |
| + Filter(text='abc.com/ad$image', selector={'type': 'url-pattern', 'value': 'abc.com/ad'}, action='block', options=[('image', True)]) |
| + Filter(text='@@/abc\\.com/', selector={'type': 'url-regexp', 'value': 'abc\\.com'}, action='allow', options=[]) |
| + ... |
| + |
| +In general `parse_filterlist` takes an iterable of strings (such as a list or |
| +an open file) and returns an iterable of parsed filter list lines. Each line |
| +will have its `.type` attribute set to a string indicating its type. It will |
| +also have a `.to_string()` method that converts it to a unicode string in the |
| +filter list format (most of the time it's the same as the string from which the |
| +filter was parsed). Further attributes depend on the type of the line. |
| + |
| +**Note:** `parse_filterlist` returns an iterator, not a list, and only consumes |
| +the input lines when its output is iterated over. This allows much more memory |
| +efficient handling of large filter lists, however there are two things to watch |
| +out for: |
| + |
| +**Note:** iteration over parsed lines may throw a `ParseError` exception if a |
| +line cannot be parsed. The exception will contain the information about the |
| +error and the original line that failed parsing. |
|
mathias
2017/08/08 12:24:35
It is not clear what bits this is about (I assume
Vasily Kuznetsov
2017/08/08 14:31:12
Yeah, we've discussed this. But for now that chang
|
| + |
| +- When you're parsing filters from a file, you need to complete the iteration |
| + before you close the file. |
| +- Once you iterate over the output of `parse_filterlist` once, it will be |
| + consumed and you won't be iterate over it again. |
| + |
| +If you find that this is bothering you, you probably want to convert the output |
|
mathias
2017/08/08 12:24:34
Everything in this section from here on, maybe inc
|
| +of `parse_filterlist` to a list: |
| + |
| + lines_list = list(parse_filterlist(filterlist)) |
| + |
| +This will load the whole file into memory but unless you're dealing with a |
| +gigantic filter list that should not be a problem. |
| + |
| +### Line types |
| + |
| +As mentioned above, lines of different types have different attributes: |
| + |
| +| type | attributes | |
|
mathias
2017/08/08 12:24:35
Are you sure this kind of table markup is supporte
Vasily Kuznetsov
2017/08/08 14:31:12
Indeed the table markup was not part of the origin
|
| +|------------|------------------------------------------------------------------------| |
| +| header | `version` - plugin version string | |
| +| emptyline | no options | |
| +| comment | `text` - text of the comment | |
| +| metadata | `key` - name of the metadata field, `value` - value of the field | |
| +| include | `target` - url/path of the file to include | |
| +| filter | `text` - text of the filter, `selector` - what to look for, `action` - what to do with selected items, `options` - filter options | |
| + |
| +#### Filter atributes |
|
mathias
2017/08/08 12:24:35
This section mentions "Selector" but not ".selecto
|
| + |
| +Selector is a dictionary with two keys: |
| + |
| +| key | meaning | |
| +|--------------|------------------------------------------------------------------| |
| +| type | 'css', 'abp-simple', 'url-pattern', 'url-regexp', 'extended-css' | |
| +| value | the selector itself, the meaning is type-dependent | |
| + |
| +It's preferable to import `SELECTOR_TYPE` namespace from `abp.filters` to refer |
| +to filter types instead of using strings. `SELECTOR_TYPE` contains constants |
| +for each filter type: `SELECTOR_TYPE.CSS`, `SELECTOR_TYPE.ABP_SIMPLE`, |
| +`SELECTOR_TYPE.URL_PATTERN`, `SELECTOR_TYPE.URL_REGEXP` and |
| +`SELECTOR_TYPE.XCSS`. |
| + |
| +Action instructs adblocking software on what should be done with the items |
| +matching the selector: |
| + |
| +| action | meaning | |
| +|--------|------------------------------------------------------------------------| |
| +| block | block http(s) request that matches the selector | |
| +| allow | allow http(s) request that matches the filter (whitelist the resource) | |
| +| hide | hide the DOM element that matches the selector | |
| +| show | show the DOM element that matches the selector (whitelist the element) | |
| + |
| +The action constants are contained in `FILTER_ACTION` namespace, which can also |
| +be imported from `abp.filters` (`FILTER_ACTION.BLOCK`, `FILTER_ACTION.ALLOW`, |
| +etc.) |
| + |
| +Options is a list of tuples consisting of option name and option value. The |
| +option value is `True` or `False` for flags or, for options with a value, it's |
| +a string, list of strings or a list of `(string, boolean)` tuples. See |
| +[documentation on authoring the filter rules][2] for the list of existing |
| +options and their meanings. |
| + |
| +### Other functions |
| + |
| +`abp.filters` module also exports a lower-level function for parsing individual |
| +lines of a filter list: `parse_line`. It returns a parsed line object just like |
| +the items in the iterator returned by `parse_filterlist`. |
| + |
| ## Testing |
| Unit tests for `python-abp` are located in the `/tests` directory. |
| -[Pytest](http://pytest.org/) is used for quickly running the tests |
| +[Pytest][3] is used for quickly running the tests |
| during development. |
| -[Tox](https://tox.readthedocs.org/) is used for testing in different |
| -environments (Python 2.7, Python 3.5 and PyPy) and code quality |
| +[Tox][4] is used for testing in different |
| +environments (Python 2.7, Python 3.5+ and PyPy) and code quality |
| reporting. |
| In order to execute the tests, first create and activate development |
| virtualenv: |
| $ python setup.py devenv |
| $ . devenv/bin/activate |
| With the development virtualenv activated use pytest for a quick test run: |
| - (devenv) $ py.test tests |
| + (devenv) $ pytest tests |
| and tox for a comprehensive report: |
| (devenv) $ tox |
| + |
| + |
| + [1]: https://adblockplus.org/filters#special-comments |
| + [2]: https://adblockplus.org/filters#options |
| + [3]: http://pytest.org/ |
| + [4]: https://tox.readthedocs.org/ |