Issue 29680684: [$csp1 adblockpluscore] Issue 6329 - Allow whitespace in filter option values

Issue 29680684: [$csp1 adblockpluscore] Issue 6329 - Allow whitespace in filter option values (Closed)

Created:
Jan. 26, 2018, 5:32 p.m. by kzar

Modified:
March 7, 2018, 5:59 a.m.

Reviewers:
Sebastian Noack, Manish Jethani

CC:
sergei

Visibility:
Public.

Description

Issue 6329 - Allow whitespace in filter option values This change is necessary for the addition of the $csp filter option since whitespace in Content Security Policies is significant. I've skipped (most of) the normalisation step, instead doing it at the point the filter is parsed. My key insight there was that we always seem to call Filter.fromText directly afterward the normalise function anyway. This makes the code a little harder to follow perhaps, but has the benefit that we don't wastefully parse element hiding filters during normalisation, only to repeat much of that work again in Filters.fromText. Note that normalisation is not identical as before for RegExpFilters, previously something like "example $ domain=foo.com" might have worked, but no longer.

Patch Set 1 #

Total comments: 22

Patch Set 2 : Addressed some of Manish's initial feedback #

Created: Feb. 13, 2018, 11:55 a.m.

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+66 lines, -50 lines)			Patch
M	lib/filterClasses.js	View	1	6 chunks	+37 lines, -49 lines	0 comments	Download
M	lib/synchronizer.js	View		1 chunk	+1 line, -1 line	0 comments	Download
M	test/filterClasses.js	View	1	1 chunk	+28 lines, -0 lines	0 comments	Download

Messages

Total messages: 9

Expand All Messages | Collapse All Messages

Manish Jethani

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js#newcode104 lib/filterClasses.js:104: * basic normalisation and parsing where possible before deferring ...

Feb. 5, 2018, 4:09 p.m. (2018-02-05 16:09:58 UTC) #2

Manish Jethani

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js#newcode721 lib/filterClasses.js:721: let options = match[1].replace(/\s/g, ""); On 2018/02/05 16:09:58, Manish ...

Feb. 6, 2018, 5:14 a.m. (2018-02-06 05:14:39 UTC) #3

kzar

Patch Set 2 : Addressed some of Manish's initial feedback https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js#newcode104 ...

Feb. 13, 2018, 11:56 a.m. (2018-02-13 11:56:47 UTC) #4

Patch Set 2 : Addressed some of Manish's initial feedback

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js
File lib/filterClasses.js (right):

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j...
lib/filterClasses.js:104: * basic normalisation and parsing where possible
before deferring to the right
On 2018/02/05 16:09:57, Manish Jethani wrote:
> I noticed that the documentation is now using the British spelling of
> "normalization". I don't have a preference either way other than maybe for
> consistency it should use the American spelling.

Kind of ironic heh, Done.

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j...
lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text);
On 2018/02/05 16:09:57, Manish Jethani wrote:
> This means that two otherwise identical filters (post normalization) could be
> treated as different filters after this change.

Any ideas? This whole change was a pain, perhaps there is a better approach
which I missed.

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j...
lib/filterClasses.js:711: let origText;
On 2018/02/05 16:09:58, Manish Jethani wrote:
> Why not initialize origText to null for consistency?

Done.

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j...
lib/filterClasses.js:721: let options = match[1].replace(/\s/g, "");
On 2018/02/06 05:14:39, Manish Jethani wrote:
> On 2018/02/05 16:09:58, Manish Jethani wrote:
> > I don't get this part. Doesn't this mean that "$csp=script-src 'none'" will
> > become "$csp=script-src'none'"?
> 
> I see, so the other patch builds on this. Got it.

Sorry I should have left a comment on the review here to explain, but yep you
got it.

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j...
lib/filterClasses.js:731: option = option.substring(0, separatorIndex);
On 2018/02/05 16:09:58, Manish Jethani wrote:
> Any particular reason this is now using String.substring?

Well since I noticed that `substr` was being used mistakenly all this time, but
happened to work since the offset was always 0. Quite misleading.

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j...
lib/filterClasses.js:779: return new InvalidFilter(origText,
"filter_invalid_regexp");
On 2018/02/05 16:09:58, Manish Jethani wrote:
> origText is no longer the original text so it has basically lost its meaning.
On
> the other hand it's not very useful either. Maybe we can get rid of the extra
> variable?

The point of `origText` is to show a more useful warning to the user when they
have added an invalid filter. It's still useful for that purpose, though I agree
the name sucks. I'm open to suggestions there.

https://codereview.adblockplus.org/29680684/diff/29680685/lib/synchronizer.js
File lib/synchronizer.js (right):

https://codereview.adblockplus.org/29680684/diff/29680685/lib/synchronizer.js...
lib/synchronizer.js:283: line = Filter.stripJunk(line);
On 2018/02/05 16:09:58, Manish Jethani wrote:
> Since most of the code has moved out anyway, why not just move the rest of it
> into Filter.fromText and get rid of Filter.stripJunk?
> 
> On the other hand, it seems like there might have been a good reason for
> separating out the normalization from the rest of the parsing, since
> Filter.normalize is called from only two places whereas Filter.fromText is
> called from multiple places.
> 
> Anyway I'm in favor of simplifying this further and just keep everything in
> Filter.fromText unless it's a performance concern.

Yea, perhaps we should do that if we continue with this approach. I wonder if
there's a better way, I quite like normalisation happening in separate step, but
it's a pain to allow the whitespace we need that way.

Manish Jethani

Sorry about the delay. https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js#newcode111 lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text); On ...

Feb. 20, 2018, 4 p.m. (2018-02-20 16:00:58 UTC) #5

Sorry about the delay.

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js
File lib/filterClasses.js (right):

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j...
lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text);
On 2018/02/13 11:56:46, kzar wrote:
> On 2018/02/05 16:09:57, Manish Jethani wrote:
> > This means that two otherwise identical filters (post normalization) could
be
> > treated as different filters after this change.
> 
> Any ideas? This whole change was a pain, perhaps there is a better approach
> which I missed.

How about leaving the functions as they are, but changing the implementations a
little bit? For example, Filter.normalize could mostly work as it does, but we
could add an additional condition for "$csp=".

Pseudocode:

  Filter.normalize = function(text)
  {
    ...
    else if (textContainsCSPOption)
    {
      stripAllSpacesExceptSpacesInCSPOption;
    }
    return text.replace(/\s/g, "");
  }

So if there's a "$csp=" option in the text then it'll preserve the spaces within
that option's value, otherwise it will strip all spaces.

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j...
lib/filterClasses.js:731: option = option.substring(0, separatorIndex);
On 2018/02/13 11:56:47, kzar wrote:
> On 2018/02/05 16:09:58, Manish Jethani wrote:
> > Any particular reason this is now using String.substring?
> 
> Well since I noticed that `substr` was being used mistakenly all this time,
but
> happened to work since the offset was always 0. Quite misleading.

Why is it a mistake to use substr?

Anyway, the change is unrelated so I think it should be part of a separate patch
(and then the change should be made all over the code).

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j...
lib/filterClasses.js:779: return new InvalidFilter(origText,
"filter_invalid_regexp");
On 2018/02/13 11:56:46, kzar wrote:
> On 2018/02/05 16:09:58, Manish Jethani wrote:
> > origText is no longer the original text so it has basically lost its
meaning.
> On
> > the other hand it's not very useful either. Maybe we can get rid of the
extra
> > variable?
> 
> The point of `origText` is to show a more useful warning to the user when they
> have added an invalid filter. It's still useful for that purpose, though I
agree
> the name sucks. I'm open to suggestions there.

What if we keep origText pointing to text as it was passed in (with no
modifications)?

Manish Jethani

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js#newcode111 lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text); On 2018/02/20 16:00:57, Manish Jethani ...

Feb. 21, 2018, 2:49 p.m. (2018-02-21 14:49:18 UTC) #6

kzar

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js#newcode111 lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text); On 2018/02/21 14:49:18, Manish Jethani ...

March 6, 2018, 2:30 p.m. (2018-03-06 14:30:38 UTC) #7

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js
File lib/filterClasses.js (right):

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j...
lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text);
On 2018/02/21 14:49:18, Manish Jethani wrote:

>     ...
>     else if (textContainsCSPOption)
>     {
>       return stripAllSpacesExceptSpacesInCSPOption; /* return statement */
>     }
>     return text.replace(/\s/g, "");
>   }

Well while I agree that doing filter normalisation as a separate step is
probably the right option (filter caching etc) I don't think your suggestion is
practical - I mean forget all this, how about we implement the pseudocode of
`blockAllAdverts(website); /* block adverts */` instead?

So far we strip all whitespace from the filters, so let's take an example
filter:

    whatev   er$  c  s P =csp,  do  main=   adblockplus.org

or another one

    whatev   er$  c  s p =csp,,

We don't know if the part following the '$' is a valid options string until we
have stripped the spaces and checked if the regexp matches. Nor do we know if
any of those options are content security policies until the whitespace is
stripped. We also don't want to do any more work than necessary in the (by far)
common case that a filter doesn't contain a $csp option.

So to do what you suggest I figure we'd have to do something like:

 1. Create a copy of the filter with all whitespace stripped.
 2. If /csp=/i doesn't match simply return that stripped copy, otherwise...
 3. Go back and figure out the positions where one or more whitespaces were
stripped relative to the stripped copy.
 4. See if the options regexp matches the stripped string, if so...
 5. Figure out the positions of all the $csp option values in the stripped copy.
 6. For each position that falls within a $csp option value check if some
whitespace was stripped there, if so insert a space back in.
 7. Finally return the stripped copy with the $csp spaces restored!

I don't think that's practical, even a rough implementation which took us to
number 5 was becoming unwieldy. Therefore I propose that instead we trim any
surrounding whitespace from the filter and replace whitespace with single
spaces. That's not as good sure, but its a whole lot simpler and I figure it's
good enough since;

 1.) The `Filter.normalize` function does not fully normalise filters anyway
(e.g. `foo$domain=FOO.COM` vs `foo$domain=foo.com`).
 2.) We can still trim / strip all whitespace from filter values and the regexp
part at the time of parsing, the same as how options are upper cased.
 3.) I think that in practice it's hard to do things like `foo$ c s p= bar` by
mistake. Trailing whitespace like `foo$csp=bar   ` sure but we'd still handle
that. (In my testing it works in practice too, at least with EasyList and AA.)

So anyway this comment is already far too long, sorry about that. I will close
this review and update the next one[1] to also include my suggested tweak to
`Filter.normalize`.

[1] - https://codereview.adblockplus.org/29680689/

Manish Jethani

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js File lib/filterClasses.js (right): https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js#newcode111 lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text); On 2018/03/06 14:30:38, kzar wrote: ...

March 7, 2018, 12:13 a.m. (2018-03-07 00:13:55 UTC) #8

Message was sent while issue was closed.

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js
File lib/filterClasses.js (right):

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j...
lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text);
On 2018/03/06 14:30:38, kzar wrote:
> On 2018/02/21 14:49:18, Manish Jethani wrote:
> 
> >     ...
> >     else if (textContainsCSPOption)
> >     {
> >       return stripAllSpacesExceptSpacesInCSPOption; /* return statement */
> >     }
> >     return text.replace(/\s/g, "");
> >   }
> 
> Well while I agree that doing filter normalisation as a separate step is
> probably the right option (filter caching etc) I don't think your suggestion
is
> practical - I mean forget all this, how about we implement the pseudocode of
> `blockAllAdverts(website); /* block adverts */` instead?

You know my suggestion was way more specific than "blockAllAdverts" :)

How about adding these lines at the end of Filter.normalize:

  let {index, 1: options} = optionsWithSpacesRegExp.exec(text) || {};
  if (options)
  {
    return text.substring(0, index).replace(/ /g, "") + "$" +
           options.split(",").map(normalizeOption).join(",");
  }

  return text.replace(/ /g, "");

optionsWithSpacesRegExp and normalizeOption are defined as follows:

  const optionsWithSpacesRegExp =
    /\$( *~? ?[ \w-]+(?:=[^,]+)?(?:, *~? ?[ \w-]+(?:=[^,]+)?)*)$/;

  const normalizeOption = option =>
  {
    let [, name, value] = /(.*?)(=.*)/.exec(option) || [, option, ""];
    name = name.replace(/ /g, "");
    if (name.toUpperCase() != "CSP")
      value = value.replace(/ /g, "");
    return name + value;
  };

We could also keep the uppercasing here (rather change it to lowercasing) so we
don't have to do it in other places.

What do you think?

I do like the idea of keeping this patch so that the other one remains easier to
review.

I really don't think we have to change the behavior of Filter.normalize at all
with respect to non-CSP filters (i.e. all filters that are currently supported).

We could easily just test for "csp=" first as an optimization.

i.e.

  let {index, 1: options} = /c *s *p *=/i.test(text) &&
                            optionsWithSpacesRegExp.exec(text) || {};

Manish Jethani

March 7, 2018, 5:59 a.m. (2018-03-07 05:59:59 UTC) #9

Message was sent while issue was closed.

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.js
File lib/filterClasses.js (right):

https://codereview.adblockplus.org/29680684/diff/29680685/lib/filterClasses.j...
lib/filterClasses.js:111: let filter = Filter.knownFilters.get(text);
On 2018/03/07 00:13:55, Manish Jethani wrote:

> How about adding these lines at the end of Filter.normalize:
>
> [snip]

Another way to do it is by splitting on "csp=":

  let index = text.indexOf("$");
  if (index >= 0)
  {
    let fragments = text.substring(index + 1)
                    .split(/((?:^|,) *c *s *p *=)([^,]+)/i);
    let csp = false;
    for (let i = 0; i < fragments.length; i++)
    {
      if (csp)
      {
        csp = false;
      }
      else
      {
        fragments[i] = fragments[i].replace(/ +/g, "");
        csp = /^,?csp=$/i.test(fragments[i]);
      }
    }

    return text.substring(0, index + 1).replace(/ +/g, "") +
           fragments.join("");
  }

Expand All Messages | Collapse All Messages