Issue 29331055: Issue #1234 - Remove 'CString' from PluginFilter.*

Eric

Nov. 29, 2015, 4:03 p.m. (2015-11-29 16:03:56 UTC) #1

sergei

A tremendous amount of work is done here! I have not taken a look at ...

Nov. 30, 2015, 12:37 p.m. (2015-11-30 12:37:55 UTC) #2

Eric

Patch set 2 addresses sergei's comments. On 2015/11/30 12:37:55, sergei wrote: > I have not ...

Nov. 30, 2015, 3:54 p.m. (2015-11-30 15:54:08 UTC) #3

Patch set 2 addresses sergei's comments.

On 2015/11/30 12:37:55, sergei wrote:
> I have not taken a look at TokenSequence.h because as I understand we use
> `wcstok_s` instead of it.

Please go look at it and the accompanying unit tests. I did not change to
'wcstok_s' in the new patch set, nor am I inclined to.

https://codereview.adblockplus.org/29331055/diff/29331528/adblockplus.gyp
File adblockplus.gyp (right):

https://codereview.adblockplus.org/29331055/diff/29331528/adblockplus.gyp#new...
adblockplus.gyp:223: 'libraries': ['-ladvapi32', '-lshell32', '-lole32',
'-loleaut32'],
On 2015/11/30 12:37:51, sergei wrote:
> Why do we need it?

SysStringLen, used in ToWstring(BSTR)

Apparently this library is getting linked in the plugin from some other library
dependency, so we don't need to specify it explicitly there.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
File src/plugin/PluginFilter.cpp (right):

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:36: std::wstring ToLowerString(const std::wstring&
s)
On 2015/11/30 12:37:53, sergei wrote:
> It should be in anonymous namespace.

Done.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:44: void ElementHideParseError(const std::wstring&
filterText, const std::string& reason)
On 2015/11/30 12:37:52, sergei wrote:
> It seems to know the current position would be also very useful. Could you
> include `filterSuffix` here as well?

I thought about doing that. I decided not to since syntax errors in filter lists
are fairly uncommon. My judgement call was that we could do that later.

I'm not averse to generating a proper syntax error message, but doing so
properly should use a utility class to do so and have associated unit tests. We
can do that later.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:44: void ElementHideParseError(const std::wstring&
filterText, const std::string& reason)
On 2015/11/30 12:37:53, sergei wrote:
> It should be in anonymous namespace.

Done.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:44: void ElementHideParseError(const std::wstring&
filterText, const std::string& reason)
On 2015/11/30 12:37:52, sergei wrote:
> It's really not clear from the name of the method that it throws an exception.

I changed the function to a subclass of std::runtime_error, so that the throw
statement is inline. It's only a few extra lines of extra code, but make the
exception behavior obvious.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:79: m_tag =
ToLowerString(filterSuffix.substr(0,pos));
On 2015/11/30 12:37:52, sergei wrote:
> space after comma is missed

Done.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:171: ElementHideParseError(filterText, "empty
attribute name before " + std::string(1, static_cast<char>(arg[0])) + "'='");
On 2015/11/30 12:37:52, sergei wrote:
> std::string(1, static_cast<char>(arg[0]))
> it seems better to call `ToUtf8String(arg[0])`

ToUtf8String only takes wstring arguments. The only basic_string constructor for
characters is the fill constructor that requires a length. The statement would
have to be thus.

    ToUtf8String(std::wstring(1,arg[0]))

I find the code in the patch set more readable, because the std::string
constructor is on the outside and thus the type is obvious.

I should point out that there are only exactly three characters that arg[0] can
be, one of '^', '*', or '$', so there's no question of locale and ToUtf8String
is overkill. If you'd like a code comment to that effect, I can add one.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:281: TokenSequence<std::wstring> ts(classNameList,
L" ");
On 2015/11/30 12:37:52, sergei wrote:
> Did we decide to use `wcstok_s` instead of custom tokenization?

We're using that function at present in exactly one loop in the plugin. We used
it in that case, as I recall, because the proposed code wasn't any better at the
time.

The class 'TokenSequence', though, is definitely better that using 'wcstok_s'.
-- TokenSequence uses C++ strings natively. 'wcstok_s' uses C-style strings.
Using TokenSequence requires no conversion back and forth to C-style strings.
-- TokenSequence uses iterators in the native C++ pattern.
-- TokenSequence only requires a single specification of delimiters. 'wcstok_s'
always requires two calls with the same delimiter string (one to initialize the
loop, one to increment it), violating DRY.
-- TokenSequence loops are shorter than the alternative.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:313: for (auto attrIt =
m_attributeSelectors.begin(); attrIt != m_attributeSelectors.end(); ++ attrIt)
On 2015/11/30 12:37:52, sergei wrote:
> additional space after `++`

Done.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:513: id = std::wstring(ToWstring(idBstr));
On 2015/11/30 12:37:52, sergei wrote:
> It should be just `id = ToWstring(idBstr);`

Done.

Good catch. That's leftover code from an earlier version before I put in
ToWstring. I removed some others like this, but I missed this one.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:528: for (auto idIt = idItEnum.first; idIt !=
idItEnum.second; idIt ++)
On 2015/11/30 12:37:53, sergei wrote:
> Good chance to change `idIt ++` to `++idIt`.

Done.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:542: for (auto idIt = idItEnum.first; idIt !=
idItEnum.second; idIt ++)
On 2015/11/30 12:37:51, sergei wrote:
> Good chance to change `idIt ++` to `++idIt`

Done.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:558: TokenSequence<std::wstring> ts(classNames, L"
\t\n\r");
On 2015/11/30 12:37:52, sergei wrote:
> Same here, let's use `wcstok_s` instead of custom tokenization.

See above comment.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:593: for (auto tagIt = tagItEnum.first; tagIt !=
tagItEnum.second; ++ tagIt)
On 2015/11/30 12:37:53, sergei wrote:
> additional space after `++`

Done.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
File src/plugin/PluginFilter.h (right):

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.h:77: CFilterElementHide(const std::wstring&
filterText=L"");
On 2015/11/30 12:37:53, sergei wrote:
> spaces around '=' are missed

Done.

I wasn't really looking for existing style problems, but I don't mind rolling
them into the present change set. Strictly speaking, they're unrelated, but
there's no good occasion to fix them unless they tag along with changes to
related code.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.h:128: typedef
std::multimap<std::pair<std::wstring,std::wstring>, CFilterElementHide>
TFilterElementHideTagsNamed;
On 2015/11/30 12:37:54, sergei wrote:
> space after comma is missed

Done.

https://codereview.adblockplus.org/29331055/diff/29331528/test/UtilTest.cpp
File test/UtilTest.cpp (right):

https://codereview.adblockplus.org/29331055/diff/29331528/test/UtilTest.cpp#n...
test/UtilTest.cpp:22: void TrimTestBody(std::wstring input, std::wstring
expected)
On 2015/11/30 12:37:54, sergei wrote:
> It's not introduces in this change set but the comments are the same, const
> references, EXPECT_EQ instead of ASSERT_EQ and let's rename it.

Done.

https://codereview.adblockplus.org/29331055/diff/29331528/test/UtilTest.cpp#n...
test/UtilTest.cpp:28: void TrimLeftTestBody(std::wstring input, std::wstring
expected)
On 2015/11/30 12:37:54, sergei wrote:
> Why not to pass arguments as constant references?

Done.

https://codereview.adblockplus.org/29331055/diff/29331528/test/UtilTest.cpp#n...
test/UtilTest.cpp:28: void TrimLeftTestBody(std::wstring input, std::wstring
expected)
On 2015/11/30 12:37:54, sergei wrote:
> It would be better to call it like
> ExpectEqualToLeftTrimmed(expected, input);

I'm not going to bother with the renaming. I don't see any good reason to to
make it look like a Google Test primitive.

https://codereview.adblockplus.org/29331055/diff/29331528/test/UtilTest.cpp#n...
test/UtilTest.cpp:31: ASSERT_EQ(expected, trimmed);
On 2015/11/30 12:37:54, sergei wrote:
> It should not stop the tests, could you please use EXPECT_EQ?

Done.

https://codereview.adblockplus.org/29331055/diff/29331528/test/UtilTest.cpp#n...
test/UtilTest.cpp:34: void TrimRightTestBody(std::wstring input, std::wstring
expected)
On 2015/11/30 12:37:54, sergei wrote:
> Same here, constant references

Done.

https://codereview.adblockplus.org/29331055/diff/29331528/test/UtilTest.cpp#n...
test/UtilTest.cpp:37: ASSERT_EQ(expected, trimmed);
On 2015/11/30 12:37:55, sergei wrote:
> EXPECT_EQ

Done.

https://codereview.adblockplus.org/29331055/diff/29331528/test/UtilTest.cpp#n...
test/UtilTest.cpp:165: ASSERT_EQ(y, x);
On 2015/11/30 12:37:54, sergei wrote:
> EXPECT_EQ here and below

This statement is largely documentation. It's here to match the ASSERT at the
end. I did change the middle one.

https://codereview.adblockplus.org/29331055/diff/29331528/test/UtilTest.cpp#n...
test/UtilTest.cpp:166: std::wstring trimmed = TrimString(x); // expect here pass
by value
On 2015/11/30 12:37:54, sergei wrote:
> What if one calls it as `TrimString<std::wstring&>(x)`?

If you do that, you get in-place modification. I searched the code; there's no
occurrence of that syntax anywhere in our code. If this were a general purpose
library, I'd worry about it right now.

> It's actually a bug in the test, we should explicitly specify the argument
type
> as a non-constant reference.

It's not a defect in the test. It's arguably a defect in 'TrimString', but if it
is, it's not one that has any significant consequences.

This test simply ensures that when the argument is a variable that it's being
passed by value so that it doesn't break existing code. I changed the
implementation significantly enough that I wanted verification on this point.

Incidentally, most, but not all, of the uses of TrimString are in-place
modifications. It would be good, in any case, to change them to use the explicit
'...InPlace...' variants in the new code. I didn't do it in this change set
because it's too much out of scope.

If we want library-quality code, we'd want all of the following variant
arguments
- T
- T&
- const T&
- T&&
It would take longer to write the unit tests for all of those than to write all
the variants. I don't see it being worth the time at present to do that. And if
we want to do it, it's a separate change and a new ticket #.

https://codereview.adblockplus.org/29331055/diff/29331528/test/UtilTest.cpp#n...
test/UtilTest.cpp:171: TEST(TrimTest, TrimBindsOnConstArg)
On 2015/11/30 12:37:55, sergei wrote:
> If it's an attempt to test whether there is some bug which modifies the string
> passed into TrimString then I would say we likely don't need such test.

It verifies the modification, yes. The real purpose of the test is to ensure
that the present function works with const arguments. That's why the test has
the name it does.

> Even more, here is a bug in the test because the parameter of TrimString will
a
> copy of an argument and `TrimString<const std::wstring&>(x)` is not
compilable.

There's no defect in this test. It verifies exactly what it's supposed to do,
that T instantiates as 'std::wstring', that the const argument is used to
initialize a temporary argument object, and that the return value is correct.
The verification that the const variable 'x' hasn't changed is icing on the
cake.

`TrimString<const std::wstring&>(x)` doesn't need to be compilable. As I said
above, 'TrimString' isn't meant a completely general purpose template function.

sergei

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFilter.cpp File src/plugin/PluginFilter.cpp (right): https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFilter.cpp#newcode171 src/plugin/PluginFilter.cpp:171: ElementHideParseError(filterText, "empty attribute name before " + std::string(1, static_cast<char>(arg[0])) ...

Nov. 30, 2015, 4:36 p.m. (2015-11-30 16:36:03 UTC) #4

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
File src/plugin/PluginFilter.cpp (right):

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:171: ElementHideParseError(filterText, "empty
attribute name before " + std::string(1, static_cast<char>(arg[0])) + "'='");
On 2015/11/30 15:54:06, Eric wrote:
> On 2015/11/30 12:37:52, sergei wrote:
> > std::string(1, static_cast<char>(arg[0]))
> > it seems better to call `ToUtf8String(arg[0])`
> 
> ToUtf8String only takes wstring arguments. The only basic_string constructor
for
> characters is the fill constructor that requires a length. The statement would
> have to be thus.
> 
>     ToUtf8String(std::wstring(1,arg[0]))
> 
> I find the code in the patch set more readable, because the std::string
> constructor is on the outside and thus the type is obvious.
> 
> I should point out that there are only exactly three characters that arg[0]
can
> be, one of '^', '*', or '$', so there's no question of locale and ToUtf8String
> is overkill. If you'd like a code comment to that effect, I can add one.

Acknowledged.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:281: TokenSequence<std::wstring> ts(classNameList,
L" ");
On 2015/11/30 15:54:05, Eric wrote:
> On 2015/11/30 12:37:52, sergei wrote:
> > Did we decide to use `wcstok_s` instead of custom tokenization?
> 
> We're using that function at present in exactly one loop in the plugin. We
used
> it in that case, as I recall, because the proposed code wasn't any better at
the
> time.
> 
> The class 'TokenSequence', though, is definitely better that using 'wcstok_s'.
> -- TokenSequence uses C++ strings natively. 'wcstok_s' uses C-style strings.
> Using TokenSequence requires no conversion back and forth to C-style strings.
> -- TokenSequence uses iterators in the native C++ pattern.
> -- TokenSequence only requires a single specification of delimiters. 
'wcstok_s'
> always requires two calls with the same delimiter string (one to initialize
the
> loop, one to increment it), violating DRY.
> -- TokenSequence loops are shorter than the alternative.
Of course it's better than 'wcstok_s', it's difficult to make something equally
or more bad.
@Oleksandr, previous time you was the originator of using 'wcstok_s', what is
your opinion?

https://codereview.adblockplus.org/29331055/diff/29331528/test/UtilTest.cpp
File test/UtilTest.cpp (right):

https://codereview.adblockplus.org/29331055/diff/29331528/test/UtilTest.cpp#n...
test/UtilTest.cpp:166: std::wstring trimmed = TrimString(x); // expect here pass
by value
On 2015/11/30 15:54:08, Eric wrote:
> On 2015/11/30 12:37:54, sergei wrote:
> > What if one calls it as `TrimString<std::wstring&>(x)`?
> 
> If you do that, you get in-place modification. I searched the code; there's no
> occurrence of that syntax anywhere in our code. If this were a general purpose
> library, I'd worry about it right now.
> 
> > It's actually a bug in the test, we should explicitly specify the argument
> type
> > as a non-constant reference.
> 
> It's not a defect in the test. It's arguably a defect in 'TrimString', but if
it
> is, it's not one that has any significant consequences.
> 
> This test simply ensures that when the argument is a variable that it's being
> passed by value so that it doesn't break existing code. I changed the
> implementation significantly enough that I wanted verification on this point.
> 
> Incidentally, most, but not all, of the uses of TrimString are in-place
> modifications. It would be good, in any case, to change them to use the
explicit
> '...InPlace...' variants in the new code. I didn't do it in this change set
> because it's too much out of scope.
> 
> If we want library-quality code, we'd want all of the following variant
> arguments
> - T
> - T&
> - const T&
> - T&&
> It would take longer to write the unit tests for all of those than to write
all
> the variants. I don't see it being worth the time at present to do that. And
if
> we want to do it, it's a separate change and a new ticket #.

Acknowledged. It's another issue, let's leave it.

Eric

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFilter.cpp File src/plugin/PluginFilter.cpp (right): https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFilter.cpp#newcode281 src/plugin/PluginFilter.cpp:281: TokenSequence<std::wstring> ts(classNameList, L" "); On 2015/11/30 16:36:03, sergei wrote: ...

Nov. 30, 2015, 5:05 p.m. (2015-11-30 17:05:47 UTC) #5

Oleksandr

This was a gargantuan review indeed, thanks :) I would simplify it a bit though. ...

Dec. 4, 2015, 1:03 a.m. (2015-12-04 01:03:17 UTC) #6

This was a gargantuan review indeed, thanks :) I would simplify it a bit though.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
File src/plugin/PluginFilter.cpp (right):

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:281: TokenSequence<std::wstring> ts(classNameList,
L" ");
On 2015/11/30 16:36:03, sergei wrote:
> On 2015/11/30 15:54:05, Eric wrote:
> > On 2015/11/30 12:37:52, sergei wrote:
> > > Did we decide to use `wcstok_s` instead of custom tokenization?
> > 
> > We're using that function at present in exactly one loop in the plugin. We
> used
> > it in that case, as I recall, because the proposed code wasn't any better at
> the
> > time.
> > 
> > The class 'TokenSequence', though, is definitely better that using
'wcstok_s'.
> > -- TokenSequence uses C++ strings natively. 'wcstok_s' uses C-style strings.
> > Using TokenSequence requires no conversion back and forth to C-style
strings.
> > -- TokenSequence uses iterators in the native C++ pattern.
> > -- TokenSequence only requires a single specification of delimiters. 
> 'wcstok_s'
> > always requires two calls with the same delimiter string (one to initialize
> the
> > loop, one to increment it), violating DRY.
> > -- TokenSequence loops are shorter than the alternative.
> Of course it's better than 'wcstok_s', it's difficult to make something
equally
> or more bad.
> @Oleksandr, previous time you was the originator of using 'wcstok_s', what is
> your opinion?

TokenSequence looks beautiful but unnecessary, IMHO. I don't see why we need it.
Even if just using wcstok_s is questionable (because of native-std string
conversions), what's wrong with something like: 
   std::wstringstream classNameListStream(classNameList);
   for (std::wstring iterClassName; std::getline(classNameListStream,
iterClassName, L' ');)
   {
     ...
   }

https://codereview.adblockplus.org/29331055/diff/29331624/src/plugin/PluginFi...
File src/plugin/PluginFilter.cpp (right):

https://codereview.adblockplus.org/29331055/diff/29331624/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:40: return ToWstring(ToCString(s).MakeLower());
Converting to CString and back seems redundant. How about _wcslwr_s instead?

https://codereview.adblockplus.org/29331055/diff/29331624/src/shared/Utils.h
File src/shared/Utils.h (right):

https://codereview.adblockplus.org/29331055/diff/29331624/src/shared/Utils.h#...
src/shared/Utils.h:73: * Th method used below for trimming strings is taken from
here:
Nit: "The" method?

Eric

New patch set. Rebase only affect the GYP file. https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFilter.cpp File src/plugin/PluginFilter.cpp (right): https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFilter.cpp#newcode281 src/plugin/PluginFilter.cpp:281: ...

Dec. 6, 2015, 5:39 p.m. (2015-12-06 17:39:57 UTC) #7

New patch set. Rebase only affect the GYP file.

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
File src/plugin/PluginFilter.cpp (right):

https://codereview.adblockplus.org/29331055/diff/29331528/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:281: TokenSequence<std::wstring> ts(classNameList,
L" ");
On 2015/12/04 01:03:15, Oleksandr wrote:
> TokenSequence looks beautiful but unnecessary, IMHO. I don't see why we need
it.

No one ever needs beautiful code. It does, however, reduce defect rates, improve
maintainability, made code easier to understand and audit. If you don't want
these benefits, please use ugly code. I prefer not to write such code.

> Even if just using wcstok_s is questionable (because of native-std string
> conversions), what's wrong with something like: 
>    std::wstringstream classNameListStream(classNameList);
>    for (std::wstring iterClassName; std::getline(classNameListStream,
> iterClassName, L' ');)
>    {
>      ...
>    }

The performance overhead of stream buffering is rather large, particularly for
something as simple as delimiter-based parsing. This function is called once per
per element (roughly) in the traverser, so performance is at least somewhat a
concern.

https://codereview.adblockplus.org/29331055/diff/29331624/src/plugin/PluginFi...
File src/plugin/PluginFilter.cpp (right):

https://codereview.adblockplus.org/29331055/diff/29331624/src/plugin/PluginFi...
src/plugin/PluginFilter.cpp:40: return ToWstring(ToCString(s).MakeLower());
On 2015/12/04 01:03:16, Oleksandr wrote:
> Converting to CString and back seems redundant. How about _wcslwr_s instead?

I put this in the commit message as out-of-scope.

The particular issue is this. HTML attribute names can contain Unicode
characters. If it were just ASCII, it would be an easy issue. I don't want to
delve into all the associated locale issues here.

Rather than commingle this change into an already-large review, I decided to
just use the same function as the previous change, avoiding a regression on this
point. A future change set can change it. If such a change causes problems, it
will be isolated and easier to deal with.

https://codereview.adblockplus.org/29331055/diff/29331624/src/shared/Utils.h
File src/shared/Utils.h (right):

https://codereview.adblockplus.org/29331055/diff/29331624/src/shared/Utils.h#...
src/shared/Utils.h:73: * Th method used below for trimming strings is taken from
here:
On 2015/12/04 01:03:16, Oleksandr wrote:
> Nit: "The" method?

Done.

Eric

New patch set is a rebase. Its only difference is to remove the declaration and ...

Dec. 14, 2015, 12:12 p.m. (2015-12-14 12:12:50 UTC) #8

Eric

Patch set 5 is the same as previous except for removing the definition of 'ToLowerString()', ...

Dec. 14, 2015, 4:21 p.m. (2015-12-14 16:21:31 UTC) #9

Eric

It's been two weeks since any response on this review. The only sticking point seems ...

Dec. 22, 2015, 12:47 p.m. (2015-12-22 12:47:24 UTC) #10

Eric

Patch set 6 is a rebase only. (The fourth one in a row, for the ...

Jan. 5, 2016, 4:12 p.m. (2016-01-05 16:12:33 UTC) #11

sergei

JIC, review of TokenSequence https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSequence.h File src/plugin/TokenSequence.h (right): https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSequence.h#newcode26 src/plugin/TokenSequence.h:26: typedef TokenSequenceConstIterator<T> ConstIterator; I would ...

Jan. 25, 2016, 2:13 p.m. (2016-01-25 14:13:55 UTC) #12

Eric

I've addressed all of Sergei's comments. No changes to behavior were necessary. There might be ...

Jan. 26, 2016, 3:54 p.m. (2016-01-26 15:54:21 UTC) #13

I've addressed all of Sergei's comments. No changes to behavior were necessary.
There might be room for additional comments, but I'll wait for another round
from Sergei to do so.

I am still waiting for Oleksandr to reply to my comments on 2015-12-06 defending
the presence of this class. He can weigh in also on what further comments would
be useful.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
File src/plugin/TokenSequence.h (right):

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:26: typedef TokenSequenceConstIterator<T>
ConstIterator;
When our style guide at https://adblockplus.org/en/coding-style has a specific
exemption for this, I'll gladly change it.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:76: const Container& sequence;
On 2016/01/25 14:13:54, sergei wrote:
> I'm pretty sure it generates a warning something like "The compiler cannot
> generate copy ctr and assignment operator."

I am completely sure it does not generate a warning, because I've compiled the
code.

> Could you explicitly make this class "uncopyable"?

I would worry about this more if I were writing a proper library class. As it
is, I've now got multiple changes that depend on this code and I am resistant to
making changes that don't involve defects or other major problems.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:77: size_t posFirstCharInToken;
On 2016/01/25 14:13:54, sergei wrote:
> Wouldn't it better to name it as `tokenFirstCharPos[ition]` because it's a
> position of the first character of the token in the container.

It's "position of the first character in the token". It's generally better to
put what kind of thing it is first in the name if you're dealing with an
inadequate type system. (Positions in strings should be their own type, IMO, but
they're not.)

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:78: size_t posFirstCharNotInToken;
On 2016/01/25 14:13:53, sergei wrote:
> And this one is a position of the next delimiter, so something like
> `nextDelimiterPos`.

Your suggested name is not as precise as the current one.

These two variables are the endpoints of a half-closed, half-open interval of
positions that define a token, and their names reflect exactly that.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:101: : sequence(sequence),
posFirstCharInToken(T::npos)
On 2016/01/25 14:13:54, sergei wrote:
> I feel myself uncomfortable when not all members without default constructor
are
> initialized.

If you would like a comment about why 'posFirstCharNotInToken' is not
initialized, I can add one. It's a deliberate choice. The current code
initializes to a valid iterator state; that's all that matters.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:108: void Advance(size_t posStart)
On 2016/01/25 14:13:54, sergei wrote:
> Wouldn't it be better to rename it to something like `FindNextToken` because
> `Advance` refers rather to a sequence of calls of `iterator::operator++` in
> context of iterators.

It's better to name functions "what they do" than "how they work". In this case,
"Advance" is named relative to the constructor and 'operator++', meaning
"advance to the next token".

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:110: posFirstCharInToken =
sequence.value.find_first_not_of(sequence.delimiters, posStart);
On 2016/01/25 14:13:53, sergei wrote:
> That line "eats" all delimiters and skips empty tokens, that's very important
> moment. It should be reflected at least in the comment for `TokenSequence`, it
> also could be good to have it commented here.

From my point of view, it's completely conventional to collapse multiple
whitespace delimiter characters into a single delimiter. Please don't use this
class with "comma" as a delimiter.

If this were a general purpose class, I'd worry about this more.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:140: void operator++()
On 2016/01/25 14:13:53, sergei wrote:
> It seems we are not using it, but it can be very convenient to return `*this`
> from the prefix increment operator.

Right. We're not using it. So no need for a return value.

Again: this is not a general purpose library function. If it ever does, that's
the time to add such things.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:144: return; // Incrementing "end()" yields "end()"
On 2016/01/25 14:13:52, sergei wrote:
> Should we throw an exception here instead?

In fact, the way that this code is used in 'for' loops, this statement is never
actually reached in any code we'll use. The presence of this test is purely
defensive.

And to answer the question directly, no, because that would not be the
documented behavior, which is that incrementing "end()" yields "end()".
Thankfully, this is not a library function, so we don't need to debate what
interface would be better for general purpose use.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:149: posFirstCharInToken = T::npos;
On 2016/01/25 14:13:53, sergei wrote:
> I'm not sure that we actually need to make an assignment here, actually
> `posFirstCharNotInToken` cannot be bigger than `sequence.value.length()`, but
I
> don't object against `>=` in the clause above.

The very simple answer is that we're incrementing from a non-"end()" iterator to
the "end()" iterator, and there's a single canonical value for "end()". That's
valuable because it makes 'operator==' much less complicated.

If we didn't have this statement, loops on the iterator would never terminate.

The class invariant is that if a token is present, both 'posFirstCharNotInToken'
and 'posFirstCharInToken' be string positions. See line 161 below for why. The
design decision here is that these position tests go in the state-change
functions. Access and predicate functions therefore do not perform such tests.

Lines 117-120 above are a related test. Those fix up the 'npos' return value
into a proper length.

Using ">=" is slightly clearer to me than using "==". It's never the case that
this test is reached with a positive value greater than then length.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:150: return;
On 2016/01/25 14:13:54, sergei wrote:
> Should we rather throw an exception?

Never. Incrementing to "end()" is not an error.

sergei

IMO, TokenSequence has too specific implementation, so I would rather prefer to don't have it, ...

Jan. 26, 2016, 10:34 p.m. (2016-01-26 22:34:59 UTC) #14

IMO, TokenSequence has too specific implementation, so I would rather prefer to
don't have it, at least in this commit.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
File src/plugin/TokenSequence.h (right):

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:26: typedef TokenSequenceConstIterator<T>
ConstIterator;
On 2016/01/26 15:54:19, Eric wrote:
> When our style guide at https://adblockplus.org/en/coding-style has a specific
> exemption for this, I'll gladly change it.

I thought that it's natural to adhere to common iterator interface. Why do
`cbegin` and `cend` start from lower letter and have "c"? I really don't want
them to be renamed, but it would be definitely better to have public
`const_iterator`.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:76: const Container& sequence;
On 2016/01/26 15:54:20, Eric wrote:
> On 2016/01/25 14:13:54, sergei wrote:
> > I'm pretty sure it generates a warning something like "The compiler cannot
> > generate copy ctr and assignment operator."
> 
> I am completely sure it does not generate a warning, because I've compiled the
> code.
OK, our warning level does not generate a warning here, but it does not mean
that we should not write a clean code.

> > Could you explicitly make this class "uncopyable"?
> 
> I would worry about this more if I were writing a proper library class. As it
> is, I've now got multiple changes that depend on this code and I am resistant
to
> making changes that don't involve defects or other major problems.

I think that these classes are quite generic and deserve a proper
implementation. If it's so specific, why is it a template class and why is it
not mentioned in the comments that it's not something which can be safely widely
used?
It's not clear what are these multiple changes that depend on this code. Anyway,
it shouldn't be a reason that allows to commit dirty code.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:77: size_t posFirstCharInToken;
On 2016/01/26 15:54:20, Eric wrote:
> On 2016/01/25 14:13:54, sergei wrote:
> > Wouldn't it better to name it as `tokenFirstCharPos[ition]` because it's a
> > position of the first character of the token in the container.
> 
> It's "position of the first character in the token". It's generally better to
> put what kind of thing it is first in the name if you're dealing with an
> inadequate type system. (Positions in strings should be their own type, IMO,
but
> they're not.)

Let's say there is a string "aaa bbb ccc", after tokenization using space as a
delimiter we get three tokens: "aaa", "bbb" and "ccc". The "position of the
first character in token" is apparently always 1 (or zero if we are using
zero-based numbering), however the position of the first character of each token
in the container is different, for "aaa" it's 1, for "bbb" it's 5 and for "ccc"
it's 9.
I don't have a strong opinion regarding position of "pos" but the rest of the
name is confusing.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:78: size_t posFirstCharNotInToken;
On 2016/01/26 15:54:21, Eric wrote:
> On 2016/01/25 14:13:53, sergei wrote:
> > And this one is a position of the next delimiter, so something like
> > `nextDelimiterPos`.
> 
> Your suggested name is not as precise as the current one.
>
> These two variables are the endpoints of a half-closed, half-open interval of
> positions that define a token, and their names reflect exactly that.
That's clear that it's a right-open interval, BTW, you can explain it in the
comments. You are right, the first character of not token is a delimiter or
nothing (can be unknown character) but in my opinion during the reading of the
code it's easier to consider the end of the string as a delimiter than read a
little bit complicated variable name. It would be already better to name it
`posNotTokenFirstChar`.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:101: : sequence(sequence),
posFirstCharInToken(T::npos)
On 2016/01/26 15:54:20, Eric wrote:
> On 2016/01/25 14:13:54, sergei wrote:
> > I feel myself uncomfortable when not all members without default constructor
> are
> > initialized.
> 
> If you would like a comment about why 'posFirstCharNotInToken' is not
> initialized, I can add one. It's a deliberate choice. The current code
> initializes to a valid iterator state; that's all that matters.

Don't forget about readability and maintainability. When I see that the member
field is not initialized the first thing I think is that there is a bug and I
need to check the implementation of the whole class. 
Here comparing `posFirstCharInToken` against `npos` might always tell us whether
we may or not use other fields, but in general when I need to modify the class
(e.g. add some method), again I need to very cautiously check that it will be in
a consistent state and to do it I need dive into details of all algorithms with
all their edge-cases.

Off-topic: do you cross a street driving car on red light when there is nobody?
Basically, it's safe and you achieve the goal, that's all that matters, isn't
it?

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:108: void Advance(size_t posStart)
On 2016/01/26 15:54:19, Eric wrote:
> On 2016/01/25 14:13:54, sergei wrote:
> > Wouldn't it be better to rename it to something like `FindNextToken` because
> > `Advance` refers rather to a sequence of calls of `iterator::operator++` in
> > context of iterators.
> 
> It's better to name functions "what they do" than "how they work". In this
case,
> "Advance" is named relative to the constructor and 'operator++', meaning
> "advance to the next token".
"advance to the next token" actually means `find next token`, does not it?
In general `Advance` is fine for me.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:110: posFirstCharInToken =
sequence.value.find_first_not_of(sequence.delimiters, posStart);
On 2016/01/26 15:54:19, Eric wrote:
> On 2016/01/25 14:13:53, sergei wrote:
> > That line "eats" all delimiters and skips empty tokens, that's very
important
> > moment. It should be reflected at least in the comment for `TokenSequence`,
it
> > also could be good to have it commented here.
> 
> From my point of view, it's completely conventional to collapse multiple
> whitespace delimiter characters into a single delimiter. Please don't use this
> class with "comma" as a delimiter.
Why not to use it with comma as a delimiter and how will one be able to
understand it in a year?
BTW, the actual limitation I see here, it won't work when a delimiter character
consists from several elements (several bytes for UTF-8 or it's a surrogate pair
for UTF-16). I have not commented it before and I don't think we should pay
attention to it, because if we ever need such functionality we will pay
attention to each method which works with the string.

> 
> If this were a general purpose class, I'd worry about this more.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:144: return; // Incrementing "end()" yields "end()"
On 2016/01/26 15:54:20, Eric wrote:
> On 2016/01/25 14:13:52, sergei wrote:
> > Should we throw an exception here instead?
> 
> In fact, the way that this code is used in 'for' loops, this statement is
never
> actually reached in any code we'll use. The presence of this test is purely
> defensive.
It defensive but it protects silently, if we are trying to increase it then
there is likely a bug and it would be better to be notified about it. I would
personally use assert, but in this project we are extensively using pure
exceptions for such cases.
> 
> And to answer the question directly, no, because that would not be the
> documented behavior, which is that incrementing "end()" yields "end()".
> Thankfully, this is not a library function, so we don't need to debate what
> interface would be better for general purpose use.

IMO, incrementing of undereferenceable iterator is an error because it's not
expected in C++.

https://codereview.adblockplus.org/29331055/diff/29333226/src/plugin/TokenSeq...
src/plugin/TokenSequence.h:149: posFirstCharInToken = T::npos;
On 2016/01/26 15:54:19, Eric wrote:
> On 2016/01/25 14:13:53, sergei wrote:
> > I'm not sure that we actually need to make an assignment here, actually
> > `posFirstCharNotInToken` cannot be bigger than `sequence.value.length()`,
but
> I
> > don't object against `>=` in the clause above.
> 
> The very simple answer is that we're incrementing from a non-"end()" iterator
to
> the "end()" iterator, and there's a single canonical value for "end()". That's
> valuable because it makes 'operator==' much less complicated.
Sorry, I have overlooked it, I have wrongly read `posFirstCharInToken` as
`posFirstCharNotInToken`. Anyway, now I think that this assignment should be
inside `Advance` method.
> 
> If we didn't have this statement, loops on the iterator would never terminate.
> 
> The class invariant is that if a token is present, both
'posFirstCharNotInToken'
> and 'posFirstCharInToken' be string positions. See line 161 below for why. The
> design decision here is that these position tests go in the state-change
> functions. Access and predicate functions therefore do not perform such tests.
> 
> Lines 117-120 above are a related test. Those fix up the 'npos' return value
> into a proper length.
> 
> Using ">=" is slightly clearer to me than using "==". It's never the case that
> this test is reached with a positive value greater than then length.

Oleksandr

> The performance overhead of stream buffering is rather large, particularly for > something as ...

Jan. 27, 2016, 1:31 a.m. (2016-01-27 01:31:34 UTC) #15

Eric

On 2016/01/27 01:31:34, Oleksandr wrote: > Let's look at the numbers. In my quick and ...

Feb. 4, 2016, 6:09 p.m. (2016-02-04 18:09:14 UTC) #16

Eric

Patch set 7 removes 'TokenSequence' and replaces it with loops based on 'wcstok_s'.

Feb. 4, 2016, 7:50 p.m. (2016-02-04 19:50:44 UTC) #17

sergei

Beside a couple of nits LGTM. https://codereview.adblockplus.org/29331055/diff/29335758/src/plugin/PluginFilter.cpp File src/plugin/PluginFilter.cpp (right): https://codereview.adblockplus.org/29331055/diff/29335758/src/plugin/PluginFilter.cpp#newcode79 src/plugin/PluginFilter.cpp:79: //TODO: Add support ...

May 2, 2016, 2:33 p.m. (2016-05-02 14:33:49 UTC) #19

Eric

May 17, 2016, 7:30 p.m. (2016-05-17 19:30:39 UTC) #20

Patch set 8 moves one white space character in a comment.

https://codereview.adblockplus.org/29331055/diff/29335758/src/shared/Utils.h
File src/shared/Utils.h (right):

https://codereview.adblockplus.org/29331055/diff/29335758/src/shared/Utils.h#...
src/shared/Utils.h:85: text.erase(text.begin(), std::find_if(text.begin(),
text.end(), isNotWhitespace<T::value_type>));
On 2016/05/02 14:33:48, sergei wrote:
> Does it compile? I expect "typename" before "T::value_type".

Of course it compiles. Do _you_ post code to review that doesn't compile?

It has passed unit tests since the very first patch set. How would that have
happened if it didn't compile?

Issue 29331055: Issue #1234 - Remove 'CString' from PluginFilter.* (Closed)

Description

Patch Set 1 : #

Patch Set 2 : address comments #

Patch Set 3 : rebase #

Patch Set 4 : rebase #

Patch Set 5 : rebase #

Patch Set 6 : rebase #

Patch Set 7 : remove TokenSequence #

Patch Set 8 : fix whitespace nit #

Messages