|
|
DescriptionIssue #1234 - Remove 'CString' From 'ToLowerString()'
Reimplement 'ToLowerString()' to eliminate 'CString'. New implementation uses
'_wcslwr_s()', the same function used in 'CString::MakeLower()' from our prior
version, but without the overhead of two string conversions. Add unit tests
for new implementation.
Patch Set 1 #
Total comments: 20
Patch Set 2 : rewrite to use _wcslwr_s() #
Total comments: 12
Patch Set 3 : address comments #
Total comments: 2
Patch Set 4 : address comments, round two #
MessagesTotal messages: 10
This shouldn't have been so complicated, but apparently it was, so there are "too many" comments and pragmas.
https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... File src/plugin/PluginUtil.cpp (right): https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:58: * This function '_wcslwr()' uses the current locale to determine case. I don't think we need this line because it's already communicated in the header. https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:59: * This is the same function used within 'CStringW::MakeLower()', which this function replaces. The comment on this line is not necessary to be in the code, it can be in the commit message. However, pay attention that `CStringW::MakeLower` finally calls `_wcslwr_s` not `_wcslwr`. https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:70: * So we're using the "insecure" version and suppressing the deprecation warning. Although "insecure" version might work good because Microsoft had cared about it already, I have a strong opinion that we should use "secure" version. Pay attention that the current code works fine, so I would propose to use the same trick here. We can construct `std::wstring lower(s.c_str());` which drops the rest of the string in case there is null character in the middle of `s` string and converts only the beginning of the string not causing any assertion failure. We should reflect it in the comment "May not lower-case entire string..." and accordingly adjust the tests. So, I would drop that comment and do something like // Ensure the string is null-terminated string without zero characters in the middle because otherwise _wcslwr_s fails. std::wstring lower(s.c_str()); assert(s.length() == lower.length() && "The argument string will be truncated till zero value character"); errno_t err = ::_wcslwr_s(&lower[0], lower.size() + 1); assert(err == 0 && "Cannot lower string"); if (err == 0) { return std::wstring(); } return lower; https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:74: _wcslwr(const_cast<wchar_t*>(lower.data())); What about returning of an empty string in case of error? https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... File src/plugin/PluginUtil.h (right): https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... src/plugin/PluginUtil.h:36: * Collates according to the current locale. That seems to be correct choice. We can also create something like `ToLowerStringCLocale` which we can use to convert HTML tags, for example. https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... src/plugin/PluginUtil.h:37: * May not lower-case entire string if it contains embedded null characters. I know that it can theoretically happen, especially because we do store the entire string while we are converting BSTR which is good, but does it actually happen and when? See the comment for the body. https://codereview.adblockplus.org/29332775/diff/29332776/test/plugin/UtilTes... File test/plugin/UtilTest.cpp (right): https://codereview.adblockplus.org/29332775/diff/29332776/test/plugin/UtilTes... test/plugin/UtilTest.cpp:77: ASSERT_EQ(std::wstring(), ToLowerString(L"")); It's still not clear why we are using ASSERT in a new code. https://codereview.adblockplus.org/29332775/diff/29332776/test/plugin/UtilTes... test/plugin/UtilTest.cpp:82: ASSERT_EQ(std::wstring(L"foo"), ToLowerString(L"Foo")); Is it necessary to explicitly construct std::wstring here? https://codereview.adblockplus.org/29332775/diff/29332776/test/plugin/UtilTes... test/plugin/UtilTest.cpp:87: ASSERT_EQ(std::wstring(L"\0BAR"), ToLowerString(L"\0BAR")); I think this test and below are incorrect. Mainly because the rest of the C-string after zero value character is dropped while `std::wstring` is being constructed. One should use constructor which takes into account the length of the string, thus `std::wstring(L"\0BAR", 4)`. One also should explicitly construct `std::wstring` with the constructor mentioned above to pass it in `ToLowerString`. On the other hand if you wanted to indicate that we are ignoring whether the rest of the string after zero value character is converted or not then it would be better to compare only the beginning of the string, e.g something like EXPECT_TRUE(BeginsWith(ToLowerString(std::wstring(L"\0BAR", 4)), L""); EXPECT_TRUE(BeginsWith(ToLowerString(std::wstring(L"foo\0BAR", 7)), L"Foo"); If we proceed with the version which drops the tail after zero value character then it should be like EXPECT_EQ(std::wstring(), ToLowerString(std::wstring(L"\0BAR", 4))); EXPECT_EQ(L"Foo", ToLowerString(std::wstring(L"foo\0BAR", 7)));
New patch set 2. https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... File src/plugin/PluginUtil.cpp (right): https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:58: * This function '_wcslwr()' uses the current locale to determine case. On 2015/12/16 10:32:44, sergei wrote: > I don't think we need this line because it's already communicated in the header. It's not the same. The header says is it uses the current locale. This note says _how_ is uses the current locale. It's the difference between specification and implementation. The specification says what the function has to do. The implementation note says how it does it. When you audit the code, you can dig through the documentation to find the place that says that '_wcslwr_s()' uses the current locale, or you can just trust the comment. https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:59: * This is the same function used within 'CStringW::MakeLower()', which this function replaces. On 2015/12/16 10:32:44, sergei wrote: > The comment on this line is not necessary to be in the code, it can be in the > commit message. I wavered on this one. I had decided to leave it in because of the slim possibility that we'd be introducing defects if we change the implementation method. It's out. Incidentally, the ATL code base, upon further examination, contains both '_wcslwr' and '_wcslwr_s', in different functions with the same name. I had overlooked the second function and didn't see the overload resolution. https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:70: * So we're using the "insecure" version and suppressing the deprecation warning. On 2015/12/16 10:32:44, sergei wrote: > Although "insecure" version might work good because Microsoft had cared about it > already, I have a strong opinion that we should use "secure" version. I had tried it once and it had failed. It turns out that was because the documentation for '_wcslwr_s' is incorrect on an important point. Regarding parameter validation, it says "length of string" where it ought to say "length of buffer" (the one containing the string). > Pay attention that the current code works fine, so I would propose to use the > same trick here. We can construct `std::wstring lower(s.c_str());` which drops > the rest of the string in case there is null character in the middle of `s` > string and converts only the beginning of the string not causing any assertion > failure. There's no need for special handling for embedded null. Since we never see it, it's just bloat to take care of it specially. I did change the commit message to something more useful, essentially a form of "don't do that". https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:74: _wcslwr(const_cast<wchar_t*>(lower.data())); On 2015/12/16 10:32:45, sergei wrote: > What about returning of an empty string in case of error? The tests aren't returning an error code, now that I understand how '_wcslwr_s' *actually* validates parameters. With the new patch set, '_wcslwr_s' will always succeed. https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... File src/plugin/PluginUtil.h (right): https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... src/plugin/PluginUtil.h:36: * Collates according to the current locale. On 2015/12/16 10:32:45, sergei wrote: > That seems to be correct choice. We can also create something like > `ToLowerStringCLocale` which we can use to convert HTML tags, for example. For my purposes in this change set, it's the choice that doesn't change the previous behavior. Right or wrong isn't the question at the moment. Now, however, its behavior is documented now so that we can make informed decisions about locale issues in the future. https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... src/plugin/PluginUtil.h:37: * May not lower-case entire string if it contains embedded null characters. On 2015/12/16 10:32:45, sergei wrote: > I know that it can theoretically happen, especially because we do store the > entire string while we are converting BSTR which is good, but does it actually > happen and when? I don't think we ever have a situation where we'll see an input with embedded null. That said, it's important to document the function as it stands, not how it's used. I've changed this specification comment to make fewer guarantees in this situation. https://codereview.adblockplus.org/29332775/diff/29332776/test/plugin/UtilTes... File test/plugin/UtilTest.cpp (right): https://codereview.adblockplus.org/29332775/diff/29332776/test/plugin/UtilTes... test/plugin/UtilTest.cpp:77: ASSERT_EQ(std::wstring(), ToLowerString(L"")); On 2015/12/16 10:32:45, sergei wrote: > It's still not clear why we are using ASSERT in a new code. Because I prefer ASSERT. And as I've said before, there's no difference at all between ASSERT and EXPECT for one line tests. And in most cases you want to terminate early if you fail an assertion. It's an unusual case where a second assertion failure in a unit test provides actual value. https://codereview.adblockplus.org/29332775/diff/29332776/test/plugin/UtilTes... test/plugin/UtilTest.cpp:82: ASSERT_EQ(std::wstring(L"foo"), ToLowerString(L"Foo")); On 2015/12/16 10:32:45, sergei wrote: > Is it necessary to explicitly construct std::wstring here? I checked. No. They were in there because I've had trouble with the ASSERT macros not properly inferring type conversions before. I took them out. https://codereview.adblockplus.org/29332775/diff/29332776/test/plugin/UtilTes... test/plugin/UtilTest.cpp:87: ASSERT_EQ(std::wstring(L"\0BAR"), ToLowerString(L"\0BAR")); On 2015/12/16 10:32:45, sergei wrote: > Mainly because the rest of the > C-string after zero value character is dropped while `std::wstring` is being > constructed. One should use constructor which takes into account the length of > the string, thus `std::wstring(L"\0BAR", 4)`. Yep. Fixed that. I keep forgetting that constructing from 'char[]' doesn't infer the length of the array; it's a legacy behavior. I've rewritten the tests to be more clear about what actually matters in this situation.
https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... File src/plugin/PluginUtil.cpp (right): https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:74: _wcslwr(const_cast<wchar_t*>(lower.data())); On 2015/12/16 13:53:26, Eric wrote: > On 2015/12/16 10:32:45, sergei wrote: > > What about returning of an empty string in case of error? > > The tests aren't returning an error code, now that I understand how '_wcslwr_s' > *actually* validates parameters. With the new patch set, '_wcslwr_s' will always > succeed. Still I would like to check the return value and have an `assert` here. However it might be better to return the original string instead of an empty string in case of an error. Our tests merely test whether it works as expected with a couple of common cases and with a couple of uncommon cases (what is the value of current locale?), there are even no tests with non-ASCII characters with different locales, so they don't actually cover many possible cases, so they cannot be a prove that we don't need to take into account the return value. Usually such conversion functions fail on a sequence of bytes which is an invalid UTF-16 string, e.g. "0xD900 0xD900", even if it does not fail now it can fail with some locale or start to fail after some update of the locale, and of course it can fail if there is no enough memory, though it might be very unlikely. https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... File src/plugin/PluginUtil.cpp (right): https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:58: * We call '_wcslwr_s()', which take C-style strings and uses the current locale to determine case. It still can be just a one line comment "uses the current locale" next to the function call https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:65: * Thus we can modify in-place after casting away the 'const' modifier from 'c_str()'. These two lines are completely redundant. https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:70: * The call below has argument "length + 1" to include the terminating null character in the buffer. It does say "Size of the buffer.", so these three lines can go away. https://codereview.adblockplus.org/29332775/diff/29332817/test/plugin/UtilTes... File test/plugin/UtilTest.cpp (right): https://codereview.adblockplus.org/29332775/diff/29332817/test/plugin/UtilTes... test/plugin/UtilTest.cpp:94: ASSERT_EQ(L"", std::wstring(actual.c_str())); That looks strange, it's good to know what happened with "BAR"?
New patch set 3. https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... File src/plugin/PluginUtil.cpp (right): https://codereview.adblockplus.org/29332775/diff/29332776/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:74: _wcslwr(const_cast<wchar_t*>(lower.data())); On 2015/12/16 15:53:16, sergei wrote: > Still I would like to check the return value and have an `assert` here. However > it might be better to return the original string instead of an empty string in > case of an error. Don't need to. According to the documentation, it can't return an error condition. There's only two conditions where it can: | If str is not a valid null-terminated string, the invalid parameter handler is invoked, as described in Parameter Validation . If execution is allowed to continue, the functions return EINVAL and set errno to EINVAL. We're passing the function the address from 'c_str()', which guarantees a null-terminated string. | If numberOfElements is less than the length of the string, the functions also return EINVAL and set errno to EINVAL. I quote this elsewhere, because it should say "buffer" and not "string", but we're passing it the correct number of elements now. In other words, neither of the error conditions can occur. Therefore I decided that not checking the error code was fine. However, I'm not averse to defensive programming with respect to a Microsoft library function. I've added an error check and throw 'logic_error' if it fails. If it fails, we have an implementation defect, so throwing 'logic_error' is the right behavior, and masking the failure is the wrong one. If we ever see such an error, we'll need to re-implement this function, probably not using '_wcslwr_s'. https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... File src/plugin/PluginUtil.cpp (right): https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:58: * We call '_wcslwr_s()', which take C-style strings and uses the current locale to determine case. On 2015/12/16 15:53:16, sergei wrote: > It still can be just a one line comment "uses the current locale" next to the > function call That's reasonable. https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:65: * Thus we can modify in-place after casting away the 'const' modifier from 'c_str()'. On 2015/12/16 15:53:16, sergei wrote: > These two lines are completely redundant. Just because you understand it already doesn't mean it doesn't clarify for someone else. You are not supposed to muck about this internals of 'std::basic_string'. We are, and this comment explains it. https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:70: * The call below has argument "length + 1" to include the terminating null character in the buffer. On 2015/12/16 15:53:16, sergei wrote: > It does say "Size of the buffer.", so these three lines can go away. It also says "size of string", and it does that (like I said before) in the part about validation. I quote: | If numberOfElements is less than the length of the string, | the functions also return EINVAL and set errno to EINVAL. When you pass 'numberOfElements' equal to the size of the string, you get an exception. Inconsistent documentation is incorrect documentation. The document is _in fact_ incorrect. Proper auditing the code requires a comment to correct the "normative" documentation. https://codereview.adblockplus.org/29332775/diff/29332817/test/plugin/UtilTes... File test/plugin/UtilTest.cpp (right): https://codereview.adblockplus.org/29332775/diff/29332817/test/plugin/UtilTes... test/plugin/UtilTest.cpp:94: ASSERT_EQ(L"", std::wstring(actual.c_str())); On 2015/12/16 15:53:16, sergei wrote: > That looks strange, it's good to know what happened with "BAR"? Don't need to know. The (new) specification for the function only says that it returns, and with an undefined result. I'm only checking the result here because I have white-box access, and it's a way of ensuring the function doesn't do anything harmful for its undefined result. That's why these are labelled "white-box"; they're cheating the black-box principle of ordinary unit testing to verify a behavior (not crashing or harming otherwise) that's hard to check directly. Also, what happens to "BAR" differs between Debug and Release. The debug version of the '*_s' functions fill the buffer with a dummy character and the release version doesn't. No need to bother with that. I just left it out.
Patch set 3 is current; no need for rebase. This is one of two outstanding reviews for #1234. I would really like to close this one out.
Just a couple of nits. Otherwise looks good. https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... File src/plugin/PluginUtil.cpp (right): https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:65: * Thus we can modify in-place after casting away the 'const' modifier from 'c_str()'. On 2015/12/16 16:38:07, Eric wrote: > On 2015/12/16 15:53:16, sergei wrote: > > These two lines are completely redundant. > > Just because you understand it already doesn't mean it doesn't clarify for > someone else. > > You are not supposed to muck about this internals of 'std::basic_string'. We > are, and this comment explains it. From what I remember the consensus on comments is that we should in general aim to make the code itself as readable as possible. Comments should be kept to minimum because of that. In this instance I think one can figure out what's going on without the comment, so I'd agree it would be best to remove these two lines. https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:70: * The call below has argument "length + 1" to include the terminating null character in the buffer. On 2015/12/16 16:38:07, Eric wrote: > On 2015/12/16 15:53:16, sergei wrote: > > It does say "Size of the buffer.", so these three lines can go away. > > It also says "size of string", and it does that (like I said before) in the part > about validation. I quote: > > | If numberOfElements is less than the length of the string, > | the functions also return EINVAL and set errno to EINVAL. > > When you pass 'numberOfElements' equal to the size of the string, you get an > exception. > > Inconsistent documentation is incorrect documentation. The document is _in fact_ > incorrect. Proper auditing the code requires a comment to correct the > "normative" documentation. How about just a comment like: //NOTE: second parameter of _wcslwr_s expects length of buffer, not length of string. https://codereview.adblockplus.org/29332775/diff/29332831/test/plugin/UtilTes... File test/plugin/UtilTest.cpp (right): https://codereview.adblockplus.org/29332775/diff/29332831/test/plugin/UtilTes... test/plugin/UtilTest.cpp:82: ASSERT_EQ(L"foo", ToLowerString(L"Foo")); Nit: why not test "FOO" instead? To make sure not only the first character was lower cased.
Patch set 4 addresses Oleksandr's comments. https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... File src/plugin/PluginUtil.cpp (right): https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:65: * Thus we can modify in-place after casting away the 'const' modifier from 'c_str()'. On 2016/01/05 22:59:00, Oleksandr wrote: > On 2015/12/16 16:38:07, Eric wrote: > > On 2015/12/16 15:53:16, sergei wrote: > > > These two lines are completely redundant. > > > > Just because you understand it already doesn't mean it doesn't clarify for > > someone else. > > > > You are not supposed to muck about this internals of 'std::basic_string'. We > > are, and this comment explains it. > > From what I remember the consensus on comments is that we should in general aim > to make the code itself as readable as possible. Comments should be kept to > minimum because of that. In this instance I think one can figure out what's > going on without the comment, so I'd agree it would be best to remove these two > lines. Done. https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... src/plugin/PluginUtil.cpp:70: * The call below has argument "length + 1" to include the terminating null character in the buffer. On 2016/01/05 22:59:00, Oleksandr wrote: > How about just a comment like: > //NOTE: second parameter of _wcslwr_s expects length of buffer, not length of > string. It's not typical for documentation to be incorrect. We should say so explicitly. Leaving this one as is. https://codereview.adblockplus.org/29332775/diff/29332831/test/plugin/UtilTes... File test/plugin/UtilTest.cpp (right): https://codereview.adblockplus.org/29332775/diff/29332831/test/plugin/UtilTes... test/plugin/UtilTest.cpp:82: ASSERT_EQ(L"foo", ToLowerString(L"Foo")); On 2016/01/05 22:59:00, Oleksandr wrote: > Nit: why not test "FOO" instead? To make sure not only the first character was > lower cased. Changed. Generally, when you partially modify data, you want to test both that (1) what's supposed to be modified is modified and (2) what's not supposed to be modified is not modified. FWIW, I'd be more extensive here if this weren't a wrapper around a library function.
On 2016/01/07 14:34:05, Eric wrote: > Patch set 4 addresses Oleksandr's comments. > > https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... > File src/plugin/PluginUtil.cpp (right): > > https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... > src/plugin/PluginUtil.cpp:65: * Thus we can modify in-place after casting away > the 'const' modifier from 'c_str()'. > On 2016/01/05 22:59:00, Oleksandr wrote: > > On 2015/12/16 16:38:07, Eric wrote: > > > On 2015/12/16 15:53:16, sergei wrote: > > > > These two lines are completely redundant. > > > > > > Just because you understand it already doesn't mean it doesn't clarify for > > > someone else. > > > > > > You are not supposed to muck about this internals of 'std::basic_string'. We > > > are, and this comment explains it. > > > > From what I remember the consensus on comments is that we should in general > aim > > to make the code itself as readable as possible. Comments should be kept to > > minimum because of that. In this instance I think one can figure out what's > > going on without the comment, so I'd agree it would be best to remove these > two > > lines. > > Done. > > https://codereview.adblockplus.org/29332775/diff/29332817/src/plugin/PluginUt... > src/plugin/PluginUtil.cpp:70: * The call below has argument "length + 1" to > include the terminating null character in the buffer. > On 2016/01/05 22:59:00, Oleksandr wrote: > > How about just a comment like: > > //NOTE: second parameter of _wcslwr_s expects length of buffer, not length of > > string. > > It's not typical for documentation to be incorrect. We should say so explicitly. > > Leaving this one as is. > > https://codereview.adblockplus.org/29332775/diff/29332831/test/plugin/UtilTes... > File test/plugin/UtilTest.cpp (right): > > https://codereview.adblockplus.org/29332775/diff/29332831/test/plugin/UtilTes... > test/plugin/UtilTest.cpp:82: ASSERT_EQ(L"foo", ToLowerString(L"Foo")); > On 2016/01/05 22:59:00, Oleksandr wrote: > > Nit: why not test "FOO" instead? To make sure not only the first character was > > lower cased. > > Changed. Generally, when you partially modify data, you want to test both that > (1) what's supposed to be modified is modified and (2) what's not supposed to be > modified is not modified. > > FWIW, I'd be more extensive here if this weren't a wrapper around a library > function. LGTM
LGTM |