|
|
Created:
July 3, 2018, 3:50 p.m. by Tudor Avram Modified:
July 17, 2018, 9:14 a.m. Visibility:
Public. |
DescriptionIssue #6707 - Make the generated malware domain filter list encode domains as Punycode
Patch Set 1 #
Total comments: 4
Patch Set 2 : updated tests #
Total comments: 2
Patch Set 3 : Updated encoding format #
Total comments: 3
Patch Set 4 : Removed uncessary codecs call from updateMalwareDomains.py #
Created: July 9, 2018, 1:08 p.m.
(Patch set is too large to download)
MessagesTotal messages: 16
Here is my first implementation for this issue.
Hi Tudor, This looks pretty good. Just one comment about .gitignore and one about the test (see below). Cheers, Vasily https://codereview.adblockplus.org/29821558/diff/29821559/.gitignore File .gitignore (right): https://codereview.adblockplus.org/29821558/diff/29821559/.gitignore#newcode9 .gitignore:9: env/ This virtualenv doesn't need to be there -- you can have it outside and then it doesn't need to be ignored. https://codereview.adblockplus.org/29821558/diff/29821559/sitescripts/subscri... File sitescripts/subscriptions/test/test_updateMalwareDomainsList.py (right): https://codereview.adblockplus.org/29821558/diff/29821559/sitescripts/subscri... sitescripts/subscriptions/test/test_updateMalwareDomainsList.py:61: idna_url = 'xn--xhq02ykwbp4a.cn\n' I think it would make sense to have 2 domains in the list just to make sure that each domain is treated separately by the recoding process. The reason for this is that if we take something like 'xn--domain1\nxn--domain2\n' and then recode it without breaking into separate lines, the result is different from breaking it into separate lines and then processing each line separately (this is what we want). The checking in line 77 will need to be a bit more sophisticated to check that the result is really what we want.
LGTM Sebastian, could you please check that the output looks right (as in: it's what you expect on the client side). Also we can coordinate the deployment of this with your release for less disruption. Cheers, Vasily
For reference, here's the whole patch combined (for some reason Rietveld doesn't let us download it). Take the part between the lines of "=". ========================================= Index: .gitignore diff --git a/.gitignore b/.gitignore index 00be79377c4e81d7ff3916b25835e8b22ab7fae5..37031079392f2a15409dfc9b1935e678f96f9af5 100644 --- a/.gitignore +++ b/.gitignore @@ -5,3 +5,5 @@ /.sitescripts.test /.coverage htmlcov/ +.idea/ +.pytest_cache/ Index: sitescripts/subscriptions/bin/updateMalwareDomainsList.py diff --git a/sitescripts/subscriptions/bin/updateMalwareDomainsList.py b/sitescripts/subscriptions/bin/updateMalwareDomainsList.py index a2cfb5ff8d46bfe4534dc79f0f03f47e31257519..c2f208228717f4e3122a4bc95cbb39f6f66c8a36 100644 --- a/sitescripts/subscriptions/bin/updateMalwareDomainsList.py +++ b/sitescripts/subscriptions/bin/updateMalwareDomainsList.py @@ -60,7 +60,7 @@ def main(): subprocess.check_call(['hg', '-q', 'up', '-R', tempdir, '-r', 'default']) path = os.path.join(tempdir, 'malwaredomains_full.txt') - file = codecs.open(path, 'wb', encoding='utf-8') + file = codecs.open(path, 'wb') print >>file, FILTERLIST_HEADER @@ -80,7 +80,7 @@ def main(): if not domain: continue - print >>file, '||%s^' % domain.decode('idna') + print >>file, '||%s^' % domain.decode('idna').encode('punycode') file.close() if subprocess.check_output(['hg', 'stat', '-R', tempdir]) != '': Index: sitescripts/subscriptions/test/test_updateMalwareDomainsList.py diff --git a/sitescripts/subscriptions/test/test_updateMalwareDomainsList.py b/sitescripts/subscriptions/test/test_updateMalwareDomainsList.py index bd1dd2521aba2c12f550d3dc965c67a78b150386..e0939649ad08aa3e35da86710ba2faa77442c788 100644 --- a/sitescripts/subscriptions/test/test_updateMalwareDomainsList.py +++ b/sitescripts/subscriptions/test/test_updateMalwareDomainsList.py @@ -58,7 +58,8 @@ def urlopen(mocker): if url.startswith('good'): zf_data = io.BytesIO() with zipfile.ZipFile(zf_data, 'w') as zf: - zf.writestr('justdomains', 'success\n') + idna_urls = ['xn--xhq02ykwbp4a.cn', 'xn--fiq80yua78t.tw'] + zf.writestr('justdomains', '\n'.join(idna_urls) + '\n') return io.BytesIO(zf_data.getvalue()) if url.startswith('bad'): raise urllib2.HTTPError('Bad', '42', 'No good', [], None) @@ -73,7 +74,9 @@ def test_good(md_repo): main() subprocess.check_call(['hg', 'up'], cwd=md_repo.strpath) result = md_repo.join('malwaredomains_full.txt').read() - assert 'success' in result + puny_urls = ['.cn-w48dx81cufdl6b', '.tw-u68dw51cgbz81a'] + for url in puny_urls: + assert url in result def test_bad(md_repo, config): =========================================
Argh. No, the previous comment doesn't work, after line wrapping by Rietveld. Sebastian, if you want to apply the patch locally, you can just concatenate the patches for the three files (individual download links are still there).
Argh. No, the previous comment doesn't work, after line wrapping by Rietveld. Sebastian, if you want to apply the patch locally, you can just concatenate the patches for the three files (individual download links are still there).
https://codereview.adblockplus.org/29821558/diff/29821560/sitescripts/subscri... File sitescripts/subscriptions/bin/updateMalwareDomainsList.py (right): https://codereview.adblockplus.org/29821558/diff/29821560/sitescripts/subscri... sitescripts/subscriptions/bin/updateMalwareDomainsList.py:83: print >>file, '||%s^' % domain.decode('idna').encode('punycode') This seems incorrect. Let's take "xn--fuson-1sa.org" (actual domain in the Malware Domains List): >>> 'xn--fuson-1sa.org'.decode('idna').encode('punycode') b'fuson.org-i5a' This is incorrect, we need 'xn--fuson-1sa.org' here (with the "xn--" annotation, like it was in the input). As far as I understand, this line should be just changed like this: print >>file, '||%s^' % domain
Amended the code. https://codereview.adblockplus.org/29821558/diff/29821559/.gitignore File .gitignore (right): https://codereview.adblockplus.org/29821558/diff/29821559/.gitignore#newcode9 .gitignore:9: env/ On 2018/07/03 16:01:52, Vasily Kuznetsov wrote: > This virtualenv doesn't need to be there -- you can have it outside and then it > doesn't need to be ignored. Done. https://codereview.adblockplus.org/29821558/diff/29821559/sitescripts/subscri... File sitescripts/subscriptions/test/test_updateMalwareDomainsList.py (right): https://codereview.adblockplus.org/29821558/diff/29821559/sitescripts/subscri... sitescripts/subscriptions/test/test_updateMalwareDomainsList.py:61: idna_url = 'xn--xhq02ykwbp4a.cn\n' On 2018/07/03 16:01:52, Vasily Kuznetsov wrote: > I think it would make sense to have 2 domains in the list just to make sure that > each domain is treated separately by the recoding process. The reason for this > is that if we take something like 'xn--domain1\nxn--domain2\n' and then recode > it without breaking into separate lines, the result is different from breaking > it into separate lines and then processing each line separately (this is what we > want). > > The checking in line 77 will need to be a bit more sophisticated to check that > the result is really what we want. Done. https://codereview.adblockplus.org/29821558/diff/29821560/sitescripts/subscri... File sitescripts/subscriptions/bin/updateMalwareDomainsList.py (right): https://codereview.adblockplus.org/29821558/diff/29821560/sitescripts/subscri... sitescripts/subscriptions/bin/updateMalwareDomainsList.py:83: print >>file, '||%s^' % domain.decode('idna').encode('punycode') On 2018/07/05 14:55:28, Sebastian Noack wrote: > This seems incorrect. Let's take "xn--fuson-1sa.org" (actual domain in the > Malware Domains List): > > >>> 'xn--fuson-1sa.org'.decode('idna').encode('punycode') > b'fuson.org-i5a' > > This is incorrect, we need 'xn--fuson-1sa.org' here (with the "xn--" annotation, > like it was in the input). > > As far as I understand, this line should be just changed like this: > > print >>file, '||%s^' % domain Done.
https://codereview.adblockplus.org/29821558/diff/29824562/sitescripts/subscri... File sitescripts/subscriptions/bin/updateMalwareDomainsList.py (right): https://codereview.adblockplus.org/29821558/diff/29824562/sitescripts/subscri... sitescripts/subscriptions/bin/updateMalwareDomainsList.py:63: file = codecs.open(path, 'wb') codecs.open() without specifying an encoding is equivalent to just open().
On 2018/07/06 18:23:25, Sebastian Noack wrote: > https://codereview.adblockplus.org/29821558/diff/29824562/sitescripts/subscri... > File sitescripts/subscriptions/bin/updateMalwareDomainsList.py (right): > > https://codereview.adblockplus.org/29821558/diff/29824562/sitescripts/subscri... > sitescripts/subscriptions/bin/updateMalwareDomainsList.py:63: file = > codecs.open(path, 'wb') > codecs.open() without specifying an encoding is equivalent to just open(). ok. so do you want me to change it to be just open() ?
https://codereview.adblockplus.org/29821558/diff/29824562/sitescripts/subscri... File sitescripts/subscriptions/bin/updateMalwareDomainsList.py (right): https://codereview.adblockplus.org/29821558/diff/29824562/sitescripts/subscri... sitescripts/subscriptions/bin/updateMalwareDomainsList.py:63: file = codecs.open(path, 'wb') On 2018/07/06 18:23:25, Sebastian Noack wrote: > codecs.open() without specifying an encoding is equivalent to just open(). That is what I'm suggesting.
removed unnecessary `codecs` dependency. https://codereview.adblockplus.org/29821558/diff/29824562/sitescripts/subscri... File sitescripts/subscriptions/bin/updateMalwareDomainsList.py (right): https://codereview.adblockplus.org/29821558/diff/29824562/sitescripts/subscri... sitescripts/subscriptions/bin/updateMalwareDomainsList.py:63: file = codecs.open(path, 'wb') On 2018/07/06 18:23:25, Sebastian Noack wrote: > codecs.open() without specifying an encoding is equivalent to just open(). Done.
LGTM
On 2018/07/09 13:39:40, Sebastian Noack wrote: > LGTM Sebastian, when should we push this change? I suppose it would be good to synchronize with corresponding release of ABP or should we just go ahead?
On 2018/07/09 14:25:57, Vasily Kuznetsov wrote: > On 2018/07/09 13:39:40, Sebastian Noack wrote: > > LGTM > > Sebastian, when should we push this change? I suppose it would be good to > synchronize with corresponding release of ABP or should we just go ahead? Yes, as discussed on the issue: Before this change lands, filters from the Malware Domains list targeting internationalized domains won't have any effect in Adblock Plus 3.2 (to be released on July 17th; and above). After this change lands, the respective filters will be ignored in older versions of Adblock Plus. However, given that the Malware Domains list is opt-in, and that only few filters in there actually target internationalized domains, I think the impact is small enough, that it doesn't matter to much when exactly this lands.
On 2018/07/09 14:39:59, Sebastian Noack wrote: > On 2018/07/09 14:25:57, Vasily Kuznetsov wrote: > > On 2018/07/09 13:39:40, Sebastian Noack wrote: > > > LGTM > > > > Sebastian, when should we push this change? I suppose it would be good to > > synchronize with corresponding release of ABP or should we just go ahead? > > Yes, as discussed on the issue: Before this change lands, filters from the > Malware Domains list targeting internationalized domains won't have any effect > in Adblock Plus 3.2 (to be released on July 17th; and above). After this change > lands, the respective filters will be ignored in older versions of Adblock Plus. > However, given that the Malware Domains list is opt-in, and that only few > filters in there actually target internationalized domains, I think the impact > is small enough, that it doesn't matter to much when exactly this lands. Thank you. Since there's not much activity currently in sitescripts, I think it should be no problem to just push this on July 17th. |