Issue 29821558: Issue #6707 - Make the generated malware domain filter list encode domains as Punycode

Issue 29821558: Issue #6707 - Make the generated malware domain filter list encode domains as Punycode (Closed)

Created:
July 3, 2018, 3:50 p.m. by Tudor Avram

Modified:
July 17, 2018, 9:14 a.m.

Reviewers:
Sebastian Noack, Vasily Kuznetsov

Visibility:
Public.

Description

Issue #6707 - Make the generated malware domain filter list encode domains as Punycode

Patch Set 3 : Updated encoding format #

Total comments: 3

Patch Set 4 : Removed uncessary codecs call from updateMalwareDomains.py #

Created: July 9, 2018, 1:08 p.m.

(Patch set is too large to download)

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+13 lines, -7 lines)			Patch
M	.gitignore	View	1 2 3	1 chunk	+2 lines, -0 lines	0 comments	Download
M	sitescripts/subscriptions/bin/updateMalwareDomainsList.py	View	1 2 3	3 chunks	+5 lines, -4 lines	0 comments	Download
M	sitescripts/subscriptions/test/test_updateMalwareDomainsList.py	View	1 2 3	2 chunks	+5 lines, -2 lines	0 comments	Download
M	tox.ini	View	1 2 3	1 chunk	+1 line, -1 line	0 comments	Download

Messages

Total messages: 16

Expand All Messages | Collapse All Messages

Vasily Kuznetsov

Hi Tudor, This looks pretty good. Just one comment about .gitignore and one about the ...

July 3, 2018, 4:01 p.m. (2018-07-03 16:01:52 UTC) #2

Vasily Kuznetsov

LGTM Sebastian, could you please check that the output looks right (as in: it's what ...

July 3, 2018, 4:42 p.m. (2018-07-03 16:42:22 UTC) #3

Vasily Kuznetsov

For reference, here's the whole patch combined (for some reason Rietveld doesn't let us download ...

July 5, 2018, 1:45 p.m. (2018-07-05 13:45:32 UTC) #4

For reference, here's the whole patch combined (for some reason Rietveld doesn't
let us download it). Take the part between the lines of "=".
=========================================
Index: .gitignore
diff --git a/.gitignore b/.gitignore
index
00be79377c4e81d7ff3916b25835e8b22ab7fae5..37031079392f2a15409dfc9b1935e678f96f9af5
100644
--- a/.gitignore
+++ b/.gitignore
@@ -5,3 +5,5 @@
 /.sitescripts.test
 /.coverage
 htmlcov/
+.idea/
+.pytest_cache/
Index: sitescripts/subscriptions/bin/updateMalwareDomainsList.py
diff --git a/sitescripts/subscriptions/bin/updateMalwareDomainsList.py
b/sitescripts/subscriptions/bin/updateMalwareDomainsList.py
index
a2cfb5ff8d46bfe4534dc79f0f03f47e31257519..c2f208228717f4e3122a4bc95cbb39f6f66c8a36
100644
--- a/sitescripts/subscriptions/bin/updateMalwareDomainsList.py
+++ b/sitescripts/subscriptions/bin/updateMalwareDomainsList.py
@@ -60,7 +60,7 @@ def main():
         subprocess.check_call(['hg', '-q', 'up', '-R', tempdir, '-r',
'default'])
 
         path = os.path.join(tempdir, 'malwaredomains_full.txt')
-        file = codecs.open(path, 'wb', encoding='utf-8')
+        file = codecs.open(path, 'wb')
 
         print >>file, FILTERLIST_HEADER
 
@@ -80,7 +80,7 @@ def main():
             if not domain:
                 continue
 
-            print >>file, '||%s^' % domain.decode('idna')
+            print >>file, '||%s^' % domain.decode('idna').encode('punycode')
         file.close()
 
         if subprocess.check_output(['hg', 'stat', '-R', tempdir]) != '':
Index: sitescripts/subscriptions/test/test_updateMalwareDomainsList.py
diff --git a/sitescripts/subscriptions/test/test_updateMalwareDomainsList.py
b/sitescripts/subscriptions/test/test_updateMalwareDomainsList.py
index
bd1dd2521aba2c12f550d3dc965c67a78b150386..e0939649ad08aa3e35da86710ba2faa77442c788
100644
--- a/sitescripts/subscriptions/test/test_updateMalwareDomainsList.py
+++ b/sitescripts/subscriptions/test/test_updateMalwareDomainsList.py
@@ -58,7 +58,8 @@ def urlopen(mocker):
         if url.startswith('good'):
             zf_data = io.BytesIO()
             with zipfile.ZipFile(zf_data, 'w') as zf:
-                zf.writestr('justdomains', 'success\n')
+                idna_urls = ['xn--xhq02ykwbp4a.cn', 'xn--fiq80yua78t.tw']
+                zf.writestr('justdomains', '\n'.join(idna_urls) + '\n')
             return io.BytesIO(zf_data.getvalue())
         if url.startswith('bad'):
             raise urllib2.HTTPError('Bad', '42', 'No good', [], None)
@@ -73,7 +74,9 @@ def test_good(md_repo):
     main()
     subprocess.check_call(['hg', 'up'], cwd=md_repo.strpath)
     result = md_repo.join('malwaredomains_full.txt').read()
-    assert 'success' in result
+    puny_urls = ['.cn-w48dx81cufdl6b', '.tw-u68dw51cgbz81a']
+    for url in puny_urls:
+        assert url in result
 
 
 def test_bad(md_repo, config):
=========================================

Vasily Kuznetsov

Argh. No, the previous comment doesn't work, after line wrapping by Rietveld. Sebastian, if you ...

July 5, 2018, 1:48 p.m. (2018-07-05 13:48:07 UTC) #5

Vasily Kuznetsov

Argh. No, the previous comment doesn't work, after line wrapping by Rietveld. Sebastian, if you ...

July 5, 2018, 1:48 p.m. (2018-07-05 13:48:07 UTC) #6

Sebastian Noack

https://codereview.adblockplus.org/29821558/diff/29821560/sitescripts/subscriptions/bin/updateMalwareDomainsList.py File sitescripts/subscriptions/bin/updateMalwareDomainsList.py (right): https://codereview.adblockplus.org/29821558/diff/29821560/sitescripts/subscriptions/bin/updateMalwareDomainsList.py#newcode83 sitescripts/subscriptions/bin/updateMalwareDomainsList.py:83: print >>file, '||%s^' % domain.decode('idna').encode('punycode') This seems incorrect. Let's ...

July 5, 2018, 2:55 p.m. (2018-07-05 14:55:28 UTC) #7

Tudor Avram

Amended the code. https://codereview.adblockplus.org/29821558/diff/29821559/.gitignore File .gitignore (right): https://codereview.adblockplus.org/29821558/diff/29821559/.gitignore#newcode9 .gitignore:9: env/ On 2018/07/03 16:01:52, Vasily Kuznetsov ...

July 6, 2018, 11:53 a.m. (2018-07-06 11:53:17 UTC) #8

Amended the code.

https://codereview.adblockplus.org/29821558/diff/29821559/.gitignore
File .gitignore (right):

https://codereview.adblockplus.org/29821558/diff/29821559/.gitignore#newcode9
.gitignore:9: env/
On 2018/07/03 16:01:52, Vasily Kuznetsov wrote:
> This virtualenv doesn't need to be there -- you can have it outside and then
it
> doesn't need to be ignored.

Done.

https://codereview.adblockplus.org/29821558/diff/29821559/sitescripts/subscri...
File sitescripts/subscriptions/test/test_updateMalwareDomainsList.py (right):

https://codereview.adblockplus.org/29821558/diff/29821559/sitescripts/subscri...
sitescripts/subscriptions/test/test_updateMalwareDomainsList.py:61: idna_url =
'xn--xhq02ykwbp4a.cn\n'
On 2018/07/03 16:01:52, Vasily Kuznetsov wrote:
> I think it would make sense to have 2 domains in the list just to make sure
that
> each domain is treated separately by the recoding process. The reason for this
> is that if we take something like 'xn--domain1\nxn--domain2\n' and then recode
> it without breaking into separate lines, the result is different from breaking
> it into separate lines and then processing each line separately (this is what
we
> want).
> 
> The checking in line 77 will need to be a bit more sophisticated to check that
> the result is really what we want.

Done.

https://codereview.adblockplus.org/29821558/diff/29821560/sitescripts/subscri...
File sitescripts/subscriptions/bin/updateMalwareDomainsList.py (right):

https://codereview.adblockplus.org/29821558/diff/29821560/sitescripts/subscri...
sitescripts/subscriptions/bin/updateMalwareDomainsList.py:83: print >>file,
'||%s^' % domain.decode('idna').encode('punycode')
On 2018/07/05 14:55:28, Sebastian Noack wrote:
> This seems incorrect. Let's take "xn--fuson-1sa.org" (actual domain in the
> Malware Domains List):
> 
>   >>> 'xn--fuson-1sa.org'.decode('idna').encode('punycode')
>   b'fuson.org-i5a'
> 
> This is incorrect, we need 'xn--fuson-1sa.org' here (with the "xn--"
annotation,
> like it was in the input).
> 
> As far as I understand, this line should be just changed like this:
> 
>   print >>file, '||%s^' % domain

Done.

Sebastian Noack

https://codereview.adblockplus.org/29821558/diff/29824562/sitescripts/subscriptions/bin/updateMalwareDomainsList.py File sitescripts/subscriptions/bin/updateMalwareDomainsList.py (right): https://codereview.adblockplus.org/29821558/diff/29824562/sitescripts/subscriptions/bin/updateMalwareDomainsList.py#newcode63 sitescripts/subscriptions/bin/updateMalwareDomainsList.py:63: file = codecs.open(path, 'wb') codecs.open() without specifying an encoding ...

July 6, 2018, 6:23 p.m. (2018-07-06 18:23:25 UTC) #9

Tudor Avram

On 2018/07/06 18:23:25, Sebastian Noack wrote: > https://codereview.adblockplus.org/29821558/diff/29824562/sitescripts/subscriptions/bin/updateMalwareDomainsList.py > File sitescripts/subscriptions/bin/updateMalwareDomainsList.py (right): > > https://codereview.adblockplus.org/29821558/diff/29824562/sitescripts/subscriptions/bin/updateMalwareDomainsList.py#newcode63 ...

July 9, 2018, 9:15 a.m. (2018-07-09 09:15:40 UTC) #10

Sebastian Noack

July 9, 2018, 11:59 a.m. (2018-07-09 11:59:29 UTC) #11

Tudor Avram

removed unnecessary `codecs` dependency. https://codereview.adblockplus.org/29821558/diff/29824562/sitescripts/subscriptions/bin/updateMalwareDomainsList.py File sitescripts/subscriptions/bin/updateMalwareDomainsList.py (right): https://codereview.adblockplus.org/29821558/diff/29824562/sitescripts/subscriptions/bin/updateMalwareDomainsList.py#newcode63 sitescripts/subscriptions/bin/updateMalwareDomainsList.py:63: file = codecs.open(path, 'wb') On ...

July 9, 2018, 1:35 p.m. (2018-07-09 13:35:43 UTC) #12

Vasily Kuznetsov

On 2018/07/09 13:39:40, Sebastian Noack wrote: > LGTM Sebastian, when should we push this change? ...

July 9, 2018, 2:25 p.m. (2018-07-09 14:25:57 UTC) #14

Sebastian Noack

On 2018/07/09 14:25:57, Vasily Kuznetsov wrote: > On 2018/07/09 13:39:40, Sebastian Noack wrote: > > ...

July 9, 2018, 2:39 p.m. (2018-07-09 14:39:59 UTC) #15

Vasily Kuznetsov

July 9, 2018, 4:29 p.m. (2018-07-09 16:29:16 UTC) #16

On 2018/07/09 14:39:59, Sebastian Noack wrote:
> On 2018/07/09 14:25:57, Vasily Kuznetsov wrote:
> > On 2018/07/09 13:39:40, Sebastian Noack wrote:
> > > LGTM
> > 
> > Sebastian, when should we push this change? I suppose it would be good to
> > synchronize with corresponding release of ABP or should we just go ahead?
> 
> Yes, as discussed on the issue: Before this change lands, filters from the
> Malware Domains list targeting internationalized domains won't have any effect
> in Adblock Plus 3.2 (to be released on July 17th; and above). After this
change
> lands, the respective filters will be ignored in older versions of Adblock
Plus.
> However, given that the Malware Domains list is opt-in, and that only few
> filters in there actually target internationalized domains, I think the impact
> is small enough, that it doesn't matter to much when exactly this lands.

Thank you. Since there's not much activity currently in sitescripts, I think it
should be no problem to just push this on July 17th.

Expand All Messages | Collapse All Messages