sitescripts/crawler/bin/extract_crawler_sites.py - Issue 8327353: Crawler backend

Keyboard Shortcuts

	File
u :	up to issue
m :	publish + mail comments
M :	edit review message
j / k :	jump to file after / before current file
J / K :	jump to next file with a comment after / before current file
	Side-by-side diff
i :	toggle intra-line diffs
e :	expand all comments
c :	collapse all comments
s :	toggle showing all comments
n / p :	next / previous diff chunk or comment
N / P :	next / previous comment
<Up> / <Down> :	next / previous line
<Enter> :	respond to / edit current comment
d :	mark current comment as done

	Issue
u :	up to list of issues
m :	publish + mail comments
j / k :	jump to patch after / before current patch
o / <Enter> :	open current patch in side-by-side view
i :	open current patch in unified diff view

	Issue List
j / k :	jump to issue after / before current issue
o / <Enter> :	open current issue
# :	close issue

	Comment/message editing
<Ctrl> + s or <Ctrl> + Enter :	save comment
<Esc> :	cancel edit

Delta Between Two Patch Sets: sitescripts/crawler/bin/extract_crawler_sites.py

Issue 8327353: Crawler backend (Closed)

Left Patch Set: Created Sept. 27, 2012, 6:22 a.m.

Right Patch Set: Created Sept. 27, 2012, 2:15 p.m.

Left:
Right:

Use n/p to move between diff chunks; N/P to move between comments.

Jump to:

Left: Side by side diff | Download
Right: Side by side diff | Download

LEFT	RIGHT
1 # coding: utf-8	1 # coding: utf-8

2	2

3 # This Source Code is subject to the terms of the Mozilla Public License	3 # This Source Code is subject to the terms of the Mozilla Public License

4 # version 2.0 (the "License"). You can obtain a copy of the License at	4 # version 2.0 (the "License"). You can obtain a copy of the License at

5 # http://mozilla.org/MPL/2.0/.	5 # http://mozilla.org/MPL/2.0/.

6	6

7 import MySQLdb, os, re, subprocess	7 import MySQLdb, os, re, subprocess

8 from sitescripts.utils import get_config	8 from sitescripts.utils import get_config

9	9

10 def hg(args):	10 def hg(args):

11 return subprocess.Popen(["hg"] + args, stdout = subprocess.PIPE)	11 return subprocess.Popen(["hg"] + args, stdout = subprocess.PIPE)

12	12

13 def extract_urls(filter_list_dir):	13 def extract_urls(filter_list_dir):

14 os.chdir(filter_list_dir)	14 os.chdir(filter_list_dir)

15 process = hg(["log", "--template", "{desc}\n"])	15 process = hg(["log", "--template", "{desc}\n"])

16 urls = set([])	16 urls = set([])

17	17

18 for line in process.stdout:	18 for line in process.stdout:

19 matches = re.match(r".\b(https?://\S)", line)	19 match = re.search(r"\b(https?://\S*)", line)
Wladimir Palant 2012/09/27 07:34:17 Please use re.search() rather than re.match() here Please use re.search() rather than re.match() here (and then without .* at the start). re.match() will always match the complete string - we are actually interested in a substring. Generally, I consider re.match() in Python's standard library a bug - it's confusing and typically useless. Side-note: I would probably name the result variable match and not matches - it is really only one match unlike when calling re.findall(). Felix Dahlke 2012/09/27 09:26:24 Done. You're right, I'm used to calling it "matche Show quoted text On 2012/09/27 07:34:17, Wladimir Palant wrote: > Please use re.search() rather than re.match() here (and then without .* at the > start). re.match() will always match the complete string - we are actually > interested in a substring. > > Generally, I consider re.match() in Python's standard library a bug - it's > confusing and typically useless. > > Side-note: I would probably name the result variable match and not matches - it > is really only one match unlike when calling re.findall(). Done. You're right, I'm used to calling it "matches" because it's an array with the matched string and all groups in JS. It's really just one though.
20 if not matches:	20 if not match:

21 continue	21 continue

22	22

23 url = matches.group(1).strip()	23 url = match.group(1).strip()

24 urls.add(url)	24 urls.add(url)

25	25

26 return urls	26 return urls

27	27

28 def print_statements(urls):	28 def print_statements(urls):

29 for url in urls:	29 for url in urls:

30 escaped_url = MySQLdb.escape_string(url)	30 escaped_url = MySQLdb.escape_string(url)

31 print "INSERT INTO crawler_sites (url) VALUES ('" + escaped_url + "');"	31 print "INSERT INTO crawler_sites (url) VALUES ('" + escaped_url + "');"

32	32

33 if __name__ == "__main__":	33 if __name__ == "__main__":

34 filter_list_dir = get_config().get("crawler", "filter_list_repository")	34 filter_list_dir = get_config().get("crawler", "filter_list_repository")

35 urls = extract_urls(filter_list_dir)	35 urls = extract_urls(filter_list_dir)

36 print_statements(urls)	36 print_statements(urls)

LEFT	RIGHT