sitescripts/crawler/bin/extract_crawler_sites.py - Issue 8327353: Crawler backend

Side by Side Diff

Use n/p to move between diff chunks; N/P to move between comments.

Keyboard Shortcuts

	File
u :	up to issue
m :	publish + mail comments
M :	edit review message
j / k :	jump to file after / before current file
J / K :	jump to next file with a comment after / before current file
	Side-by-side diff
i :	toggle intra-line diffs
e :	expand all comments
c :	collapse all comments
s :	toggle showing all comments
n / p :	next / previous diff chunk or comment
N / P :	next / previous comment
<Up> / <Down> :	next / previous line
<Enter> :	respond to / edit current comment
d :	mark current comment as done

	Issue
u :	up to list of issues
m :	publish + mail comments
j / k :	jump to patch after / before current patch
o / <Enter> :	open current patch in side-by-side view
i :	open current patch in unified diff view

	Issue List
j / k :	jump to issue after / before current issue
o / <Enter> :	open current issue
# :	close issue

	Comment/message editing
<Ctrl> + s or <Ctrl> + Enter :	save comment
<Esc> :	cancel edit

Side by Side Diff: sitescripts/crawler/bin/extract_crawler_sites.py

Issue 8327353: Crawler backend (Closed)

Patch Set: Created Sept. 27, 2012, 6:22 a.m.

Left:
Right:

Use n/p to move between diff chunks; N/P to move between comments.

Jump to:

View unified diff | Download patch

OLD	NEW
(Empty)
	1 # coding: utf-8

	2

	3 # This Source Code is subject to the terms of the Mozilla Public License

	4 # version 2.0 (the "License"). You can obtain a copy of the License at

	5 # http://mozilla.org/MPL/2.0/.

	6

	7 import MySQLdb, os, re, subprocess

	8 from sitescripts.utils import get_config

	9

	10 def hg(args):

	11 return subprocess.Popen(["hg"] + args, stdout = subprocess.PIPE)

	12

	13 def extract_urls(filter_list_dir):

	14 os.chdir(filter_list_dir)

	15 process = hg(["log", "--template", "{desc}\n"])

	16 urls = set([])

	17

	18 for line in process.stdout:

	19 matches = re.match(r".\b(https?://\S)", line)
	Wladimir Palant 2012/09/27 07:34:17 Please use re.search() rather than re.match() here Please use re.search() rather than re.match() here (and then without .* at the start). re.match() will always match the complete string - we are actually interested in a substring. Generally, I consider re.match() in Python's standard library a bug - it's confusing and typically useless. Side-note: I would probably name the result variable match and not matches - it is really only one match unlike when calling re.findall(). Felix Dahlke 2012/09/27 09:26:24 Done. You're right, I'm used to calling it "matche Show quoted text On 2012/09/27 07:34:17, Wladimir Palant wrote: > Please use re.search() rather than re.match() here (and then without .* at the > start). re.match() will always match the complete string - we are actually > interested in a substring. > > Generally, I consider re.match() in Python's standard library a bug - it's > confusing and typically useless. > > Side-note: I would probably name the result variable match and not matches - it > is really only one match unlike when calling re.findall(). Done. You're right, I'm used to calling it "matches" because it's an array with the matched string and all groups in JS. It's really just one though.
	20 if not matches:

	21 continue

	22

	23 url = matches.group(1).strip()

	24 urls.add(url)

	25

	26 return urls

	27

	28 def print_statements(urls):

	29 for url in urls:

	30 escaped_url = MySQLdb.escape_string(url)

	31 print "INSERT INTO crawler_sites (url) VALUES ('" + escaped_url + "');"

	32

	33 if __name__ == "__main__":

	34 filter_list_dir = get_config().get("crawler", "filter_list_repository")

	35 urls = extract_urls(filter_list_dir)

	36 print_statements(urls)

OLD	NEW

« no previous file with comments | « sitescripts/crawler/bin/__init__.py ('k') | sitescripts/crawler/schema.sql » ('j') | sitescripts/crawler/web/crawler.py » ('J')