Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code

Issue 8492019: sitescripts: Collect unmatched filters (Closed)

Created:
Oct. 2, 2012, 5:02 a.m. by Felix Dahlke
Modified:
Dec. 19, 2012, 4:18 p.m.
Reviewers:
Wladimir Palant
Visibility:
Public.

Description

In order to collect all unmatched domain-specific filters, I import all domain-specific filters from easylist and associate them with requests send by the crawler. As discussed, we will probably import all filters, not just domain-specific ones, from compiled lists in the future. However, since we've decided to put the crawler on hold for now, I think we should leave that part as it is and take it from there in a few months. The part where I actually figure out which domain-specific filters are unused is still missing. But we should be able to extract that from the database, all the data is there. I also used the opportunity to turn the extract_crawler_sites script into one that doesn't directly import the data into the database (import_sites) and made a few improvements.

Patch Set 1 #

Unified diffs Side-by-side diffs Delta from patch set Stats (+222 lines, -58 lines) Patch
M sitescripts/crawler/README.md View 3 chunks +11 lines, -9 lines 0 comments Download
R sitescripts/crawler/bin/extract_sites.py View 1 chunk +0 lines, -36 lines 0 comments Download
A sitescripts/crawler/bin/import_filters.py View 1 chunk +95 lines, -0 lines 0 comments Download
A sitescripts/crawler/bin/import_sites.py View 1 chunk +52 lines, -0 lines 0 comments Download
M sitescripts/crawler/schema.sql View 3 chunks +32 lines, -4 lines 0 comments Download
M sitescripts/crawler/web/crawler.py View 3 chunks +32 lines, -9 lines 0 comments Download

Messages

Total messages: 6
Wladimir Palant
How about we agree that we don't want this change at the moment (particularly the ...
Dec. 13, 2012, 4:16 p.m. (2012-12-13 16:16:23 UTC) #1
Felix Dahlke
On 2012/12/13 16:16:23, Wladimir Palant wrote: > How about we agree that we don't want ...
Dec. 14, 2012, 10:11 a.m. (2012-12-14 10:11:33 UTC) #2
Wladimir Palant
Frankly, I am unhappy with the general approach taken here - importing filters into the ...
Dec. 14, 2012, 4:39 p.m. (2012-12-14 16:39:58 UTC) #3
Felix Dahlke
On 2012/12/14 16:39:58, Wladimir Palant wrote: > Frankly, I am unhappy with the general approach ...
Dec. 17, 2012, 2:30 p.m. (2012-12-17 14:30:22 UTC) #4
Wladimir Palant
On 2012/12/17 14:30:22, Felix H. Dahlke wrote: > Wouldn't it still make sense to keep ...
Dec. 19, 2012, 4:12 p.m. (2012-12-19 16:12:10 UTC) #5
Felix Dahlke
Dec. 19, 2012, 4:18 p.m. (2012-12-19 16:18:23 UTC) #6
On 2012/12/19 16:12:10, Wladimir Palant wrote:
> On 2012/12/17 14:30:22, Felix H. Dahlke wrote:
> > Wouldn't it still make sense to keep this version around? How I see it, we
> will
> > need the code in import_filters.py. I believe it's easier to adopt this
> version
> > to call it directly rather than rewrite it.
> 
> It will still be in this code review but I doubt that it makes sense to commit
> this code to the repository...
> 
> > My main reason for putting this in two steps was to have a more flexible
data
> > source, especially for testing.
> 
> My original idea was having the list of URLs in the Mercurial repositories
along
> with the lists themselves - which would allow filter list authors do
add/remove
> URLs easily.

I see. Well in that case we can just close this issue, it's unlikely we'll need
any of this code.

Powered by Google App Engine
This is Rietveld