cms/converters.py - Issue 29516687: Issue 4488 - Add support for JSON page front matter

Side by Side Diff: cms/converters.py

Issue 29516687: Issue 4488 - Add support for JSON page front matter (Closed) Base URL: https://hg.adblockplus.org/cms

Patch Set: Created Aug. 16, 2017, 12:43 a.m.

Left:
Right:

Use n/p to move between diff chunks; N/P to move between comments.

Jump to:

View unified diff | Download patch

OLD	NEW
1 # This file is part of the Adblock Plus web scripts,	1 # This file is part of the Adblock Plus web scripts,

2 # Copyright (C) 2006-2017 eyeo GmbH	2 # Copyright (C) 2006-2017 eyeo GmbH

3 #	3 #

4 # Adblock Plus is free software: you can redistribute it and/or modify	4 # Adblock Plus is free software: you can redistribute it and/or modify

5 # it under the terms of the GNU General Public License version 3 as	5 # it under the terms of the GNU General Public License version 3 as

6 # published by the Free Software Foundation.	6 # published by the Free Software Foundation.

7 #	7 #

8 # Adblock Plus is distributed in the hope that it will be useful,	8 # Adblock Plus is distributed in the hope that it will be useful,

9 # but WITHOUT ANY WARRANTY; without even the implied warranty of	9 # but WITHOUT ANY WARRANTY; without even the implied warranty of

10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the	10 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

11 # GNU General Public License for more details.	11 # GNU General Public License for more details.

12 #	12 #

13 # You should have received a copy of the GNU General Public License	13 # You should have received a copy of the GNU General Public License

14 # along with Adblock Plus. If not, see <http://www.gnu.org/licenses/>.	14 # along with Adblock Plus. If not, see <http://www.gnu.org/licenses/>.

15	15

16 from __future__ import unicode_literals	16 from __future__ import unicode_literals

17	17

18 import os	18 import os

19 import HTMLParser	19 import HTMLParser

20 import re	20 import re

21 import urlparse	21 import urlparse

	22 import json

	23 import collections

22	24

23 import jinja2	25 import jinja2

24 import markdown	26 import markdown

25	27

26	28

27 # Monkey-patch Markdown's isBlockLevel function to ensure that no paragraphs	29 # Monkey-patch Markdown's isBlockLevel function to ensure that no paragraphs

28 # are inserted into the <head> tag	30 # are inserted into the <head> tag

29 orig_isBlockLevel = markdown.util.isBlockLevel	31 orig_isBlockLevel = markdown.util.isBlockLevel

30	32

31	33

(...skipping 78 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
110 # the document.	112 # the document.

111 self._append_text(data)	113 self._append_text(data)

112	114

113 def handle_entityref(self, name):	115 def handle_entityref(self, name):

114 self._append_text(self.unescape('&{};'.format(name)))	116 self._append_text(self.unescape('&{};'.format(name)))

115	117

116 def handle_charref(self, name):	118 def handle_charref(self, name):

117 self._append_text(self.unescape('&#{};'.format(name)))	119 self._append_text(self.unescape('&#{};'.format(name)))

118	120

119	121

	122 def parse_json(json_data, parent_key='', sep='_'):
	Vasily Kuznetsov 2017/08/16 18:34:56 This was not very clear from the ticket, but actua This was not very clear from the ticket, but actually this postprocessing was not required by the ticket. I've checked it with Julian and updated the ticket to make it more clear. rosie 2017/08/19 01:56:34 Done. Show quoted text On 2017/08/16 18:34:56, Vasily Kuznetsov wrote: > This was not very clear from the ticket, but actually this postprocessing was > not required by the ticket. I've checked it with Julian and updated the ticket > to make it more clear. Done.
	123 result = []

	124 for key, value in json_data.items():

	125 new_key = parent_key + sep + key if parent_key else key

	126 if isinstance(value, collections.MutableMapping):

	127 result.extend(parse_json(value, new_key, sep=sep).items())

	128 else:

	129 result.append((new_key, value))

	130 return dict(result)

	131

	132

120 def parse_page_content(page, data):	133 def parse_page_content(page, data):

121 """Separate page content into metadata (dict) and body text (str)"""	134 """Separate page content into metadata (dict) and body text (str)"""

122 page_data = {'page': page}	135 page_data = {'page': page}

123 lines = data.splitlines(True)	136 try:

124 for i, line in enumerate(lines):	137 data = data.replace('<!--', '')
	Vasily Kuznetsov 2017/08/16 18:34:56 We need to be more careful here: this line (togeth We need to be more careful here: this line (together with the following one) will strip all comment tags in the whole document. That is probably not what we want. A safer approach would be to make sure that the first non-space symbols of data are '<!--' and if it is so, we take what's between them and try to parse it as JSON. If this succeeds, we insert the remaining part of the first comment back between the comment tags and return the parsed metadata and updated page content. If parsing fails, we try to parse the content of the first comment as old-style metadata, remove parsed lines, insert the rest between the comment tags and return. In case the first non-space symbols are not '<!--', just try to parse everything as JSON, remove what's parsed and return the rest. If JSON parsing failed, we fall back to old-style parsing. I would probably create another function (let's imagine it's called `parse_metadata`) that does the metadata parsing (first as JSON, then old-style) and returns metadata as dict + remaining string and then in `parse_page_content` I would check if the document starts with an HTML comment and then call `parse_metadata` on the content of the first comment, otherwise on the whole page. Does this make sense? rosie 2017/08/19 01:56:34 Done. Show quoted text On 2017/08/16 18:34:56, Vasily Kuznetsov wrote: > We need to be more careful here: this line (together with the following one) > will strip all comment tags in the whole document. That is probably not what we > want. > > A safer approach would be to make sure that the first non-space symbols of data > are '<!--' and if it is so, we take what's between them and try to parse it as > JSON. If this succeeds, we insert the remaining part of the first comment back > between the comment tags and return the parsed metadata and updated page > content. If parsing fails, we try to parse the content of the first comment as > old-style metadata, remove parsed lines, insert the rest between the comment > tags and return. > In case the first non-space symbols are not '<!--', just try to parse everything > as JSON, remove what's parsed and return the rest. If JSON parsing failed, we > fall back to old-style parsing. > > I would probably create another function (let's imagine it's called > `parse_metadata`) that does the metadata parsing (first as JSON, then old-style) > and returns metadata as dict + remaining string and then in `parse_page_content` > I would check if the document starts with an HTML comment and then call > `parse_metadata` on the content of the first comment, otherwise on the whole > page. > > Does this make sense? Done.
125 if line.strip() in {'<!--', '-->'}:	138 data = data.replace('-->', '')

126 lines[i] = ''	139 decoder = json.JSONDecoder()

127 continue	140 json_data, index = decoder.raw_decode(data)

128 if not re.search(r'^\s[\w\-]+\s=', line):	141 json_data['page'] = page

129 break	142 return parse_json(json_data), data[index:]

130 name, value = line.split('=', 1)	143 except ValueError:

131 value = value.strip()	144 lines = data.splitlines(True)

132 if value.startswith('[') and value.endswith(']'):	145 for i, line in enumerate(lines):

133 value = [element.strip() for element in value[1:-1].split(',')]	146 if not re.search(r'^\s[\w\-]+\s=', line):

134 lines[i] = '\n'	147 break

135 page_data[name.strip()] = value	148 name, value = line.split('=', 1)

136 return page_data, ''.join(lines)	149 value = value.strip()

	150 if value.startswith('[') and value.endswith(']'):

	151 value = [element.strip() for element in value[1:-1].split(',')]

	152 lines[i] = '\n'

	153 page_data[name.strip()] = value

	154 return page_data, ''.join(lines)

137	155

138	156

139 class Converter:	157 class Converter:

140 whitelist = {'a', 'em', 'sup', 'strong', 'code', 'span'}	158 whitelist = {'a', 'em', 'sup', 'strong', 'code', 'span'}

141 missing_translations = 0	159 missing_translations = 0

142 total_translations = 0	160 total_translations = 0

143	161

144 def __init__(self, params, key='pagedata'):	162 def __init__(self, params, key='pagedata'):

145 self._params = params	163 self._params = params

146 self._key = key	164 self._key = key

(...skipping 409 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
556 stack.pop()	574 stack.pop()

557 stack[-1]['subitems'].append(item)	575 stack[-1]['subitems'].append(item)

558 stack.append(item)	576 stack.append(item)

559 return structured	577 return structured

560	578

561 converters = {	579 converters = {

562 'html': RawConverter,	580 'html': RawConverter,

563 'md': MarkdownConverter,	581 'md': MarkdownConverter,

564 'tmpl': TemplateConverter,	582 'tmpl': TemplateConverter,

565 }	583 }

OLD	NEW

« no previous file with comments | « no previous file | tests/frontmatter_samples/front_matter_json.html » ('j') | tests/frontmatter_samples/front_matter_json.html » ('J')