Python Markdown: ideas, bugs; extensions percent_comments and linebreak_plus
Contents
- INFO level logging showing Markdown progress (abandoned)
- Issue in spantable.py when working with Python Markdown 2.6.11
- Progress logging using sys.stderr.write
- Idea for specifying extensions more consisely
- Markdown extension percent_comments
- Markdown extension linebreak_plus
- Running my Markdown extensions in Python 2
INFO level logging showing Markdown progress (abandoned)
The challenge: create a way to log information at INFO level about Markdown’s progress whit newlines where I want them instead of being applied to every log line.
Logging has formatters, handlers, and filters:
- Formatters determine how the message is set up for display
- Handlers determine where log messages are written (console, file, stream …)
- Filters provide finer grained determination for what messages to display
Ergo, I need a special formatter. Next, how I set up an INFO level logger that uses my formatter when I want it to, but uses the “standard” formatter for everything else?
First, I can also set up a custom logging level. The default levels are:
Level | Numeric value |
---|---|
CRITICAL | 50 |
ERROR | 40 |
WARNING | 30 |
INFO | 20 |
DEBUG | 10 |
NOTSET | 0 |
From the python-docs/howto/logging file:
Levels can also be associated with loggers, being set either by the developer or through loading a saved logging configuration. When a logging method is called on a logger, the logger compares its own level with the level associated with the method call. If the logger’s level is higher than the method call’s, no logging message is actually generated. This is the basic mechanism controlling the verbosity of logging output.
(Alternatively, if the method call’s level is <= the logger’s level, then
a logging message is generated and sent to the logging system. For example,
a log level of 30 will display messages generated by logging.warning
,
logging.info
, and logging.debug
.)
(“Method call” refers to debug(), info(), warning(), etc.)
If the overall logging level is 25, then logger.warning
will generate a log
message (because 25 < WARNING
[30]) but logger.info
will not (because 25 >
INFO
[20]).
Next, can I set a custom formatter for level 21? Maybe. It’s worth noting that the setFormatter method is used by the handler. Does this mean I can set up a custom handler for level 21? I think so. I set up a handler that outputs to the console, add a custom formatter to it, the tell that handler to handle messages at level 21, not 20. (In practice this didn’t work: I got duplicate INFO messages because the two handlers each wrote a message to the console.)
Next, can I get the formatter to not output a newline? The logging-cookbook
file shows how one can set up a custom Class for a message with a __str__
method that is invoked with the logger calls str()
on that object. I should
be able to set up that class in the main file (markdown.__init__.py
) and
import
it into the classes where I need to invoke it.
Finally, is there an equivalent to Perl’s $|
command to make stdout and
stderr non-buffering? Well, I may not have to worry about this; a comment
at StackOverflow indicates stdout in Python is always non-buffered.
After working at this for a while, I determined the following:
- At true INFO level (20), I don’t want any progress messages appearing, only regular INFO messages
- At level 25 I want progress messages and regular INFO messages appearing, but don’t want WARNING messages appearing.
- I need a special level called
progress
at level 25; then I can calllogger.progress
for these messages. They won’t appear at INFO level because INFO==20, and 25 > 20. - However, the logging document says this:
… it is possibly a very bad idea to define custom levels if you are developing a library. That’s because if multiple library authors all define their own custom levels, there is a chance that the logging output from such multiple libraries used together will be difficult for the using developer to control and/or interpret, because a given numeric value might mean different things for different libraries.
- Given that, it’s rather strange that the logging system allows for “in between” levels such as 15 and 25, because practically they’re of no use. Possibly it’s a case of designing for the future, only to discover the future didn’t turn out as expected.
I could, of course, bypass the logging mechanism altogether and simply write what I need to stderr. But even that has issues:
- In Python 2, stderr is unbuffered
- Starting in Python 3.0, stderr is buffered, but that can be overridden by
starting the interpreter with
-u
or using thePYTHONUNBUFFERED
environment variable - The help text for python2 indicates it honours
PYTHONUNBUFFERED
as well. - Starting in Python 3.3, the
print()
function has aflush=True
parameter
This is actually a decent argument for programming in Perl. Perl 5 solved a lot of these issues twenty years ago–Perl 5.004 was released in May 1997. By contrast, there is still a lot of Python 2 code out there, and (for Markdown at least) the expectation is that libraries should support both Python 2 and Python 3.
Issue in spantable.py when working with Python Markdown 2.6.11
Curiously, markdown_py-3
is version 2.6.7 of Python Markdown, while the
version installed with Python 2 is a more up-to-date 2.6.11. Between the two
versions a few things were changed, which broke the spantable
extension I
require for better table formatting. The issue was an older version of the
XML Element module did not support a default
keyword on an attribute get
method:
colspan = cell_obj.get('colspan', default='1')
I had to update a couple of instances of that to read:
cell_obj.get('colspan') if not colspan: colspan = 1
Finally there was an issue where Python 2 and Python 3 disagreed about being
able to convert something to an int
, so I had to work wound that:
if text == None: if c is not None: colspan = c.get('colspan') if not colspan: colspan = 1 try: # Added colspan = int(colspan) # (Works in v2, not in v3) except TypeError: # Added pass # Added c.set('colspan', str(colspan + 1)) else: # if this is the first cell, then fall back to creating an empty cell text = ''
Progress logging using sys.stderr.write
I first had to determine how to set up a progress
or show_progress
parameter and get it propogated to the Markdown instance. Note that Markdown
much prefers to use kwargs (keyword arguments).
When run as a command line program, the process flow is:
__main.py__::run() → __init.py__::markdownFromFile() → convertFile() → convert()
When run as a module, the process flow is:
__init.py__.markdown() → convert()
The process flow for convert
is:
Preprocessors → BlockProcessors → Treeprocessors → serialize → Postprocessors
The actual argument parsing is done in the __init__
function in the __init__.py
file. Keyword parameters are stored as properties in the markdown object; ergo,
self.show_progress
will return True
or False
.
Now I need a place to stash the terminal width. One possibility to put the
progress logging into a class and store it as a class property, determined in
the class’s __init__
function. I put this class into its own module and
stored it in markdown_extensions/show_progress.py:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
#!/usr/bin/python """ Progress display for Python Markdown ==================================== Note: This file is in the markdown_extensions directory because it seems to be a good home for it. However, this is not a regular extension and cannot be included simply by adding '-x markdown_extensions.show_progress' on the command line. This code needs additional support in __init__py, __main__.py, and blockparser.py in order to work. See "Adding show_progress to Python Markdown" at the end of this file for details. When enabled, this code displays Markdown's progress on stderr as it goes through its various phases and processors: >>> import sys >>> sys.path.append('/home/brian/projects/python') >>> import markdown >>> markdown.markdown("This is a test", show_progress = True) Preprocessors: NormalizeWhitespace, HtmlBlock, Reference Block processors: Empty, ListIndent, Code, HashHeader, SetextHeader, HR, OList, UList, BlockQuote, Paragraph Processing 2 blocks Tree processors: Inline, Prettify Post-processors: RawHtml, AndSubstitute, Unescape '<p>This is a test</p>' Or from the command line: [user@host ~]$ export PYTHONPATH='/home/brian/projects/python' [user@host ~]$ echo "This is a test" | markdown_py-3 --progress """ import sys import re class ShowProgress: """Display phases and processor names as are they are run""" def __init__(self, show_progress): self.show_progress = show_progress if show_progress: self.width, height = self._getTerminalSize() self.RE_NORMALIZE = re.compile(r'(([Pp]re|[Bb]lock|[Tt]ree|[Pp]ost)?[Pp]rocessor)?\'>') self.comma = '' self.number_of_blocks = None def phase(self, phase_name): """ Display a phase: Pre, Block, Tree, or Post Processors """ if not self.show_progress: return if phase_name: if phase_name != 'Preprocessors': sys.stderr.write("\n") sys.stderr.write("%s:" % phase_name) else: sys.stderr.write("\n") sys.stderr.flush() self.pos = len(phase_name) self.comma = '' def processor(self, p_obj): """ Display the name of the passed processor object """ if not self.show_progress: return x = self.RE_NORMALIZE.sub('', str(p_obj.__class__).split('.')[-1]) if self.pos + len(x) + 2 > self.width: sys.stderr.write("%s\n " % self.comma) self.pos = 0 self.comma = '' sys.stderr.write("%s %s" % (self.comma, x)) sys.stderr.flush() self.comma = ',' self.pos = self.pos + len(x) + 2 def block_count(self, blocks): """ Display the number of blocks being processed by the block processors """ if self.show_progress and not self.number_of_blocks: self.number_of_blocks = len(blocks) sys.stderr.write("\n Processing %i blocks" % self.number_of_blocks) sys.stderr.flush() # (81 lines pertaining to determining terminal size deleted) |
As noted in the source code above, this isn’t a regular extension. Other Markdown
modules have to be patched in order for it to work. There’s some work to accept
a -p
pr --progress
parameter in __main__.py,
and to accept a show_progress
parameter in __init__.py,
and after that the criticial stuff is (for example):
import markdown_extensions.show_progress # (lines skipped) # (self.show_progress is the value of the show_progress parameter: True/False) progress = markdown_extensions.show_progress.ShowProgress(self.show_progress) # (lines skipped) progress.phase('Preprocessors') self.lines = source.split("\n") for prep in self.preprocessors.values(): progress.processor(prep) self.lines = prep.run(self.lines)
Idea for specifying extensions more consisely
Currently, extensions must be specified by passing a complete path to the extension:
>>> import sys >>> sys.path.append('/home/brian/projects/python') >>> import markdown >>> markdown.markdown('text', extensions = ['markdown.extensions.extra', ... 'markdown.extensions.toc', 'markdown_extensions.auc_headers' ...])
export PYTHONPATH='/home/brian/projects/python' markdown_py-3 -x markdown.extensions.extra \ -x markdown_extensions.gfm_tasklist -x markdown.extensions.meta \ -x markdown.extensions.sane_lists -x markdown.extensions.smarty \ -x markdown_extensions.spantable -x markdown.extensions.toc \ -x markdown_extensions.urlize -x markdown_extensions.gentoc_remove \ -x markdown_extensions.percent_comments -x markdown_extensions.auc_headers \ -x markdown_extensions.linebreak_plus -x markdown_extensions.autoxref \ -x markdown_extensions.toc_fixer filename.md >filename.html
I thought there might be a way to improve this.
- On the command line, provide a comma separated list of extension names
- When calling
markdown
as a Python module, pass an array of names - For each name, prefix with
markdown.extensions.
and try to load it - If that fails, loop through entries in
markdown_mod_prefix
; prefix the extension name with the entry and try to load it. - If that fails, give up and raise an error
>>> import sys >>> sys.path.append('/home/brian/projects/python') >>> import markdown >>> markdown.markdown('text', extensions_mod_prefix = ['markdown_extensions'], ... extensions = ['extra', 'toc', 'auc_headers'])
export PYTHONPATH='/home/brian/projects/python' markdown_py-3 -m markdown_extensions -x extra,gfm_tasklist,meta,sane_lists \ -x smarty,spantable,toc,urlize,gentoc_remove,percent_comments,auc_headers \ -x linebreak_plus,autoxref,toc_fixer
Right now it’s just an idea. In practice, if one needs to load up a lot of extensions, on the command line I use a shell script, and when running as a Python module I’s probably write a wrapper.
Here’s a prototype --help
that I wrote for this. Most of the options are
already in markdown_py;
I added the second -x
line and the -m
line.
Usage: markdown_py-3 [options] [INPUTFILE] (STDIN is assumed if no INPUTFILE is given) A Python implementation of John Gruber's Markdown. https://pythonhosted.org/Markdown/ Options: --version show program's version number and exit -h, --help show this help message and exit -f OUTPUT_FILE, --file=OUTPUT_FILE Write output to OUTPUT_FILE. Defaults to STDOUT. -e ENCODING, --encoding=ENCODING Encoding for input and output files. -s SAFE_MODE, --safe=SAFE_MODE Deprecated! 'replace', 'remove' or 'escape' HTML tags in input -o OUTPUT_FORMAT, --output_format=OUTPUT_FORMAT 'xhtml1' (default), 'html4' or 'html5'. -n, --no_lazy_ol Observe number of first item of ordered lists. -x EXTENSION, --extension=EXTENSION Load extension EXTENSION. -x NAME[,NAME...], -extensions NAME[,NAME...], Load list of extension names separated by commas. When resolving names, they are first prefixed with 'markdown.extension', and if not found, prefixes from -m/--extensions_mod_prefix are tried as well. -m PREFIX[:PREFIX ...], --extensions_mod_prefix PREFIX[:PREFIX ...] One or more module prefixes to prepend to extension names when searching for them, in addition to the built-in name 'markdown.extensions'. For example, if you have extensions in '/usr/local/lib/md_py_extn', you can pass '-m md_py_extn' (note that you also need to set PYTHONPATH='/usr/local/lib') -c CONFIG_FILE, --extension_configs=CONFIG_FILE Read extension configurations from CONFIG_FILE. CONFIG_FILE must be of JSON or YAML format. YAMLformat requires that a python YAML library be installed. The parsed JSON or YAML must result in a python dictionary which would be accepted by the 'extension_configs' keyword on the markdown.Markdown class. The extensions must also be loaded with the `--extension` option. -q, --quiet Suppress all warnings. -v, --verbose Print all warnings. -p, --progress Show markdown progress. --noisy Print debug messages.
I did make a change similar to this, but put it into the jmd
wrapper script
instead.
Markdown extension percent_comments
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
#!/usr/bin/python """ Percent Comments Extension for Python-Markdown ============================================== Treats lines beginning with '%' in columns 1, 2, or 3 in the document as internal comments and removes them from the document so they do not appear in the output. Handles such lines found within blockquotes and tables but ignores them within fenced code blocks. Such comments are useful if you maintain a document intended for public consumption but you want to include information that's useful only for you or other people who need to make changes to it. For example: % Information in the following table is gleaned from this file: % NODE"VMS12"::SYS$DISK2:[DOCUMENTS.INTERNAL]NETWORK.TXT """ from __future__ import absolute_import from __future__ import unicode_literals from markdown import Extension from markdown.blockprocessors import BlockProcessor from markdown.postprocessors import Postprocessor from random import randint from hashlib import sha1 import re class PercentCommentsExtension(Extension): """ PercentComments Extension. """ def extendMarkdown(self, md, md_globals): """ Add Percent Comments Block processor to Markdown. """ md.registerExtension(self) # Placeholder text to indicate an elemnt that became empty because it # was a "%" comment. Append a random hex string to eliminate the # possbility of text being replaced by chance. Because this needs to be # accessible from both the BlockProcessor and the Postprocessor, we # make it a property of the extension itself. sha = sha1() sha.update(randint(0, 9999999).__str__().encode('UTF-8')) self.EMPTY_ELEMENT = "EMPTY_DUE_TO_PERCENT_COMMENT_%s" % sha.hexdigest().upper()[0:8] md.parser.blockprocessors.add('percent_comments', PercentCommentsBlockProcessor(md.parser, self), '<paragraph') # Insert the post-processor before inserting raw HTML md.postprocessors.add('percent_comments', PercentCommentsPostprocessor(self), '<raw_html') class PercentCommentsBlockProcessor(BlockProcessor): """ Remove text elements that have a % in column 1, 2, or 3 """ RE_TEST = re.compile(r'^ ? ? ?\\?%', re.MULTILINE) RE_REMOVE = re.compile(r'^ ? ? ?%[^\n]+\n?', re.MULTILINE) RE_UNESCAPE = re.compile(r'\\%') sw = False # True = we processed this block on the previous call def __init__(self, md_parser, extobj): super(PercentCommentsBlockProcessor, self).__init__(md_parser) self.EMPTY_ELEMENT = extobj.EMPTY_ELEMENT def test(self, parent, block): if self.sw: self.sw = False return False return bool(self.RE_TEST.search(block)) def run(self, parent, blocks): raw_block = blocks.pop(0) raw_block = self.RE_REMOVE.sub('', raw_block) raw_block = self.RE_UNESCAPE.sub('%', raw_block) if len(raw_block): # Block contains additional lines. Add to master blocks for later. blocks.insert(0, raw_block) self.sw = True elif not parent.tag == 'div': blocks.insert(0, self.EMPTY_ELEMENT) class PercentCommentsPostprocessor(Postprocessor): """ Remove HTML elements made empty by PercentCommentsBlockProcessor """ def __init__(self, extobj): super(PercentCommentsPostprocessor, self).__init__() self.EMPTY_ELEMENT = extobj.EMPTY_ELEMENT def run(self, text): """ Remove lines of the format \n<xx>EMPTY_DUE_TO_PERCENT_COMMENT_XXXXXXXX</xx> """ return re.sub(r'\n<([a-z]+)>%s</\1>' % self.EMPTY_ELEMENT, '', text, flags=re.MULTILINE) def makeExtension(*args, **kwargs): return PercentCommentsExtension(*args, **kwargs) |
Markdown extension linebreak_plus
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
#!/usr/bin/python """ Linebreak-Plus Extension for Python-Markdown ============================================ In addition to having two spaces at the end of a line mark a line break, adds support for backslash (John McFarlane's CommonMark) and "<space><underscore>" (my idea--I think it looks nicer) at the end of a line. """ from __future__ import absolute_import from __future__ import unicode_literals from markdown import Extension from markdown.inlinepatterns import SubstituteTagPattern LINE_BREAK_PLUS_RE = r'(\\| _)\n' class LinebreakPlusExtension(Extension): def extendMarkdown(self, md, md_globals): linebreak_plus_tag = SubstituteTagPattern(LINE_BREAK_PLUS_RE, 'br') md.inlinePatterns.add('linebreak_plus', linebreak_plus_tag, '>linebreak') def makeExtension(*args, **kwargs): return LinebreakPlusExtension(*args, **kwargs) |
Running my Markdown extensions in Python 2
First, my markdown_extensions
directory needed an empty file in it named
__init__.py
before Python Markdown could import files from it.
Then my H1H2_Uplinks
extension failed:
File "/home/brian/markdown_extensions/h1h2_uplinks.py", line 63, in run self.h1h2_id[slugify(str(elem.text), '-')] = target File "/usr/lib/python2.7/site-packages/markdown/extensions/headerid.py", line 93, in slugify value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore') TypeError: must be unicode, not str
According to Ned Batchelder’s Pragmatic Unicode pages:
Just as in Python 2, Python 3 has two string types, one for Unicode and one for bytes, but they are named differently.
Now the “str” type that you get from a plain string literal stores unicode, and the “bytes” types stores bytes. You can create a bytes literal with a
b
prefix.So “str” in Python 2 is now called “bytes,” and “unicode” in Python 2 is now called “str”. This makes more sense than the Python 2 names, since Unicode is how you want all text stored, and byte strings are only for when you are dealing with bytes.
That makes the above error obvious: the function str(elem.text)
in Python 3
creates and returns Unicode text. But in Python 2 it returns a simple byte
string, which causes slugify to raise an error because it’s expecting Unicode
text.
Now I have a problem. The source text I’m dealing with can be either straight ASCII or Unicode. It doesn’t matter for Python 3, because it implicily works with it as Unicode. But not so much in Python 2.
I changed the above problem code to read:
self.h1h2_id[slugify(elem.text, '-')] = target
The change is I removed the str()
function that wrapped elem.text.
I had it
there originally to force it to Unicode, but further testing on both Python 2
and Python 3 shows it already is Unicode.
I may have to revisit this if I start getting errors (again) about non-Unicode
text being passed to slugify.