movingparts.rst (6221B)
1 The moving parts 2 ================ 3 4 html5lib consists of a number of components, which are responsible for 5 handling its features. 6 7 Parsing uses a *tree builder* to generate a *tree*, the in-memory representation of the document. 8 Several tree representations are supported, as are translations to other formats via *tree adapters*. 9 The tree may be translated to a token stream with a *tree walker*, from which :class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes. 10 The token stream may also be transformed by use of *filters* to accomplish tasks like sanitization. 11 12 Tree builders 13 ------------- 14 15 The parser reads HTML by tokenizing the content and building a tree that 16 the user can later access. html5lib can build three types of trees: 17 18 * ``etree`` - this is the default; builds a tree based on :mod:`xml.etree`, 19 which can be found in the standard library. Whenever possible, the 20 accelerated ``ElementTree`` implementation (i.e. 21 ``xml.etree.cElementTree`` on Python 2.x) is used. 22 23 * ``dom`` - builds a tree based on :mod:`xml.dom.minidom`. 24 25 * ``lxml`` - uses the :mod:`lxml.etree` implementation of the ``ElementTree`` 26 API. The performance gains are relatively small compared to using the 27 accelerated ``ElementTree`` module. 28 29 You can specify the builder by name when using the shorthand API: 30 31 .. code-block:: python 32 33 import html5lib 34 with open("mydocument.html", "rb") as f: 35 lxml_etree_document = html5lib.parse(f, treebuilder="lxml") 36 37 To get a builder class by name, use the :func:`~html5lib.treebuilders.getTreeBuilder` function. 38 39 When instantiating a :class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute: 40 41 .. code-block:: python 42 43 import html5lib 44 TreeBuilder = html5lib.getTreeBuilder("dom") 45 parser = html5lib.HTMLParser(tree=TreeBuilder) 46 minidom_document = parser.parse("<p>Hello World!") 47 48 The implementation of builders can be found in `html5lib/treebuilders/ 49 <https://github.com/html5lib/html5lib-python/tree/master/html5lib/treebuilders>`_. 50 51 52 Tree walkers 53 ------------ 54 55 In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it. 56 html5lib provides walkers for ``etree``, ``dom``, and ``lxml`` trees, as well as ``genshi`` `markup streams <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_. 57 58 The implementation of walkers can be found in `html5lib/treewalkers/ 59 <https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_. 60 61 html5lib provides :class:`~html5lib.serializer.HTMLSerializer` for generating a stream of bytes from a token stream, and several filters which manipulate the stream. 62 63 HTMLSerializer 64 ~~~~~~~~~~~~~~ 65 66 The serializer lets you write HTML back as a stream of bytes. 67 68 .. code-block:: pycon 69 70 >>> import html5lib 71 >>> element = html5lib.parse('<p xml:lang="pl">Witam wszystkich') 72 >>> walker = html5lib.getTreeWalker("etree") 73 >>> stream = walker(element) 74 >>> s = html5lib.serializer.HTMLSerializer() 75 >>> output = s.serialize(stream) 76 >>> for item in output: 77 ... print("%r" % item) 78 '<p' 79 ' ' 80 'xml:lang' 81 '=' 82 'pl' 83 '>' 84 'Witam wszystkich' 85 86 You can customize the serializer behaviour in a variety of ways. Consult 87 the :class:`~html5lib.serializer.HTMLSerializer` documentation. 88 89 90 Filters 91 ~~~~~~~ 92 93 html5lib provides several filters: 94 95 * :class:`alphabeticalattributes.Filter 96 <html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on 97 tags to be in alphabetical order 98 99 * :class:`inject_meta_charset.Filter 100 <html5lib.filters.inject_meta_charset.Filter>` sets a user-specified 101 encoding in the correct ``<meta>`` tag in the ``<head>`` section of 102 the document 103 104 * :class:`lint.Filter <html5lib.filters.lint.Filter>` raises 105 :exc:`AssertionError` exceptions on invalid tag and attribute names, invalid 106 PCDATA, etc. 107 108 * :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>` 109 removes tags from the token stream which are not necessary to produce valid 110 HTML 111 112 * :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes 113 unsafe markup and CSS. Elements that are known to be safe are passed 114 through and the rest is converted to visible text. The default 115 configuration of the sanitizer follows the `WHATWG Sanitization Rules 116 <http://wiki.whatwg.org/wiki/Sanitization_rules>`_. 117 118 * :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>` 119 collapses all whitespace characters to single spaces unless they're in 120 ``<pre/>`` or ``<textarea/>`` tags. 121 122 To use a filter, simply wrap it around a token stream: 123 124 .. code-block:: python 125 126 >>> import html5lib 127 >>> from html5lib.filters import sanitizer 128 >>> dom = html5lib.parse("<p><script>alert('Boo!')", treebuilder="dom") 129 >>> walker = html5lib.getTreeWalker("dom") 130 >>> stream = walker(dom) 131 >>> clean_stream = sanitizer.Filter(stream) 132 133 134 Tree adapters 135 ------------- 136 137 Tree adapters can be used to translate between tree formats. 138 Two adapters are provided by html5lib: 139 140 * :func:`html5lib.treeadapters.genshi.to_genshi()` generates a `Genshi markup stream <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_. 141 * :func:`html5lib.treeadapters.sax.to_sax()` calls a SAX handler based on the tree. 142 143 Encoding discovery 144 ------------------ 145 146 Parsed trees are always Unicode. However a large variety of input 147 encodings are supported. The encoding of the document is determined in 148 the following way: 149 150 * The encoding may be explicitly specified by passing the name of the 151 encoding as the encoding parameter to the 152 :meth:`~html5lib.html5parser.HTMLParser.parse` method on 153 :class:`~html5lib.html5parser.HTMLParser` objects. 154 155 * If no encoding is specified, the parser will attempt to detect the 156 encoding from a ``<meta>`` element in the first 512 bytes of the 157 document (this is only a partial implementation of the current HTML 158 specification). 159 160 * If no encoding can be found and the :mod:`chardet` library is available, an 161 attempt will be made to sniff the encoding from the byte pattern. 162 163 * If all else fails, the default encoding will be used. This is usually 164 `Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252>`_, which is 165 a common fallback used by Web browsers.