tor-browser

The Tor Browser
git clone https://git.dasho.dev/tor-browser.git
Log | Files | Refs | README | LICENSE

movingparts.rst (6221B)


      1 The moving parts
      2 ================
      3 
      4 html5lib consists of a number of components, which are responsible for
      5 handling its features.
      6 
      7 Parsing uses a *tree builder* to generate a *tree*, the in-memory representation of the document.
      8 Several tree representations are supported, as are translations to other formats via *tree adapters*.
      9 The tree may be translated to a token stream with a *tree walker*, from which :class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes.
     10 The token stream may also be transformed by use of *filters* to accomplish tasks like sanitization.
     11 
     12 Tree builders
     13 -------------
     14 
     15 The parser reads HTML by tokenizing the content and building a tree that
     16 the user can later access. html5lib can build three types of trees:
     17 
     18 * ``etree`` - this is the default; builds a tree based on :mod:`xml.etree`,
     19  which can be found in the standard library. Whenever possible, the
     20  accelerated ``ElementTree`` implementation (i.e.
     21  ``xml.etree.cElementTree`` on Python 2.x) is used.
     22 
     23 * ``dom`` - builds a tree based on :mod:`xml.dom.minidom`.
     24 
     25 * ``lxml`` - uses the :mod:`lxml.etree` implementation of the ``ElementTree``
     26  API.  The performance gains are relatively small compared to using the
     27  accelerated ``ElementTree`` module.
     28 
     29 You can specify the builder by name when using the shorthand API:
     30 
     31 .. code-block:: python
     32 
     33  import html5lib
     34  with open("mydocument.html", "rb") as f:
     35      lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
     36 
     37 To get a builder class by name, use the :func:`~html5lib.treebuilders.getTreeBuilder` function.
     38 
     39 When instantiating a :class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute:
     40 
     41 .. code-block:: python
     42 
     43  import html5lib
     44  TreeBuilder = html5lib.getTreeBuilder("dom")
     45  parser = html5lib.HTMLParser(tree=TreeBuilder)
     46  minidom_document = parser.parse("<p>Hello World!")
     47 
     48 The implementation of builders can be found in `html5lib/treebuilders/
     49 <https://github.com/html5lib/html5lib-python/tree/master/html5lib/treebuilders>`_.
     50 
     51 
     52 Tree walkers
     53 ------------
     54 
     55 In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.
     56 html5lib provides walkers for ``etree``, ``dom``, and ``lxml`` trees, as well as ``genshi`` `markup streams <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
     57 
     58 The implementation of walkers can be found in `html5lib/treewalkers/
     59 <https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.
     60 
     61 html5lib provides :class:`~html5lib.serializer.HTMLSerializer` for generating a stream of bytes from a token stream, and several filters which manipulate the stream.
     62 
     63 HTMLSerializer
     64 ~~~~~~~~~~~~~~
     65 
     66 The serializer lets you write HTML back as a stream of bytes.
     67 
     68 .. code-block:: pycon
     69 
     70  >>> import html5lib
     71  >>> element = html5lib.parse('<p xml:lang="pl">Witam wszystkich')
     72  >>> walker = html5lib.getTreeWalker("etree")
     73  >>> stream = walker(element)
     74  >>> s = html5lib.serializer.HTMLSerializer()
     75  >>> output = s.serialize(stream)
     76  >>> for item in output:
     77  ...   print("%r" % item)
     78  '<p'
     79  ' '
     80  'xml:lang'
     81  '='
     82  'pl'
     83  '>'
     84  'Witam wszystkich'
     85 
     86 You can customize the serializer behaviour in a variety of ways. Consult
     87 the :class:`~html5lib.serializer.HTMLSerializer` documentation.
     88 
     89 
     90 Filters
     91 ~~~~~~~
     92 
     93 html5lib provides several filters:
     94 
     95 * :class:`alphabeticalattributes.Filter
     96  <html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
     97  tags to be in alphabetical order
     98 
     99 * :class:`inject_meta_charset.Filter
    100  <html5lib.filters.inject_meta_charset.Filter>` sets a user-specified
    101  encoding in the correct ``<meta>`` tag in the ``<head>`` section of
    102  the document
    103 
    104 * :class:`lint.Filter <html5lib.filters.lint.Filter>` raises
    105  :exc:`AssertionError` exceptions on invalid tag and attribute names, invalid
    106  PCDATA, etc.
    107 
    108 * :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
    109  removes tags from the token stream which are not necessary to produce valid
    110  HTML
    111 
    112 * :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes
    113  unsafe markup and CSS. Elements that are known to be safe are passed
    114  through and the rest is converted to visible text. The default
    115  configuration of the sanitizer follows the `WHATWG Sanitization Rules
    116  <http://wiki.whatwg.org/wiki/Sanitization_rules>`_.
    117 
    118 * :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
    119  collapses all whitespace characters to single spaces unless they're in
    120  ``<pre/>`` or ``<textarea/>`` tags.
    121 
    122 To use a filter, simply wrap it around a token stream:
    123 
    124 .. code-block:: python
    125 
    126  >>> import html5lib
    127  >>> from html5lib.filters import sanitizer
    128  >>> dom = html5lib.parse("<p><script>alert('Boo!')", treebuilder="dom")
    129  >>> walker = html5lib.getTreeWalker("dom")
    130  >>> stream = walker(dom)
    131  >>> clean_stream = sanitizer.Filter(stream)
    132 
    133 
    134 Tree adapters
    135 -------------
    136 
    137 Tree adapters can be used to translate between tree formats.
    138 Two adapters are provided by html5lib:
    139 
    140 * :func:`html5lib.treeadapters.genshi.to_genshi()` generates a `Genshi markup stream <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
    141 * :func:`html5lib.treeadapters.sax.to_sax()` calls a SAX handler based on the tree.
    142 
    143 Encoding discovery
    144 ------------------
    145 
    146 Parsed trees are always Unicode. However a large variety of input
    147 encodings are supported. The encoding of the document is determined in
    148 the following way:
    149 
    150 * The encoding may be explicitly specified by passing the name of the
    151  encoding as the encoding parameter to the
    152  :meth:`~html5lib.html5parser.HTMLParser.parse` method on
    153  :class:`~html5lib.html5parser.HTMLParser` objects.
    154 
    155 * If no encoding is specified, the parser will attempt to detect the
    156  encoding from a ``<meta>``  element in the first 512 bytes of the
    157  document (this is only a partial implementation of the current HTML
    158  specification).
    159 
    160 * If no encoding can be found and the :mod:`chardet` library is available, an
    161  attempt will be made to sniff the encoding from the byte pattern.
    162 
    163 * If all else fails, the default encoding will be used. This is usually
    164  `Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252>`_, which is
    165  a common fallback used by Web browsers.