tor-browser

The Tor Browser
git clone https://git.dasho.dev/tor-browser.git
Log | Files | Refs | README | LICENSE

README.rst (4105B)


      1 html5lib
      2 ========
      3 
      4 .. image:: https://travis-ci.org/html5lib/html5lib-python.svg?branch=master
      5    :target: https://travis-ci.org/html5lib/html5lib-python
      6 
      7 
      8 html5lib is a pure-python library for parsing HTML. It is designed to
      9 conform to the WHATWG HTML specification, as is implemented by all major
     10 web browsers.
     11 
     12 
     13 Usage
     14 -----
     15 
     16 Simple usage follows this pattern:
     17 
     18 .. code-block:: python
     19 
     20  import html5lib
     21  with open("mydocument.html", "rb") as f:
     22      document = html5lib.parse(f)
     23 
     24 or:
     25 
     26 .. code-block:: python
     27 
     28  import html5lib
     29  document = html5lib.parse("<p>Hello World!")
     30 
     31 By default, the ``document`` will be an ``xml.etree`` element instance.
     32 Whenever possible, html5lib chooses the accelerated ``ElementTree``
     33 implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x).
     34 
     35 Two other tree types are supported: ``xml.dom.minidom`` and
     36 ``lxml.etree``. To use an alternative format, specify the name of
     37 a treebuilder:
     38 
     39 .. code-block:: python
     40 
     41  import html5lib
     42  with open("mydocument.html", "rb") as f:
     43      lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
     44 
     45 When using with ``urllib2`` (Python 2), the charset from HTTP should be
     46 pass into html5lib as follows:
     47 
     48 .. code-block:: python
     49 
     50  from contextlib import closing
     51  from urllib2 import urlopen
     52  import html5lib
     53 
     54  with closing(urlopen("http://example.com/")) as f:
     55      document = html5lib.parse(f, transport_encoding=f.info().getparam("charset"))
     56 
     57 When using with ``urllib.request`` (Python 3), the charset from HTTP
     58 should be pass into html5lib as follows:
     59 
     60 .. code-block:: python
     61 
     62  from urllib.request import urlopen
     63  import html5lib
     64 
     65  with urlopen("http://example.com/") as f:
     66      document = html5lib.parse(f, transport_encoding=f.info().get_content_charset())
     67 
     68 To have more control over the parser, create a parser object explicitly.
     69 For instance, to make the parser raise exceptions on parse errors, use:
     70 
     71 .. code-block:: python
     72 
     73  import html5lib
     74  with open("mydocument.html", "rb") as f:
     75      parser = html5lib.HTMLParser(strict=True)
     76      document = parser.parse(f)
     77 
     78 When you're instantiating parser objects explicitly, pass a treebuilder
     79 class as the ``tree`` keyword argument to use an alternative document
     80 format:
     81 
     82 .. code-block:: python
     83 
     84  import html5lib
     85  parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
     86  minidom_document = parser.parse("<p>Hello World!")
     87 
     88 More documentation is available at https://html5lib.readthedocs.io/.
     89 
     90 
     91 Installation
     92 ------------
     93 
     94 html5lib works on CPython 2.7+, CPython 3.5+ and PyPy. To install:
     95 
     96 .. code-block:: bash
     97 
     98    $ pip install html5lib
     99 
    100 The goal is to support a (non-strict) superset of the versions that `pip
    101 supports
    102 <https://pip.pypa.io/en/stable/installing/#python-and-os-compatibility>`_.
    103 
    104 Optional Dependencies
    105 ---------------------
    106 
    107 The following third-party libraries may be used for additional
    108 functionality:
    109 
    110 - ``lxml`` is supported as a tree format (for both building and
    111  walking) under CPython (but *not* PyPy where it is known to cause
    112  segfaults);
    113 
    114 - ``genshi`` has a treewalker (but not builder); and
    115 
    116 - ``chardet`` can be used as a fallback when character encoding cannot
    117  be determined.
    118 
    119 
    120 Bugs
    121 ----
    122 
    123 Please report any bugs on the `issue tracker
    124 <https://github.com/html5lib/html5lib-python/issues>`_.
    125 
    126 
    127 Tests
    128 -----
    129 
    130 Unit tests require the ``pytest`` and ``mock`` libraries and can be
    131 run using the ``py.test`` command in the root directory.
    132 
    133 Test data are contained in a separate `html5lib-tests
    134 <https://github.com/html5lib/html5lib-tests>`_ repository and included
    135 as a submodule, thus for git checkouts they must be initialized::
    136 
    137  $ git submodule init
    138  $ git submodule update
    139 
    140 If you have all compatible Python implementations available on your
    141 system, you can run tests on all of them using the ``tox`` utility,
    142 which can be found on PyPI.
    143 
    144 
    145 Questions?
    146 ----------
    147 
    148 There's a mailing list available for support on Google Groups,
    149 `html5lib-discuss <http://groups.google.com/group/html5lib-discuss>`_,
    150 though you may get a quicker response asking on IRC in `#whatwg on
    151 irc.freenode.net <http://wiki.whatwg.org/wiki/IRC>`_.