README.rst (4105B)
1 html5lib 2 ======== 3 4 .. image:: https://travis-ci.org/html5lib/html5lib-python.svg?branch=master 5 :target: https://travis-ci.org/html5lib/html5lib-python 6 7 8 html5lib is a pure-python library for parsing HTML. It is designed to 9 conform to the WHATWG HTML specification, as is implemented by all major 10 web browsers. 11 12 13 Usage 14 ----- 15 16 Simple usage follows this pattern: 17 18 .. code-block:: python 19 20 import html5lib 21 with open("mydocument.html", "rb") as f: 22 document = html5lib.parse(f) 23 24 or: 25 26 .. code-block:: python 27 28 import html5lib 29 document = html5lib.parse("<p>Hello World!") 30 31 By default, the ``document`` will be an ``xml.etree`` element instance. 32 Whenever possible, html5lib chooses the accelerated ``ElementTree`` 33 implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x). 34 35 Two other tree types are supported: ``xml.dom.minidom`` and 36 ``lxml.etree``. To use an alternative format, specify the name of 37 a treebuilder: 38 39 .. code-block:: python 40 41 import html5lib 42 with open("mydocument.html", "rb") as f: 43 lxml_etree_document = html5lib.parse(f, treebuilder="lxml") 44 45 When using with ``urllib2`` (Python 2), the charset from HTTP should be 46 pass into html5lib as follows: 47 48 .. code-block:: python 49 50 from contextlib import closing 51 from urllib2 import urlopen 52 import html5lib 53 54 with closing(urlopen("http://example.com/")) as f: 55 document = html5lib.parse(f, transport_encoding=f.info().getparam("charset")) 56 57 When using with ``urllib.request`` (Python 3), the charset from HTTP 58 should be pass into html5lib as follows: 59 60 .. code-block:: python 61 62 from urllib.request import urlopen 63 import html5lib 64 65 with urlopen("http://example.com/") as f: 66 document = html5lib.parse(f, transport_encoding=f.info().get_content_charset()) 67 68 To have more control over the parser, create a parser object explicitly. 69 For instance, to make the parser raise exceptions on parse errors, use: 70 71 .. code-block:: python 72 73 import html5lib 74 with open("mydocument.html", "rb") as f: 75 parser = html5lib.HTMLParser(strict=True) 76 document = parser.parse(f) 77 78 When you're instantiating parser objects explicitly, pass a treebuilder 79 class as the ``tree`` keyword argument to use an alternative document 80 format: 81 82 .. code-block:: python 83 84 import html5lib 85 parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) 86 minidom_document = parser.parse("<p>Hello World!") 87 88 More documentation is available at https://html5lib.readthedocs.io/. 89 90 91 Installation 92 ------------ 93 94 html5lib works on CPython 2.7+, CPython 3.5+ and PyPy. To install: 95 96 .. code-block:: bash 97 98 $ pip install html5lib 99 100 The goal is to support a (non-strict) superset of the versions that `pip 101 supports 102 <https://pip.pypa.io/en/stable/installing/#python-and-os-compatibility>`_. 103 104 Optional Dependencies 105 --------------------- 106 107 The following third-party libraries may be used for additional 108 functionality: 109 110 - ``lxml`` is supported as a tree format (for both building and 111 walking) under CPython (but *not* PyPy where it is known to cause 112 segfaults); 113 114 - ``genshi`` has a treewalker (but not builder); and 115 116 - ``chardet`` can be used as a fallback when character encoding cannot 117 be determined. 118 119 120 Bugs 121 ---- 122 123 Please report any bugs on the `issue tracker 124 <https://github.com/html5lib/html5lib-python/issues>`_. 125 126 127 Tests 128 ----- 129 130 Unit tests require the ``pytest`` and ``mock`` libraries and can be 131 run using the ``py.test`` command in the root directory. 132 133 Test data are contained in a separate `html5lib-tests 134 <https://github.com/html5lib/html5lib-tests>`_ repository and included 135 as a submodule, thus for git checkouts they must be initialized:: 136 137 $ git submodule init 138 $ git submodule update 139 140 If you have all compatible Python implementations available on your 141 system, you can run tests on all of them using the ``tox`` utility, 142 which can be found on PyPI. 143 144 145 Questions? 146 ---------- 147 148 There's a mailing list available for support on Google Groups, 149 `html5lib-discuss <http://groups.google.com/group/html5lib-discuss>`_, 150 though you may get a quicker response asking on IRC in `#whatwg on 151 irc.freenode.net <http://wiki.whatwg.org/wiki/IRC>`_.