xref: /openbmc/openbmc/poky/bitbake/lib/bs4/CHANGELOG (revision 8460358c)
1= 4.12.3 (20240117)
2
3* The Beautiful Soup documentation now has a Spanish translation, thanks
4  to Carlos Romero. Delong Wang's Chinese translation has been updated
5  to cover Beautiful Soup 4.12.0.
6
7* Fixed a regression such that if you set .hidden on a tag, the tag
8  becomes invisible but its contents are still visible. User manipulation
9  of .hidden is not a documented or supported feature, so don't do this,
10  but it wasn't too difficult to keep the old behavior working.
11
12* Fixed a case found by Mengyuhan where html.parser giving up on
13  markup would result in an AssertionError instead of a
14  ParserRejectedMarkup exception.
15
16* Added the correct stacklevel to instances of the XMLParsedAsHTMLWarning.
17  [bug=2034451]
18
19* Corrected the syntax of the license definition in pyproject.toml. Patch
20  by Louis Maddox. [bug=2032848]
21
22* Corrected a typo in a test that was causing test failures when run against
23  libxml2 2.12.1. [bug=2045481]
24
25= 4.12.2 (20230407)
26
27* Fixed an unhandled exception in BeautifulSoup.decode_contents
28  and methods that call it. [bug=2015545]
29
30= 4.12.1 (20230405)
31
32NOTE: the following things are likely to be dropped in the next
33feature release of Beautiful Soup:
34
35 Official support for Python 3.6.
36 Inclusion of unit tests and test data in the wheel file.
37 Two scripts: demonstrate_parser_differences.py and test-all-versions.
38
39Changes:
40
41* This version of Beautiful Soup replaces setup.py and setup.cfg
42  with pyproject.toml. Beautiful Soup now uses tox as its test backend
43  and hatch to do builds.
44
45* The main functional improvement in this version is a nonrecursive technique
46  for regenerating a tree. This technique is used to avoid situations where,
47  in previous versions, doing something to a very deeply nested tree
48  would overflow the Python interpreter stack:
49
50  1. Outputting a tree as a string, e.g. with
51     BeautifulSoup.encode() [bug=1471755]
52
53  2. Making copies of trees (copy.copy() and
54     copy.deepcopy() from the Python standard library). [bug=1709837]
55
56  3. Pickling a BeautifulSoup object. (Note that pickling a Tag
57     object can still cause an overflow.)
58
59* Making a copy of a BeautifulSoup object no longer parses the
60  document again, which should improve performance significantly.
61
62* When a BeautifulSoup object is unpickled, Beautiful Soup now
63  tries to associate an appropriate TreeBuilder object with it.
64
65* Tag.prettify() will now consistently end prettified markup with
66  a newline.
67
68* Added unit tests for fuzz test cases created by third
69  parties. Some of these tests are skipped since they point
70  to problems outside of Beautiful Soup, but this change
71  puts them all in one convenient place.
72
73* PageElement now implements the known_xml attribute. (This was technically
74  a bug, but it shouldn't be an issue in normal use.) [bug=2007895]
75
76* The demonstrate_parser_differences.py script was still written in
77  Python 2. I've converted it to Python 3, but since no one has
78  mentioned this over the years, it's a sign that no one uses this
79  script and it's not serving its purpose.
80
81= 4.12.0 (20230320)
82
83* Introduced the .css property, which centralizes all access to
84  the Soup Sieve API. This allows Beautiful Soup to give direct
85  access to as much of Soup Sieve that makes sense, without cluttering
86  the BeautifulSoup and Tag classes with a lot of new methods.
87
88  This does mean one addition to the BeautifulSoup and Tag classes
89  (the .css property itself), so this might be a breaking change if you
90  happen to use Beautiful Soup to parse XML that includes a tag called
91  <css>. In particular, code like this will stop working in 4.12.0:
92
93    soup.css['id']
94
95  Code like this will work just as before:
96
97    soup.find_one('css')['id']
98
99  The Soup Sieve methods supported through the .css property are
100  select(), select_one(), iselect(), closest(), match(), filter(),
101  escape(), and compile(). The BeautifulSoup and Tag classes still
102  support the select() and select_one() methods; they have not been
103  deprecated, but they have been demoted to convenience methods.
104
105  [bug=2003677]
106
107* When the html.parser parser decides it can't parse a document, Beautiful
108  Soup now consistently propagates this fact by raising a
109  ParserRejectedMarkup error. [bug=2007343]
110
111* Removed some error checking code from diagnose(), which is redundant with
112  similar (but more Pythonic) code in the BeautifulSoup constructor.
113  [bug=2007344]
114
115* Added intersphinx references to the documentation so that other
116  projects have a target to point to when they reference Beautiful
117  Soup classes. [bug=1453370]
118
119= 4.11.2 (20230131)
120
121* Fixed test failures caused by nondeterministic behavior of
122  UnicodeDammit's character detection, depending on the platform setup.
123  [bug=1973072]
124
125* Fixed another crash when overriding multi_valued_attributes and using the
126  html5lib parser. [bug=1948488]
127
128* The HTMLFormatter and XMLFormatter constructors no longer return a
129  value. [bug=1992693]
130
131* Tag.interesting_string_types is now propagated when a tag is
132  copied. [bug=1990400]
133
134* Warnings now do their best to provide an appropriate stacklevel,
135  improving the usefulness of the message. [bug=1978744]
136
137* Passing a Tag's .contents into PageElement.extend() now works the
138  same way as passing the Tag itself.
139
140* Soup Sieve tests will be skipped if the library is not installed.
141
142= 4.11.1 (20220408)
143
144This release was done to ensure that the unit tests are packaged along
145with the released source. There are no functionality changes in this
146release, but there are a few other packaging changes:
147
148* The Japanese and Korean translations of the documentation are included.
149* The changelog is now packaged as CHANGELOG, and the license file is
150  packaged as LICENSE. NEWS.txt and COPYING.txt are still present,
151  but may be removed in the future.
152* TODO.txt is no longer packaged, since a TODO is not relevant for released
153  code.
154
155= 4.11.0 (20220407)
156
157* Ported unit tests to use pytest.
158
159* Added special string classes, RubyParenthesisString and RubyTextString,
160  to make it possible to treat ruby text specially in get_text() calls.
161  [bug=1941980]
162
163* It's now possible to customize the way output is indented by
164  providing a value for the 'indent' argument to the Formatter
165  constructor. The 'indent' argument works very similarly to the
166  argument of the same name in the Python standard library's
167  json.dump() function. [bug=1955497]
168
169* If the charset-normalizer Python module
170  (https://pypi.org/project/charset-normalizer/) is installed, Beautiful
171  Soup will use it to detect the character sets of incoming documents.
172  This is also the module used by newer versions of the Requests library.
173  For the sake of backwards compatibility, chardet and cchardet both take
174  precedence if installed. [bug=1955346]
175
176* Added a workaround for an lxml bug
177  (https://bugs.launchpad.net/lxml/+bug/1948551) that causes
178  problems when parsing a Unicode string beginning with BYTE ORDER MARK.
179  [bug=1947768]
180
181* Issue a warning when an HTML parser is used to parse a document that
182  looks like XML but not XHTML. [bug=1939121]
183
184* Do a better job of keeping track of namespaces as an XML document is
185  parsed, so that CSS selectors that use namespaces will do the right
186  thing more often. [bug=1946243]
187
188* Some time ago, the misleadingly named "text" argument to find-type
189  methods was renamed to the more accurate "string." But this supposed
190  "renaming" didn't make it into important places like the method
191  signatures or the docstrings. That's corrected in this
192  version. "text" still works, but will give a DeprecationWarning.
193  [bug=1947038]
194
195* Fixed a crash when pickling a BeautifulSoup object that has no
196  tree builder. [bug=1934003]
197
198* Fixed a crash when overriding multi_valued_attributes and using the
199  html5lib parser. [bug=1948488]
200
201* Standardized the wording of the MarkupResemblesLocatorWarning
202  warnings to omit untrusted input and make the warnings less
203  judgmental about what you ought to be doing. [bug=1955450]
204
205* Removed support for the iconv_codec library, which doesn't seem
206  to exist anymore and was never put up on PyPI. (The closest
207  replacement on PyPI, iconv_codecs, is GPL-licensed, so we can't use
208  it--it's also quite old.)
209
210= 4.10.0 (20210907)
211
212* This is the first release of Beautiful Soup to only support Python
213  3. I dropped Python 2 support to maintain support for newer versions
214  (58 and up) of setuptools. See:
215  https://github.com/pypa/setuptools/issues/2769 [bug=1942919]
216
217* The behavior of methods like .get_text() and .strings now differs
218  depending on the type of tag. The change is visible with HTML tags
219  like <script>, <style>, and <template>. Starting in 4.9.0, methods
220  like get_text() returned no results on such tags, because the
221  contents of those tags are not considered 'text' within the document
222  as a whole.
223
224  But a user who calls script.get_text() is working from a different
225  definition of 'text' than a user who calls div.get_text()--otherwise
226  there would be no need to call script.get_text() at all. In 4.10.0,
227  the contents of (e.g.) a <script> tag are considered 'text' during a
228  get_text() call on the tag itself, but not considered 'text' during
229  a get_text() call on the tag's parent.
230
231  Because of this change, calling get_text() on each child of a tag
232  may now return a different result than calling get_text() on the tag
233  itself. That's because different tags now have different
234  understandings of what counts as 'text'. [bug=1906226] [bug=1868861]
235
236* NavigableString and its subclasses now implement the get_text()
237  method, as well as the properties .strings and
238  .stripped_strings. These methods will either return the string
239  itself, or nothing, so the only reason to use this is when iterating
240  over a list of mixed Tag and NavigableString objects. [bug=1904309]
241
242* The 'html5' formatter now treats attributes whose values are the
243  empty string as HTML boolean attributes. Previously (and in other
244  formatters), an attribute value must be set as None to be treated as
245  a boolean attribute. In a future release, I plan to also give this
246  behavior to the 'html' formatter. Patch by Isaac Muse. [bug=1915424]
247
248* The 'replace_with()' method now takes a variable number of arguments,
249  and can be used to replace a single element with a sequence of elements.
250  Patch by Bill Chandos. [rev=605]
251
252* Corrected output when the namespace prefix associated with a
253  namespaced attribute is the empty string, as opposed to
254  None. [bug=1915583]
255
256* Performance improvement when processing tags that speeds up overall
257  tree construction by 2%. Patch by Morotti. [bug=1899358]
258
259* Corrected the use of special string container classes in cases when a
260  single tag may contain strings with different containers; such as
261  the <template> tag, which may contain both TemplateString objects
262  and Comment objects. [bug=1913406]
263
264* The html.parser tree builder can now handle named entities
265  found in the HTML5 spec in much the same way that the html5lib
266  tree builder does. Note that the lxml HTML tree builder doesn't handle
267  named entities this way. [bug=1924908]
268
269* Added a second way to pass specify encodings to UnicodeDammit and
270  EncodingDetector, based on the order of precedence defined in the
271  HTML5 spec, starting at:
272  https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding
273
274  Encodings in 'known_definite_encodings' are tried first, then
275  byte-order-mark sniffing is run, then encodings in 'user_encodings'
276  are tried. The old argument, 'override_encodings', is now a
277  deprecated alias for 'known_definite_encodings'.
278
279  This changes the default behavior of the html.parser and lxml tree
280  builders, in a way that may slightly improve encoding
281  detection but will probably have no effect. [bug=1889014]
282
283* Improve the warning issued when a directory name (as opposed to
284  the name of a regular file) is passed as markup into the BeautifulSoup
285  constructor. [bug=1913628]
286
287= 4.9.3 (20201003)
288
289This is the final release of Beautiful Soup to support Python
2902. Beautiful Soup's official support for Python 2 ended on 01 January,
2912021. In the Launchpad Git repository, the final revision to support
292Python 2 was revision 70f546b1e689a70e2f103795efce6d261a3dadf7; it is
293tagged as "python2".
294
295* Implemented a significant performance optimization to the process of
296  searching the parse tree. Patch by Morotti. [bug=1898212]
297
298= 4.9.2 (20200926)
299
300* Fixed a bug that caused too many tags to be popped from the tag
301  stack during tree building, when encountering a closing tag that had
302  no matching opening tag. [bug=1880420]
303
304* Fixed a bug that inconsistently moved elements over when passing
305  a Tag, rather than a list, into Tag.extend(). [bug=1885710]
306
307* Specify the soupsieve dependency in a way that complies with
308  PEP 508. Patch by Mike Nerone. [bug=1893696]
309
310* Change the signatures for BeautifulSoup.insert_before and insert_after
311  (which are not implemented) to match PageElement.insert_before and
312  insert_after, quieting warnings in some IDEs. [bug=1897120]
313
314= 4.9.1 (20200517)
315
316* Added a keyword argument 'on_duplicate_attribute' to the
317  BeautifulSoupHTMLParser constructor (used by the html.parser tree
318  builder) which lets you customize the handling of markup that
319  contains the same attribute more than once, as in:
320  <a href="url1" href="url2"> [bug=1878209]
321
322* Added a distinct subclass, GuessedAtParserWarning, for the warning
323  issued when BeautifulSoup is instantiated without a parser being
324  specified. [bug=1873787]
325
326* Added a distinct subclass, MarkupResemblesLocatorWarning, for the
327  warning issued when BeautifulSoup is instantiated with 'markup' that
328  actually seems to be a URL or the path to a file on
329  disk. [bug=1873787]
330
331* The new NavigableString subclasses (Stylesheet, Script, and
332  TemplateString) can now be imported directly from the bs4 package.
333
334* If you encode a document with a Python-specific encoding like
335  'unicode_escape', that encoding is no longer mentioned in the final
336  XML or HTML document. Instead, encoding information is omitted or
337  left blank. [bug=1874955]
338
339* Fixed test failures when run against soupselect 2.0. Patch by Tomáš
340  Chvátal. [bug=1872279]
341
342= 4.9.0 (20200405)
343
344* Added PageElement.decomposed, a new property which lets you
345  check whether you've already called decompose() on a Tag or
346  NavigableString.
347
348* Embedded CSS and Javascript is now stored in distinct Stylesheet and
349  Script tags, which are ignored by methods like get_text() since most
350  people don't consider this sort of content to be 'text'. This
351  feature is not supported by the html5lib treebuilder. [bug=1868861]
352
353* Added a Russian translation by 'authoress' to the repository.
354
355* Fixed an unhandled exception when formatting a Tag that had been
356  decomposed.[bug=1857767]
357
358* Fixed a bug that happened when passing a Unicode filename containing
359  non-ASCII characters as markup into Beautiful Soup, on a system that
360  allows Unicode filenames. [bug=1866717]
361
362* Added a performance optimization to PageElement.extract(). Patch by
363  Arthur Darcet.
364
365= 4.8.2 (20191224)
366
367* Added Python docstrings to all public methods of the most commonly
368  used classes.
369
370* Added a Chinese translation by Deron Wang and a Brazilian Portuguese
371  translation by Cezar Peixeiro to the repository.
372
373* Fixed two deprecation warnings. Patches by Colin
374  Watson and Nicholas Neumann. [bug=1847592] [bug=1855301]
375
376* The html.parser tree builder now correctly handles DOCTYPEs that are
377  not uppercase. [bug=1848401]
378
379* PageElement.select() now returns a ResultSet rather than a regular
380  list, making it consistent with methods like find_all().
381
382= 4.8.1 (20191006)
383
384* When the html.parser or html5lib parsers are in use, Beautiful Soup
385  will, by default, record the position in the original document where
386  each tag was encountered. This includes line number (Tag.sourceline)
387  and position within a line (Tag.sourcepos).  Based on code by Chris
388  Mayo. [bug=1742921]
389
390* When instantiating a BeautifulSoup object, it's now possible to
391   provide a dictionary ('element_classes') of the classes you'd like to be
392   instantiated instead of Tag, NavigableString, etc.
393
394* Fixed the definition of the default XML namespace when using
395   lxml 4.4. Patch by Isaac Muse. [bug=1840141]
396
397* Fixed a crash when pretty-printing tags that were not created
398   during initial parsing. [bug=1838903]
399
400* Copying a Tag preserves information that was originally obtained from
401   the TreeBuilder used to build the original Tag. [bug=1838903]
402
403* Raise an explanatory exception when the underlying parser
404   completely rejects the incoming markup. [bug=1838877]
405
406* Avoid a crash when trying to detect the declared encoding of a
407   Unicode document. [bug=1838877]
408
409* Avoid a crash when unpickling certain parse trees generated
410   using html5lib on Python 3. [bug=1843545]
411
412= 4.8.0 (20190720, "One Small Soup")
413
414This release focuses on making it easier to customize Beautiful Soup's
415input mechanism (the TreeBuilder) and output mechanism (the Formatter).
416
417* You can customize the TreeBuilder object by passing keyword
418  arguments into the BeautifulSoup constructor. Those keyword
419  arguments will be passed along into the TreeBuilder constructor.
420
421  The main reason to do this right now is to change how which
422  attributes are treated as multi-valued attributes (the way 'class'
423  is treated by default). You can do this with the
424  'multi_valued_attributes' argument. [bug=1832978]
425
426* The role of Formatter objects has been greatly expanded. The Formatter
427  class now controls the following:
428
429  - The function to call to perform entity substitution. (This was
430    previously Formatter's only job.)
431  - Which tags should be treated as containing CDATA and have their
432    contents exempt from entity substitution.
433  - The order in which a tag's attributes are output. [bug=1812422]
434  - Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>'
435
436  All preexisting code should work as before.
437
438* Added a new method to the API, Tag.smooth(), which consolidates
439  multiple adjacent NavigableString elements. [bug=1697296]
440
441* &apos; (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is always
442  recognized as a named entity and converted to a single quote. [bug=1818721]
443
444= 4.7.1 (20190106)
445
446* Fixed a significant performance problem introduced in 4.7.0. [bug=1810617]
447
448* Fixed an incorrectly raised exception when inserting a tag before or
449  after an identical tag. [bug=1810692]
450
451* Beautiful Soup will no longer try to keep track of namespaces that
452  are not defined with a prefix; this can confuse soupselect. [bug=1810680]
453
454* Tried even harder to avoid the deprecation warning originally fixed in
455   4.6.1. [bug=1778909]
456
457= 4.7.0 (20181231)
458
459* Beautiful Soup's CSS Selector implementation has been replaced by a
460  dependency on Isaac Muse's SoupSieve project (the soupsieve package
461  on PyPI). The good news is that SoupSieve has a much more robust and
462  complete implementation of CSS selectors, resolving a large number
463  of longstanding issues. The bad news is that from this point onward,
464  SoupSieve must be installed if you want to use the select() method.
465
466  You don't have to change anything lf you installed Beautiful Soup
467  through pip (SoupSieve will be automatically installed when you
468  upgrade Beautiful Soup) or if you don't use CSS selectors from
469  within Beautiful Soup.
470
471  SoupSieve documentation: https://facelessuser.github.io/soupsieve/
472
473* Added the PageElement.extend() method, which works like list.append().
474   [bug=1514970]
475
476* PageElement.insert_before() and insert_after() now take a variable
477   number of arguments. [bug=1514970]
478
479* Fix a number of problems with the tree builder that caused
480  trees that were superficially okay, but which fell apart when bits
481  were extracted. Patch by Isaac Muse. [bug=1782928,1809910]
482
483* Fixed a problem with the tree builder in which elements that
484  contained no content (such as empty comments and all-whitespace
485  elements) were not being treated as part of the tree. Patch by Isaac
486  Muse. [bug=1798699]
487
488* Fixed a problem with multi-valued attributes where the value
489  contained whitespace. Thanks to Jens Svalgaard for the
490  fix. [bug=1787453]
491
492* Clarified ambiguous license statements in the source code. Beautiful
493  Soup is released under the MIT license, and has been since 4.4.0.
494
495* This file has been renamed from NEWS.txt to CHANGELOG.
496
497= 4.6.3 (20180812)
498
499* Exactly the same as 4.6.2. Re-released to make the README file
500  render properly on PyPI.
501
502= 4.6.2 (20180812)
503
504* Fix an exception when a custom formatter was asked to format a void
505  element. [bug=1784408]
506
507= 4.6.1 (20180728)
508
509* Stop data loss when encountering an empty numeric entity, and
510  possibly in other cases.  Thanks to tos.kamiya for the fix. [bug=1698503]
511
512* Preserve XML namespaces introduced inside an XML document, not just
513   the ones introduced at the top level. [bug=1718787]
514
515* Added a new formatter, "html5", which represents void elements
516   as "<element>" rather than "<element/>".  [bug=1716272]
517
518* Fixed a problem where the html.parser tree builder interpreted
519  a string like "&foo " as the character entity "&foo;"  [bug=1728706]
520
521* Correctly handle invalid HTML numeric character entities like &#147;
522  which reference code points that are not Unicode code points. Note
523  that this is only fixed when Beautiful Soup is used with the
524  html.parser parser -- html5lib already worked and I couldn't fix it
525  with lxml.  [bug=1782933]
526
527* Improved the warning given when no parser is specified. [bug=1780571]
528
529* When markup contains duplicate elements, a select() call that
530  includes multiple match clauses will match all relevant
531  elements. [bug=1770596]
532
533* Fixed code that was causing deprecation warnings in recent Python 3
534  versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496]
535
536* Fixed a Windows crash in diagnose() when checking whether a long
537  markup string is a filename. [bug=1737121]
538
539* Stopped HTMLParser from raising an exception in very rare cases of
540  bad markup. [bug=1708831]
541
542* Fixed a bug where find_all() was not working when asked to find a
543  tag with a namespaced name in an XML document that was parsed as
544  HTML. [bug=1723783]
545
546* You can get finer control over formatting by subclassing
547  bs4.element.Formatter and passing a Formatter instance into (e.g.)
548  encode(). [bug=1716272]
549
550* You can pass a dictionary of `attrs` into
551  BeautifulSoup.new_tag. This makes it possible to create a tag with
552  an attribute like 'name' that would otherwise be masked by another
553  argument of new_tag. [bug=1779276]
554
555* Clarified the deprecation warning when accessing tag.fooTag, to cover
556  the possibility that you might really have been looking for a tag
557  called 'fooTag'.
558
559= 4.6.0 (20170507) =
560
561* Added the `Tag.get_attribute_list` method, which acts like `Tag.get` for
562  getting the value of an attribute, but which always returns a list,
563  whether or not the attribute is a multi-value attribute. [bug=1678589]
564
565* It's now possible to use a tag's namespace prefix when searching,
566  e.g. soup.find('namespace:tag') [bug=1655332]
567
568* Improved the handling of empty-element tags like <br> when using the
569  html.parser parser. [bug=1676935]
570
571* HTML parsers treat all HTML4 and HTML5 empty element tags (aka void
572  element tags) correctly. [bug=1656909]
573
574* Namespace prefix is preserved when an XML tag is copied. Thanks
575  to Vikas for a patch and test. [bug=1685172]
576
577= 4.5.3 (20170102) =
578
579* Fixed foster parenting when html5lib is the tree builder. Thanks to
580  Geoffrey Sneddon for a patch and test.
581
582* Fixed yet another problem that caused the html5lib tree builder to
583  create a disconnected parse tree. [bug=1629825]
584
585= 4.5.2 (20170102) =
586
587* Apart from the version number, this release is identical to
588  4.5.3. Due to user error, it could not be completely uploaded to
589  PyPI. Use 4.5.3 instead.
590
591= 4.5.1 (20160802) =
592
593* Fixed a crash when passing Unicode markup that contained a
594  processing instruction into the lxml HTML parser on Python
595  3. [bug=1608048]
596
597= 4.5.0 (20160719) =
598
599* Beautiful Soup is no longer compatible with Python 2.6. This
600  actually happened a few releases ago, but it's now official.
601
602* Beautiful Soup will now work with versions of html5lib greater than
603  0.99999999. [bug=1603299]
604
605* If a search against each individual value of a multi-valued
606  attribute fails, the search will be run one final time against the
607  complete attribute value considered as a single string. That is, if
608  a tag has class="foo bar" and neither "foo" nor "bar" matches, but
609  "foo bar" does, the tag is now considered a match.
610
611  This happened in previous versions, but only when the value being
612  searched for was a string. Now it also works when that value is
613  a regular expression, a list of strings, etc. [bug=1476868]
614
615* Fixed a bug that deranged the tree when a whitespace element was
616  reparented into a tag that contained an identical whitespace
617  element. [bug=1505351]
618
619* Added support for CSS selector values that contain quoted spaces,
620  such as tag[style="display: foo"]. [bug=1540588]
621
622* Corrected handling of XML processing instructions. [bug=1504393]
623
624* Corrected an encoding error that happened when a BeautifulSoup
625  object was copied. [bug=1554439]
626
627* The contents of <textarea> tags will no longer be modified when the
628  tree is prettified. [bug=1555829]
629
630* When a BeautifulSoup object is pickled but its tree builder cannot
631  be pickled, its .builder attribute is set to None instead of being
632  destroyed. This avoids a performance problem once the object is
633  unpickled. [bug=1523629]
634
635* Specify the file and line number when warning about a
636  BeautifulSoup object being instantiated without a parser being
637  specified. [bug=1574647]
638
639* The `limit` argument to `select()` now works correctly, though it's
640  not implemented very efficiently. [bug=1520530]
641
642* Fixed a Python 3 ByteWarning when a URL was passed in as though it
643  were markup. Thanks to James Salter for a patch and
644  test. [bug=1533762]
645
646* We don't run the check for a filename passed in as markup if the
647  'filename' contains a less-than character; the less-than character
648  indicates it's most likely a very small document. [bug=1577864]
649
650= 4.4.1 (20150928) =
651
652* Fixed a bug that deranged the tree when part of it was
653  removed. Thanks to Eric Weiser for the patch and John Wiseman for a
654  test. [bug=1481520]
655
656* Fixed a parse bug with the html5lib tree-builder. Thanks to Roel
657  Kramer for the patch. [bug=1483781]
658
659* Improved the implementation of CSS selector grouping. Thanks to
660  Orangain for the patch. [bug=1484543]
661
662* Fixed the test_detect_utf8 test so that it works when chardet is
663  installed. [bug=1471359]
664
665* Corrected the output of Declaration objects. [bug=1477847]
666
667
668= 4.4.0 (20150703) =
669
670Especially important changes:
671
672* Added a warning when you instantiate a BeautifulSoup object without
673  explicitly naming a parser. [bug=1398866]
674
675* __repr__ now returns an ASCII bytestring in Python 2, and a Unicode
676  string in Python 3, instead of a UTF8-encoded bytestring in both
677  versions. In Python 3, __str__ now returns a Unicode string instead
678  of a bytestring. [bug=1420131]
679
680* The `text` argument to the find_* methods is now called `string`,
681  which is more accurate. `text` still works, but `string` is the
682  argument described in the documentation. `text` may eventually
683  change its meaning, but not for a very long time. [bug=1366856]
684
685* Changed the way soup objects work under copy.copy(). Copying a
686  NavigableString or a Tag will give you a new NavigableString that's
687  equal to the old one but not connected to the parse tree. Patch by
688  Martijn Peters. [bug=1307490]
689
690* Started using a standard MIT license. [bug=1294662]
691
692* Added a Chinese translation of the documentation by Delong .w.
693
694New features:
695
696* Introduced the select_one() method, which uses a CSS selector but
697  only returns the first match, instead of a list of
698  matches. [bug=1349367]
699
700* You can now create a Tag object without specifying a
701  TreeBuilder. Patch by Martijn Pieters. [bug=1307471]
702
703* You can now create a NavigableString or a subclass just by invoking
704  the constructor. [bug=1294315]
705
706* Added an `exclude_encodings` argument to UnicodeDammit and to the
707  Beautiful Soup constructor, which lets you prohibit the detection of
708  an encoding that you know is wrong. [bug=1469408]
709
710* The select() method now supports selector grouping. Patch by
711  Francisco Canas [bug=1191917]
712
713Bug fixes:
714
715* Fixed yet another problem that caused the html5lib tree builder to
716  create a disconnected parse tree. [bug=1237763]
717
718* Force object_was_parsed() to keep the tree intact even when an element
719  from later in the document is moved into place. [bug=1430633]
720
721* Fixed yet another bug that caused a disconnected tree when html5lib
722  copied an element from one part of the tree to another. [bug=1270611]
723
724* Fixed a bug where Element.extract() could create an infinite loop in
725  the remaining tree.
726
727* The select() method can now find tags whose names contain
728  dashes. Patch by Francisco Canas. [bug=1276211]
729
730* The select() method can now find tags with attributes whose names
731  contain dashes. Patch by Marek Kapolka. [bug=1304007]
732
733* Improved the lxml tree builder's handling of processing
734  instructions. [bug=1294645]
735
736* Restored the helpful syntax error that happens when you try to
737  import the Python 2 edition of Beautiful Soup under Python
738  3. [bug=1213387]
739
740* In Python 3.4 and above, set the new convert_charrefs argument to
741  the html.parser constructor to avoid a warning and future
742  failures. Patch by Stefano Revera. [bug=1375721]
743
744* The warning when you pass in a filename or URL as markup will now be
745  displayed correctly even if the filename or URL is a Unicode
746  string. [bug=1268888]
747
748* If the initial <html> tag contains a CDATA list attribute such as
749  'class', the html5lib tree builder will now turn its value into a
750  list, as it would with any other tag. [bug=1296481]
751
752* Fixed an import error in Python 3.5 caused by the removal of the
753  HTMLParseError class. [bug=1420063]
754
755* Improved docstring for encode_contents() and
756  decode_contents(). [bug=1441543]
757
758* Fixed a crash in Unicode, Dammit's encoding detector when the name
759  of the encoding itself contained invalid bytes. [bug=1360913]
760
761* Improved the exception raised when you call .unwrap() or
762  .replace_with() on an element that's not attached to a tree.
763
764* Raise a NotImplementedError whenever an unsupported CSS pseudoclass
765  is used in select(). Previously some cases did not result in a
766  NotImplementedError.
767
768* It's now possible to pickle a BeautifulSoup object no matter which
769  tree builder was used to create it. However, the only tree builder
770  that survives the pickling process is the HTMLParserTreeBuilder
771  ('html.parser'). If you unpickle a BeautifulSoup object created with
772  some other tree builder, soup.builder will be None. [bug=1231545]
773
774= 4.3.2 (20131002) =
775
776* Fixed a bug in which short Unicode input was improperly encoded to
777  ASCII when checking whether or not it was the name of a file on
778  disk. [bug=1227016]
779
780* Fixed a crash when a short input contains data not valid in
781  filenames. [bug=1232604]
782
783* Fixed a bug that caused Unicode data put into UnicodeDammit to
784  return None instead of the original data. [bug=1214983]
785
786* Combined two tests to stop a spurious test failure when tests are
787  run by nosetests. [bug=1212445]
788
789= 4.3.1 (20130815) =
790
791* Fixed yet another problem with the html5lib tree builder, caused by
792  html5lib's tendency to rearrange the tree during
793  parsing. [bug=1189267]
794
795* Fixed a bug that caused the optimized version of find_all() to
796  return nothing. [bug=1212655]
797
798= 4.3.0 (20130812) =
799
800* Instead of converting incoming data to Unicode and feeding it to the
801  lxml tree builder in chunks, Beautiful Soup now makes successive
802  guesses at the encoding of the incoming data, and tells lxml to
803  parse the data as that encoding. Giving lxml more control over the
804  parsing process improves performance and avoids a number of bugs and
805  issues with the lxml parser which had previously required elaborate
806  workarounds:
807
808  - An issue in which lxml refuses to parse Unicode strings on some
809    systems. [bug=1180527]
810
811  - A returning bug that truncated documents longer than a (very
812    small) size. [bug=963880]
813
814  - A returning bug in which extra spaces were added to a document if
815    the document defined a charset other than UTF-8. [bug=972466]
816
817  This required a major overhaul of the tree builder architecture. If
818  you wrote your own tree builder and didn't tell me, you'll need to
819  modify your prepare_markup() method.
820
821* The UnicodeDammit code that makes guesses at encodings has been
822  split into its own class, EncodingDetector. A lot of apparently
823  redundant code has been removed from Unicode, Dammit, and some
824  undocumented features have also been removed.
825
826* Beautiful Soup will issue a warning if instead of markup you pass it
827  a URL or the name of a file on disk (a common beginner's mistake).
828
829* A number of optimizations improve the performance of the lxml tree
830  builder by about 33%, the html.parser tree builder by about 20%, and
831  the html5lib tree builder by about 15%.
832
833* All find_all calls should now return a ResultSet object. Patch by
834  Aaron DeVore. [bug=1194034]
835
836= 4.2.1 (20130531) =
837
838* The default XML formatter will now replace ampersands even if they
839  appear to be part of entities. That is, "&lt;" will become
840  "&amp;lt;". The old code was left over from Beautiful Soup 3, which
841  didn't always turn entities into Unicode characters.
842
843  If you really want the old behavior (maybe because you add new
844  strings to the tree, those strings include entities, and you want
845  the formatter to leave them alone on output), it can be found in
846  EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183]
847
848* Gave new_string() the ability to create subclasses of
849  NavigableString. [bug=1181986]
850
851* Fixed another bug by which the html5lib tree builder could create a
852  disconnected tree. [bug=1182089]
853
854* The .previous_element of a BeautifulSoup object is now always None,
855  not the last element to be parsed. [bug=1182089]
856
857* Fixed test failures when lxml is not installed. [bug=1181589]
858
859* html5lib now supports Python 3. Fixed some Python 2-specific
860  code in the html5lib test suite. [bug=1181624]
861
862* The html.parser treebuilder can now handle numeric attributes in
863  text when the hexidecimal name of the attribute starts with a
864  capital X. Patch by Tim Shirley. [bug=1186242]
865
866= 4.2.0 (20130514) =
867
868* The Tag.select() method now supports a much wider variety of CSS
869  selectors.
870
871 - Added support for the adjacent sibling combinator (+) and the
872   general sibling combinator (~). Tests by "liquider". [bug=1082144]
873
874 - The combinators (>, +, and ~) can now combine with any supported
875   selector, not just one that selects based on tag name.
876
877 - Added limited support for the "nth-of-type" pseudo-class. Code
878   by Sven Slootweg. [bug=1109952]
879
880* The BeautifulSoup class is now aliased to "_s" and "_soup", making
881  it quicker to type the import statement in an interactive session:
882
883  from bs4 import _s
884   or
885  from bs4 import _soup
886
887  The alias may change in the future, so don't use this in code you're
888  going to run more than once.
889
890* Added the 'diagnose' submodule, which includes several useful
891  functions for reporting problems and doing tech support.
892
893  - diagnose(data) tries the given markup on every installed parser,
894    reporting exceptions and displaying successes. If a parser is not
895    installed, diagnose() mentions this fact.
896
897  - lxml_trace(data, html=True) runs the given markup through lxml's
898    XML parser or HTML parser, and prints out the parser events as
899    they happen. This helps you quickly determine whether a given
900    problem occurs in lxml code or Beautiful Soup code.
901
902  - htmlparser_trace(data) is the same thing, but for Python's
903    built-in HTMLParser class.
904
905* In an HTML document, the contents of a <script> or <style> tag will
906  no longer undergo entity substitution by default. XML documents work
907  the same way they did before. [bug=1085953]
908
909* Methods like get_text() and properties like .strings now only give
910  you strings that are visible in the document--no comments or
911  processing commands. [bug=1050164]
912
913* The prettify() method now leaves the contents of <pre> tags
914  alone. [bug=1095654]
915
916* Fix a bug in the html5lib treebuilder which sometimes created
917  disconnected trees. [bug=1039527]
918
919* Fix a bug in the lxml treebuilder which crashed when a tag included
920  an attribute from the predefined "xml:" namespace. [bug=1065617]
921
922* Fix a bug by which keyword arguments to find_parent() were not
923  being passed on. [bug=1126734]
924
925* Stop a crash when unwisely messing with a tag that's been
926  decomposed. [bug=1097699]
927
928* Now that lxml's segfault on invalid doctype has been fixed, fixed a
929  corresponding problem on the Beautiful Soup end that was previously
930  invisible. [bug=984936]
931
932* Fixed an exception when an overspecified CSS selector didn't match
933  anything. Code by Stefaan Lippens. [bug=1168167]
934
935= 4.1.3 (20120820) =
936
937* Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious
938  test failure caused by the lousy HTMLParser in those
939  versions. [bug=1038503]
940
941* Raise a more specific error (FeatureNotFound) when a requested
942  parser or parser feature is not installed. Raise NotImplementedError
943  instead of ValueError when the user calls insert_before() or
944  insert_after() on the BeautifulSoup object itself. Patch by Aaron
945  Devore. [bug=1038301]
946
947= 4.1.2 (20120817) =
948
949* As per PEP-8, allow searching by CSS class using the 'class_'
950  keyword argument. [bug=1037624]
951
952* Display namespace prefixes for namespaced attribute names, instead of
953  the fully-qualified names given by the lxml parser. [bug=1037597]
954
955* Fixed a crash on encoding when an attribute name contained
956  non-ASCII characters.
957
958* When sniffing encodings, if the cchardet library is installed,
959  Beautiful Soup uses it instead of chardet. cchardet is much
960  faster. [bug=1020748]
961
962* Use logging.warning() instead of warning.warn() to notify the user
963  that characters were replaced with REPLACEMENT
964  CHARACTER. [bug=1013862]
965
966= 4.1.1 (20120703) =
967
968* Fixed an html5lib tree builder crash which happened when html5lib
969  moved a tag with a multivalued attribute from one part of the tree
970  to another. [bug=1019603]
971
972* Correctly display closing tags with an XML namespace declared. Patch
973  by Andreas Kostyrka. [bug=1019635]
974
975* Fixed a typo that made parsing significantly slower than it should
976  have been, and also waited too long to close tags with XML
977  namespaces. [bug=1020268]
978
979* get_text() now returns an empty Unicode string if there is no text,
980  rather than an empty bytestring. [bug=1020387]
981
982= 4.1.0 (20120529) =
983
984* Added experimental support for fixing Windows-1252 characters
985  embedded in UTF-8 documents. (UnicodeDammit.detwingle())
986
987* Fixed the handling of &quot; with the built-in parser. [bug=993871]
988
989* Comments, processing instructions, document type declarations, and
990  markup declarations are now treated as preformatted strings, the way
991  CData blocks are. [bug=1001025]
992
993* Fixed a bug with the lxml treebuilder that prevented the user from
994  adding attributes to a tag that didn't originally have
995  attributes. [bug=1002378] Thanks to Oliver Beattie for the patch.
996
997* Fixed some edge-case bugs having to do with inserting an element
998  into a tag it's already inside, and replacing one of a tag's
999  children with another. [bug=997529]
1000
1001* Added the ability to search for attribute values specified in UTF-8. [bug=1003974]
1002
1003  This caused a major refactoring of the search code. All the tests
1004  pass, but it's possible that some searches will behave differently.
1005
1006= 4.0.5 (20120427) =
1007
1008* Added a new method, wrap(), which wraps an element in a tag.
1009
1010* Renamed replace_with_children() to unwrap(), which is easier to
1011  understand and also the jQuery name of the function.
1012
1013* Made encoding substitution in <meta> tags completely transparent (no
1014  more %SOUP-ENCODING%).
1015
1016* Fixed a bug in decoding data that contained a byte-order mark, such
1017  as data encoded in UTF-16LE. [bug=988980]
1018
1019* Fixed a bug that made the HTMLParser treebuilder generate XML
1020  definitions ending with two question marks instead of
1021  one. [bug=984258]
1022
1023* Upon document generation, CData objects are no longer run through
1024  the formatter. [bug=988905]
1025
1026* The test suite now passes when lxml is not installed, whether or not
1027  html5lib is installed. [bug=987004]
1028
1029* Print a warning on HTMLParseErrors to let people know they should
1030  install a better parser library.
1031
1032= 4.0.4 (20120416) =
1033
1034* Fixed a bug that sometimes created disconnected trees.
1035
1036* Fixed a bug with the string setter that moved a string around the
1037  tree instead of copying it. [bug=983050]
1038
1039* Attribute values are now run through the provided output formatter.
1040  Previously they were always run through the 'minimal' formatter. In
1041  the future I may make it possible to specify different formatters
1042  for attribute values and strings, but for now, consistent behavior
1043  is better than inconsistent behavior. [bug=980237]
1044
1045* Added the missing renderContents method from Beautiful Soup 3. Also
1046  added an encode_contents() method to go along with decode_contents().
1047
1048* Give a more useful error when the user tries to run the Python 2
1049  version of BS under Python 3.
1050
1051* UnicodeDammit can now convert Microsoft smart quotes to ASCII with
1052  UnicodeDammit(markup, smart_quotes_to="ascii").
1053
1054= 4.0.3 (20120403) =
1055
1056* Fixed a typo that caused some versions of Python 3 to convert the
1057  Beautiful Soup codebase incorrectly.
1058
1059* Got rid of the 4.0.2 workaround for HTML documents--it was
1060  unnecessary and the workaround was triggering a (possibly different,
1061  but related) bug in lxml. [bug=972466]
1062
1063= 4.0.2 (20120326) =
1064
1065* Worked around a possible bug in lxml that prevents non-tiny XML
1066  documents from being parsed. [bug=963880, bug=963936]
1067
1068* Fixed a bug where specifying `text` while also searching for a tag
1069  only worked if `text` wanted an exact string match. [bug=955942]
1070
1071= 4.0.1 (20120314) =
1072
1073* This is the first official release of Beautiful Soup 4. There is no
1074  4.0.0 release, to eliminate any possibility that packaging software
1075  might treat "4.0.0" as being an earlier version than "4.0.0b10".
1076
1077* Brought BS up to date with the latest release of soupselect, adding
1078  CSS selector support for direct descendant matches and multiple CSS
1079  class matches.
1080
1081= 4.0.0b10 (20120302) =
1082
1083* Added support for simple CSS selectors, taken from the soupselect project.
1084
1085* Fixed a crash when using html5lib. [bug=943246]
1086
1087* In HTML5-style <meta charset="foo"> tags, the value of the "charset"
1088  attribute is now replaced with the appropriate encoding on
1089  output. [bug=942714]
1090
1091* Fixed a bug that caused calling a tag to sometimes call find_all()
1092  with the wrong arguments. [bug=944426]
1093
1094* For backwards compatibility, brought back the BeautifulStoneSoup
1095  class as a deprecated wrapper around BeautifulSoup.
1096
1097= 4.0.0b9 (20120228) =
1098
1099* Fixed the string representation of DOCTYPEs that have both a public
1100  ID and a system ID.
1101
1102* Fixed the generated XML declaration.
1103
1104* Renamed Tag.nsprefix to Tag.prefix, for consistency with
1105  NamespacedAttribute.
1106
1107* Fixed a test failure that occurred on Python 3.x when chardet was
1108  installed.
1109
1110* Made prettify() return Unicode by default, so it will look nice on
1111  Python 3 when passed into print().
1112
1113= 4.0.0b8 (20120224) =
1114
1115* All tree builders now preserve namespace information in the
1116  documents they parse. If you use the html5lib parser or lxml's XML
1117  parser, you can access the namespace URL for a tag as tag.namespace.
1118
1119  However, there is no special support for namespace-oriented
1120  searching or tree manipulation. When you search the tree, you need
1121  to use namespace prefixes exactly as they're used in the original
1122  document.
1123
1124* The string representation of a DOCTYPE always ends in a newline.
1125
1126* Issue a warning if the user tries to use a SoupStrainer in
1127  conjunction with the html5lib tree builder, which doesn't support
1128  them.
1129
1130= 4.0.0b7 (20120223) =
1131
1132* Upon decoding to string, any characters that can't be represented in
1133  your chosen encoding will be converted into numeric XML entity
1134  references.
1135
1136* Issue a warning if characters were replaced with REPLACEMENT
1137  CHARACTER during Unicode conversion.
1138
1139* Restored compatibility with Python 2.6.
1140
1141* The install process no longer installs docs or auxiliary text files.
1142
1143* It's now possible to deepcopy a BeautifulSoup object created with
1144  Python's built-in HTML parser.
1145
1146* About 100 unit tests that "test" the behavior of various parsers on
1147  invalid markup have been removed. Legitimate changes to those
1148  parsers caused these tests to fail, indicating that perhaps
1149  Beautiful Soup should not test the behavior of foreign
1150  libraries.
1151
1152  The problematic unit tests have been reformulated as informational
1153  comparisons generated by the script
1154  scripts/demonstrate_parser_differences.py.
1155
1156  This makes Beautiful Soup compatible with html5lib version 0.95 and
1157  future versions of HTMLParser.
1158
1159= 4.0.0b6 (20120216) =
1160
1161* Multi-valued attributes like "class" always have a list of values,
1162  even if there's only one value in the list.
1163
1164* Added a number of multi-valued attributes defined in HTML5.
1165
1166* Stopped generating a space before the slash that closes an
1167  empty-element tag. This may come back if I add a special XHTML mode
1168  (http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty
1169  useless.
1170
1171* Passing text along with tag-specific arguments to a find* method:
1172
1173   find("a", text="Click here")
1174
1175  will find tags that contain the given text as their
1176  .string. Previously, the tag-specific arguments were ignored and
1177  only strings were searched.
1178
1179* Fixed a bug that caused the html5lib tree builder to build a
1180  partially disconnected tree. Generally cleaned up the html5lib tree
1181  builder.
1182
1183* If you restrict a multi-valued attribute like "class" to a string
1184  that contains spaces, Beautiful Soup will only consider it a match
1185  if the values correspond to that specific string.
1186
1187= 4.0.0b5 (20120209) =
1188
1189* Rationalized Beautiful Soup's treatment of CSS class. A tag
1190  belonging to multiple CSS classes is treated as having a list of
1191  values for the 'class' attribute. Searching for a CSS class will
1192  match *any* of the CSS classes.
1193
1194  This actually affects all attributes that the HTML standard defines
1195  as taking multiple values (class, rel, rev, archive, accept-charset,
1196  and headers), but 'class' is by far the most common. [bug=41034]
1197
1198* If you pass anything other than a dictionary as the second argument
1199  to one of the find* methods, it'll assume you want to use that
1200  object to search against a tag's CSS classes. Previously this only
1201  worked if you passed in a string.
1202
1203* Fixed a bug that caused a crash when you passed a dictionary as an
1204  attribute value (possibly because you mistyped "attrs"). [bug=842419]
1205
1206* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags
1207  like <meta charset="utf-8" />. [bug=837268]
1208
1209* If Unicode, Dammit can't figure out a consistent encoding for a
1210  page, it will try each of its guesses again, with errors="replace"
1211  instead of errors="strict". This may mean that some data gets
1212  replaced with REPLACEMENT CHARACTER, but at least most of it will
1213  get turned into Unicode. [bug=754903]
1214
1215* Patched over a bug in html5lib (?) that was crashing Beautiful Soup
1216  on certain kinds of markup. [bug=838800]
1217
1218* Fixed a bug that wrecked the tree if you replaced an element with an
1219  empty string. [bug=728697]
1220
1221* Improved Unicode, Dammit's behavior when you give it Unicode to
1222  begin with.
1223
1224= 4.0.0b4 (20120208) =
1225
1226* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag()
1227
1228* BeautifulSoup.new_tag() will follow the rules of whatever
1229  tree-builder was used to create the original BeautifulSoup object. A
1230  new <p> tag will look like "<p />" if the soup object was created to
1231  parse XML, but it will look like "<p></p>" if the soup object was
1232  created to parse HTML.
1233
1234* We pass in strict=False to html.parser on Python 3, greatly
1235  improving html.parser's ability to handle bad HTML.
1236
1237* We also monkeypatch a serious bug in html.parser that made
1238  strict=False disastrous on Python 3.2.2.
1239
1240* Replaced the "substitute_html_entities" argument with the
1241  more general "formatter" argument.
1242
1243* Bare ampersands and angle brackets are always converted to XML
1244  entities unless the user prevents it.
1245
1246* Added PageElement.insert_before() and PageElement.insert_after(),
1247  which let you put an element into the parse tree with respect to
1248  some other element.
1249
1250* Raise an exception when the user tries to do something nonsensical
1251  like insert a tag into itself.
1252
1253
1254= 4.0.0b3 (20120203) =
1255
1256Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful
1257Soup's custom HTML parser in favor of a system that lets you write a
1258little glue code and plug in any HTML or XML parser you want.
1259
1260Beautiful Soup 4.0 comes with glue code for four parsers:
1261
1262 * Python's standard HTMLParser (html.parser in Python 3)
1263 * lxml's HTML and XML parsers
1264 * html5lib's HTML parser
1265
1266HTMLParser is the default, but I recommend you install lxml if you
1267can.
1268
1269For complete documentation, see the Sphinx documentation in
1270bs4/doc/source/. What follows is a summary of the changes from
1271Beautiful Soup 3.
1272
1273=== The module name has changed ===
1274
1275Previously you imported the BeautifulSoup class from a module also
1276called BeautifulSoup. To save keystrokes and make it clear which
1277version of the API is in use, the module is now called 'bs4':
1278
1279    >>> from bs4 import BeautifulSoup
1280
1281=== It works with Python 3 ===
1282
1283Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
1284so bad that it barely worked at all. Beautiful Soup 4 works with
1285Python 3, and since its parser is pluggable, you don't sacrifice
1286quality.
1287
1288Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3
1289support to the finish line. Ezio Melotti is also to thank for greatly
1290improving the HTML parser that comes with Python 3.2.
1291
1292=== CDATA sections are normal text, if they're understood at all. ===
1293
1294Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
1295markup:
1296
1297 <p><![CDATA[foo]]></p> => <p></p>
1298
1299A future version of html5lib will turn CDATA sections into text nodes,
1300but only within tags like <svg> and <math>:
1301
1302 <svg><![CDATA[foo]]></svg> => <p>foo</p>
1303
1304The default XML parser (which uses lxml behind the scenes) turns CDATA
1305sections into ordinary text elements:
1306
1307 <p><![CDATA[foo]]></p> => <p>foo</p>
1308
1309In theory it's possible to preserve the CDATA sections when using the
1310XML parser, but I don't see how to get it to work in practice.
1311
1312=== Miscellaneous other stuff ===
1313
1314If the BeautifulSoup instance has .is_xml set to True, an appropriate
1315XML declaration will be emitted when the tree is transformed into a
1316string:
1317
1318    <?xml version="1.0" encoding="utf-8">
1319    <markup>
1320     ...
1321    </markup>
1322
1323The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree
1324builders set it to False. If you want to parse XHTML with an HTML
1325parser, you can set it manually.
1326
1327
1328= 3.2.0 =
1329
1330The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2
1331to make it obvious which one you should use.
1332
1333= 3.1.0 =
1334
1335A hybrid version that supports 2.4 and can be automatically converted
1336to run under Python 3.0. There are three backwards-incompatible
1337changes you should be aware of, but no new features or deliberate
1338behavior changes.
1339
13401. str() may no longer do what you want. This is because the meaning
1341of str() inverts between Python 2 and 3; in Python 2 it gives you a
1342byte string, in Python 3 it gives you a Unicode string.
1343
1344The effect of this is that you can't pass an encoding to .__str__
1345anymore. Use encode() to get a string and decode() to get Unicode, and
1346you'll be ready (well, readier) for Python 3.
1347
13482. Beautiful Soup is now based on HTMLParser rather than SGMLParser,
1349which is gone in Python 3. There's some bad HTML that SGMLParser
1350handled but HTMLParser doesn't, usually to do with attribute values
1351that aren't closed or have brackets inside them:
1352
1353  <a href="foo</a>, </a><a href="bar">baz</a>
1354  <a b="<a>">', '<a b="&lt;a&gt;"></a><a>"></a>
1355
1356A later version of Beautiful Soup will allow you to plug in different
1357parsers to make tradeoffs between speed and the ability to handle bad
1358HTML.
1359
13603. In Python 3 (but not Python 2), HTMLParser converts entities within
1361attributes to the corresponding Unicode characters. In Python 2 it's
1362possible to parse this string and leave the &eacute; intact.
1363
1364 <a href="http://crummy.com?sacr&eacute;&bleu">
1365
1366In Python 3, the &eacute; is always converted to \xe9 during
1367parsing.
1368
1369
1370= 3.0.7a =
1371
1372Added an import that makes BS work in Python 2.3.
1373
1374
1375= 3.0.7 =
1376
1377Fixed a UnicodeDecodeError when unpickling documents that contain
1378non-ASCII characters.
1379
1380Fixed a TypeError that occurred in some circumstances when a tag
1381contained no text.
1382
1383Jump through hoops to avoid the use of chardet, which can be extremely
1384slow in some circumstances. UTF-8 documents should never trigger the
1385use of chardet.
1386
1387Whitespace is preserved inside <pre> and <textarea> tags that contain
1388nothing but whitespace.
1389
1390Beautiful Soup can now parse a doctype that's scoped to an XML namespace.
1391
1392
1393= 3.0.6 =
1394
1395Got rid of a very old debug line that prevented chardet from working.
1396
1397Added a Tag.decompose() method that completely disconnects a tree or a
1398subset of a tree, breaking it up into bite-sized pieces that are
1399easy for the garbage collecter to collect.
1400
1401Tag.extract() now returns the tag that was extracted.
1402
1403Tag.findNext() now does something with the keyword arguments you pass
1404it instead of dropping them on the floor.
1405
1406Fixed a Unicode conversion bug.
1407
1408Fixed a bug that garbled some <meta> tags when rewriting them.
1409
1410
1411= 3.0.5 =
1412
1413Soup objects can now be pickled, and copied with copy.deepcopy.
1414
1415Tag.append now works properly on existing BS objects. (It wasn't
1416originally intended for outside use, but it can be now.) (Giles
1417Radford)
1418
1419Passing in a nonexistent encoding will no longer crash the parser on
1420Python 2.4 (John Nagle).
1421
1422Fixed an underlying bug in SGMLParser that thinks ASCII has 255
1423characters instead of 127 (John Nagle).
1424
1425Entities are converted more consistently to Unicode characters.
1426
1427Entity references in attribute values are now converted to Unicode
1428characters when appropriate. Numeric entities are always converted,
1429because SGMLParser always converts them outside of attribute values.
1430
1431ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to
1432XHTML_ENTITIES.
1433
1434The regular expression for bare ampersands was too loose. In some
1435cases ampersands were not being escaped. (Sam Ruby?)
1436
1437Non-breaking spaces and other special Unicode space characters are no
1438longer folded to ASCII spaces. (Robert Leftwich)
1439
1440Information inside a TEXTAREA tag is now parsed literally, not as HTML
1441tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang)
1442
1443= 3.0.4 =
1444
1445Fixed a bug that crashed Unicode conversion in some cases.
1446
1447Fixed a bug that prevented UnicodeDammit from being used as a
1448general-purpose data scrubber.
1449
1450Fixed some unit test failures when running against Python 2.5.
1451
1452When considering whether to convert smart quotes, UnicodeDammit now
1453looks at the original encoding in a case-insensitive way.
1454
1455= 3.0.3 (20060606) =
1456
1457Beautiful Soup is now usable as a way to clean up invalid XML/HTML (be
1458sure to pass in an appropriate value for convertEntities, or XML/HTML
1459entities might stick around that aren't valid in HTML/XML). The result
1460may not validate, but it should be good enough to not choke a
1461real-world XML parser. Specifically, the output of a properly
1462constructed soup object should always be valid as part of an XML
1463document, but parts may be missing if they were missing in the
1464original. As always, if the input is valid XML, the output will also
1465be valid.
1466
1467= 3.0.2 (20060602) =
1468
1469Previously, Beautiful Soup correctly handled attribute values that
1470contained embedded quotes (sometimes by escaping), but not other kinds
1471of XML character. Now, it correctly handles or escapes all special XML
1472characters in attribute values.
1473
1474I aliased methods to the 2.x names (fetch, find, findText, etc.) for
1475backwards compatibility purposes. Those names are deprecated and if I
1476ever do a 4.0 I will remove them. I will, I tell you!
1477
1478Fixed a bug where the findAll method wasn't passing along any keyword
1479arguments.
1480
1481When run from the command line, Beautiful Soup now acts as an HTML
1482pretty-printer, not an XML pretty-printer.
1483
1484= 3.0.1 (20060530) =
1485
1486Reintroduced the "fetch by CSS class" shortcut. I thought keyword
1487arguments would replace it, but they don't. You can't call soup('a',
1488class='foo') because class is a Python keyword.
1489
1490If Beautiful Soup encounters a meta tag that declares the encoding,
1491but a SoupStrainer tells it not to parse that tag, Beautiful Soup will
1492no longer try to rewrite the meta tag to mention the new
1493encoding. Basically, this makes SoupStrainers work in real-world
1494applications instead of crashing the parser.
1495
1496= 3.0.0 "Who would not give all else for two p" (20060528) =
1497
1498This release is not backward-compatible with previous releases. If
1499you've got code written with a previous version of the library, go
1500ahead and keep using it, unless one of the features mentioned here
1501really makes your life easier. Since the library is self-contained,
1502you can include an old copy of the library in your old applications,
1503and use the new version for everything else.
1504
1505The documentation has been rewritten and greatly expanded with many
1506more examples.
1507
1508Beautiful Soup autodetects the encoding of a document (or uses the one
1509you specify), and converts it from its native encoding to
1510Unicode. Internally, it only deals with Unicode strings. When you
1511print out the document, it converts to UTF-8 (or another encoding you
1512specify). [Doc reference]
1513
1514It's now easy to make large-scale changes to the parse tree without
1515screwing up the navigation members. The methods are extract,
1516replaceWith, and insert. [Doc reference. See also Improving Memory
1517Usage with extract]
1518
1519Passing True in as an attribute value gives you tags that have any
1520value for that attribute. You don't have to create a regular
1521expression. Passing None for an attribute value gives you tags that
1522don't have that attribute at all.
1523
1524Tag objects now know whether or not they're self-closing. This avoids
1525the problem where Beautiful Soup thought that tags like <BR /> were
1526self-closing even in XML documents. You can customize the self-closing
1527tags for a parser object by passing them in as a list of
1528selfClosingTags: you don't have to subclass anymore.
1529
1530There's a new built-in parser, MinimalSoup, which has most of
1531BeautifulSoup's HTML-specific rules, but no tag nesting rules. [Doc
1532reference]
1533
1534You can use a SoupStrainer to tell Beautiful Soup to parse only part
1535of a document. This saves time and memory, often making Beautiful Soup
1536about as fast as a custom-built SGMLParser subclass. [Doc reference,
1537SoupStrainer reference]
1538
1539You can (usually) use keyword arguments instead of passing a
1540dictionary of attributes to a search method. That is, you can replace
1541soup(args={"id" : "5"}) with soup(id="5"). You can still use args if
1542(for instance) you need to find an attribute whose name clashes with
1543the name of an argument to findAll. [Doc reference: **kwargs attrs]
1544
1545The method names have changed to the better method names used in
1546Rubyful Soup. Instead of find methods and fetch methods, there are
1547only find methods. Instead of a scheme where you can't remember which
1548method finds one element and which one finds them all, we have find
1549and findAll. In general, if the method name mentions All or a plural
1550noun (eg. findNextSiblings), then it finds many elements
1551method. Otherwise, it only finds one element. [Doc reference]
1552
1553Some of the argument names have been renamed for clarity. For instance
1554avoidParserProblems is now parserMassage.
1555
1556Beautiful Soup no longer implements a feed method. You need to pass a
1557string or a filehandle into the soup constructor, not with feed after
1558the soup has been created. There is still a feed method, but it's the
1559feed method implemented by SGMLParser and calling it will bypass
1560Beautiful Soup and cause problems.
1561
1562The NavigableText class has been renamed to NavigableString. There is
1563no NavigableUnicodeString anymore, because every string inside a
1564Beautiful Soup parse tree is a Unicode string.
1565
1566findText and fetchText are gone. Just pass a text argument into find
1567or findAll.
1568
1569Null was more trouble than it was worth, so I got rid of it. Anything
1570that used to return Null now returns None.
1571
1572Special XML constructs like comments and CDATA now have their own
1573NavigableString subclasses, instead of being treated as oddly-formed
1574data. If you parse a document that contains CDATA and write it back
1575out, the CDATA will still be there.
1576
1577When you're parsing a document, you can get Beautiful Soup to convert
1578XML or HTML entities into the corresponding Unicode characters. [Doc
1579reference]
1580
1581= 2.1.1 (20050918) =
1582
1583Fixed a serious performance bug in BeautifulStoneSoup which was
1584causing parsing to be incredibly slow.
1585
1586Corrected several entities that were previously being incorrectly
1587translated from Microsoft smart-quote-like characters.
1588
1589Fixed a bug that was breaking text fetch.
1590
1591Fixed a bug that crashed the parser when text chunks that look like
1592HTML tag names showed up within a SCRIPT tag.
1593
1594THEAD, TBODY, and TFOOT tags are now nestable within TABLE
1595tags. Nested tables should parse more sensibly now.
1596
1597BASE is now considered a self-closing tag.
1598
1599= 2.1.0 "Game, or any other dish?" (20050504) =
1600
1601Added a wide variety of new search methods which, given a starting
1602point inside the tree, follow a particular navigation member (like
1603nextSibling) over and over again, looking for Tag and NavigableText
1604objects that match certain criteria. The new methods are findNext,
1605fetchNext, findPrevious, fetchPrevious, findNextSibling,
1606fetchNextSiblings, findPreviousSibling, fetchPreviousSiblings,
1607findParent, and fetchParents. All of these use the same basic code
1608used by first and fetch, so you can pass your weird ways of matching
1609things into these methods.
1610
1611The fetch method and its derivatives now accept a limit argument.
1612
1613You can now pass keyword arguments when calling a Tag object as though
1614it were a method.
1615
1616Fixed a bug that caused all hand-created tags to share a single set of
1617attributes.
1618
1619= 2.0.3 (20050501) =
1620
1621Fixed Python 2.2 support for iterators.
1622
1623Fixed a bug that gave the wrong representation to tags within quote
1624tags like <script>.
1625
1626Took some code from Mark Pilgrim that treats CDATA declarations as
1627data instead of ignoring them.
1628
1629Beautiful Soup's setup.py will now do an install even if the unit
1630tests fail. It won't build a source distribution if the unit tests
1631fail, so I can't release a new version unless they pass.
1632
1633= 2.0.2 (20050416) =
1634
1635Added the unit tests in a separate module, and packaged it with
1636distutils.
1637
1638Fixed a bug that sometimes caused renderContents() to return a Unicode
1639string even if there was no Unicode in the original string.
1640
1641Added the done() method, which closes all of the parser's open
1642tags. It gets called automatically when you pass in some text to the
1643constructor of a parser class; otherwise you must call it yourself.
1644
1645Reinstated some backwards compatibility with 1.x versions: referencing
1646the string member of a NavigableText object returns the NavigableText
1647object instead of throwing an error.
1648
1649= 2.0.1 (20050412) =
1650
1651Fixed a bug that caused bad results when you tried to reference a tag
1652name shorter than 3 characters as a member of a Tag, eg. tag.table.td.
1653
1654Made sure all Tags have the 'hidden' attribute so that an attempt to
1655access tag.hidden doesn't spawn an attempt to find a tag named
1656'hidden'.
1657
1658Fixed a bug in the comparison operator.
1659
1660= 2.0.0 "Who cares for fish?" (20050410)
1661
1662Beautiful Soup version 1 was very useful but also pretty stupid. I
1663originally wrote it without noticing any of the problems inherent in
1664trying to build a parse tree out of ambiguous HTML tags. This version
1665solves all of those problems to my satisfaction. It also adds many new
1666clever things to make up for the removal of the stupid things.
1667
1668== Parsing ==
1669
1670The parser logic has been greatly improved, and the BeautifulSoup
1671class should much more reliably yield a parse tree that looks like
1672what the page author intended. For a particular class of odd edge
1673cases that now causes problems, there is a new class,
1674ICantBelieveItsBeautifulSoup.
1675
1676By default, Beautiful Soup now performs some cleanup operations on
1677text before parsing it. This is to avoid common problems with bad
1678definitions and self-closing tags that crash SGMLParser. You can
1679provide your own set of cleanup operations, or turn it off
1680altogether. The cleanup operations include fixing self-closing tags
1681that don't close, and replacing Microsoft smart quotes and similar
1682characters with their HTML entity equivalents.
1683
1684You can now get a pretty-print version of parsed HTML to get a visual
1685picture of how Beautiful Soup parses it, with the Tag.prettify()
1686method.
1687
1688== Strings and Unicode ==
1689
1690There are separate NavigableText subclasses for ASCII and Unicode
1691strings. These classes directly subclass the corresponding base data
1692types. This means you can treat NavigableText objects as strings
1693instead of having to call methods on them to get the strings.
1694
1695str() on a Tag always returns a string, and unicode() always returns
1696Unicode. Previously it was inconsistent.
1697
1698== Tree traversal ==
1699
1700In a first() or fetch() call, the tag name or the desired value of an
1701attribute can now be any of the following:
1702
1703 * A string (matches that specific tag or that specific attribute value)
1704 * A list of strings (matches any tag or attribute value in the list)
1705 * A compiled regular expression object (matches any tag or attribute
1706   value that matches the regular expression)
1707 * A callable object that takes the Tag object or attribute value as a
1708   string. It returns None/false/empty string if the given string
1709   doesn't match, and any other value if it does.
1710
1711This is much easier to use than SQL-style wildcards (see, regular
1712expressions are good for something). Because of this, I took out
1713SQL-style wildcards. I'll put them back if someone complains, but
1714their removal simplifies the code a lot.
1715
1716You can use fetch() and first() to search for text in the parse tree,
1717not just tags. There are new alias methods fetchText() and firstText()
1718designed for this purpose. As with searching for tags, you can pass in
1719a string, a regular expression object, or a method to match your text.
1720
1721If you pass in something besides a map to the attrs argument of
1722fetch() or first(), Beautiful Soup will assume you want to match that
1723thing against the "class" attribute. When you're scraping
1724well-structured HTML, this makes your code a lot cleaner.
1725
17261.x and 2.x both let you call a Tag object as a shorthand for
1727fetch(). For instance, foo("bar") is a shorthand for
1728foo.fetch("bar"). In 2.x, you can also access a specially-named member
1729of a Tag object as a shorthand for first(). For instance, foo.barTag
1730is a shorthand for foo.first("bar"). By chaining these shortcuts you
1731traverse a tree in very little code: for header in
1732soup.bodyTag.pTag.tableTag('th'):
1733
1734If an element relationship (like parent or next) doesn't apply to a
1735tag, it'll now show up Null instead of None. first() will also return
1736Null if you ask it for a nonexistent tag. Null is an object that's
1737just like None, except you can do whatever you want to it and it'll
1738give you Null instead of throwing an error.
1739
1740This lets you do tree traversals like soup.htmlTag.headTag.titleTag
1741without having to worry if the intermediate stages are actually
1742there. Previously, if there was no 'head' tag in the document, headTag
1743in that instance would have been None, and accessing its 'titleTag'
1744member would have thrown an AttributeError. Now, you can get what you
1745want when it exists, and get Null when it doesn't, without having to
1746do a lot of conditionals checking to see if every stage is None.
1747
1748There are two new relations between page elements: previousSibling and
1749nextSibling. They reference the previous and next element at the same
1750level of the parse tree. For instance, if you have HTML like this:
1751
1752  <p><ul><li>Foo<br /><li>Bar</ul>
1753
1754The first 'li' tag has a previousSibling of Null and its nextSibling
1755is the second 'li' tag. The second 'li' tag has a nextSibling of Null
1756and its previousSibling is the first 'li' tag. The previousSibling of
1757the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the
1758'br' tag.
1759
1760I took out the ability to use fetch() to find tags that have a
1761specific list of contents. See, I can't even explain it well. It was
1762really difficult to use, I never used it, and I don't think anyone
1763else ever used it. To the extent anyone did, they can probably use
1764fetchText() instead. If it turns out someone needs it I'll think of
1765another solution.
1766
1767== Tree manipulation ==
1768
1769You can add new attributes to a tag, and delete attributes from a
1770tag. In 1.x you could only change a tag's existing attributes.
1771
1772== Porting Considerations ==
1773
1774There are three changes in 2.0 that break old code:
1775
1776In the post-1.2 release you could pass in a function into fetch(). The
1777function took a string, the tag name. In 2.0, the function takes the
1778actual Tag object.
1779
1780It's no longer to pass in SQL-style wildcards to fetch(). Use a
1781regular expression instead.
1782
1783The different parsing algorithm means the parse tree may not be shaped
1784like you expect. This will only actually affect you if your code uses
1785one of the affected parts. I haven't run into this problem yet while
1786porting my code.
1787
1788= Between 1.2 and 2.0 =
1789
1790This is the release to get if you want Python 1.5 compatibility.
1791
1792The desired value of an attribute can now be any of the following:
1793
1794 * A string
1795 * A string with SQL-style wildcards
1796 * A compiled RE object
1797 * A callable that returns None/false/empty string if the given value
1798   doesn't match, and any other value otherwise.
1799
1800This is much easier to use than SQL-style wildcards (see, regular
1801expressions are good for something). Because of this, I no longer
1802recommend you use SQL-style wildcards. They may go away in a future
1803release to clean up the code.
1804
1805Made Beautiful Soup handle processing instructions as text instead of
1806ignoring them.
1807
1808Applied patch from Richie Hindle (richie at entrian dot com) that
1809makes tag.string a shorthand for tag.contents[0].string when the tag
1810has only one string-owning child.
1811
1812Added still more nestable tags. The nestable tags thing won't work in
1813a lot of cases and needs to be rethought.
1814
1815Fixed an edge case where searching for "%foo" would match any string
1816shorter than "foo".
1817
1818= 1.2 "Who for such dainties would not stoop?" (20040708) =
1819
1820Applied patch from Ben Last (ben at benlast dot com) that made
1821Tag.renderContents() correctly handle Unicode.
1822
1823Made BeautifulStoneSoup even dumber by making it not implicitly close
1824a tag when another tag of the same type is encountered; only when an
1825actual closing tag is encountered. This change courtesy of Fuzzy (mike
1826at pcblokes dot com). BeautifulSoup still works as before.
1827
1828= 1.1 "Swimming in a hot tureen" =
1829
1830Added more 'nestable' tags. Changed popping semantics so that when a
1831nestable tag is encountered, tags are popped up to the previously
1832encountered nestable tag (of whatever kind). I will revert this if
1833enough people complain, but it should make more people's lives easier
1834than harder. This enhancement was suggested by Anthony Baxter (anthony
1835at interlink dot com dot au).
1836
1837= 1.0 "So rich and green" (20040420) =
1838
1839Initial release.
1840