1= 4.12.3 (20240117) 2 3* The Beautiful Soup documentation now has a Spanish translation, thanks 4 to Carlos Romero. Delong Wang's Chinese translation has been updated 5 to cover Beautiful Soup 4.12.0. 6 7* Fixed a regression such that if you set .hidden on a tag, the tag 8 becomes invisible but its contents are still visible. User manipulation 9 of .hidden is not a documented or supported feature, so don't do this, 10 but it wasn't too difficult to keep the old behavior working. 11 12* Fixed a case found by Mengyuhan where html.parser giving up on 13 markup would result in an AssertionError instead of a 14 ParserRejectedMarkup exception. 15 16* Added the correct stacklevel to instances of the XMLParsedAsHTMLWarning. 17 [bug=2034451] 18 19* Corrected the syntax of the license definition in pyproject.toml. Patch 20 by Louis Maddox. [bug=2032848] 21 22* Corrected a typo in a test that was causing test failures when run against 23 libxml2 2.12.1. [bug=2045481] 24 25= 4.12.2 (20230407) 26 27* Fixed an unhandled exception in BeautifulSoup.decode_contents 28 and methods that call it. [bug=2015545] 29 30= 4.12.1 (20230405) 31 32NOTE: the following things are likely to be dropped in the next 33feature release of Beautiful Soup: 34 35 Official support for Python 3.6. 36 Inclusion of unit tests and test data in the wheel file. 37 Two scripts: demonstrate_parser_differences.py and test-all-versions. 38 39Changes: 40 41* This version of Beautiful Soup replaces setup.py and setup.cfg 42 with pyproject.toml. Beautiful Soup now uses tox as its test backend 43 and hatch to do builds. 44 45* The main functional improvement in this version is a nonrecursive technique 46 for regenerating a tree. This technique is used to avoid situations where, 47 in previous versions, doing something to a very deeply nested tree 48 would overflow the Python interpreter stack: 49 50 1. Outputting a tree as a string, e.g. with 51 BeautifulSoup.encode() [bug=1471755] 52 53 2. Making copies of trees (copy.copy() and 54 copy.deepcopy() from the Python standard library). [bug=1709837] 55 56 3. Pickling a BeautifulSoup object. (Note that pickling a Tag 57 object can still cause an overflow.) 58 59* Making a copy of a BeautifulSoup object no longer parses the 60 document again, which should improve performance significantly. 61 62* When a BeautifulSoup object is unpickled, Beautiful Soup now 63 tries to associate an appropriate TreeBuilder object with it. 64 65* Tag.prettify() will now consistently end prettified markup with 66 a newline. 67 68* Added unit tests for fuzz test cases created by third 69 parties. Some of these tests are skipped since they point 70 to problems outside of Beautiful Soup, but this change 71 puts them all in one convenient place. 72 73* PageElement now implements the known_xml attribute. (This was technically 74 a bug, but it shouldn't be an issue in normal use.) [bug=2007895] 75 76* The demonstrate_parser_differences.py script was still written in 77 Python 2. I've converted it to Python 3, but since no one has 78 mentioned this over the years, it's a sign that no one uses this 79 script and it's not serving its purpose. 80 81= 4.12.0 (20230320) 82 83* Introduced the .css property, which centralizes all access to 84 the Soup Sieve API. This allows Beautiful Soup to give direct 85 access to as much of Soup Sieve that makes sense, without cluttering 86 the BeautifulSoup and Tag classes with a lot of new methods. 87 88 This does mean one addition to the BeautifulSoup and Tag classes 89 (the .css property itself), so this might be a breaking change if you 90 happen to use Beautiful Soup to parse XML that includes a tag called 91 <css>. In particular, code like this will stop working in 4.12.0: 92 93 soup.css['id'] 94 95 Code like this will work just as before: 96 97 soup.find_one('css')['id'] 98 99 The Soup Sieve methods supported through the .css property are 100 select(), select_one(), iselect(), closest(), match(), filter(), 101 escape(), and compile(). The BeautifulSoup and Tag classes still 102 support the select() and select_one() methods; they have not been 103 deprecated, but they have been demoted to convenience methods. 104 105 [bug=2003677] 106 107* When the html.parser parser decides it can't parse a document, Beautiful 108 Soup now consistently propagates this fact by raising a 109 ParserRejectedMarkup error. [bug=2007343] 110 111* Removed some error checking code from diagnose(), which is redundant with 112 similar (but more Pythonic) code in the BeautifulSoup constructor. 113 [bug=2007344] 114 115* Added intersphinx references to the documentation so that other 116 projects have a target to point to when they reference Beautiful 117 Soup classes. [bug=1453370] 118 119= 4.11.2 (20230131) 120 121* Fixed test failures caused by nondeterministic behavior of 122 UnicodeDammit's character detection, depending on the platform setup. 123 [bug=1973072] 124 125* Fixed another crash when overriding multi_valued_attributes and using the 126 html5lib parser. [bug=1948488] 127 128* The HTMLFormatter and XMLFormatter constructors no longer return a 129 value. [bug=1992693] 130 131* Tag.interesting_string_types is now propagated when a tag is 132 copied. [bug=1990400] 133 134* Warnings now do their best to provide an appropriate stacklevel, 135 improving the usefulness of the message. [bug=1978744] 136 137* Passing a Tag's .contents into PageElement.extend() now works the 138 same way as passing the Tag itself. 139 140* Soup Sieve tests will be skipped if the library is not installed. 141 142= 4.11.1 (20220408) 143 144This release was done to ensure that the unit tests are packaged along 145with the released source. There are no functionality changes in this 146release, but there are a few other packaging changes: 147 148* The Japanese and Korean translations of the documentation are included. 149* The changelog is now packaged as CHANGELOG, and the license file is 150 packaged as LICENSE. NEWS.txt and COPYING.txt are still present, 151 but may be removed in the future. 152* TODO.txt is no longer packaged, since a TODO is not relevant for released 153 code. 154 155= 4.11.0 (20220407) 156 157* Ported unit tests to use pytest. 158 159* Added special string classes, RubyParenthesisString and RubyTextString, 160 to make it possible to treat ruby text specially in get_text() calls. 161 [bug=1941980] 162 163* It's now possible to customize the way output is indented by 164 providing a value for the 'indent' argument to the Formatter 165 constructor. The 'indent' argument works very similarly to the 166 argument of the same name in the Python standard library's 167 json.dump() function. [bug=1955497] 168 169* If the charset-normalizer Python module 170 (https://pypi.org/project/charset-normalizer/) is installed, Beautiful 171 Soup will use it to detect the character sets of incoming documents. 172 This is also the module used by newer versions of the Requests library. 173 For the sake of backwards compatibility, chardet and cchardet both take 174 precedence if installed. [bug=1955346] 175 176* Added a workaround for an lxml bug 177 (https://bugs.launchpad.net/lxml/+bug/1948551) that causes 178 problems when parsing a Unicode string beginning with BYTE ORDER MARK. 179 [bug=1947768] 180 181* Issue a warning when an HTML parser is used to parse a document that 182 looks like XML but not XHTML. [bug=1939121] 183 184* Do a better job of keeping track of namespaces as an XML document is 185 parsed, so that CSS selectors that use namespaces will do the right 186 thing more often. [bug=1946243] 187 188* Some time ago, the misleadingly named "text" argument to find-type 189 methods was renamed to the more accurate "string." But this supposed 190 "renaming" didn't make it into important places like the method 191 signatures or the docstrings. That's corrected in this 192 version. "text" still works, but will give a DeprecationWarning. 193 [bug=1947038] 194 195* Fixed a crash when pickling a BeautifulSoup object that has no 196 tree builder. [bug=1934003] 197 198* Fixed a crash when overriding multi_valued_attributes and using the 199 html5lib parser. [bug=1948488] 200 201* Standardized the wording of the MarkupResemblesLocatorWarning 202 warnings to omit untrusted input and make the warnings less 203 judgmental about what you ought to be doing. [bug=1955450] 204 205* Removed support for the iconv_codec library, which doesn't seem 206 to exist anymore and was never put up on PyPI. (The closest 207 replacement on PyPI, iconv_codecs, is GPL-licensed, so we can't use 208 it--it's also quite old.) 209 210= 4.10.0 (20210907) 211 212* This is the first release of Beautiful Soup to only support Python 213 3. I dropped Python 2 support to maintain support for newer versions 214 (58 and up) of setuptools. See: 215 https://github.com/pypa/setuptools/issues/2769 [bug=1942919] 216 217* The behavior of methods like .get_text() and .strings now differs 218 depending on the type of tag. The change is visible with HTML tags 219 like <script>, <style>, and <template>. Starting in 4.9.0, methods 220 like get_text() returned no results on such tags, because the 221 contents of those tags are not considered 'text' within the document 222 as a whole. 223 224 But a user who calls script.get_text() is working from a different 225 definition of 'text' than a user who calls div.get_text()--otherwise 226 there would be no need to call script.get_text() at all. In 4.10.0, 227 the contents of (e.g.) a <script> tag are considered 'text' during a 228 get_text() call on the tag itself, but not considered 'text' during 229 a get_text() call on the tag's parent. 230 231 Because of this change, calling get_text() on each child of a tag 232 may now return a different result than calling get_text() on the tag 233 itself. That's because different tags now have different 234 understandings of what counts as 'text'. [bug=1906226] [bug=1868861] 235 236* NavigableString and its subclasses now implement the get_text() 237 method, as well as the properties .strings and 238 .stripped_strings. These methods will either return the string 239 itself, or nothing, so the only reason to use this is when iterating 240 over a list of mixed Tag and NavigableString objects. [bug=1904309] 241 242* The 'html5' formatter now treats attributes whose values are the 243 empty string as HTML boolean attributes. Previously (and in other 244 formatters), an attribute value must be set as None to be treated as 245 a boolean attribute. In a future release, I plan to also give this 246 behavior to the 'html' formatter. Patch by Isaac Muse. [bug=1915424] 247 248* The 'replace_with()' method now takes a variable number of arguments, 249 and can be used to replace a single element with a sequence of elements. 250 Patch by Bill Chandos. [rev=605] 251 252* Corrected output when the namespace prefix associated with a 253 namespaced attribute is the empty string, as opposed to 254 None. [bug=1915583] 255 256* Performance improvement when processing tags that speeds up overall 257 tree construction by 2%. Patch by Morotti. [bug=1899358] 258 259* Corrected the use of special string container classes in cases when a 260 single tag may contain strings with different containers; such as 261 the <template> tag, which may contain both TemplateString objects 262 and Comment objects. [bug=1913406] 263 264* The html.parser tree builder can now handle named entities 265 found in the HTML5 spec in much the same way that the html5lib 266 tree builder does. Note that the lxml HTML tree builder doesn't handle 267 named entities this way. [bug=1924908] 268 269* Added a second way to pass specify encodings to UnicodeDammit and 270 EncodingDetector, based on the order of precedence defined in the 271 HTML5 spec, starting at: 272 https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding 273 274 Encodings in 'known_definite_encodings' are tried first, then 275 byte-order-mark sniffing is run, then encodings in 'user_encodings' 276 are tried. The old argument, 'override_encodings', is now a 277 deprecated alias for 'known_definite_encodings'. 278 279 This changes the default behavior of the html.parser and lxml tree 280 builders, in a way that may slightly improve encoding 281 detection but will probably have no effect. [bug=1889014] 282 283* Improve the warning issued when a directory name (as opposed to 284 the name of a regular file) is passed as markup into the BeautifulSoup 285 constructor. [bug=1913628] 286 287= 4.9.3 (20201003) 288 289This is the final release of Beautiful Soup to support Python 2902. Beautiful Soup's official support for Python 2 ended on 01 January, 2912021. In the Launchpad Git repository, the final revision to support 292Python 2 was revision 70f546b1e689a70e2f103795efce6d261a3dadf7; it is 293tagged as "python2". 294 295* Implemented a significant performance optimization to the process of 296 searching the parse tree. Patch by Morotti. [bug=1898212] 297 298= 4.9.2 (20200926) 299 300* Fixed a bug that caused too many tags to be popped from the tag 301 stack during tree building, when encountering a closing tag that had 302 no matching opening tag. [bug=1880420] 303 304* Fixed a bug that inconsistently moved elements over when passing 305 a Tag, rather than a list, into Tag.extend(). [bug=1885710] 306 307* Specify the soupsieve dependency in a way that complies with 308 PEP 508. Patch by Mike Nerone. [bug=1893696] 309 310* Change the signatures for BeautifulSoup.insert_before and insert_after 311 (which are not implemented) to match PageElement.insert_before and 312 insert_after, quieting warnings in some IDEs. [bug=1897120] 313 314= 4.9.1 (20200517) 315 316* Added a keyword argument 'on_duplicate_attribute' to the 317 BeautifulSoupHTMLParser constructor (used by the html.parser tree 318 builder) which lets you customize the handling of markup that 319 contains the same attribute more than once, as in: 320 <a href="url1" href="url2"> [bug=1878209] 321 322* Added a distinct subclass, GuessedAtParserWarning, for the warning 323 issued when BeautifulSoup is instantiated without a parser being 324 specified. [bug=1873787] 325 326* Added a distinct subclass, MarkupResemblesLocatorWarning, for the 327 warning issued when BeautifulSoup is instantiated with 'markup' that 328 actually seems to be a URL or the path to a file on 329 disk. [bug=1873787] 330 331* The new NavigableString subclasses (Stylesheet, Script, and 332 TemplateString) can now be imported directly from the bs4 package. 333 334* If you encode a document with a Python-specific encoding like 335 'unicode_escape', that encoding is no longer mentioned in the final 336 XML or HTML document. Instead, encoding information is omitted or 337 left blank. [bug=1874955] 338 339* Fixed test failures when run against soupselect 2.0. Patch by Tomáš 340 Chvátal. [bug=1872279] 341 342= 4.9.0 (20200405) 343 344* Added PageElement.decomposed, a new property which lets you 345 check whether you've already called decompose() on a Tag or 346 NavigableString. 347 348* Embedded CSS and Javascript is now stored in distinct Stylesheet and 349 Script tags, which are ignored by methods like get_text() since most 350 people don't consider this sort of content to be 'text'. This 351 feature is not supported by the html5lib treebuilder. [bug=1868861] 352 353* Added a Russian translation by 'authoress' to the repository. 354 355* Fixed an unhandled exception when formatting a Tag that had been 356 decomposed.[bug=1857767] 357 358* Fixed a bug that happened when passing a Unicode filename containing 359 non-ASCII characters as markup into Beautiful Soup, on a system that 360 allows Unicode filenames. [bug=1866717] 361 362* Added a performance optimization to PageElement.extract(). Patch by 363 Arthur Darcet. 364 365= 4.8.2 (20191224) 366 367* Added Python docstrings to all public methods of the most commonly 368 used classes. 369 370* Added a Chinese translation by Deron Wang and a Brazilian Portuguese 371 translation by Cezar Peixeiro to the repository. 372 373* Fixed two deprecation warnings. Patches by Colin 374 Watson and Nicholas Neumann. [bug=1847592] [bug=1855301] 375 376* The html.parser tree builder now correctly handles DOCTYPEs that are 377 not uppercase. [bug=1848401] 378 379* PageElement.select() now returns a ResultSet rather than a regular 380 list, making it consistent with methods like find_all(). 381 382= 4.8.1 (20191006) 383 384* When the html.parser or html5lib parsers are in use, Beautiful Soup 385 will, by default, record the position in the original document where 386 each tag was encountered. This includes line number (Tag.sourceline) 387 and position within a line (Tag.sourcepos). Based on code by Chris 388 Mayo. [bug=1742921] 389 390* When instantiating a BeautifulSoup object, it's now possible to 391 provide a dictionary ('element_classes') of the classes you'd like to be 392 instantiated instead of Tag, NavigableString, etc. 393 394* Fixed the definition of the default XML namespace when using 395 lxml 4.4. Patch by Isaac Muse. [bug=1840141] 396 397* Fixed a crash when pretty-printing tags that were not created 398 during initial parsing. [bug=1838903] 399 400* Copying a Tag preserves information that was originally obtained from 401 the TreeBuilder used to build the original Tag. [bug=1838903] 402 403* Raise an explanatory exception when the underlying parser 404 completely rejects the incoming markup. [bug=1838877] 405 406* Avoid a crash when trying to detect the declared encoding of a 407 Unicode document. [bug=1838877] 408 409* Avoid a crash when unpickling certain parse trees generated 410 using html5lib on Python 3. [bug=1843545] 411 412= 4.8.0 (20190720, "One Small Soup") 413 414This release focuses on making it easier to customize Beautiful Soup's 415input mechanism (the TreeBuilder) and output mechanism (the Formatter). 416 417* You can customize the TreeBuilder object by passing keyword 418 arguments into the BeautifulSoup constructor. Those keyword 419 arguments will be passed along into the TreeBuilder constructor. 420 421 The main reason to do this right now is to change how which 422 attributes are treated as multi-valued attributes (the way 'class' 423 is treated by default). You can do this with the 424 'multi_valued_attributes' argument. [bug=1832978] 425 426* The role of Formatter objects has been greatly expanded. The Formatter 427 class now controls the following: 428 429 - The function to call to perform entity substitution. (This was 430 previously Formatter's only job.) 431 - Which tags should be treated as containing CDATA and have their 432 contents exempt from entity substitution. 433 - The order in which a tag's attributes are output. [bug=1812422] 434 - Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>' 435 436 All preexisting code should work as before. 437 438* Added a new method to the API, Tag.smooth(), which consolidates 439 multiple adjacent NavigableString elements. [bug=1697296] 440 441* ' (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is always 442 recognized as a named entity and converted to a single quote. [bug=1818721] 443 444= 4.7.1 (20190106) 445 446* Fixed a significant performance problem introduced in 4.7.0. [bug=1810617] 447 448* Fixed an incorrectly raised exception when inserting a tag before or 449 after an identical tag. [bug=1810692] 450 451* Beautiful Soup will no longer try to keep track of namespaces that 452 are not defined with a prefix; this can confuse soupselect. [bug=1810680] 453 454* Tried even harder to avoid the deprecation warning originally fixed in 455 4.6.1. [bug=1778909] 456 457= 4.7.0 (20181231) 458 459* Beautiful Soup's CSS Selector implementation has been replaced by a 460 dependency on Isaac Muse's SoupSieve project (the soupsieve package 461 on PyPI). The good news is that SoupSieve has a much more robust and 462 complete implementation of CSS selectors, resolving a large number 463 of longstanding issues. The bad news is that from this point onward, 464 SoupSieve must be installed if you want to use the select() method. 465 466 You don't have to change anything lf you installed Beautiful Soup 467 through pip (SoupSieve will be automatically installed when you 468 upgrade Beautiful Soup) or if you don't use CSS selectors from 469 within Beautiful Soup. 470 471 SoupSieve documentation: https://facelessuser.github.io/soupsieve/ 472 473* Added the PageElement.extend() method, which works like list.append(). 474 [bug=1514970] 475 476* PageElement.insert_before() and insert_after() now take a variable 477 number of arguments. [bug=1514970] 478 479* Fix a number of problems with the tree builder that caused 480 trees that were superficially okay, but which fell apart when bits 481 were extracted. Patch by Isaac Muse. [bug=1782928,1809910] 482 483* Fixed a problem with the tree builder in which elements that 484 contained no content (such as empty comments and all-whitespace 485 elements) were not being treated as part of the tree. Patch by Isaac 486 Muse. [bug=1798699] 487 488* Fixed a problem with multi-valued attributes where the value 489 contained whitespace. Thanks to Jens Svalgaard for the 490 fix. [bug=1787453] 491 492* Clarified ambiguous license statements in the source code. Beautiful 493 Soup is released under the MIT license, and has been since 4.4.0. 494 495* This file has been renamed from NEWS.txt to CHANGELOG. 496 497= 4.6.3 (20180812) 498 499* Exactly the same as 4.6.2. Re-released to make the README file 500 render properly on PyPI. 501 502= 4.6.2 (20180812) 503 504* Fix an exception when a custom formatter was asked to format a void 505 element. [bug=1784408] 506 507= 4.6.1 (20180728) 508 509* Stop data loss when encountering an empty numeric entity, and 510 possibly in other cases. Thanks to tos.kamiya for the fix. [bug=1698503] 511 512* Preserve XML namespaces introduced inside an XML document, not just 513 the ones introduced at the top level. [bug=1718787] 514 515* Added a new formatter, "html5", which represents void elements 516 as "<element>" rather than "<element/>". [bug=1716272] 517 518* Fixed a problem where the html.parser tree builder interpreted 519 a string like "&foo " as the character entity "&foo;" [bug=1728706] 520 521* Correctly handle invalid HTML numeric character entities like “ 522 which reference code points that are not Unicode code points. Note 523 that this is only fixed when Beautiful Soup is used with the 524 html.parser parser -- html5lib already worked and I couldn't fix it 525 with lxml. [bug=1782933] 526 527* Improved the warning given when no parser is specified. [bug=1780571] 528 529* When markup contains duplicate elements, a select() call that 530 includes multiple match clauses will match all relevant 531 elements. [bug=1770596] 532 533* Fixed code that was causing deprecation warnings in recent Python 3 534 versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496] 535 536* Fixed a Windows crash in diagnose() when checking whether a long 537 markup string is a filename. [bug=1737121] 538 539* Stopped HTMLParser from raising an exception in very rare cases of 540 bad markup. [bug=1708831] 541 542* Fixed a bug where find_all() was not working when asked to find a 543 tag with a namespaced name in an XML document that was parsed as 544 HTML. [bug=1723783] 545 546* You can get finer control over formatting by subclassing 547 bs4.element.Formatter and passing a Formatter instance into (e.g.) 548 encode(). [bug=1716272] 549 550* You can pass a dictionary of `attrs` into 551 BeautifulSoup.new_tag. This makes it possible to create a tag with 552 an attribute like 'name' that would otherwise be masked by another 553 argument of new_tag. [bug=1779276] 554 555* Clarified the deprecation warning when accessing tag.fooTag, to cover 556 the possibility that you might really have been looking for a tag 557 called 'fooTag'. 558 559= 4.6.0 (20170507) = 560 561* Added the `Tag.get_attribute_list` method, which acts like `Tag.get` for 562 getting the value of an attribute, but which always returns a list, 563 whether or not the attribute is a multi-value attribute. [bug=1678589] 564 565* It's now possible to use a tag's namespace prefix when searching, 566 e.g. soup.find('namespace:tag') [bug=1655332] 567 568* Improved the handling of empty-element tags like <br> when using the 569 html.parser parser. [bug=1676935] 570 571* HTML parsers treat all HTML4 and HTML5 empty element tags (aka void 572 element tags) correctly. [bug=1656909] 573 574* Namespace prefix is preserved when an XML tag is copied. Thanks 575 to Vikas for a patch and test. [bug=1685172] 576 577= 4.5.3 (20170102) = 578 579* Fixed foster parenting when html5lib is the tree builder. Thanks to 580 Geoffrey Sneddon for a patch and test. 581 582* Fixed yet another problem that caused the html5lib tree builder to 583 create a disconnected parse tree. [bug=1629825] 584 585= 4.5.2 (20170102) = 586 587* Apart from the version number, this release is identical to 588 4.5.3. Due to user error, it could not be completely uploaded to 589 PyPI. Use 4.5.3 instead. 590 591= 4.5.1 (20160802) = 592 593* Fixed a crash when passing Unicode markup that contained a 594 processing instruction into the lxml HTML parser on Python 595 3. [bug=1608048] 596 597= 4.5.0 (20160719) = 598 599* Beautiful Soup is no longer compatible with Python 2.6. This 600 actually happened a few releases ago, but it's now official. 601 602* Beautiful Soup will now work with versions of html5lib greater than 603 0.99999999. [bug=1603299] 604 605* If a search against each individual value of a multi-valued 606 attribute fails, the search will be run one final time against the 607 complete attribute value considered as a single string. That is, if 608 a tag has class="foo bar" and neither "foo" nor "bar" matches, but 609 "foo bar" does, the tag is now considered a match. 610 611 This happened in previous versions, but only when the value being 612 searched for was a string. Now it also works when that value is 613 a regular expression, a list of strings, etc. [bug=1476868] 614 615* Fixed a bug that deranged the tree when a whitespace element was 616 reparented into a tag that contained an identical whitespace 617 element. [bug=1505351] 618 619* Added support for CSS selector values that contain quoted spaces, 620 such as tag[style="display: foo"]. [bug=1540588] 621 622* Corrected handling of XML processing instructions. [bug=1504393] 623 624* Corrected an encoding error that happened when a BeautifulSoup 625 object was copied. [bug=1554439] 626 627* The contents of <textarea> tags will no longer be modified when the 628 tree is prettified. [bug=1555829] 629 630* When a BeautifulSoup object is pickled but its tree builder cannot 631 be pickled, its .builder attribute is set to None instead of being 632 destroyed. This avoids a performance problem once the object is 633 unpickled. [bug=1523629] 634 635* Specify the file and line number when warning about a 636 BeautifulSoup object being instantiated without a parser being 637 specified. [bug=1574647] 638 639* The `limit` argument to `select()` now works correctly, though it's 640 not implemented very efficiently. [bug=1520530] 641 642* Fixed a Python 3 ByteWarning when a URL was passed in as though it 643 were markup. Thanks to James Salter for a patch and 644 test. [bug=1533762] 645 646* We don't run the check for a filename passed in as markup if the 647 'filename' contains a less-than character; the less-than character 648 indicates it's most likely a very small document. [bug=1577864] 649 650= 4.4.1 (20150928) = 651 652* Fixed a bug that deranged the tree when part of it was 653 removed. Thanks to Eric Weiser for the patch and John Wiseman for a 654 test. [bug=1481520] 655 656* Fixed a parse bug with the html5lib tree-builder. Thanks to Roel 657 Kramer for the patch. [bug=1483781] 658 659* Improved the implementation of CSS selector grouping. Thanks to 660 Orangain for the patch. [bug=1484543] 661 662* Fixed the test_detect_utf8 test so that it works when chardet is 663 installed. [bug=1471359] 664 665* Corrected the output of Declaration objects. [bug=1477847] 666 667 668= 4.4.0 (20150703) = 669 670Especially important changes: 671 672* Added a warning when you instantiate a BeautifulSoup object without 673 explicitly naming a parser. [bug=1398866] 674 675* __repr__ now returns an ASCII bytestring in Python 2, and a Unicode 676 string in Python 3, instead of a UTF8-encoded bytestring in both 677 versions. In Python 3, __str__ now returns a Unicode string instead 678 of a bytestring. [bug=1420131] 679 680* The `text` argument to the find_* methods is now called `string`, 681 which is more accurate. `text` still works, but `string` is the 682 argument described in the documentation. `text` may eventually 683 change its meaning, but not for a very long time. [bug=1366856] 684 685* Changed the way soup objects work under copy.copy(). Copying a 686 NavigableString or a Tag will give you a new NavigableString that's 687 equal to the old one but not connected to the parse tree. Patch by 688 Martijn Peters. [bug=1307490] 689 690* Started using a standard MIT license. [bug=1294662] 691 692* Added a Chinese translation of the documentation by Delong .w. 693 694New features: 695 696* Introduced the select_one() method, which uses a CSS selector but 697 only returns the first match, instead of a list of 698 matches. [bug=1349367] 699 700* You can now create a Tag object without specifying a 701 TreeBuilder. Patch by Martijn Pieters. [bug=1307471] 702 703* You can now create a NavigableString or a subclass just by invoking 704 the constructor. [bug=1294315] 705 706* Added an `exclude_encodings` argument to UnicodeDammit and to the 707 Beautiful Soup constructor, which lets you prohibit the detection of 708 an encoding that you know is wrong. [bug=1469408] 709 710* The select() method now supports selector grouping. Patch by 711 Francisco Canas [bug=1191917] 712 713Bug fixes: 714 715* Fixed yet another problem that caused the html5lib tree builder to 716 create a disconnected parse tree. [bug=1237763] 717 718* Force object_was_parsed() to keep the tree intact even when an element 719 from later in the document is moved into place. [bug=1430633] 720 721* Fixed yet another bug that caused a disconnected tree when html5lib 722 copied an element from one part of the tree to another. [bug=1270611] 723 724* Fixed a bug where Element.extract() could create an infinite loop in 725 the remaining tree. 726 727* The select() method can now find tags whose names contain 728 dashes. Patch by Francisco Canas. [bug=1276211] 729 730* The select() method can now find tags with attributes whose names 731 contain dashes. Patch by Marek Kapolka. [bug=1304007] 732 733* Improved the lxml tree builder's handling of processing 734 instructions. [bug=1294645] 735 736* Restored the helpful syntax error that happens when you try to 737 import the Python 2 edition of Beautiful Soup under Python 738 3. [bug=1213387] 739 740* In Python 3.4 and above, set the new convert_charrefs argument to 741 the html.parser constructor to avoid a warning and future 742 failures. Patch by Stefano Revera. [bug=1375721] 743 744* The warning when you pass in a filename or URL as markup will now be 745 displayed correctly even if the filename or URL is a Unicode 746 string. [bug=1268888] 747 748* If the initial <html> tag contains a CDATA list attribute such as 749 'class', the html5lib tree builder will now turn its value into a 750 list, as it would with any other tag. [bug=1296481] 751 752* Fixed an import error in Python 3.5 caused by the removal of the 753 HTMLParseError class. [bug=1420063] 754 755* Improved docstring for encode_contents() and 756 decode_contents(). [bug=1441543] 757 758* Fixed a crash in Unicode, Dammit's encoding detector when the name 759 of the encoding itself contained invalid bytes. [bug=1360913] 760 761* Improved the exception raised when you call .unwrap() or 762 .replace_with() on an element that's not attached to a tree. 763 764* Raise a NotImplementedError whenever an unsupported CSS pseudoclass 765 is used in select(). Previously some cases did not result in a 766 NotImplementedError. 767 768* It's now possible to pickle a BeautifulSoup object no matter which 769 tree builder was used to create it. However, the only tree builder 770 that survives the pickling process is the HTMLParserTreeBuilder 771 ('html.parser'). If you unpickle a BeautifulSoup object created with 772 some other tree builder, soup.builder will be None. [bug=1231545] 773 774= 4.3.2 (20131002) = 775 776* Fixed a bug in which short Unicode input was improperly encoded to 777 ASCII when checking whether or not it was the name of a file on 778 disk. [bug=1227016] 779 780* Fixed a crash when a short input contains data not valid in 781 filenames. [bug=1232604] 782 783* Fixed a bug that caused Unicode data put into UnicodeDammit to 784 return None instead of the original data. [bug=1214983] 785 786* Combined two tests to stop a spurious test failure when tests are 787 run by nosetests. [bug=1212445] 788 789= 4.3.1 (20130815) = 790 791* Fixed yet another problem with the html5lib tree builder, caused by 792 html5lib's tendency to rearrange the tree during 793 parsing. [bug=1189267] 794 795* Fixed a bug that caused the optimized version of find_all() to 796 return nothing. [bug=1212655] 797 798= 4.3.0 (20130812) = 799 800* Instead of converting incoming data to Unicode and feeding it to the 801 lxml tree builder in chunks, Beautiful Soup now makes successive 802 guesses at the encoding of the incoming data, and tells lxml to 803 parse the data as that encoding. Giving lxml more control over the 804 parsing process improves performance and avoids a number of bugs and 805 issues with the lxml parser which had previously required elaborate 806 workarounds: 807 808 - An issue in which lxml refuses to parse Unicode strings on some 809 systems. [bug=1180527] 810 811 - A returning bug that truncated documents longer than a (very 812 small) size. [bug=963880] 813 814 - A returning bug in which extra spaces were added to a document if 815 the document defined a charset other than UTF-8. [bug=972466] 816 817 This required a major overhaul of the tree builder architecture. If 818 you wrote your own tree builder and didn't tell me, you'll need to 819 modify your prepare_markup() method. 820 821* The UnicodeDammit code that makes guesses at encodings has been 822 split into its own class, EncodingDetector. A lot of apparently 823 redundant code has been removed from Unicode, Dammit, and some 824 undocumented features have also been removed. 825 826* Beautiful Soup will issue a warning if instead of markup you pass it 827 a URL or the name of a file on disk (a common beginner's mistake). 828 829* A number of optimizations improve the performance of the lxml tree 830 builder by about 33%, the html.parser tree builder by about 20%, and 831 the html5lib tree builder by about 15%. 832 833* All find_all calls should now return a ResultSet object. Patch by 834 Aaron DeVore. [bug=1194034] 835 836= 4.2.1 (20130531) = 837 838* The default XML formatter will now replace ampersands even if they 839 appear to be part of entities. That is, "<" will become 840 "&lt;". The old code was left over from Beautiful Soup 3, which 841 didn't always turn entities into Unicode characters. 842 843 If you really want the old behavior (maybe because you add new 844 strings to the tree, those strings include entities, and you want 845 the formatter to leave them alone on output), it can be found in 846 EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183] 847 848* Gave new_string() the ability to create subclasses of 849 NavigableString. [bug=1181986] 850 851* Fixed another bug by which the html5lib tree builder could create a 852 disconnected tree. [bug=1182089] 853 854* The .previous_element of a BeautifulSoup object is now always None, 855 not the last element to be parsed. [bug=1182089] 856 857* Fixed test failures when lxml is not installed. [bug=1181589] 858 859* html5lib now supports Python 3. Fixed some Python 2-specific 860 code in the html5lib test suite. [bug=1181624] 861 862* The html.parser treebuilder can now handle numeric attributes in 863 text when the hexidecimal name of the attribute starts with a 864 capital X. Patch by Tim Shirley. [bug=1186242] 865 866= 4.2.0 (20130514) = 867 868* The Tag.select() method now supports a much wider variety of CSS 869 selectors. 870 871 - Added support for the adjacent sibling combinator (+) and the 872 general sibling combinator (~). Tests by "liquider". [bug=1082144] 873 874 - The combinators (>, +, and ~) can now combine with any supported 875 selector, not just one that selects based on tag name. 876 877 - Added limited support for the "nth-of-type" pseudo-class. Code 878 by Sven Slootweg. [bug=1109952] 879 880* The BeautifulSoup class is now aliased to "_s" and "_soup", making 881 it quicker to type the import statement in an interactive session: 882 883 from bs4 import _s 884 or 885 from bs4 import _soup 886 887 The alias may change in the future, so don't use this in code you're 888 going to run more than once. 889 890* Added the 'diagnose' submodule, which includes several useful 891 functions for reporting problems and doing tech support. 892 893 - diagnose(data) tries the given markup on every installed parser, 894 reporting exceptions and displaying successes. If a parser is not 895 installed, diagnose() mentions this fact. 896 897 - lxml_trace(data, html=True) runs the given markup through lxml's 898 XML parser or HTML parser, and prints out the parser events as 899 they happen. This helps you quickly determine whether a given 900 problem occurs in lxml code or Beautiful Soup code. 901 902 - htmlparser_trace(data) is the same thing, but for Python's 903 built-in HTMLParser class. 904 905* In an HTML document, the contents of a <script> or <style> tag will 906 no longer undergo entity substitution by default. XML documents work 907 the same way they did before. [bug=1085953] 908 909* Methods like get_text() and properties like .strings now only give 910 you strings that are visible in the document--no comments or 911 processing commands. [bug=1050164] 912 913* The prettify() method now leaves the contents of <pre> tags 914 alone. [bug=1095654] 915 916* Fix a bug in the html5lib treebuilder which sometimes created 917 disconnected trees. [bug=1039527] 918 919* Fix a bug in the lxml treebuilder which crashed when a tag included 920 an attribute from the predefined "xml:" namespace. [bug=1065617] 921 922* Fix a bug by which keyword arguments to find_parent() were not 923 being passed on. [bug=1126734] 924 925* Stop a crash when unwisely messing with a tag that's been 926 decomposed. [bug=1097699] 927 928* Now that lxml's segfault on invalid doctype has been fixed, fixed a 929 corresponding problem on the Beautiful Soup end that was previously 930 invisible. [bug=984936] 931 932* Fixed an exception when an overspecified CSS selector didn't match 933 anything. Code by Stefaan Lippens. [bug=1168167] 934 935= 4.1.3 (20120820) = 936 937* Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious 938 test failure caused by the lousy HTMLParser in those 939 versions. [bug=1038503] 940 941* Raise a more specific error (FeatureNotFound) when a requested 942 parser or parser feature is not installed. Raise NotImplementedError 943 instead of ValueError when the user calls insert_before() or 944 insert_after() on the BeautifulSoup object itself. Patch by Aaron 945 Devore. [bug=1038301] 946 947= 4.1.2 (20120817) = 948 949* As per PEP-8, allow searching by CSS class using the 'class_' 950 keyword argument. [bug=1037624] 951 952* Display namespace prefixes for namespaced attribute names, instead of 953 the fully-qualified names given by the lxml parser. [bug=1037597] 954 955* Fixed a crash on encoding when an attribute name contained 956 non-ASCII characters. 957 958* When sniffing encodings, if the cchardet library is installed, 959 Beautiful Soup uses it instead of chardet. cchardet is much 960 faster. [bug=1020748] 961 962* Use logging.warning() instead of warning.warn() to notify the user 963 that characters were replaced with REPLACEMENT 964 CHARACTER. [bug=1013862] 965 966= 4.1.1 (20120703) = 967 968* Fixed an html5lib tree builder crash which happened when html5lib 969 moved a tag with a multivalued attribute from one part of the tree 970 to another. [bug=1019603] 971 972* Correctly display closing tags with an XML namespace declared. Patch 973 by Andreas Kostyrka. [bug=1019635] 974 975* Fixed a typo that made parsing significantly slower than it should 976 have been, and also waited too long to close tags with XML 977 namespaces. [bug=1020268] 978 979* get_text() now returns an empty Unicode string if there is no text, 980 rather than an empty bytestring. [bug=1020387] 981 982= 4.1.0 (20120529) = 983 984* Added experimental support for fixing Windows-1252 characters 985 embedded in UTF-8 documents. (UnicodeDammit.detwingle()) 986 987* Fixed the handling of " with the built-in parser. [bug=993871] 988 989* Comments, processing instructions, document type declarations, and 990 markup declarations are now treated as preformatted strings, the way 991 CData blocks are. [bug=1001025] 992 993* Fixed a bug with the lxml treebuilder that prevented the user from 994 adding attributes to a tag that didn't originally have 995 attributes. [bug=1002378] Thanks to Oliver Beattie for the patch. 996 997* Fixed some edge-case bugs having to do with inserting an element 998 into a tag it's already inside, and replacing one of a tag's 999 children with another. [bug=997529] 1000 1001* Added the ability to search for attribute values specified in UTF-8. [bug=1003974] 1002 1003 This caused a major refactoring of the search code. All the tests 1004 pass, but it's possible that some searches will behave differently. 1005 1006= 4.0.5 (20120427) = 1007 1008* Added a new method, wrap(), which wraps an element in a tag. 1009 1010* Renamed replace_with_children() to unwrap(), which is easier to 1011 understand and also the jQuery name of the function. 1012 1013* Made encoding substitution in <meta> tags completely transparent (no 1014 more %SOUP-ENCODING%). 1015 1016* Fixed a bug in decoding data that contained a byte-order mark, such 1017 as data encoded in UTF-16LE. [bug=988980] 1018 1019* Fixed a bug that made the HTMLParser treebuilder generate XML 1020 definitions ending with two question marks instead of 1021 one. [bug=984258] 1022 1023* Upon document generation, CData objects are no longer run through 1024 the formatter. [bug=988905] 1025 1026* The test suite now passes when lxml is not installed, whether or not 1027 html5lib is installed. [bug=987004] 1028 1029* Print a warning on HTMLParseErrors to let people know they should 1030 install a better parser library. 1031 1032= 4.0.4 (20120416) = 1033 1034* Fixed a bug that sometimes created disconnected trees. 1035 1036* Fixed a bug with the string setter that moved a string around the 1037 tree instead of copying it. [bug=983050] 1038 1039* Attribute values are now run through the provided output formatter. 1040 Previously they were always run through the 'minimal' formatter. In 1041 the future I may make it possible to specify different formatters 1042 for attribute values and strings, but for now, consistent behavior 1043 is better than inconsistent behavior. [bug=980237] 1044 1045* Added the missing renderContents method from Beautiful Soup 3. Also 1046 added an encode_contents() method to go along with decode_contents(). 1047 1048* Give a more useful error when the user tries to run the Python 2 1049 version of BS under Python 3. 1050 1051* UnicodeDammit can now convert Microsoft smart quotes to ASCII with 1052 UnicodeDammit(markup, smart_quotes_to="ascii"). 1053 1054= 4.0.3 (20120403) = 1055 1056* Fixed a typo that caused some versions of Python 3 to convert the 1057 Beautiful Soup codebase incorrectly. 1058 1059* Got rid of the 4.0.2 workaround for HTML documents--it was 1060 unnecessary and the workaround was triggering a (possibly different, 1061 but related) bug in lxml. [bug=972466] 1062 1063= 4.0.2 (20120326) = 1064 1065* Worked around a possible bug in lxml that prevents non-tiny XML 1066 documents from being parsed. [bug=963880, bug=963936] 1067 1068* Fixed a bug where specifying `text` while also searching for a tag 1069 only worked if `text` wanted an exact string match. [bug=955942] 1070 1071= 4.0.1 (20120314) = 1072 1073* This is the first official release of Beautiful Soup 4. There is no 1074 4.0.0 release, to eliminate any possibility that packaging software 1075 might treat "4.0.0" as being an earlier version than "4.0.0b10". 1076 1077* Brought BS up to date with the latest release of soupselect, adding 1078 CSS selector support for direct descendant matches and multiple CSS 1079 class matches. 1080 1081= 4.0.0b10 (20120302) = 1082 1083* Added support for simple CSS selectors, taken from the soupselect project. 1084 1085* Fixed a crash when using html5lib. [bug=943246] 1086 1087* In HTML5-style <meta charset="foo"> tags, the value of the "charset" 1088 attribute is now replaced with the appropriate encoding on 1089 output. [bug=942714] 1090 1091* Fixed a bug that caused calling a tag to sometimes call find_all() 1092 with the wrong arguments. [bug=944426] 1093 1094* For backwards compatibility, brought back the BeautifulStoneSoup 1095 class as a deprecated wrapper around BeautifulSoup. 1096 1097= 4.0.0b9 (20120228) = 1098 1099* Fixed the string representation of DOCTYPEs that have both a public 1100 ID and a system ID. 1101 1102* Fixed the generated XML declaration. 1103 1104* Renamed Tag.nsprefix to Tag.prefix, for consistency with 1105 NamespacedAttribute. 1106 1107* Fixed a test failure that occurred on Python 3.x when chardet was 1108 installed. 1109 1110* Made prettify() return Unicode by default, so it will look nice on 1111 Python 3 when passed into print(). 1112 1113= 4.0.0b8 (20120224) = 1114 1115* All tree builders now preserve namespace information in the 1116 documents they parse. If you use the html5lib parser or lxml's XML 1117 parser, you can access the namespace URL for a tag as tag.namespace. 1118 1119 However, there is no special support for namespace-oriented 1120 searching or tree manipulation. When you search the tree, you need 1121 to use namespace prefixes exactly as they're used in the original 1122 document. 1123 1124* The string representation of a DOCTYPE always ends in a newline. 1125 1126* Issue a warning if the user tries to use a SoupStrainer in 1127 conjunction with the html5lib tree builder, which doesn't support 1128 them. 1129 1130= 4.0.0b7 (20120223) = 1131 1132* Upon decoding to string, any characters that can't be represented in 1133 your chosen encoding will be converted into numeric XML entity 1134 references. 1135 1136* Issue a warning if characters were replaced with REPLACEMENT 1137 CHARACTER during Unicode conversion. 1138 1139* Restored compatibility with Python 2.6. 1140 1141* The install process no longer installs docs or auxiliary text files. 1142 1143* It's now possible to deepcopy a BeautifulSoup object created with 1144 Python's built-in HTML parser. 1145 1146* About 100 unit tests that "test" the behavior of various parsers on 1147 invalid markup have been removed. Legitimate changes to those 1148 parsers caused these tests to fail, indicating that perhaps 1149 Beautiful Soup should not test the behavior of foreign 1150 libraries. 1151 1152 The problematic unit tests have been reformulated as informational 1153 comparisons generated by the script 1154 scripts/demonstrate_parser_differences.py. 1155 1156 This makes Beautiful Soup compatible with html5lib version 0.95 and 1157 future versions of HTMLParser. 1158 1159= 4.0.0b6 (20120216) = 1160 1161* Multi-valued attributes like "class" always have a list of values, 1162 even if there's only one value in the list. 1163 1164* Added a number of multi-valued attributes defined in HTML5. 1165 1166* Stopped generating a space before the slash that closes an 1167 empty-element tag. This may come back if I add a special XHTML mode 1168 (http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty 1169 useless. 1170 1171* Passing text along with tag-specific arguments to a find* method: 1172 1173 find("a", text="Click here") 1174 1175 will find tags that contain the given text as their 1176 .string. Previously, the tag-specific arguments were ignored and 1177 only strings were searched. 1178 1179* Fixed a bug that caused the html5lib tree builder to build a 1180 partially disconnected tree. Generally cleaned up the html5lib tree 1181 builder. 1182 1183* If you restrict a multi-valued attribute like "class" to a string 1184 that contains spaces, Beautiful Soup will only consider it a match 1185 if the values correspond to that specific string. 1186 1187= 4.0.0b5 (20120209) = 1188 1189* Rationalized Beautiful Soup's treatment of CSS class. A tag 1190 belonging to multiple CSS classes is treated as having a list of 1191 values for the 'class' attribute. Searching for a CSS class will 1192 match *any* of the CSS classes. 1193 1194 This actually affects all attributes that the HTML standard defines 1195 as taking multiple values (class, rel, rev, archive, accept-charset, 1196 and headers), but 'class' is by far the most common. [bug=41034] 1197 1198* If you pass anything other than a dictionary as the second argument 1199 to one of the find* methods, it'll assume you want to use that 1200 object to search against a tag's CSS classes. Previously this only 1201 worked if you passed in a string. 1202 1203* Fixed a bug that caused a crash when you passed a dictionary as an 1204 attribute value (possibly because you mistyped "attrs"). [bug=842419] 1205 1206* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags 1207 like <meta charset="utf-8" />. [bug=837268] 1208 1209* If Unicode, Dammit can't figure out a consistent encoding for a 1210 page, it will try each of its guesses again, with errors="replace" 1211 instead of errors="strict". This may mean that some data gets 1212 replaced with REPLACEMENT CHARACTER, but at least most of it will 1213 get turned into Unicode. [bug=754903] 1214 1215* Patched over a bug in html5lib (?) that was crashing Beautiful Soup 1216 on certain kinds of markup. [bug=838800] 1217 1218* Fixed a bug that wrecked the tree if you replaced an element with an 1219 empty string. [bug=728697] 1220 1221* Improved Unicode, Dammit's behavior when you give it Unicode to 1222 begin with. 1223 1224= 4.0.0b4 (20120208) = 1225 1226* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag() 1227 1228* BeautifulSoup.new_tag() will follow the rules of whatever 1229 tree-builder was used to create the original BeautifulSoup object. A 1230 new <p> tag will look like "<p />" if the soup object was created to 1231 parse XML, but it will look like "<p></p>" if the soup object was 1232 created to parse HTML. 1233 1234* We pass in strict=False to html.parser on Python 3, greatly 1235 improving html.parser's ability to handle bad HTML. 1236 1237* We also monkeypatch a serious bug in html.parser that made 1238 strict=False disastrous on Python 3.2.2. 1239 1240* Replaced the "substitute_html_entities" argument with the 1241 more general "formatter" argument. 1242 1243* Bare ampersands and angle brackets are always converted to XML 1244 entities unless the user prevents it. 1245 1246* Added PageElement.insert_before() and PageElement.insert_after(), 1247 which let you put an element into the parse tree with respect to 1248 some other element. 1249 1250* Raise an exception when the user tries to do something nonsensical 1251 like insert a tag into itself. 1252 1253 1254= 4.0.0b3 (20120203) = 1255 1256Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful 1257Soup's custom HTML parser in favor of a system that lets you write a 1258little glue code and plug in any HTML or XML parser you want. 1259 1260Beautiful Soup 4.0 comes with glue code for four parsers: 1261 1262 * Python's standard HTMLParser (html.parser in Python 3) 1263 * lxml's HTML and XML parsers 1264 * html5lib's HTML parser 1265 1266HTMLParser is the default, but I recommend you install lxml if you 1267can. 1268 1269For complete documentation, see the Sphinx documentation in 1270bs4/doc/source/. What follows is a summary of the changes from 1271Beautiful Soup 3. 1272 1273=== The module name has changed === 1274 1275Previously you imported the BeautifulSoup class from a module also 1276called BeautifulSoup. To save keystrokes and make it clear which 1277version of the API is in use, the module is now called 'bs4': 1278 1279 >>> from bs4 import BeautifulSoup 1280 1281=== It works with Python 3 === 1282 1283Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was 1284so bad that it barely worked at all. Beautiful Soup 4 works with 1285Python 3, and since its parser is pluggable, you don't sacrifice 1286quality. 1287 1288Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3 1289support to the finish line. Ezio Melotti is also to thank for greatly 1290improving the HTML parser that comes with Python 3.2. 1291 1292=== CDATA sections are normal text, if they're understood at all. === 1293 1294Currently, the lxml and html5lib HTML parsers ignore CDATA sections in 1295markup: 1296 1297 <p><![CDATA[foo]]></p> => <p></p> 1298 1299A future version of html5lib will turn CDATA sections into text nodes, 1300but only within tags like <svg> and <math>: 1301 1302 <svg><![CDATA[foo]]></svg> => <p>foo</p> 1303 1304The default XML parser (which uses lxml behind the scenes) turns CDATA 1305sections into ordinary text elements: 1306 1307 <p><![CDATA[foo]]></p> => <p>foo</p> 1308 1309In theory it's possible to preserve the CDATA sections when using the 1310XML parser, but I don't see how to get it to work in practice. 1311 1312=== Miscellaneous other stuff === 1313 1314If the BeautifulSoup instance has .is_xml set to True, an appropriate 1315XML declaration will be emitted when the tree is transformed into a 1316string: 1317 1318 <?xml version="1.0" encoding="utf-8"> 1319 <markup> 1320 ... 1321 </markup> 1322 1323The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree 1324builders set it to False. If you want to parse XHTML with an HTML 1325parser, you can set it manually. 1326 1327 1328= 3.2.0 = 1329 1330The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2 1331to make it obvious which one you should use. 1332 1333= 3.1.0 = 1334 1335A hybrid version that supports 2.4 and can be automatically converted 1336to run under Python 3.0. There are three backwards-incompatible 1337changes you should be aware of, but no new features or deliberate 1338behavior changes. 1339 13401. str() may no longer do what you want. This is because the meaning 1341of str() inverts between Python 2 and 3; in Python 2 it gives you a 1342byte string, in Python 3 it gives you a Unicode string. 1343 1344The effect of this is that you can't pass an encoding to .__str__ 1345anymore. Use encode() to get a string and decode() to get Unicode, and 1346you'll be ready (well, readier) for Python 3. 1347 13482. Beautiful Soup is now based on HTMLParser rather than SGMLParser, 1349which is gone in Python 3. There's some bad HTML that SGMLParser 1350handled but HTMLParser doesn't, usually to do with attribute values 1351that aren't closed or have brackets inside them: 1352 1353 <a href="foo</a>, </a><a href="bar">baz</a> 1354 <a b="<a>">', '<a b="<a>"></a><a>"></a> 1355 1356A later version of Beautiful Soup will allow you to plug in different 1357parsers to make tradeoffs between speed and the ability to handle bad 1358HTML. 1359 13603. In Python 3 (but not Python 2), HTMLParser converts entities within 1361attributes to the corresponding Unicode characters. In Python 2 it's 1362possible to parse this string and leave the é intact. 1363 1364 <a href="http://crummy.com?sacré&bleu"> 1365 1366In Python 3, the é is always converted to \xe9 during 1367parsing. 1368 1369 1370= 3.0.7a = 1371 1372Added an import that makes BS work in Python 2.3. 1373 1374 1375= 3.0.7 = 1376 1377Fixed a UnicodeDecodeError when unpickling documents that contain 1378non-ASCII characters. 1379 1380Fixed a TypeError that occurred in some circumstances when a tag 1381contained no text. 1382 1383Jump through hoops to avoid the use of chardet, which can be extremely 1384slow in some circumstances. UTF-8 documents should never trigger the 1385use of chardet. 1386 1387Whitespace is preserved inside <pre> and <textarea> tags that contain 1388nothing but whitespace. 1389 1390Beautiful Soup can now parse a doctype that's scoped to an XML namespace. 1391 1392 1393= 3.0.6 = 1394 1395Got rid of a very old debug line that prevented chardet from working. 1396 1397Added a Tag.decompose() method that completely disconnects a tree or a 1398subset of a tree, breaking it up into bite-sized pieces that are 1399easy for the garbage collecter to collect. 1400 1401Tag.extract() now returns the tag that was extracted. 1402 1403Tag.findNext() now does something with the keyword arguments you pass 1404it instead of dropping them on the floor. 1405 1406Fixed a Unicode conversion bug. 1407 1408Fixed a bug that garbled some <meta> tags when rewriting them. 1409 1410 1411= 3.0.5 = 1412 1413Soup objects can now be pickled, and copied with copy.deepcopy. 1414 1415Tag.append now works properly on existing BS objects. (It wasn't 1416originally intended for outside use, but it can be now.) (Giles 1417Radford) 1418 1419Passing in a nonexistent encoding will no longer crash the parser on 1420Python 2.4 (John Nagle). 1421 1422Fixed an underlying bug in SGMLParser that thinks ASCII has 255 1423characters instead of 127 (John Nagle). 1424 1425Entities are converted more consistently to Unicode characters. 1426 1427Entity references in attribute values are now converted to Unicode 1428characters when appropriate. Numeric entities are always converted, 1429because SGMLParser always converts them outside of attribute values. 1430 1431ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to 1432XHTML_ENTITIES. 1433 1434The regular expression for bare ampersands was too loose. In some 1435cases ampersands were not being escaped. (Sam Ruby?) 1436 1437Non-breaking spaces and other special Unicode space characters are no 1438longer folded to ASCII spaces. (Robert Leftwich) 1439 1440Information inside a TEXTAREA tag is now parsed literally, not as HTML 1441tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang) 1442 1443= 3.0.4 = 1444 1445Fixed a bug that crashed Unicode conversion in some cases. 1446 1447Fixed a bug that prevented UnicodeDammit from being used as a 1448general-purpose data scrubber. 1449 1450Fixed some unit test failures when running against Python 2.5. 1451 1452When considering whether to convert smart quotes, UnicodeDammit now 1453looks at the original encoding in a case-insensitive way. 1454 1455= 3.0.3 (20060606) = 1456 1457Beautiful Soup is now usable as a way to clean up invalid XML/HTML (be 1458sure to pass in an appropriate value for convertEntities, or XML/HTML 1459entities might stick around that aren't valid in HTML/XML). The result 1460may not validate, but it should be good enough to not choke a 1461real-world XML parser. Specifically, the output of a properly 1462constructed soup object should always be valid as part of an XML 1463document, but parts may be missing if they were missing in the 1464original. As always, if the input is valid XML, the output will also 1465be valid. 1466 1467= 3.0.2 (20060602) = 1468 1469Previously, Beautiful Soup correctly handled attribute values that 1470contained embedded quotes (sometimes by escaping), but not other kinds 1471of XML character. Now, it correctly handles or escapes all special XML 1472characters in attribute values. 1473 1474I aliased methods to the 2.x names (fetch, find, findText, etc.) for 1475backwards compatibility purposes. Those names are deprecated and if I 1476ever do a 4.0 I will remove them. I will, I tell you! 1477 1478Fixed a bug where the findAll method wasn't passing along any keyword 1479arguments. 1480 1481When run from the command line, Beautiful Soup now acts as an HTML 1482pretty-printer, not an XML pretty-printer. 1483 1484= 3.0.1 (20060530) = 1485 1486Reintroduced the "fetch by CSS class" shortcut. I thought keyword 1487arguments would replace it, but they don't. You can't call soup('a', 1488class='foo') because class is a Python keyword. 1489 1490If Beautiful Soup encounters a meta tag that declares the encoding, 1491but a SoupStrainer tells it not to parse that tag, Beautiful Soup will 1492no longer try to rewrite the meta tag to mention the new 1493encoding. Basically, this makes SoupStrainers work in real-world 1494applications instead of crashing the parser. 1495 1496= 3.0.0 "Who would not give all else for two p" (20060528) = 1497 1498This release is not backward-compatible with previous releases. If 1499you've got code written with a previous version of the library, go 1500ahead and keep using it, unless one of the features mentioned here 1501really makes your life easier. Since the library is self-contained, 1502you can include an old copy of the library in your old applications, 1503and use the new version for everything else. 1504 1505The documentation has been rewritten and greatly expanded with many 1506more examples. 1507 1508Beautiful Soup autodetects the encoding of a document (or uses the one 1509you specify), and converts it from its native encoding to 1510Unicode. Internally, it only deals with Unicode strings. When you 1511print out the document, it converts to UTF-8 (or another encoding you 1512specify). [Doc reference] 1513 1514It's now easy to make large-scale changes to the parse tree without 1515screwing up the navigation members. The methods are extract, 1516replaceWith, and insert. [Doc reference. See also Improving Memory 1517Usage with extract] 1518 1519Passing True in as an attribute value gives you tags that have any 1520value for that attribute. You don't have to create a regular 1521expression. Passing None for an attribute value gives you tags that 1522don't have that attribute at all. 1523 1524Tag objects now know whether or not they're self-closing. This avoids 1525the problem where Beautiful Soup thought that tags like <BR /> were 1526self-closing even in XML documents. You can customize the self-closing 1527tags for a parser object by passing them in as a list of 1528selfClosingTags: you don't have to subclass anymore. 1529 1530There's a new built-in parser, MinimalSoup, which has most of 1531BeautifulSoup's HTML-specific rules, but no tag nesting rules. [Doc 1532reference] 1533 1534You can use a SoupStrainer to tell Beautiful Soup to parse only part 1535of a document. This saves time and memory, often making Beautiful Soup 1536about as fast as a custom-built SGMLParser subclass. [Doc reference, 1537SoupStrainer reference] 1538 1539You can (usually) use keyword arguments instead of passing a 1540dictionary of attributes to a search method. That is, you can replace 1541soup(args={"id" : "5"}) with soup(id="5"). You can still use args if 1542(for instance) you need to find an attribute whose name clashes with 1543the name of an argument to findAll. [Doc reference: **kwargs attrs] 1544 1545The method names have changed to the better method names used in 1546Rubyful Soup. Instead of find methods and fetch methods, there are 1547only find methods. Instead of a scheme where you can't remember which 1548method finds one element and which one finds them all, we have find 1549and findAll. In general, if the method name mentions All or a plural 1550noun (eg. findNextSiblings), then it finds many elements 1551method. Otherwise, it only finds one element. [Doc reference] 1552 1553Some of the argument names have been renamed for clarity. For instance 1554avoidParserProblems is now parserMassage. 1555 1556Beautiful Soup no longer implements a feed method. You need to pass a 1557string or a filehandle into the soup constructor, not with feed after 1558the soup has been created. There is still a feed method, but it's the 1559feed method implemented by SGMLParser and calling it will bypass 1560Beautiful Soup and cause problems. 1561 1562The NavigableText class has been renamed to NavigableString. There is 1563no NavigableUnicodeString anymore, because every string inside a 1564Beautiful Soup parse tree is a Unicode string. 1565 1566findText and fetchText are gone. Just pass a text argument into find 1567or findAll. 1568 1569Null was more trouble than it was worth, so I got rid of it. Anything 1570that used to return Null now returns None. 1571 1572Special XML constructs like comments and CDATA now have their own 1573NavigableString subclasses, instead of being treated as oddly-formed 1574data. If you parse a document that contains CDATA and write it back 1575out, the CDATA will still be there. 1576 1577When you're parsing a document, you can get Beautiful Soup to convert 1578XML or HTML entities into the corresponding Unicode characters. [Doc 1579reference] 1580 1581= 2.1.1 (20050918) = 1582 1583Fixed a serious performance bug in BeautifulStoneSoup which was 1584causing parsing to be incredibly slow. 1585 1586Corrected several entities that were previously being incorrectly 1587translated from Microsoft smart-quote-like characters. 1588 1589Fixed a bug that was breaking text fetch. 1590 1591Fixed a bug that crashed the parser when text chunks that look like 1592HTML tag names showed up within a SCRIPT tag. 1593 1594THEAD, TBODY, and TFOOT tags are now nestable within TABLE 1595tags. Nested tables should parse more sensibly now. 1596 1597BASE is now considered a self-closing tag. 1598 1599= 2.1.0 "Game, or any other dish?" (20050504) = 1600 1601Added a wide variety of new search methods which, given a starting 1602point inside the tree, follow a particular navigation member (like 1603nextSibling) over and over again, looking for Tag and NavigableText 1604objects that match certain criteria. The new methods are findNext, 1605fetchNext, findPrevious, fetchPrevious, findNextSibling, 1606fetchNextSiblings, findPreviousSibling, fetchPreviousSiblings, 1607findParent, and fetchParents. All of these use the same basic code 1608used by first and fetch, so you can pass your weird ways of matching 1609things into these methods. 1610 1611The fetch method and its derivatives now accept a limit argument. 1612 1613You can now pass keyword arguments when calling a Tag object as though 1614it were a method. 1615 1616Fixed a bug that caused all hand-created tags to share a single set of 1617attributes. 1618 1619= 2.0.3 (20050501) = 1620 1621Fixed Python 2.2 support for iterators. 1622 1623Fixed a bug that gave the wrong representation to tags within quote 1624tags like <script>. 1625 1626Took some code from Mark Pilgrim that treats CDATA declarations as 1627data instead of ignoring them. 1628 1629Beautiful Soup's setup.py will now do an install even if the unit 1630tests fail. It won't build a source distribution if the unit tests 1631fail, so I can't release a new version unless they pass. 1632 1633= 2.0.2 (20050416) = 1634 1635Added the unit tests in a separate module, and packaged it with 1636distutils. 1637 1638Fixed a bug that sometimes caused renderContents() to return a Unicode 1639string even if there was no Unicode in the original string. 1640 1641Added the done() method, which closes all of the parser's open 1642tags. It gets called automatically when you pass in some text to the 1643constructor of a parser class; otherwise you must call it yourself. 1644 1645Reinstated some backwards compatibility with 1.x versions: referencing 1646the string member of a NavigableText object returns the NavigableText 1647object instead of throwing an error. 1648 1649= 2.0.1 (20050412) = 1650 1651Fixed a bug that caused bad results when you tried to reference a tag 1652name shorter than 3 characters as a member of a Tag, eg. tag.table.td. 1653 1654Made sure all Tags have the 'hidden' attribute so that an attempt to 1655access tag.hidden doesn't spawn an attempt to find a tag named 1656'hidden'. 1657 1658Fixed a bug in the comparison operator. 1659 1660= 2.0.0 "Who cares for fish?" (20050410) 1661 1662Beautiful Soup version 1 was very useful but also pretty stupid. I 1663originally wrote it without noticing any of the problems inherent in 1664trying to build a parse tree out of ambiguous HTML tags. This version 1665solves all of those problems to my satisfaction. It also adds many new 1666clever things to make up for the removal of the stupid things. 1667 1668== Parsing == 1669 1670The parser logic has been greatly improved, and the BeautifulSoup 1671class should much more reliably yield a parse tree that looks like 1672what the page author intended. For a particular class of odd edge 1673cases that now causes problems, there is a new class, 1674ICantBelieveItsBeautifulSoup. 1675 1676By default, Beautiful Soup now performs some cleanup operations on 1677text before parsing it. This is to avoid common problems with bad 1678definitions and self-closing tags that crash SGMLParser. You can 1679provide your own set of cleanup operations, or turn it off 1680altogether. The cleanup operations include fixing self-closing tags 1681that don't close, and replacing Microsoft smart quotes and similar 1682characters with their HTML entity equivalents. 1683 1684You can now get a pretty-print version of parsed HTML to get a visual 1685picture of how Beautiful Soup parses it, with the Tag.prettify() 1686method. 1687 1688== Strings and Unicode == 1689 1690There are separate NavigableText subclasses for ASCII and Unicode 1691strings. These classes directly subclass the corresponding base data 1692types. This means you can treat NavigableText objects as strings 1693instead of having to call methods on them to get the strings. 1694 1695str() on a Tag always returns a string, and unicode() always returns 1696Unicode. Previously it was inconsistent. 1697 1698== Tree traversal == 1699 1700In a first() or fetch() call, the tag name or the desired value of an 1701attribute can now be any of the following: 1702 1703 * A string (matches that specific tag or that specific attribute value) 1704 * A list of strings (matches any tag or attribute value in the list) 1705 * A compiled regular expression object (matches any tag or attribute 1706 value that matches the regular expression) 1707 * A callable object that takes the Tag object or attribute value as a 1708 string. It returns None/false/empty string if the given string 1709 doesn't match, and any other value if it does. 1710 1711This is much easier to use than SQL-style wildcards (see, regular 1712expressions are good for something). Because of this, I took out 1713SQL-style wildcards. I'll put them back if someone complains, but 1714their removal simplifies the code a lot. 1715 1716You can use fetch() and first() to search for text in the parse tree, 1717not just tags. There are new alias methods fetchText() and firstText() 1718designed for this purpose. As with searching for tags, you can pass in 1719a string, a regular expression object, or a method to match your text. 1720 1721If you pass in something besides a map to the attrs argument of 1722fetch() or first(), Beautiful Soup will assume you want to match that 1723thing against the "class" attribute. When you're scraping 1724well-structured HTML, this makes your code a lot cleaner. 1725 17261.x and 2.x both let you call a Tag object as a shorthand for 1727fetch(). For instance, foo("bar") is a shorthand for 1728foo.fetch("bar"). In 2.x, you can also access a specially-named member 1729of a Tag object as a shorthand for first(). For instance, foo.barTag 1730is a shorthand for foo.first("bar"). By chaining these shortcuts you 1731traverse a tree in very little code: for header in 1732soup.bodyTag.pTag.tableTag('th'): 1733 1734If an element relationship (like parent or next) doesn't apply to a 1735tag, it'll now show up Null instead of None. first() will also return 1736Null if you ask it for a nonexistent tag. Null is an object that's 1737just like None, except you can do whatever you want to it and it'll 1738give you Null instead of throwing an error. 1739 1740This lets you do tree traversals like soup.htmlTag.headTag.titleTag 1741without having to worry if the intermediate stages are actually 1742there. Previously, if there was no 'head' tag in the document, headTag 1743in that instance would have been None, and accessing its 'titleTag' 1744member would have thrown an AttributeError. Now, you can get what you 1745want when it exists, and get Null when it doesn't, without having to 1746do a lot of conditionals checking to see if every stage is None. 1747 1748There are two new relations between page elements: previousSibling and 1749nextSibling. They reference the previous and next element at the same 1750level of the parse tree. For instance, if you have HTML like this: 1751 1752 <p><ul><li>Foo<br /><li>Bar</ul> 1753 1754The first 'li' tag has a previousSibling of Null and its nextSibling 1755is the second 'li' tag. The second 'li' tag has a nextSibling of Null 1756and its previousSibling is the first 'li' tag. The previousSibling of 1757the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the 1758'br' tag. 1759 1760I took out the ability to use fetch() to find tags that have a 1761specific list of contents. See, I can't even explain it well. It was 1762really difficult to use, I never used it, and I don't think anyone 1763else ever used it. To the extent anyone did, they can probably use 1764fetchText() instead. If it turns out someone needs it I'll think of 1765another solution. 1766 1767== Tree manipulation == 1768 1769You can add new attributes to a tag, and delete attributes from a 1770tag. In 1.x you could only change a tag's existing attributes. 1771 1772== Porting Considerations == 1773 1774There are three changes in 2.0 that break old code: 1775 1776In the post-1.2 release you could pass in a function into fetch(). The 1777function took a string, the tag name. In 2.0, the function takes the 1778actual Tag object. 1779 1780It's no longer to pass in SQL-style wildcards to fetch(). Use a 1781regular expression instead. 1782 1783The different parsing algorithm means the parse tree may not be shaped 1784like you expect. This will only actually affect you if your code uses 1785one of the affected parts. I haven't run into this problem yet while 1786porting my code. 1787 1788= Between 1.2 and 2.0 = 1789 1790This is the release to get if you want Python 1.5 compatibility. 1791 1792The desired value of an attribute can now be any of the following: 1793 1794 * A string 1795 * A string with SQL-style wildcards 1796 * A compiled RE object 1797 * A callable that returns None/false/empty string if the given value 1798 doesn't match, and any other value otherwise. 1799 1800This is much easier to use than SQL-style wildcards (see, regular 1801expressions are good for something). Because of this, I no longer 1802recommend you use SQL-style wildcards. They may go away in a future 1803release to clean up the code. 1804 1805Made Beautiful Soup handle processing instructions as text instead of 1806ignoring them. 1807 1808Applied patch from Richie Hindle (richie at entrian dot com) that 1809makes tag.string a shorthand for tag.contents[0].string when the tag 1810has only one string-owning child. 1811 1812Added still more nestable tags. The nestable tags thing won't work in 1813a lot of cases and needs to be rethought. 1814 1815Fixed an edge case where searching for "%foo" would match any string 1816shorter than "foo". 1817 1818= 1.2 "Who for such dainties would not stoop?" (20040708) = 1819 1820Applied patch from Ben Last (ben at benlast dot com) that made 1821Tag.renderContents() correctly handle Unicode. 1822 1823Made BeautifulStoneSoup even dumber by making it not implicitly close 1824a tag when another tag of the same type is encountered; only when an 1825actual closing tag is encountered. This change courtesy of Fuzzy (mike 1826at pcblokes dot com). BeautifulSoup still works as before. 1827 1828= 1.1 "Swimming in a hot tureen" = 1829 1830Added more 'nestable' tags. Changed popping semantics so that when a 1831nestable tag is encountered, tags are popped up to the previously 1832encountered nestable tag (of whatever kind). I will revert this if 1833enough people complain, but it should make more people's lives easier 1834than harder. This enhancement was suggested by Anthony Baxter (anthony 1835at interlink dot com dot au). 1836 1837= 1.0 "So rich and green" (20040420) = 1838 1839Initial release. 1840