python - html - how to modify code by converting text outside of a tag into a tag

How to replace/convert/correct a string representing tag into a tag?

I have below example where I need to clean some parts of the code and need to convert strings like </div> into the proper tags

html = """
 <html>
 <body>
 <div>
 &lt;/div&gt; <----- how to convert the line into </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
 </html> 
"""

I tried

soup = BeautifulSoup(html,"lxml")

tag = soup.find(text="&lt;")
tag.replace_with("<")

print(soup.prettify())

but this logic doesn't work, the find function doesn't pick up the string. The fact that the text is outside of any tag makes it more difficult. How can this be achieved?

asked Nov 16 '18 at 2:30

Chris

342213

Did you try: soup.find(text="<")? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.

– Lie Ryan
Nov 16 '18 at 5:50

add a comment |

How to replace/convert/correct a string representing tag into a tag?

I have below example where I need to clean some parts of the code and need to convert strings like </div> into the proper tags

html = """
 <html>
 <body>
 <div>
 &lt;/div&gt; <----- how to convert the line into </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
 </html> 
"""

I tried

soup = BeautifulSoup(html,"lxml")

tag = soup.find(text="&lt;")
tag.replace_with("<")

print(soup.prettify())

but this logic doesn't work, the find function doesn't pick up the string. The fact that the text is outside of any tag makes it more difficult. How can this be achieved?

asked Nov 16 '18 at 2:30

Chris

342213

Did you try: soup.find(text="<")? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.

– Lie Ryan
Nov 16 '18 at 5:50

add a comment |

How to replace/convert/correct a string representing tag into a tag?

I have below example where I need to clean some parts of the code and need to convert strings like </div> into the proper tags

html = """
 <html>
 <body>
 <div>
 &lt;/div&gt; <----- how to convert the line into </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
 </html> 
"""

I tried

soup = BeautifulSoup(html,"lxml")

tag = soup.find(text="&lt;")
tag.replace_with("<")

print(soup.prettify())

but this logic doesn't work, the find function doesn't pick up the string. The fact that the text is outside of any tag makes it more difficult. How can this be achieved?

asked Nov 16 '18 at 2:30

Chris

342213

How to replace/convert/correct a string representing tag into a tag?

I have below example where I need to clean some parts of the code and need to convert strings like </div> into the proper tags

html = """
 <html>
 <body>
 <div>
 &lt;/div&gt; <----- how to convert the line into </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
 </html> 
"""

I tried

soup = BeautifulSoup(html,"lxml")

tag = soup.find(text="&lt;")
tag.replace_with("<")

print(soup.prettify())

but this logic doesn't work, the find function doesn't pick up the string. The fact that the text is outside of any tag makes it more difficult. How can this be achieved?

python html beautifulsoup

asked Nov 16 '18 at 2:30

Chris

342213

asked Nov 16 '18 at 2:30

Chris

342213

asked Nov 16 '18 at 2:30

Chris

342213

asked Nov 16 '18 at 2:30

Chris

342213

asked Nov 16 '18 at 2:30

Chris

342213

Did you try: soup.find(text="<")? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.

– Lie Ryan
Nov 16 '18 at 5:50

add a comment |

Did you try: soup.find(text="<")? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.

– Lie Ryan
Nov 16 '18 at 5:50

Did you try: soup.find(text="<")? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.

– Lie Ryan
Nov 16 '18 at 5:50

add a comment |

3 Answers
3

active

oldest

votes

Using str.replace

In [3]: print(html.replace('&lt;', '<').replace('&gt;', '>'))

 <html>
 <body>
 <div>
 </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
 </html>

To place into BeautifulSoup from a file. Open the file first, replace the malformed text and then load the contents to BeautifulSoup. Something like this

with open('malformed.html') as f:
 malformed = f.read()

html = malformed.replace('&lt;', '<').replace('&gt;', '>')

soup = bs4.BeautifulSoup(html)

edited Nov 17 '18 at 23:01

answered Nov 16 '18 at 3:14

aydow

2,46511127

@ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?

– Chris
Nov 17 '18 at 0:54

@Chris see updated answer

– aydow
Nov 17 '18 at 23:01

add a comment |

I think you need a function to decode them, such as unescape on html.parser.

from html.parser import HTMLParser

unescape = HTMLParser().unescape 
html = """
 <html>
 <body>
 <div>
 &lt;/div&gt; <----- how to convert the line into </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
 </html> 
"""

print(unescape(html))

Output

<html>
 <body>
 <div>
 </div> <----- how to convert the line into </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
</html>

answered Nov 16 '18 at 5:41

kcorlidy

2,2482619

@ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'

– Chris
Nov 17 '18 at 0:01

1

i ran html=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) but i did not get such error. If you want to read as bytes, use open('C:\FolderTest.html','rb'). Btw you must close the file when you reading finished. Use with open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')

– kcorlidy
Nov 17 '18 at 2:07

add a comment |

Try using regular expressions instead.

Something like:

html = re.sub("&lt;", "<", html)

for less-than and

html = re.sub("&gt;", ">", html)

for greater-than.

Make sure you import re first.

Edit: for reference on how to use re.sub - https://lzone.de/examples/Python%20re.sub

Edit2: After some further research it seems like str.replace() is faster, so you may want to use that instead.

edited Nov 16 '18 at 6:06

answered Nov 16 '18 at 2:51

jwoff

76112

@ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.

– Chris
Nov 17 '18 at 1:01

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53330609%2fpython-html-how-to-modify-code-by-converting-text-outside-of-a-tag-into-a-ta%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Using str.replace

In [3]: print(html.replace('&lt;', '<').replace('&gt;', '>'))

 <html>
 <body>
 <div>
 </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
 </html>

To place into BeautifulSoup from a file. Open the file first, replace the malformed text and then load the contents to BeautifulSoup. Something like this

with open('malformed.html') as f:
 malformed = f.read()

html = malformed.replace('&lt;', '<').replace('&gt;', '>')

soup = bs4.BeautifulSoup(html)

edited Nov 17 '18 at 23:01

answered Nov 16 '18 at 3:14

aydow

2,46511127

@ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?

– Chris
Nov 17 '18 at 0:54

@Chris see updated answer

– aydow
Nov 17 '18 at 23:01

add a comment |

Using str.replace

In [3]: print(html.replace('&lt;', '<').replace('&gt;', '>'))

 <html>
 <body>
 <div>
 </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
 </html>

To place into BeautifulSoup from a file. Open the file first, replace the malformed text and then load the contents to BeautifulSoup. Something like this

with open('malformed.html') as f:
 malformed = f.read()

html = malformed.replace('&lt;', '<').replace('&gt;', '>')

soup = bs4.BeautifulSoup(html)

edited Nov 17 '18 at 23:01

answered Nov 16 '18 at 3:14

aydow

2,46511127

@ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?

– Chris
Nov 17 '18 at 0:54

@Chris see updated answer

– aydow
Nov 17 '18 at 23:01

add a comment |

Using str.replace

In [3]: print(html.replace('&lt;', '<').replace('&gt;', '>'))

 <html>
 <body>
 <div>
 </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
 </html>

To place into BeautifulSoup from a file. Open the file first, replace the malformed text and then load the contents to BeautifulSoup. Something like this

with open('malformed.html') as f:
 malformed = f.read()

html = malformed.replace('&lt;', '<').replace('&gt;', '>')

soup = bs4.BeautifulSoup(html)

edited Nov 17 '18 at 23:01

answered Nov 16 '18 at 3:14

aydow

2,46511127

Using str.replace

In [3]: print(html.replace('&lt;', '<').replace('&gt;', '>'))

 <html>
 <body>
 <div>
 </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
 </html>

To place into BeautifulSoup from a file. Open the file first, replace the malformed text and then load the contents to BeautifulSoup. Something like this

with open('malformed.html') as f:
 malformed = f.read()

html = malformed.replace('&lt;', '<').replace('&gt;', '>')

soup = bs4.BeautifulSoup(html)

edited Nov 17 '18 at 23:01

answered Nov 16 '18 at 3:14

aydow

2,46511127

edited Nov 17 '18 at 23:01

answered Nov 16 '18 at 3:14

aydow

2,46511127

answered Nov 16 '18 at 3:14

aydow

2,46511127

answered Nov 16 '18 at 3:14

aydow

2,46511127

@ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?

– Chris
Nov 17 '18 at 0:54

@Chris see updated answer

– aydow
Nov 17 '18 at 23:01

add a comment |

@ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?

– Chris
Nov 17 '18 at 0:54

@Chris see updated answer

– aydow
Nov 17 '18 at 23:01

@ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?

– Chris
Nov 17 '18 at 0:54

@Chris see updated answer

– aydow
Nov 17 '18 at 23:01

add a comment |

I think you need a function to decode them, such as unescape on html.parser.

from html.parser import HTMLParser

unescape = HTMLParser().unescape 
html = """
 <html>
 <body>
 <div>
 &lt;/div&gt; <----- how to convert the line into </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
 </html> 
"""

print(unescape(html))

Output

<html>
 <body>
 <div>
 </div> <----- how to convert the line into </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
</html>

answered Nov 16 '18 at 5:41

kcorlidy

2,2482619

@ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'

– Chris
Nov 17 '18 at 0:01

1

i ran html=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) but i did not get such error. If you want to read as bytes, use open('C:\FolderTest.html','rb'). Btw you must close the file when you reading finished. Use with open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')

– kcorlidy
Nov 17 '18 at 2:07

add a comment |

I think you need a function to decode them, such as unescape on html.parser.

from html.parser import HTMLParser

unescape = HTMLParser().unescape 
html = """
 <html>
 <body>
 <div>
 &lt;/div&gt; <----- how to convert the line into </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
 </html> 
"""

print(unescape(html))

Output

<html>
 <body>
 <div>
 </div> <----- how to convert the line into </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
</html>

answered Nov 16 '18 at 5:41

kcorlidy

2,2482619

@ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'

– Chris
Nov 17 '18 at 0:01

1

i ran html=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) but i did not get such error. If you want to read as bytes, use open('C:\FolderTest.html','rb'). Btw you must close the file when you reading finished. Use with open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')

– kcorlidy
Nov 17 '18 at 2:07

add a comment |

I think you need a function to decode them, such as unescape on html.parser.

from html.parser import HTMLParser

unescape = HTMLParser().unescape 
html = """
 <html>
 <body>
 <div>
 &lt;/div&gt; <----- how to convert the line into </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
 </html> 
"""

print(unescape(html))

Output

<html>
 <body>
 <div>
 </div> <----- how to convert the line into </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
</html>

answered Nov 16 '18 at 5:41

kcorlidy

2,2482619

I think you need a function to decode them, such as unescape on html.parser.

from html.parser import HTMLParser

unescape = HTMLParser().unescape 
html = """
 <html>
 <body>
 <div>
 &lt;/div&gt; <----- how to convert the line into </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
 </html> 
"""

print(unescape(html))

Output

<html>
 <body>
 <div>
 </div> <----- how to convert the line into </div>
 <div class="first_class">
 <h1 id="Header_1">
 Header_1
 </h1>
 </div>
 </body>
</html>

answered Nov 16 '18 at 5:41

kcorlidy

2,2482619

answered Nov 16 '18 at 5:41

kcorlidy

2,2482619

answered Nov 16 '18 at 5:41

kcorlidy

2,2482619

answered Nov 16 '18 at 5:41

kcorlidy

2,2482619

@ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'

– Chris
Nov 17 '18 at 0:01

1

i ran html=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) but i did not get such error. If you want to read as bytes, use open('C:\FolderTest.html','rb'). Btw you must close the file when you reading finished. Use with open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')

– kcorlidy
Nov 17 '18 at 2:07

add a comment |

@ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'

– Chris
Nov 17 '18 at 0:01

1

i ran html=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) but i did not get such error. If you want to read as bytes, use open('C:\FolderTest.html','rb'). Btw you must close the file when you reading finished. Use with open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')

– kcorlidy
Nov 17 '18 at 2:07

@ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'

– Chris
Nov 17 '18 at 0:01

i ran html=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) but i did not get such error. If you want to read as bytes, use open('C:\FolderTest.html','rb'). Btw you must close the file when you reading finished. Use with open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')

– kcorlidy
Nov 17 '18 at 2:07

add a comment |

Try using regular expressions instead.

Something like:

html = re.sub("&lt;", "<", html)

for less-than and

html = re.sub("&gt;", ">", html)

for greater-than.

Make sure you import re first.

Edit: for reference on how to use re.sub - https://lzone.de/examples/Python%20re.sub

Edit2: After some further research it seems like str.replace() is faster, so you may want to use that instead.

edited Nov 16 '18 at 6:06

answered Nov 16 '18 at 2:51

jwoff

76112

@ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.

– Chris
Nov 17 '18 at 1:01

add a comment |

Try using regular expressions instead.

Something like:

html = re.sub("&lt;", "<", html)

for less-than and

html = re.sub("&gt;", ">", html)

for greater-than.

Make sure you import re first.

Edit: for reference on how to use re.sub - https://lzone.de/examples/Python%20re.sub

Edit2: After some further research it seems like str.replace() is faster, so you may want to use that instead.

edited Nov 16 '18 at 6:06

answered Nov 16 '18 at 2:51

jwoff

76112

@ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.

– Chris
Nov 17 '18 at 1:01

add a comment |

Try using regular expressions instead.

Something like:

html = re.sub("&lt;", "<", html)

for less-than and

html = re.sub("&gt;", ">", html)

for greater-than.

Make sure you import re first.

Edit: for reference on how to use re.sub - https://lzone.de/examples/Python%20re.sub

Edit2: After some further research it seems like str.replace() is faster, so you may want to use that instead.

edited Nov 16 '18 at 6:06

answered Nov 16 '18 at 2:51

jwoff

76112

Try using regular expressions instead.

Something like:

html = re.sub("&lt;", "<", html)

for less-than and

html = re.sub("&gt;", ">", html)

for greater-than.

Make sure you import re first.

Edit: for reference on how to use re.sub - https://lzone.de/examples/Python%20re.sub

Edit2: After some further research it seems like str.replace() is faster, so you may want to use that instead.

edited Nov 16 '18 at 6:06

answered Nov 16 '18 at 2:51

jwoff

76112

edited Nov 16 '18 at 6:06

answered Nov 16 '18 at 2:51

jwoff

76112

answered Nov 16 '18 at 2:51

jwoff

76112

answered Nov 16 '18 at 2:51

jwoff

76112

@ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.

– Chris
Nov 17 '18 at 1:01

add a comment |

@ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.

– Chris
Nov 17 '18 at 1:01

@ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.

– Chris
Nov 17 '18 at 1:01

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Myujth