Accessing unicode content from DataFrame returns unicode content with additional backslash in Python3









up vote
1
down vote

favorite












I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.



I put the CSV File into DataFrame,



df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns


one of the tweets is -



b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'


But when i access this tweet through the command -
df['tweet'][0]



the output is returned in below format -



"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"


I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.



 time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'


Screenshot of 'sample.csv'.
enter image description here



As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.



Can anyone please explain why this is happening and how to avoid it?



thanks










share|improve this question























  • Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
    – Mark Tolonen
    22 hours ago















up vote
1
down vote

favorite












I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.



I put the CSV File into DataFrame,



df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns


one of the tweets is -



b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'


But when i access this tweet through the command -
df['tweet'][0]



the output is returned in below format -



"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"


I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.



 time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'


Screenshot of 'sample.csv'.
enter image description here



As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.



Can anyone please explain why this is happening and how to avoid it?



thanks










share|improve this question























  • Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
    – Mark Tolonen
    22 hours ago













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.



I put the CSV File into DataFrame,



df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns


one of the tweets is -



b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'


But when i access this tweet through the command -
df['tweet'][0]



the output is returned in below format -



"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"


I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.



 time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'


Screenshot of 'sample.csv'.
enter image description here



As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.



Can anyone please explain why this is happening and how to avoid it?



thanks










share|improve this question















I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.



I put the CSV File into DataFrame,



df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns


one of the tweets is -



b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'


But when i access this tweet through the command -
df['tweet'][0]



the output is returned in below format -



"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"


I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.



 time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'


Screenshot of 'sample.csv'.
enter image description here



As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.



Can anyone please explain why this is happening and how to avoid it?



thanks







python-3.x pandas dataframe twitter unicode






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 19 hours ago

























asked yesterday









Nakul Sharma

265




265











  • Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
    – Mark Tolonen
    22 hours ago

















  • Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
    – Mark Tolonen
    22 hours ago
















Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
– Mark Tolonen
22 hours ago





Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
– Mark Tolonen
22 hours ago













1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...' characters.



So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.



One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary.



So, after you have your data loaded into your dataframe, this could fix your tweets column:



import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)





share|improve this answer






















  • Thank you so much @jsbueno. Your solution worked like charm.
    – Nakul Sharma
    19 hours ago










  • Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
    – Nakul Sharma
    19 hours ago










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53233570%2faccessing-unicode-content-from-dataframe-returns-unicode-content-with-additional%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...' characters.



So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.



One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary.



So, after you have your data loaded into your dataframe, this could fix your tweets column:



import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)





share|improve this answer






















  • Thank you so much @jsbueno. Your solution worked like charm.
    – Nakul Sharma
    19 hours ago










  • Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
    – Nakul Sharma
    19 hours ago














up vote
1
down vote



accepted










You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...' characters.



So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.



One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary.



So, after you have your data loaded into your dataframe, this could fix your tweets column:



import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)





share|improve this answer






















  • Thank you so much @jsbueno. Your solution worked like charm.
    – Nakul Sharma
    19 hours ago










  • Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
    – Nakul Sharma
    19 hours ago












up vote
1
down vote



accepted







up vote
1
down vote



accepted






You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...' characters.



So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.



One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary.



So, after you have your data loaded into your dataframe, this could fix your tweets column:



import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)





share|improve this answer














You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...' characters.



So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.



One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary.



So, after you have your data loaded into your dataframe, this could fix your tweets column:



import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)






share|improve this answer














share|improve this answer



share|improve this answer








edited 14 hours ago

























answered yesterday









jsbueno

54.1k673124




54.1k673124











  • Thank you so much @jsbueno. Your solution worked like charm.
    – Nakul Sharma
    19 hours ago










  • Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
    – Nakul Sharma
    19 hours ago
















  • Thank you so much @jsbueno. Your solution worked like charm.
    – Nakul Sharma
    19 hours ago










  • Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
    – Nakul Sharma
    19 hours ago















Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
19 hours ago




Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
19 hours ago












Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
19 hours ago




Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
19 hours ago

















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53233570%2faccessing-unicode-content-from-dataframe-returns-unicode-content-with-additional%23new-answer', 'question_page');

);

Post as a guest














































































Popular posts from this blog

Top Tejano songwriter Luis Silva dead of heart attack at 64

政党

天津地下鉄3号線