Read in a csv and recognize em dash (u'u2014') and en dash (u'u2013') in python
I am trying to bring in a file with a bunch of text with em dashes and/or en dashes, these are not to be confused with the regular hyphen (minus sign). The problem is that every time I read in this CSV, the dashes are turned into the replacement character (�). If I try to encode or decode the file I just get error messages about how utf-8 doesn't recognize the dashes. Do I just try to write to the CSV file from python? This just seems like a really dumb problem that should be easy to fix.
My code is:
df = pd.read_csv('csv file with em dash or en dash')
print(df)
My output is:
col_name
� �
I have tried replacing the dashes after it has been read in but that isn't working. I have also tried replacing the replacement character, but that hasn't worked either. My ideal solution would that the dashes would just show up how they are in the CSV file. I think is has something to do with how the file is being read into python but whenever I try an encoder/decoder, I just get errors that the dashes aren't supported.
python unicode utf-8 ascii special-characters
|
show 5 more comments
I am trying to bring in a file with a bunch of text with em dashes and/or en dashes, these are not to be confused with the regular hyphen (minus sign). The problem is that every time I read in this CSV, the dashes are turned into the replacement character (�). If I try to encode or decode the file I just get error messages about how utf-8 doesn't recognize the dashes. Do I just try to write to the CSV file from python? This just seems like a really dumb problem that should be easy to fix.
My code is:
df = pd.read_csv('csv file with em dash or en dash')
print(df)
My output is:
col_name
� �
I have tried replacing the dashes after it has been read in but that isn't working. I have also tried replacing the replacement character, but that hasn't worked either. My ideal solution would that the dashes would just show up how they are in the CSV file. I think is has something to do with how the file is being read into python but whenever I try an encoder/decoder, I just get errors that the dashes aren't supported.
python unicode utf-8 ascii special-characters
1
python2 or python3? What happens if you writeprint(u"u2014")? Is the dashed outputted correct? In case you are on windows, you know of chcp, see superuser.com/questions/269818/… ?
– quant
Nov 15 '18 at 21:47
stackoverflow.com/questions/33307690/…
– Xogle
Nov 15 '18 at 21:47
You need to determine the actual encoding of the file; it seems it's not UTF-8.
– Mark Ransom
Nov 15 '18 at 21:50
This is in python 2.7.
– mgh5021
Nov 15 '18 at 22:07
1
@mgh5021 No, because it is python 2.7 and python 2.7 internal default encoding is not UTF-8! But at least the output of the character is already working correctly - which is not always the case ...
– quant
Nov 15 '18 at 22:12
|
show 5 more comments
I am trying to bring in a file with a bunch of text with em dashes and/or en dashes, these are not to be confused with the regular hyphen (minus sign). The problem is that every time I read in this CSV, the dashes are turned into the replacement character (�). If I try to encode or decode the file I just get error messages about how utf-8 doesn't recognize the dashes. Do I just try to write to the CSV file from python? This just seems like a really dumb problem that should be easy to fix.
My code is:
df = pd.read_csv('csv file with em dash or en dash')
print(df)
My output is:
col_name
� �
I have tried replacing the dashes after it has been read in but that isn't working. I have also tried replacing the replacement character, but that hasn't worked either. My ideal solution would that the dashes would just show up how they are in the CSV file. I think is has something to do with how the file is being read into python but whenever I try an encoder/decoder, I just get errors that the dashes aren't supported.
python unicode utf-8 ascii special-characters
I am trying to bring in a file with a bunch of text with em dashes and/or en dashes, these are not to be confused with the regular hyphen (minus sign). The problem is that every time I read in this CSV, the dashes are turned into the replacement character (�). If I try to encode or decode the file I just get error messages about how utf-8 doesn't recognize the dashes. Do I just try to write to the CSV file from python? This just seems like a really dumb problem that should be easy to fix.
My code is:
df = pd.read_csv('csv file with em dash or en dash')
print(df)
My output is:
col_name
� �
I have tried replacing the dashes after it has been read in but that isn't working. I have also tried replacing the replacement character, but that hasn't worked either. My ideal solution would that the dashes would just show up how they are in the CSV file. I think is has something to do with how the file is being read into python but whenever I try an encoder/decoder, I just get errors that the dashes aren't supported.
python unicode utf-8 ascii special-characters
python unicode utf-8 ascii special-characters
asked Nov 15 '18 at 21:42
mgh5021mgh5021
414
414
1
python2 or python3? What happens if you writeprint(u"u2014")? Is the dashed outputted correct? In case you are on windows, you know of chcp, see superuser.com/questions/269818/… ?
– quant
Nov 15 '18 at 21:47
stackoverflow.com/questions/33307690/…
– Xogle
Nov 15 '18 at 21:47
You need to determine the actual encoding of the file; it seems it's not UTF-8.
– Mark Ransom
Nov 15 '18 at 21:50
This is in python 2.7.
– mgh5021
Nov 15 '18 at 22:07
1
@mgh5021 No, because it is python 2.7 and python 2.7 internal default encoding is not UTF-8! But at least the output of the character is already working correctly - which is not always the case ...
– quant
Nov 15 '18 at 22:12
|
show 5 more comments
1
python2 or python3? What happens if you writeprint(u"u2014")? Is the dashed outputted correct? In case you are on windows, you know of chcp, see superuser.com/questions/269818/… ?
– quant
Nov 15 '18 at 21:47
stackoverflow.com/questions/33307690/…
– Xogle
Nov 15 '18 at 21:47
You need to determine the actual encoding of the file; it seems it's not UTF-8.
– Mark Ransom
Nov 15 '18 at 21:50
This is in python 2.7.
– mgh5021
Nov 15 '18 at 22:07
1
@mgh5021 No, because it is python 2.7 and python 2.7 internal default encoding is not UTF-8! But at least the output of the character is already working correctly - which is not always the case ...
– quant
Nov 15 '18 at 22:12
1
1
python2 or python3? What happens if you write
print(u"u2014")? Is the dashed outputted correct? In case you are on windows, you know of chcp, see superuser.com/questions/269818/… ?– quant
Nov 15 '18 at 21:47
python2 or python3? What happens if you write
print(u"u2014")? Is the dashed outputted correct? In case you are on windows, you know of chcp, see superuser.com/questions/269818/… ?– quant
Nov 15 '18 at 21:47
stackoverflow.com/questions/33307690/…
– Xogle
Nov 15 '18 at 21:47
stackoverflow.com/questions/33307690/…
– Xogle
Nov 15 '18 at 21:47
You need to determine the actual encoding of the file; it seems it's not UTF-8.
– Mark Ransom
Nov 15 '18 at 21:50
You need to determine the actual encoding of the file; it seems it's not UTF-8.
– Mark Ransom
Nov 15 '18 at 21:50
This is in python 2.7.
– mgh5021
Nov 15 '18 at 22:07
This is in python 2.7.
– mgh5021
Nov 15 '18 at 22:07
1
1
@mgh5021 No, because it is python 2.7 and python 2.7 internal default encoding is not UTF-8! But at least the output of the character is already working correctly - which is not always the case ...
– quant
Nov 15 '18 at 22:12
@mgh5021 No, because it is python 2.7 and python 2.7 internal default encoding is not UTF-8! But at least the output of the character is already working correctly - which is not always the case ...
– quant
Nov 15 '18 at 22:12
|
show 5 more comments
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53328297%2fread-in-a-csv-and-recognize-em-dash-u-u2014-and-en-dash-u-u2013-in-pytho%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53328297%2fread-in-a-csv-and-recognize-em-dash-u-u2014-and-en-dash-u-u2013-in-pytho%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
python2 or python3? What happens if you write
print(u"u2014")? Is the dashed outputted correct? In case you are on windows, you know of chcp, see superuser.com/questions/269818/… ?– quant
Nov 15 '18 at 21:47
stackoverflow.com/questions/33307690/…
– Xogle
Nov 15 '18 at 21:47
You need to determine the actual encoding of the file; it seems it's not UTF-8.
– Mark Ransom
Nov 15 '18 at 21:50
This is in python 2.7.
– mgh5021
Nov 15 '18 at 22:07
1
@mgh5021 No, because it is python 2.7 and python 2.7 internal default encoding is not UTF-8! But at least the output of the character is already working correctly - which is not always the case ...
– quant
Nov 15 '18 at 22:12