How to remove extra | (pipe) separator from rows when loading | (pipe)-separated text into R










2















I am reading text from a file in which the text is separated by | (pipes).



The text table looks like this (tweet id|date and time|tweet):



545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx


I am reading this information using the following code:



nyt <- read.table(file=".../nytimeshealth.txt", 
sep="|",
header = F,
quote="",
fill=T,
stringsAsFactors = F,
numerals ="no.loss",
encoding = "UTF-8",
na.strings = "NA")


Now, while most of the rows in the original file have 3 columns, each separated by a '|', a few of the rows have an additional '|' separator. That is to say, they have four columns, because some of the tweets themselves contain a | symbol.



545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx


I know that usingfill=T option in the read.table function above allows me to read rows of unequal length (blank fields are implicitly added in the empty cells).



So, the row above becomes



71 545074589374881792 Wed Dec 17 04:34:43 +0000 2014 National Briefing
72 New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx


However, now column 3 of row 71 has incomplete information, and columns 2 and 3 of row 72 are empty while column 1 does not contain the tweet ID but a part of the tweet. Is there any way I can avoid this? I would like to remove the extra | separator wherever it appears, so that I do not lose any information.



Is this possible while reading the text file into R? Or is it something I will have to take care of before I start loading the text. What would be my best course of action?










share|improve this question






















  • @hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.

    – Anonymouse
    Nov 15 '18 at 2:56















2















I am reading text from a file in which the text is separated by | (pipes).



The text table looks like this (tweet id|date and time|tweet):



545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx


I am reading this information using the following code:



nyt <- read.table(file=".../nytimeshealth.txt", 
sep="|",
header = F,
quote="",
fill=T,
stringsAsFactors = F,
numerals ="no.loss",
encoding = "UTF-8",
na.strings = "NA")


Now, while most of the rows in the original file have 3 columns, each separated by a '|', a few of the rows have an additional '|' separator. That is to say, they have four columns, because some of the tweets themselves contain a | symbol.



545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx


I know that usingfill=T option in the read.table function above allows me to read rows of unequal length (blank fields are implicitly added in the empty cells).



So, the row above becomes



71 545074589374881792 Wed Dec 17 04:34:43 +0000 2014 National Briefing
72 New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx


However, now column 3 of row 71 has incomplete information, and columns 2 and 3 of row 72 are empty while column 1 does not contain the tweet ID but a part of the tweet. Is there any way I can avoid this? I would like to remove the extra | separator wherever it appears, so that I do not lose any information.



Is this possible while reading the text file into R? Or is it something I will have to take care of before I start loading the text. What would be my best course of action?










share|improve this question






















  • @hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.

    – Anonymouse
    Nov 15 '18 at 2:56













2












2








2








I am reading text from a file in which the text is separated by | (pipes).



The text table looks like this (tweet id|date and time|tweet):



545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx


I am reading this information using the following code:



nyt <- read.table(file=".../nytimeshealth.txt", 
sep="|",
header = F,
quote="",
fill=T,
stringsAsFactors = F,
numerals ="no.loss",
encoding = "UTF-8",
na.strings = "NA")


Now, while most of the rows in the original file have 3 columns, each separated by a '|', a few of the rows have an additional '|' separator. That is to say, they have four columns, because some of the tweets themselves contain a | symbol.



545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx


I know that usingfill=T option in the read.table function above allows me to read rows of unequal length (blank fields are implicitly added in the empty cells).



So, the row above becomes



71 545074589374881792 Wed Dec 17 04:34:43 +0000 2014 National Briefing
72 New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx


However, now column 3 of row 71 has incomplete information, and columns 2 and 3 of row 72 are empty while column 1 does not contain the tweet ID but a part of the tweet. Is there any way I can avoid this? I would like to remove the extra | separator wherever it appears, so that I do not lose any information.



Is this possible while reading the text file into R? Or is it something I will have to take care of before I start loading the text. What would be my best course of action?










share|improve this question














I am reading text from a file in which the text is separated by | (pipes).



The text table looks like this (tweet id|date and time|tweet):



545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx


I am reading this information using the following code:



nyt <- read.table(file=".../nytimeshealth.txt", 
sep="|",
header = F,
quote="",
fill=T,
stringsAsFactors = F,
numerals ="no.loss",
encoding = "UTF-8",
na.strings = "NA")


Now, while most of the rows in the original file have 3 columns, each separated by a '|', a few of the rows have an additional '|' separator. That is to say, they have four columns, because some of the tweets themselves contain a | symbol.



545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx


I know that usingfill=T option in the read.table function above allows me to read rows of unequal length (blank fields are implicitly added in the empty cells).



So, the row above becomes



71 545074589374881792 Wed Dec 17 04:34:43 +0000 2014 National Briefing
72 New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx


However, now column 3 of row 71 has incomplete information, and columns 2 and 3 of row 72 are empty while column 1 does not contain the tweet ID but a part of the tweet. Is there any way I can avoid this? I would like to remove the extra | separator wherever it appears, so that I do not lose any information.



Is this possible while reading the text file into R? Or is it something I will have to take care of before I start loading the text. What would be my best course of action?







r






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 15 '18 at 1:48









AnonymouseAnonymouse

527




527












  • @hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.

    – Anonymouse
    Nov 15 '18 at 2:56

















  • @hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.

    – Anonymouse
    Nov 15 '18 at 2:56
















@hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.

– Anonymouse
Nov 15 '18 at 2:56





@hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.

– Anonymouse
Nov 15 '18 at 2:56












2 Answers
2






active

oldest

votes


















2














I created a text file called text.txt with the 3 lines you provide as example of your data (the 2 easy lines without any | in the tweet as well as the one which has a | inside the tweet).



Here is the content of this file:



545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx


Code



library(tidyverse)

readLines("text.txt", encoding = "UTF-8") %>%
map(., str_split_fixed, "\|", 3) %>%
map_df(., as_tibble)


Result



# A tibble: 3 x 3
V1 V2
<chr> <chr>
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
<chr>
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …





share|improve this answer

























  • but why not just join any extra columns ex post facto?

    – hrbrmstr
    Nov 15 '18 at 2:14











  • Sorry, I am not sure I understand your question

    – prosoitos
    Nov 15 '18 at 2:15






  • 1





    Do you mean, split at every |, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)

    – prosoitos
    Nov 15 '18 at 2:21











  • This also works regardless of how many | you have in the tweets. So you don't have to worry about some tweets having more complex structures

    – prosoitos
    Nov 15 '18 at 2:22












  • @hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!

    – Anonymouse
    Nov 15 '18 at 2:52



















0














Here is another solution, to get back to your comment and to use your initial code. But this solution will only work if you have one | per tweet (you can have tweets with none as long as at least one tweet has one |). If you don't have any | in your tweets, or if some tweets have more than one |, it will break and you will have to edit it. So the other answer, which will work regardless of the structure of your tweets is better IMO.



I am still using my text.txt file:



df <- read.table(file = "text.txt", 
sep = "|",
header = F,
quote = "",
fill = T,
stringsAsFactors = F,
numerals = "no.loss",
encoding = "UTF-8",
na.strings = "NA")

df %>%
mutate(V3 = paste0(V3, V4)) %>%
select(- V4)


Result



 V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx





share|improve this answer

























  • Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this 585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx appears as 585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring All the text after the hastag #superfoods is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.

    – Anonymouse
    Nov 15 '18 at 3:29












  • That's because # is a comment in R. But if you use my first answer, you won't have this problem

    – prosoitos
    Nov 15 '18 at 3:30











  • Thanks for clarifying that...seems obvious now that you have mentioned it.

    – Anonymouse
    Nov 15 '18 at 3:31











  • read.table() interprets your data. So anything after # is considered a comment and is thus omitted. readLines() however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget about read.table() in your case. There are countless things that could go wrong with read.table() if you have funky characters in your tweets. While the readLines() answer is pretty bomb proof

    – prosoitos
    Nov 15 '18 at 3:31







  • 1





    Was just going to write that... I did check ?readLines and read that. Thanks a lot for your time!

    – Anonymouse
    Nov 15 '18 at 3:45










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311301%2fhow-to-remove-extra-pipe-separator-from-rows-when-loading-pipe-separated%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














I created a text file called text.txt with the 3 lines you provide as example of your data (the 2 easy lines without any | in the tweet as well as the one which has a | inside the tweet).



Here is the content of this file:



545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx


Code



library(tidyverse)

readLines("text.txt", encoding = "UTF-8") %>%
map(., str_split_fixed, "\|", 3) %>%
map_df(., as_tibble)


Result



# A tibble: 3 x 3
V1 V2
<chr> <chr>
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
<chr>
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …





share|improve this answer

























  • but why not just join any extra columns ex post facto?

    – hrbrmstr
    Nov 15 '18 at 2:14











  • Sorry, I am not sure I understand your question

    – prosoitos
    Nov 15 '18 at 2:15






  • 1





    Do you mean, split at every |, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)

    – prosoitos
    Nov 15 '18 at 2:21











  • This also works regardless of how many | you have in the tweets. So you don't have to worry about some tweets having more complex structures

    – prosoitos
    Nov 15 '18 at 2:22












  • @hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!

    – Anonymouse
    Nov 15 '18 at 2:52
















2














I created a text file called text.txt with the 3 lines you provide as example of your data (the 2 easy lines without any | in the tweet as well as the one which has a | inside the tweet).



Here is the content of this file:



545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx


Code



library(tidyverse)

readLines("text.txt", encoding = "UTF-8") %>%
map(., str_split_fixed, "\|", 3) %>%
map_df(., as_tibble)


Result



# A tibble: 3 x 3
V1 V2
<chr> <chr>
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
<chr>
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …





share|improve this answer

























  • but why not just join any extra columns ex post facto?

    – hrbrmstr
    Nov 15 '18 at 2:14











  • Sorry, I am not sure I understand your question

    – prosoitos
    Nov 15 '18 at 2:15






  • 1





    Do you mean, split at every |, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)

    – prosoitos
    Nov 15 '18 at 2:21











  • This also works regardless of how many | you have in the tweets. So you don't have to worry about some tweets having more complex structures

    – prosoitos
    Nov 15 '18 at 2:22












  • @hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!

    – Anonymouse
    Nov 15 '18 at 2:52














2












2








2







I created a text file called text.txt with the 3 lines you provide as example of your data (the 2 easy lines without any | in the tweet as well as the one which has a | inside the tweet).



Here is the content of this file:



545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx


Code



library(tidyverse)

readLines("text.txt", encoding = "UTF-8") %>%
map(., str_split_fixed, "\|", 3) %>%
map_df(., as_tibble)


Result



# A tibble: 3 x 3
V1 V2
<chr> <chr>
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
<chr>
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …





share|improve this answer















I created a text file called text.txt with the 3 lines you provide as example of your data (the 2 easy lines without any | in the tweet as well as the one which has a | inside the tweet).



Here is the content of this file:



545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx


Code



library(tidyverse)

readLines("text.txt", encoding = "UTF-8") %>%
map(., str_split_fixed, "\|", 3) %>%
map_df(., as_tibble)


Result



# A tibble: 3 x 3
V1 V2
<chr> <chr>
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
<chr>
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 15 '18 at 3:44

























answered Nov 15 '18 at 2:12









prosoitosprosoitos

935419




935419












  • but why not just join any extra columns ex post facto?

    – hrbrmstr
    Nov 15 '18 at 2:14











  • Sorry, I am not sure I understand your question

    – prosoitos
    Nov 15 '18 at 2:15






  • 1





    Do you mean, split at every |, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)

    – prosoitos
    Nov 15 '18 at 2:21











  • This also works regardless of how many | you have in the tweets. So you don't have to worry about some tweets having more complex structures

    – prosoitos
    Nov 15 '18 at 2:22












  • @hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!

    – Anonymouse
    Nov 15 '18 at 2:52


















  • but why not just join any extra columns ex post facto?

    – hrbrmstr
    Nov 15 '18 at 2:14











  • Sorry, I am not sure I understand your question

    – prosoitos
    Nov 15 '18 at 2:15






  • 1





    Do you mean, split at every |, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)

    – prosoitos
    Nov 15 '18 at 2:21











  • This also works regardless of how many | you have in the tweets. So you don't have to worry about some tweets having more complex structures

    – prosoitos
    Nov 15 '18 at 2:22












  • @hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!

    – Anonymouse
    Nov 15 '18 at 2:52

















but why not just join any extra columns ex post facto?

– hrbrmstr
Nov 15 '18 at 2:14





but why not just join any extra columns ex post facto?

– hrbrmstr
Nov 15 '18 at 2:14













Sorry, I am not sure I understand your question

– prosoitos
Nov 15 '18 at 2:15





Sorry, I am not sure I understand your question

– prosoitos
Nov 15 '18 at 2:15




1




1





Do you mean, split at every |, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)

– prosoitos
Nov 15 '18 at 2:21





Do you mean, split at every |, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)

– prosoitos
Nov 15 '18 at 2:21













This also works regardless of how many | you have in the tweets. So you don't have to worry about some tweets having more complex structures

– prosoitos
Nov 15 '18 at 2:22






This also works regardless of how many | you have in the tweets. So you don't have to worry about some tweets having more complex structures

– prosoitos
Nov 15 '18 at 2:22














@hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!

– Anonymouse
Nov 15 '18 at 2:52






@hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!

– Anonymouse
Nov 15 '18 at 2:52














0














Here is another solution, to get back to your comment and to use your initial code. But this solution will only work if you have one | per tweet (you can have tweets with none as long as at least one tweet has one |). If you don't have any | in your tweets, or if some tweets have more than one |, it will break and you will have to edit it. So the other answer, which will work regardless of the structure of your tweets is better IMO.



I am still using my text.txt file:



df <- read.table(file = "text.txt", 
sep = "|",
header = F,
quote = "",
fill = T,
stringsAsFactors = F,
numerals = "no.loss",
encoding = "UTF-8",
na.strings = "NA")

df %>%
mutate(V3 = paste0(V3, V4)) %>%
select(- V4)


Result



 V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx





share|improve this answer

























  • Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this 585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx appears as 585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring All the text after the hastag #superfoods is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.

    – Anonymouse
    Nov 15 '18 at 3:29












  • That's because # is a comment in R. But if you use my first answer, you won't have this problem

    – prosoitos
    Nov 15 '18 at 3:30











  • Thanks for clarifying that...seems obvious now that you have mentioned it.

    – Anonymouse
    Nov 15 '18 at 3:31











  • read.table() interprets your data. So anything after # is considered a comment and is thus omitted. readLines() however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget about read.table() in your case. There are countless things that could go wrong with read.table() if you have funky characters in your tweets. While the readLines() answer is pretty bomb proof

    – prosoitos
    Nov 15 '18 at 3:31







  • 1





    Was just going to write that... I did check ?readLines and read that. Thanks a lot for your time!

    – Anonymouse
    Nov 15 '18 at 3:45















0














Here is another solution, to get back to your comment and to use your initial code. But this solution will only work if you have one | per tweet (you can have tweets with none as long as at least one tweet has one |). If you don't have any | in your tweets, or if some tweets have more than one |, it will break and you will have to edit it. So the other answer, which will work regardless of the structure of your tweets is better IMO.



I am still using my text.txt file:



df <- read.table(file = "text.txt", 
sep = "|",
header = F,
quote = "",
fill = T,
stringsAsFactors = F,
numerals = "no.loss",
encoding = "UTF-8",
na.strings = "NA")

df %>%
mutate(V3 = paste0(V3, V4)) %>%
select(- V4)


Result



 V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx





share|improve this answer

























  • Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this 585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx appears as 585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring All the text after the hastag #superfoods is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.

    – Anonymouse
    Nov 15 '18 at 3:29












  • That's because # is a comment in R. But if you use my first answer, you won't have this problem

    – prosoitos
    Nov 15 '18 at 3:30











  • Thanks for clarifying that...seems obvious now that you have mentioned it.

    – Anonymouse
    Nov 15 '18 at 3:31











  • read.table() interprets your data. So anything after # is considered a comment and is thus omitted. readLines() however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget about read.table() in your case. There are countless things that could go wrong with read.table() if you have funky characters in your tweets. While the readLines() answer is pretty bomb proof

    – prosoitos
    Nov 15 '18 at 3:31







  • 1





    Was just going to write that... I did check ?readLines and read that. Thanks a lot for your time!

    – Anonymouse
    Nov 15 '18 at 3:45













0












0








0







Here is another solution, to get back to your comment and to use your initial code. But this solution will only work if you have one | per tweet (you can have tweets with none as long as at least one tweet has one |). If you don't have any | in your tweets, or if some tweets have more than one |, it will break and you will have to edit it. So the other answer, which will work regardless of the structure of your tweets is better IMO.



I am still using my text.txt file:



df <- read.table(file = "text.txt", 
sep = "|",
header = F,
quote = "",
fill = T,
stringsAsFactors = F,
numerals = "no.loss",
encoding = "UTF-8",
na.strings = "NA")

df %>%
mutate(V3 = paste0(V3, V4)) %>%
select(- V4)


Result



 V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx





share|improve this answer















Here is another solution, to get back to your comment and to use your initial code. But this solution will only work if you have one | per tweet (you can have tweets with none as long as at least one tweet has one |). If you don't have any | in your tweets, or if some tweets have more than one |, it will break and you will have to edit it. So the other answer, which will work regardless of the structure of your tweets is better IMO.



I am still using my text.txt file:



df <- read.table(file = "text.txt", 
sep = "|",
header = F,
quote = "",
fill = T,
stringsAsFactors = F,
numerals = "no.loss",
encoding = "UTF-8",
na.strings = "NA")

df %>%
mutate(V3 = paste0(V3, V4)) %>%
select(- V4)


Result



 V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 15 '18 at 5:43

























answered Nov 15 '18 at 3:09









prosoitosprosoitos

935419




935419












  • Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this 585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx appears as 585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring All the text after the hastag #superfoods is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.

    – Anonymouse
    Nov 15 '18 at 3:29












  • That's because # is a comment in R. But if you use my first answer, you won't have this problem

    – prosoitos
    Nov 15 '18 at 3:30











  • Thanks for clarifying that...seems obvious now that you have mentioned it.

    – Anonymouse
    Nov 15 '18 at 3:31











  • read.table() interprets your data. So anything after # is considered a comment and is thus omitted. readLines() however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget about read.table() in your case. There are countless things that could go wrong with read.table() if you have funky characters in your tweets. While the readLines() answer is pretty bomb proof

    – prosoitos
    Nov 15 '18 at 3:31







  • 1





    Was just going to write that... I did check ?readLines and read that. Thanks a lot for your time!

    – Anonymouse
    Nov 15 '18 at 3:45

















  • Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this 585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx appears as 585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring All the text after the hastag #superfoods is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.

    – Anonymouse
    Nov 15 '18 at 3:29












  • That's because # is a comment in R. But if you use my first answer, you won't have this problem

    – prosoitos
    Nov 15 '18 at 3:30











  • Thanks for clarifying that...seems obvious now that you have mentioned it.

    – Anonymouse
    Nov 15 '18 at 3:31











  • read.table() interprets your data. So anything after # is considered a comment and is thus omitted. readLines() however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget about read.table() in your case. There are countless things that could go wrong with read.table() if you have funky characters in your tweets. While the readLines() answer is pretty bomb proof

    – prosoitos
    Nov 15 '18 at 3:31







  • 1





    Was just going to write that... I did check ?readLines and read that. Thanks a lot for your time!

    – Anonymouse
    Nov 15 '18 at 3:45
















Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this 585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx appears as 585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring All the text after the hastag #superfoods is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.

– Anonymouse
Nov 15 '18 at 3:29






Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this 585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx appears as 585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring All the text after the hastag #superfoods is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.

– Anonymouse
Nov 15 '18 at 3:29














That's because # is a comment in R. But if you use my first answer, you won't have this problem

– prosoitos
Nov 15 '18 at 3:30





That's because # is a comment in R. But if you use my first answer, you won't have this problem

– prosoitos
Nov 15 '18 at 3:30













Thanks for clarifying that...seems obvious now that you have mentioned it.

– Anonymouse
Nov 15 '18 at 3:31





Thanks for clarifying that...seems obvious now that you have mentioned it.

– Anonymouse
Nov 15 '18 at 3:31













read.table() interprets your data. So anything after # is considered a comment and is thus omitted. readLines() however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget about read.table() in your case. There are countless things that could go wrong with read.table() if you have funky characters in your tweets. While the readLines() answer is pretty bomb proof

– prosoitos
Nov 15 '18 at 3:31






read.table() interprets your data. So anything after # is considered a comment and is thus omitted. readLines() however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget about read.table() in your case. There are countless things that could go wrong with read.table() if you have funky characters in your tweets. While the readLines() answer is pretty bomb proof

– prosoitos
Nov 15 '18 at 3:31





1




1





Was just going to write that... I did check ?readLines and read that. Thanks a lot for your time!

– Anonymouse
Nov 15 '18 at 3:45





Was just going to write that... I did check ?readLines and read that. Thanks a lot for your time!

– Anonymouse
Nov 15 '18 at 3:45

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311301%2fhow-to-remove-extra-pipe-separator-from-rows-when-loading-pipe-separated%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Top Tejano songwriter Luis Silva dead of heart attack at 64

政党

天津地下鉄3号線