How to remove extra | (pipe) separator from rows when loading

How to remove extra | (pipe) separator from rows when loading | (pipe)-separated text into R

I am reading text from a file in which the text is separated by | (pipes).

The text table looks like this (tweet id|date and time|tweet):

545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx

I am reading this information using the following code:

nyt <- read.table(file=".../nytimeshealth.txt", 
 sep="|", 
 header = F, 
 quote="", 
 fill=T, 
 stringsAsFactors = F,
 numerals ="no.loss",
 encoding = "UTF-8",
 na.strings = "NA")

Now, while most of the rows in the original file have 3 columns, each separated by a '|', a few of the rows have an additional '|' separator. That is to say, they have four columns, because some of the tweets themselves contain a | symbol.

545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

I know that usingfill=T option in the read.table function above allows me to read rows of unequal length (blank fields are implicitly added in the empty cells).

So, the row above becomes

71 545074589374881792 Wed Dec 17 04:34:43 +0000 2014 National Briefing
72 New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

However, now column 3 of row 71 has incomplete information, and columns 2 and 3 of row 72 are empty while column 1 does not contain the tweet ID but a part of the tweet. Is there any way I can avoid this? I would like to remove the extra | separator wherever it appears, so that I do not lose any information.

Is this possible while reading the text file into R? Or is it something I will have to take care of before I start loading the text. What would be my best course of action?

asked Nov 15 '18 at 1:48

Anonymouse

527

@hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.

– Anonymouse
Nov 15 '18 at 2:56

add a comment |

I am reading text from a file in which the text is separated by | (pipes).

The text table looks like this (tweet id|date and time|tweet):

545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx

I am reading this information using the following code:

nyt <- read.table(file=".../nytimeshealth.txt", 
 sep="|", 
 header = F, 
 quote="", 
 fill=T, 
 stringsAsFactors = F,
 numerals ="no.loss",
 encoding = "UTF-8",
 na.strings = "NA")

545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

I know that usingfill=T option in the read.table function above allows me to read rows of unequal length (blank fields are implicitly added in the empty cells).

So, the row above becomes

71 545074589374881792 Wed Dec 17 04:34:43 +0000 2014 National Briefing
72 New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

Is this possible while reading the text file into R? Or is it something I will have to take care of before I start loading the text. What would be my best course of action?

asked Nov 15 '18 at 1:48

Anonymouse

527

@hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.

– Anonymouse
Nov 15 '18 at 2:56

add a comment |

I am reading text from a file in which the text is separated by | (pipes).

The text table looks like this (tweet id|date and time|tweet):

545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx

I am reading this information using the following code:

nyt <- read.table(file=".../nytimeshealth.txt", 
 sep="|", 
 header = F, 
 quote="", 
 fill=T, 
 stringsAsFactors = F,
 numerals ="no.loss",
 encoding = "UTF-8",
 na.strings = "NA")

545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

I know that usingfill=T option in the read.table function above allows me to read rows of unequal length (blank fields are implicitly added in the empty cells).

So, the row above becomes

71 545074589374881792 Wed Dec 17 04:34:43 +0000 2014 National Briefing
72 New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

Is this possible while reading the text file into R? Or is it something I will have to take care of before I start loading the text. What would be my best course of action?

asked Nov 15 '18 at 1:48

Anonymouse

527

I am reading text from a file in which the text is separated by | (pipes).

The text table looks like this (tweet id|date and time|tweet):

545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx

I am reading this information using the following code:

nyt <- read.table(file=".../nytimeshealth.txt", 
 sep="|", 
 header = F, 
 quote="", 
 fill=T, 
 stringsAsFactors = F,
 numerals ="no.loss",
 encoding = "UTF-8",
 na.strings = "NA")

545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

I know that usingfill=T option in the read.table function above allows me to read rows of unequal length (blank fields are implicitly added in the empty cells).

So, the row above becomes

71 545074589374881792 Wed Dec 17 04:34:43 +0000 2014 National Briefing
72 New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

Is this possible while reading the text file into R? Or is it something I will have to take care of before I start loading the text. What would be my best course of action?

asked Nov 15 '18 at 1:48

Anonymouse

527

asked Nov 15 '18 at 1:48

Anonymouse

527

asked Nov 15 '18 at 1:48

Anonymouse

527

asked Nov 15 '18 at 1:48

Anonymouse

527

asked Nov 15 '18 at 1:48

Anonymouse

527

@hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.

– Anonymouse
Nov 15 '18 at 2:56

add a comment |

@hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.

– Anonymouse
Nov 15 '18 at 2:56

@hrbrmstr: point noted regarding the use of T and F for TRUE and FALSE. Please also see other comment about your suggested method.

– Anonymouse
Nov 15 '18 at 2:56

add a comment |

2 Answers
2

active

oldest

votes

I created a text file called text.txt with the 3 lines you provide as example of your data (the 2 easy lines without any | in the tweet as well as the one which has a | inside the tweet).

Here is the content of this file:

545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

Code

library(tidyverse)

readLines("text.txt", encoding = "UTF-8") %>%
 map(., str_split_fixed, "\|", 3) %>%
 map_df(., as_tibble)

Result

# A tibble: 3 x 3
 V1 V2 
 <chr> <chr> 
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
 V3 
 <chr> 
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx 
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …

edited Nov 15 '18 at 3:44

answered Nov 15 '18 at 2:12

prosoitos

935419

but why not just join any extra columns ex post facto?

– hrbrmstr
Nov 15 '18 at 2:14

Sorry, I am not sure I understand your question

– prosoitos
Nov 15 '18 at 2:15

1

Do you mean, split at every |, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)

– prosoitos
Nov 15 '18 at 2:21

This also works regardless of how many | you have in the tweets. So you don't have to worry about some tweets having more complex structures

– prosoitos
Nov 15 '18 at 2:22

@hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!

– Anonymouse
Nov 15 '18 at 2:52

|
show 1 more comment

Here is another solution, to get back to your comment and to use your initial code. But this solution will only work if you have one | per tweet (you can have tweets with none as long as at least one tweet has one |). If you don't have any | in your tweets, or if some tweets have more than one |, it will break and you will have to edit it. So the other answer, which will work regardless of the structure of your tweets is better IMO.

I am still using my text.txt file:

df <- read.table(file = "text.txt", 
 sep = "|", 
 header = F, 
 quote = "", 
 fill = T, 
 stringsAsFactors = F,
 numerals = "no.loss",
 encoding = "UTF-8",
 na.strings = "NA")

df %>%
 mutate(V3 = paste0(V3, V4)) %>%
 select(- V4)

Result

 V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
 V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx 
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx 
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

edited Nov 15 '18 at 5:43

answered Nov 15 '18 at 3:09

prosoitos

935419

Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this 585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx appears as 585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring All the text after the hastag #superfoods is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.

– Anonymouse
Nov 15 '18 at 3:29

That's because # is a comment in R. But if you use my first answer, you won't have this problem

– prosoitos
Nov 15 '18 at 3:30

Thanks for clarifying that...seems obvious now that you have mentioned it.

– Anonymouse
Nov 15 '18 at 3:31

read.table() interprets your data. So anything after # is considered a comment and is thus omitted. readLines() however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget about read.table() in your case. There are countless things that could go wrong with read.table() if you have funky characters in your tweets. While the readLines() answer is pretty bomb proof

– prosoitos
Nov 15 '18 at 3:31

1

Was just going to write that... I did check ?readLines and read that. Thanks a lot for your time!

– Anonymouse
Nov 15 '18 at 3:45

|
show 7 more comments

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311301%2fhow-to-remove-extra-pipe-separator-from-rows-when-loading-pipe-separated%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

I created a text file called text.txt with the 3 lines you provide as example of your data (the 2 easy lines without any | in the tweet as well as the one which has a | inside the tweet).

Here is the content of this file:

545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

Code

library(tidyverse)

readLines("text.txt", encoding = "UTF-8") %>%
 map(., str_split_fixed, "\|", 3) %>%
 map_df(., as_tibble)

Result

# A tibble: 3 x 3
 V1 V2 
 <chr> <chr> 
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
 V3 
 <chr> 
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx 
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …

edited Nov 15 '18 at 3:44

answered Nov 15 '18 at 2:12

prosoitos

935419

but why not just join any extra columns ex post facto?

– hrbrmstr
Nov 15 '18 at 2:14

Sorry, I am not sure I understand your question

– prosoitos
Nov 15 '18 at 2:15

1

Do you mean, split at every |, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)

– prosoitos
Nov 15 '18 at 2:21

This also works regardless of how many | you have in the tweets. So you don't have to worry about some tweets having more complex structures

– prosoitos
Nov 15 '18 at 2:22

@hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!

– Anonymouse
Nov 15 '18 at 2:52

|
show 1 more comment

I created a text file called text.txt with the 3 lines you provide as example of your data (the 2 easy lines without any | in the tweet as well as the one which has a | inside the tweet).

Here is the content of this file:

545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

Code

library(tidyverse)

readLines("text.txt", encoding = "UTF-8") %>%
 map(., str_split_fixed, "\|", 3) %>%
 map_df(., as_tibble)

Result

# A tibble: 3 x 3
 V1 V2 
 <chr> <chr> 
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
 V3 
 <chr> 
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx 
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …

edited Nov 15 '18 at 3:44

answered Nov 15 '18 at 2:12

prosoitos

935419

but why not just join any extra columns ex post facto?

– hrbrmstr
Nov 15 '18 at 2:14

Sorry, I am not sure I understand your question

– prosoitos
Nov 15 '18 at 2:15

1

Do you mean, split at every |, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)

– prosoitos
Nov 15 '18 at 2:21

This also works regardless of how many | you have in the tweets. So you don't have to worry about some tweets having more complex structures

– prosoitos
Nov 15 '18 at 2:22

@hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!

– Anonymouse
Nov 15 '18 at 2:52

|
show 1 more comment

I created a text file called text.txt with the 3 lines you provide as example of your data (the 2 easy lines without any | in the tweet as well as the one which has a | inside the tweet).

Here is the content of this file:

545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

Code

library(tidyverse)

readLines("text.txt", encoding = "UTF-8") %>%
 map(., str_split_fixed, "\|", 3) %>%
 map_df(., as_tibble)

Result

# A tibble: 3 x 3
 V1 V2 
 <chr> <chr> 
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
 V3 
 <chr> 
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx 
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …

edited Nov 15 '18 at 3:44

answered Nov 15 '18 at 2:12

prosoitos

935419

I created a text file called text.txt with the 3 lines you provide as example of your data (the 2 easy lines without any | in the tweet as well as the one which has a | inside the tweet).

Here is the content of this file:

545253503963516928|Wed Dec 17 16:25:40 +0000 2014|Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx
545235402156937217|Wed Dec 17 15:13:44 +0000 2014|For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx
545074589374881792|Wed Dec 17 04:34:43 +0000 2014|National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

Code

library(tidyverse)

readLines("text.txt", encoding = "UTF-8") %>%
 map(., str_split_fixed, "\|", 3) %>%
 map_df(., as_tibble)

Result

# A tibble: 3 x 3
 V1 V2 
 <chr> <chr> 
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
 V3 
 <chr> 
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx 
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Say…
3 National Briefing | New England: Massachusetts: Sex-Change Surgery Denied to …

edited Nov 15 '18 at 3:44

answered Nov 15 '18 at 2:12

prosoitos

935419

edited Nov 15 '18 at 3:44

answered Nov 15 '18 at 2:12

prosoitos

935419

answered Nov 15 '18 at 2:12

prosoitos

935419

answered Nov 15 '18 at 2:12

prosoitos

935419

but why not just join any extra columns ex post facto?

– hrbrmstr
Nov 15 '18 at 2:14

Sorry, I am not sure I understand your question

– prosoitos
Nov 15 '18 at 2:15

1

Do you mean, split at every |, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)

– prosoitos
Nov 15 '18 at 2:21

This also works regardless of how many | you have in the tweets. So you don't have to worry about some tweets having more complex structures

– prosoitos
Nov 15 '18 at 2:22

@hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!

– Anonymouse
Nov 15 '18 at 2:52

|
show 1 more comment

but why not just join any extra columns ex post facto?

– hrbrmstr
Nov 15 '18 at 2:14

Sorry, I am not sure I understand your question

– prosoitos
Nov 15 '18 at 2:15

1

Do you mean, split at every |, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)

– prosoitos
Nov 15 '18 at 2:21

This also works regardless of how many | you have in the tweets. So you don't have to worry about some tweets having more complex structures

– prosoitos
Nov 15 '18 at 2:22

@hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!

– Anonymouse
Nov 15 '18 at 2:52

but why not just join any extra columns ex post facto?

– hrbrmstr
Nov 15 '18 at 2:14

Sorry, I am not sure I understand your question

– prosoitos
Nov 15 '18 at 2:15

Do you mean, split at every |, then join extra columns? That would work too. There are many solutions. I like this one because it is compact and avoids unnecessary steps (such as splitting to join again later)

– prosoitos
Nov 15 '18 at 2:21

This also works regardless of how many | you have in the tweets. So you don't have to worry about some tweets having more complex structures

– prosoitos
Nov 15 '18 at 2:22

@hrbrmstr: While the answer by prosoitos solves the issue, I would also like to know how to do this the other way? That is, by joining the columns. How would I go about joining column 3 of row 71 with column 1 of row 72, while discarding columns 2 and 3 of row 72. Furthermore, how do I check this in all lines that may be in this format (i.e., have more than 2 '|' separators) for the entire data frame? Thanks!

– Anonymouse
Nov 15 '18 at 2:52

|
show 1 more comment

I am still using my text.txt file:

df <- read.table(file = "text.txt", 
 sep = "|", 
 header = F, 
 quote = "", 
 fill = T, 
 stringsAsFactors = F,
 numerals = "no.loss",
 encoding = "UTF-8",
 na.strings = "NA")

df %>%
 mutate(V3 = paste0(V3, V4)) %>%
 select(- V4)

Result

 V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
 V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx 
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx 
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

edited Nov 15 '18 at 5:43

answered Nov 15 '18 at 3:09

prosoitos

935419

Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this 585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx appears as 585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring All the text after the hastag #superfoods is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.

– Anonymouse
Nov 15 '18 at 3:29

That's because # is a comment in R. But if you use my first answer, you won't have this problem

– prosoitos
Nov 15 '18 at 3:30

Thanks for clarifying that...seems obvious now that you have mentioned it.

– Anonymouse
Nov 15 '18 at 3:31

read.table() interprets your data. So anything after # is considered a comment and is thus omitted. readLines() however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget about read.table() in your case. There are countless things that could go wrong with read.table() if you have funky characters in your tweets. While the readLines() answer is pretty bomb proof

– prosoitos
Nov 15 '18 at 3:31

1

Was just going to write that... I did check ?readLines and read that. Thanks a lot for your time!

– Anonymouse
Nov 15 '18 at 3:45

|
show 7 more comments

I am still using my text.txt file:

df <- read.table(file = "text.txt", 
 sep = "|", 
 header = F, 
 quote = "", 
 fill = T, 
 stringsAsFactors = F,
 numerals = "no.loss",
 encoding = "UTF-8",
 na.strings = "NA")

df %>%
 mutate(V3 = paste0(V3, V4)) %>%
 select(- V4)

Result

 V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
 V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx 
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx 
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

edited Nov 15 '18 at 5:43

answered Nov 15 '18 at 3:09

prosoitos

935419

Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this 585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx appears as 585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring All the text after the hastag #superfoods is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.

– Anonymouse
Nov 15 '18 at 3:29

That's because # is a comment in R. But if you use my first answer, you won't have this problem

– prosoitos
Nov 15 '18 at 3:30

Thanks for clarifying that...seems obvious now that you have mentioned it.

– Anonymouse
Nov 15 '18 at 3:31

read.table() interprets your data. So anything after # is considered a comment and is thus omitted. readLines() however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget about read.table() in your case. There are countless things that could go wrong with read.table() if you have funky characters in your tweets. While the readLines() answer is pretty bomb proof

– prosoitos
Nov 15 '18 at 3:31

1

Was just going to write that... I did check ?readLines and read that. Thanks a lot for your time!

– Anonymouse
Nov 15 '18 at 3:45

|
show 7 more comments

I am still using my text.txt file:

df <- read.table(file = "text.txt", 
 sep = "|", 
 header = F, 
 quote = "", 
 fill = T, 
 stringsAsFactors = F,
 numerals = "no.loss",
 encoding = "UTF-8",
 na.strings = "NA")

df %>%
 mutate(V3 = paste0(V3, V4)) %>%
 select(- V4)

Result

 V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
 V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx 
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx 
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

edited Nov 15 '18 at 5:43

answered Nov 15 '18 at 3:09

prosoitos

935419

I am still using my text.txt file:

df <- read.table(file = "text.txt", 
 sep = "|", 
 header = F, 
 quote = "", 
 fill = T, 
 stringsAsFactors = F,
 numerals = "no.loss",
 encoding = "UTF-8",
 na.strings = "NA")

df %>%
 mutate(V3 = paste0(V3, V4)) %>%
 select(- V4)

Result

 V1 V2
1 545253503963516928 Wed Dec 17 16:25:40 +0000 2014
2 545235402156937217 Wed Dec 17 15:13:44 +0000 2014
3 545074589374881792 Wed Dec 17 04:34:43 +0000 2014
 V3
1 Massachusetts Pharmacy Owners Arrested in Meningitis Deaths http://xxxxxxxxx 
2 For First Time, Treatment Helps Patients With Worst Kind of Stroke, Study Says http://xxxxxxxxx 
3 National Briefing New England: Massachusetts: Sex-Change Surgery Denied to Inmate http://xxxxxxxxx

edited Nov 15 '18 at 5:43

answered Nov 15 '18 at 3:09

prosoitos

935419

edited Nov 15 '18 at 5:43

answered Nov 15 '18 at 3:09

prosoitos

935419

answered Nov 15 '18 at 3:09

prosoitos

935419

answered Nov 15 '18 at 3:09

prosoitos

935419

Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this 585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx appears as 585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring All the text after the hastag #superfoods is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.

– Anonymouse
Nov 15 '18 at 3:29

That's because # is a comment in R. But if you use my first answer, you won't have this problem

– prosoitos
Nov 15 '18 at 3:30

Thanks for clarifying that...seems obvious now that you have mentioned it.

– Anonymouse
Nov 15 '18 at 3:31

read.table() interprets your data. So anything after # is considered a comment and is thus omitted. readLines() however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget about read.table() in your case. There are countless things that could go wrong with read.table() if you have funky characters in your tweets. While the readLines() answer is pretty bomb proof

– prosoitos
Nov 15 '18 at 3:31

1

Was just going to write that... I did check ?readLines and read that. Thanks a lot for your time!

– Anonymouse
Nov 15 '18 at 3:45

|
show 7 more comments

Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this 585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx appears as 585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring All the text after the hastag #superfoods is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.

– Anonymouse
Nov 15 '18 at 3:29

That's because # is a comment in R. But if you use my first answer, you won't have this problem

– prosoitos
Nov 15 '18 at 3:30

Thanks for clarifying that...seems obvious now that you have mentioned it.

– Anonymouse
Nov 15 '18 at 3:31

read.table() interprets your data. So anything after # is considered a comment and is thus omitted. readLines() however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget about read.table() in your case. There are countless things that could go wrong with read.table() if you have funky characters in your tweets. While the readLines() answer is pretty bomb proof

– prosoitos
Nov 15 '18 at 3:31

1

Was just going to write that... I did check ?readLines and read that. Thanks a lot for your time!

– Anonymouse
Nov 15 '18 at 3:45

Thanks for your reply, @prosoitos. I am running into another problem: when using the read.table function, a row like this 585625669805219840|Wed Apr 08 02:10:14 +0000 2015|12 spring #superfoods – from leeks to beets: http://xxxxxxxx appears as 585625669805219840 Wed Apr 08 02:10:14 +0000 2015 12 spring All the text after the hastag #superfoods is not read into the data frame. This is true for other rows containing hashtags as well. Cannot understand why this would be the case.

– Anonymouse
Nov 15 '18 at 3:29

That's because # is a comment in R. But if you use my first answer, you won't have this problem

– prosoitos
Nov 15 '18 at 3:30

Thanks for clarifying that...seems obvious now that you have mentioned it.

– Anonymouse
Nov 15 '18 at 3:31

read.table() interprets your data. So anything after # is considered a comment and is thus omitted. readLines() however, read lines of strings as is, without any interpretation. So I really suggest that you use my first answer and forget about read.table() in your case. There are countless things that could go wrong with read.table() if you have funky characters in your tweets. While the readLines() answer is pretty bomb proof

– prosoitos
Nov 15 '18 at 3:31

Was just going to write that... I did check ?readLines and read that. Thanks a lot for your time!

– Anonymouse
Nov 15 '18 at 3:45

|
show 7 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Myujth