Filtering the dataframe by matching values of two columns in different rows
This is a problem pretty similar to this one Filtering the dataframe by matching values of two columns, but I think still different. In the example there it just is required to match the two columns on values for the same row. I am looking to filter a data.table on matching values in two columns, which can be scattered across the whole range of the data.table.
The resulting data.table should look like this, and it should basically filter the rows, which have matching values between msg_seq_nb
and orig_msg_seq_nb
. I know that I could achieve that by merging the two data.tables on the two columns, but I thought there must be an easier way, which would prevent the merging and the additional overhead this might create.
dt <- structure(list(cusip_id = c("00208J702", "00208J702", "00208J702",
"00208J702", "010284AH0", "010284AH0", "010284AH0", "U9220JAC4",
"U9220JAC4"), orig_msg_seq_nb = c(NA, NA, NA, NA, NA, NA, "0003081", NA, "0012889"), msg_seq_nb = c("0008911", "0009000", "0009075","0009152", "0031914", "0003081", "0003105", "0012889", "0017534" )), row.names = c(NA, -9L), class = c("data.table", "data.frame" ))
> dt
cusip_id orig_msg_seq_nb msg_seq_nb
1: 00208J702 <NA> 0008911
2: 00208J702 <NA> 0009000
3: 00208J702 <NA> 0009075
4: 00208J702 <NA> 0009152
5: 010284AH0 <NA> 0031914
6: 010284AH0 <NA> 0003081
7: 010284AH0 0003081 0003105
8: U9220JAC4 <NA> 0012889
9: U9220JAC4 0012889 0017534
The result should look like this
cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 <NA> 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 <NA> 0012889
4: U9220JAC4 0012889 0017534
Note that the msg_seq_nb and the orig_msg_seq_nb are not unique in this dataset.
r dplyr data.table
add a comment |
This is a problem pretty similar to this one Filtering the dataframe by matching values of two columns, but I think still different. In the example there it just is required to match the two columns on values for the same row. I am looking to filter a data.table on matching values in two columns, which can be scattered across the whole range of the data.table.
The resulting data.table should look like this, and it should basically filter the rows, which have matching values between msg_seq_nb
and orig_msg_seq_nb
. I know that I could achieve that by merging the two data.tables on the two columns, but I thought there must be an easier way, which would prevent the merging and the additional overhead this might create.
dt <- structure(list(cusip_id = c("00208J702", "00208J702", "00208J702",
"00208J702", "010284AH0", "010284AH0", "010284AH0", "U9220JAC4",
"U9220JAC4"), orig_msg_seq_nb = c(NA, NA, NA, NA, NA, NA, "0003081", NA, "0012889"), msg_seq_nb = c("0008911", "0009000", "0009075","0009152", "0031914", "0003081", "0003105", "0012889", "0017534" )), row.names = c(NA, -9L), class = c("data.table", "data.frame" ))
> dt
cusip_id orig_msg_seq_nb msg_seq_nb
1: 00208J702 <NA> 0008911
2: 00208J702 <NA> 0009000
3: 00208J702 <NA> 0009075
4: 00208J702 <NA> 0009152
5: 010284AH0 <NA> 0031914
6: 010284AH0 <NA> 0003081
7: 010284AH0 0003081 0003105
8: U9220JAC4 <NA> 0012889
9: U9220JAC4 0012889 0017534
The result should look like this
cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 <NA> 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 <NA> 0012889
4: U9220JAC4 0012889 0017534
Note that the msg_seq_nb and the orig_msg_seq_nb are not unique in this dataset.
r dplyr data.table
So for your example above, you would filter out all observations ofcusip_id == 010284AH0
? Maybe add a bigger example along with expected output
– Sotos
Nov 16 '18 at 9:51
The cusip doesn't matter for the matching to work. Thedt
is basically the desired output, and the initial data.table is just the same but with a lot more entries in the columns, don't know if that would make it easier to get.
– hannes101
Nov 16 '18 at 10:05
1
It is better to also give the initial table as well...no need for a lot of entries. just a few more that did not make the filtering
– Sotos
Nov 16 '18 at 10:17
add a comment |
This is a problem pretty similar to this one Filtering the dataframe by matching values of two columns, but I think still different. In the example there it just is required to match the two columns on values for the same row. I am looking to filter a data.table on matching values in two columns, which can be scattered across the whole range of the data.table.
The resulting data.table should look like this, and it should basically filter the rows, which have matching values between msg_seq_nb
and orig_msg_seq_nb
. I know that I could achieve that by merging the two data.tables on the two columns, but I thought there must be an easier way, which would prevent the merging and the additional overhead this might create.
dt <- structure(list(cusip_id = c("00208J702", "00208J702", "00208J702",
"00208J702", "010284AH0", "010284AH0", "010284AH0", "U9220JAC4",
"U9220JAC4"), orig_msg_seq_nb = c(NA, NA, NA, NA, NA, NA, "0003081", NA, "0012889"), msg_seq_nb = c("0008911", "0009000", "0009075","0009152", "0031914", "0003081", "0003105", "0012889", "0017534" )), row.names = c(NA, -9L), class = c("data.table", "data.frame" ))
> dt
cusip_id orig_msg_seq_nb msg_seq_nb
1: 00208J702 <NA> 0008911
2: 00208J702 <NA> 0009000
3: 00208J702 <NA> 0009075
4: 00208J702 <NA> 0009152
5: 010284AH0 <NA> 0031914
6: 010284AH0 <NA> 0003081
7: 010284AH0 0003081 0003105
8: U9220JAC4 <NA> 0012889
9: U9220JAC4 0012889 0017534
The result should look like this
cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 <NA> 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 <NA> 0012889
4: U9220JAC4 0012889 0017534
Note that the msg_seq_nb and the orig_msg_seq_nb are not unique in this dataset.
r dplyr data.table
This is a problem pretty similar to this one Filtering the dataframe by matching values of two columns, but I think still different. In the example there it just is required to match the two columns on values for the same row. I am looking to filter a data.table on matching values in two columns, which can be scattered across the whole range of the data.table.
The resulting data.table should look like this, and it should basically filter the rows, which have matching values between msg_seq_nb
and orig_msg_seq_nb
. I know that I could achieve that by merging the two data.tables on the two columns, but I thought there must be an easier way, which would prevent the merging and the additional overhead this might create.
dt <- structure(list(cusip_id = c("00208J702", "00208J702", "00208J702",
"00208J702", "010284AH0", "010284AH0", "010284AH0", "U9220JAC4",
"U9220JAC4"), orig_msg_seq_nb = c(NA, NA, NA, NA, NA, NA, "0003081", NA, "0012889"), msg_seq_nb = c("0008911", "0009000", "0009075","0009152", "0031914", "0003081", "0003105", "0012889", "0017534" )), row.names = c(NA, -9L), class = c("data.table", "data.frame" ))
> dt
cusip_id orig_msg_seq_nb msg_seq_nb
1: 00208J702 <NA> 0008911
2: 00208J702 <NA> 0009000
3: 00208J702 <NA> 0009075
4: 00208J702 <NA> 0009152
5: 010284AH0 <NA> 0031914
6: 010284AH0 <NA> 0003081
7: 010284AH0 0003081 0003105
8: U9220JAC4 <NA> 0012889
9: U9220JAC4 0012889 0017534
The result should look like this
cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 <NA> 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 <NA> 0012889
4: U9220JAC4 0012889 0017534
Note that the msg_seq_nb and the orig_msg_seq_nb are not unique in this dataset.
r dplyr data.table
r dplyr data.table
edited Nov 16 '18 at 11:46
hannes101
asked Nov 16 '18 at 9:43
hannes101hannes101
732719
732719
So for your example above, you would filter out all observations ofcusip_id == 010284AH0
? Maybe add a bigger example along with expected output
– Sotos
Nov 16 '18 at 9:51
The cusip doesn't matter for the matching to work. Thedt
is basically the desired output, and the initial data.table is just the same but with a lot more entries in the columns, don't know if that would make it easier to get.
– hannes101
Nov 16 '18 at 10:05
1
It is better to also give the initial table as well...no need for a lot of entries. just a few more that did not make the filtering
– Sotos
Nov 16 '18 at 10:17
add a comment |
So for your example above, you would filter out all observations ofcusip_id == 010284AH0
? Maybe add a bigger example along with expected output
– Sotos
Nov 16 '18 at 9:51
The cusip doesn't matter for the matching to work. Thedt
is basically the desired output, and the initial data.table is just the same but with a lot more entries in the columns, don't know if that would make it easier to get.
– hannes101
Nov 16 '18 at 10:05
1
It is better to also give the initial table as well...no need for a lot of entries. just a few more that did not make the filtering
– Sotos
Nov 16 '18 at 10:17
So for your example above, you would filter out all observations of
cusip_id == 010284AH0
? Maybe add a bigger example along with expected output– Sotos
Nov 16 '18 at 9:51
So for your example above, you would filter out all observations of
cusip_id == 010284AH0
? Maybe add a bigger example along with expected output– Sotos
Nov 16 '18 at 9:51
The cusip doesn't matter for the matching to work. The
dt
is basically the desired output, and the initial data.table is just the same but with a lot more entries in the columns, don't know if that would make it easier to get.– hannes101
Nov 16 '18 at 10:05
The cusip doesn't matter for the matching to work. The
dt
is basically the desired output, and the initial data.table is just the same but with a lot more entries in the columns, don't know if that would make it easier to get.– hannes101
Nov 16 '18 at 10:05
1
1
It is better to also give the initial table as well...no need for a lot of entries. just a few more that did not make the filtering
– Sotos
Nov 16 '18 at 10:17
It is better to also give the initial table as well...no need for a lot of entries. just a few more that did not make the filtering
– Sotos
Nov 16 '18 at 10:17
add a comment |
1 Answer
1
active
oldest
votes
The easiest way I can think of:
dt[orig_msg_seq_nb %in% msg_seq_nb | msg_seq_nb %in% orig_msg_seq_nb, ]
cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 NA 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 NA 0012889
4: U9220JAC4 0012889 0017534
Unfortunately, this is not possible, since the msg_seq_nb are not unique and it will be much more matches than the correct ones.
– hannes101
Nov 16 '18 at 11:46
@hannes101 If you look for matching values betweenmsg_seq_nb
andorig_msg_seq_nb
, then this behavior is correct. If you want to further reduce the numer of rows, then you need an additional condition. Do you want to filter those rows, wherecusip_id
matches as well?
– MRau
Nov 16 '18 at 11:53
I just read that the msg_seq_nb are only unique per trading day, so yes I would need an additional condition. I think I'll just go with a merge between these two tables.
– hannes101
Nov 16 '18 at 11:59
@hannes101 The example you provided doesn't enable to test this additional condition, so I can only guess, but you can try this:ids <- dt[orig_msg_seq_nb %in% msg_seq_nb, cusip_id]; dt[orig_msg_seq_nb %in% msg_seq_nb | (msg_seq_nb %in% orig_msg_seq_nb & cusip_id %in% ids), ]
– MRau
Nov 16 '18 at 12:45
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53335142%2ffiltering-the-dataframe-by-matching-values-of-two-columns-in-different-rows%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The easiest way I can think of:
dt[orig_msg_seq_nb %in% msg_seq_nb | msg_seq_nb %in% orig_msg_seq_nb, ]
cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 NA 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 NA 0012889
4: U9220JAC4 0012889 0017534
Unfortunately, this is not possible, since the msg_seq_nb are not unique and it will be much more matches than the correct ones.
– hannes101
Nov 16 '18 at 11:46
@hannes101 If you look for matching values betweenmsg_seq_nb
andorig_msg_seq_nb
, then this behavior is correct. If you want to further reduce the numer of rows, then you need an additional condition. Do you want to filter those rows, wherecusip_id
matches as well?
– MRau
Nov 16 '18 at 11:53
I just read that the msg_seq_nb are only unique per trading day, so yes I would need an additional condition. I think I'll just go with a merge between these two tables.
– hannes101
Nov 16 '18 at 11:59
@hannes101 The example you provided doesn't enable to test this additional condition, so I can only guess, but you can try this:ids <- dt[orig_msg_seq_nb %in% msg_seq_nb, cusip_id]; dt[orig_msg_seq_nb %in% msg_seq_nb | (msg_seq_nb %in% orig_msg_seq_nb & cusip_id %in% ids), ]
– MRau
Nov 16 '18 at 12:45
add a comment |
The easiest way I can think of:
dt[orig_msg_seq_nb %in% msg_seq_nb | msg_seq_nb %in% orig_msg_seq_nb, ]
cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 NA 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 NA 0012889
4: U9220JAC4 0012889 0017534
Unfortunately, this is not possible, since the msg_seq_nb are not unique and it will be much more matches than the correct ones.
– hannes101
Nov 16 '18 at 11:46
@hannes101 If you look for matching values betweenmsg_seq_nb
andorig_msg_seq_nb
, then this behavior is correct. If you want to further reduce the numer of rows, then you need an additional condition. Do you want to filter those rows, wherecusip_id
matches as well?
– MRau
Nov 16 '18 at 11:53
I just read that the msg_seq_nb are only unique per trading day, so yes I would need an additional condition. I think I'll just go with a merge between these two tables.
– hannes101
Nov 16 '18 at 11:59
@hannes101 The example you provided doesn't enable to test this additional condition, so I can only guess, but you can try this:ids <- dt[orig_msg_seq_nb %in% msg_seq_nb, cusip_id]; dt[orig_msg_seq_nb %in% msg_seq_nb | (msg_seq_nb %in% orig_msg_seq_nb & cusip_id %in% ids), ]
– MRau
Nov 16 '18 at 12:45
add a comment |
The easiest way I can think of:
dt[orig_msg_seq_nb %in% msg_seq_nb | msg_seq_nb %in% orig_msg_seq_nb, ]
cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 NA 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 NA 0012889
4: U9220JAC4 0012889 0017534
The easiest way I can think of:
dt[orig_msg_seq_nb %in% msg_seq_nb | msg_seq_nb %in% orig_msg_seq_nb, ]
cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 NA 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 NA 0012889
4: U9220JAC4 0012889 0017534
answered Nov 16 '18 at 10:31
MRauMRau
32318
32318
Unfortunately, this is not possible, since the msg_seq_nb are not unique and it will be much more matches than the correct ones.
– hannes101
Nov 16 '18 at 11:46
@hannes101 If you look for matching values betweenmsg_seq_nb
andorig_msg_seq_nb
, then this behavior is correct. If you want to further reduce the numer of rows, then you need an additional condition. Do you want to filter those rows, wherecusip_id
matches as well?
– MRau
Nov 16 '18 at 11:53
I just read that the msg_seq_nb are only unique per trading day, so yes I would need an additional condition. I think I'll just go with a merge between these two tables.
– hannes101
Nov 16 '18 at 11:59
@hannes101 The example you provided doesn't enable to test this additional condition, so I can only guess, but you can try this:ids <- dt[orig_msg_seq_nb %in% msg_seq_nb, cusip_id]; dt[orig_msg_seq_nb %in% msg_seq_nb | (msg_seq_nb %in% orig_msg_seq_nb & cusip_id %in% ids), ]
– MRau
Nov 16 '18 at 12:45
add a comment |
Unfortunately, this is not possible, since the msg_seq_nb are not unique and it will be much more matches than the correct ones.
– hannes101
Nov 16 '18 at 11:46
@hannes101 If you look for matching values betweenmsg_seq_nb
andorig_msg_seq_nb
, then this behavior is correct. If you want to further reduce the numer of rows, then you need an additional condition. Do you want to filter those rows, wherecusip_id
matches as well?
– MRau
Nov 16 '18 at 11:53
I just read that the msg_seq_nb are only unique per trading day, so yes I would need an additional condition. I think I'll just go with a merge between these two tables.
– hannes101
Nov 16 '18 at 11:59
@hannes101 The example you provided doesn't enable to test this additional condition, so I can only guess, but you can try this:ids <- dt[orig_msg_seq_nb %in% msg_seq_nb, cusip_id]; dt[orig_msg_seq_nb %in% msg_seq_nb | (msg_seq_nb %in% orig_msg_seq_nb & cusip_id %in% ids), ]
– MRau
Nov 16 '18 at 12:45
Unfortunately, this is not possible, since the msg_seq_nb are not unique and it will be much more matches than the correct ones.
– hannes101
Nov 16 '18 at 11:46
Unfortunately, this is not possible, since the msg_seq_nb are not unique and it will be much more matches than the correct ones.
– hannes101
Nov 16 '18 at 11:46
@hannes101 If you look for matching values between
msg_seq_nb
and orig_msg_seq_nb
, then this behavior is correct. If you want to further reduce the numer of rows, then you need an additional condition. Do you want to filter those rows, where cusip_id
matches as well?– MRau
Nov 16 '18 at 11:53
@hannes101 If you look for matching values between
msg_seq_nb
and orig_msg_seq_nb
, then this behavior is correct. If you want to further reduce the numer of rows, then you need an additional condition. Do you want to filter those rows, where cusip_id
matches as well?– MRau
Nov 16 '18 at 11:53
I just read that the msg_seq_nb are only unique per trading day, so yes I would need an additional condition. I think I'll just go with a merge between these two tables.
– hannes101
Nov 16 '18 at 11:59
I just read that the msg_seq_nb are only unique per trading day, so yes I would need an additional condition. I think I'll just go with a merge between these two tables.
– hannes101
Nov 16 '18 at 11:59
@hannes101 The example you provided doesn't enable to test this additional condition, so I can only guess, but you can try this:
ids <- dt[orig_msg_seq_nb %in% msg_seq_nb, cusip_id]; dt[orig_msg_seq_nb %in% msg_seq_nb | (msg_seq_nb %in% orig_msg_seq_nb & cusip_id %in% ids), ]
– MRau
Nov 16 '18 at 12:45
@hannes101 The example you provided doesn't enable to test this additional condition, so I can only guess, but you can try this:
ids <- dt[orig_msg_seq_nb %in% msg_seq_nb, cusip_id]; dt[orig_msg_seq_nb %in% msg_seq_nb | (msg_seq_nb %in% orig_msg_seq_nb & cusip_id %in% ids), ]
– MRau
Nov 16 '18 at 12:45
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53335142%2ffiltering-the-dataframe-by-matching-values-of-two-columns-in-different-rows%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
So for your example above, you would filter out all observations of
cusip_id == 010284AH0
? Maybe add a bigger example along with expected output– Sotos
Nov 16 '18 at 9:51
The cusip doesn't matter for the matching to work. The
dt
is basically the desired output, and the initial data.table is just the same but with a lot more entries in the columns, don't know if that would make it easier to get.– hannes101
Nov 16 '18 at 10:05
1
It is better to also give the initial table as well...no need for a lot of entries. just a few more that did not make the filtering
– Sotos
Nov 16 '18 at 10:17