Filtering the dataframe by matching values of two columns in different rows










0















This is a problem pretty similar to this one Filtering the dataframe by matching values of two columns, but I think still different. In the example there it just is required to match the two columns on values for the same row. I am looking to filter a data.table on matching values in two columns, which can be scattered across the whole range of the data.table.



The resulting data.table should look like this, and it should basically filter the rows, which have matching values between msg_seq_nb and orig_msg_seq_nb. I know that I could achieve that by merging the two data.tables on the two columns, but I thought there must be an easier way, which would prevent the merging and the additional overhead this might create.



 dt <- structure(list(cusip_id = c("00208J702", "00208J702", "00208J702", 
"00208J702", "010284AH0", "010284AH0", "010284AH0", "U9220JAC4",
"U9220JAC4"), orig_msg_seq_nb = c(NA, NA, NA, NA, NA, NA, "0003081", NA, "0012889"), msg_seq_nb = c("0008911", "0009000", "0009075","0009152", "0031914", "0003081", "0003105", "0012889", "0017534" )), row.names = c(NA, -9L), class = c("data.table", "data.frame" ))

> dt
cusip_id orig_msg_seq_nb msg_seq_nb
1: 00208J702 <NA> 0008911
2: 00208J702 <NA> 0009000
3: 00208J702 <NA> 0009075
4: 00208J702 <NA> 0009152
5: 010284AH0 <NA> 0031914
6: 010284AH0 <NA> 0003081
7: 010284AH0 0003081 0003105
8: U9220JAC4 <NA> 0012889
9: U9220JAC4 0012889 0017534


The result should look like this



 cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 <NA> 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 <NA> 0012889
4: U9220JAC4 0012889 0017534


Note that the msg_seq_nb and the orig_msg_seq_nb are not unique in this dataset.










share|improve this question
























  • So for your example above, you would filter out all observations of cusip_id == 010284AH0? Maybe add a bigger example along with expected output

    – Sotos
    Nov 16 '18 at 9:51











  • The cusip doesn't matter for the matching to work. The dt is basically the desired output, and the initial data.table is just the same but with a lot more entries in the columns, don't know if that would make it easier to get.

    – hannes101
    Nov 16 '18 at 10:05






  • 1





    It is better to also give the initial table as well...no need for a lot of entries. just a few more that did not make the filtering

    – Sotos
    Nov 16 '18 at 10:17















0















This is a problem pretty similar to this one Filtering the dataframe by matching values of two columns, but I think still different. In the example there it just is required to match the two columns on values for the same row. I am looking to filter a data.table on matching values in two columns, which can be scattered across the whole range of the data.table.



The resulting data.table should look like this, and it should basically filter the rows, which have matching values between msg_seq_nb and orig_msg_seq_nb. I know that I could achieve that by merging the two data.tables on the two columns, but I thought there must be an easier way, which would prevent the merging and the additional overhead this might create.



 dt <- structure(list(cusip_id = c("00208J702", "00208J702", "00208J702", 
"00208J702", "010284AH0", "010284AH0", "010284AH0", "U9220JAC4",
"U9220JAC4"), orig_msg_seq_nb = c(NA, NA, NA, NA, NA, NA, "0003081", NA, "0012889"), msg_seq_nb = c("0008911", "0009000", "0009075","0009152", "0031914", "0003081", "0003105", "0012889", "0017534" )), row.names = c(NA, -9L), class = c("data.table", "data.frame" ))

> dt
cusip_id orig_msg_seq_nb msg_seq_nb
1: 00208J702 <NA> 0008911
2: 00208J702 <NA> 0009000
3: 00208J702 <NA> 0009075
4: 00208J702 <NA> 0009152
5: 010284AH0 <NA> 0031914
6: 010284AH0 <NA> 0003081
7: 010284AH0 0003081 0003105
8: U9220JAC4 <NA> 0012889
9: U9220JAC4 0012889 0017534


The result should look like this



 cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 <NA> 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 <NA> 0012889
4: U9220JAC4 0012889 0017534


Note that the msg_seq_nb and the orig_msg_seq_nb are not unique in this dataset.










share|improve this question
























  • So for your example above, you would filter out all observations of cusip_id == 010284AH0? Maybe add a bigger example along with expected output

    – Sotos
    Nov 16 '18 at 9:51











  • The cusip doesn't matter for the matching to work. The dt is basically the desired output, and the initial data.table is just the same but with a lot more entries in the columns, don't know if that would make it easier to get.

    – hannes101
    Nov 16 '18 at 10:05






  • 1





    It is better to also give the initial table as well...no need for a lot of entries. just a few more that did not make the filtering

    – Sotos
    Nov 16 '18 at 10:17













0












0








0








This is a problem pretty similar to this one Filtering the dataframe by matching values of two columns, but I think still different. In the example there it just is required to match the two columns on values for the same row. I am looking to filter a data.table on matching values in two columns, which can be scattered across the whole range of the data.table.



The resulting data.table should look like this, and it should basically filter the rows, which have matching values between msg_seq_nb and orig_msg_seq_nb. I know that I could achieve that by merging the two data.tables on the two columns, but I thought there must be an easier way, which would prevent the merging and the additional overhead this might create.



 dt <- structure(list(cusip_id = c("00208J702", "00208J702", "00208J702", 
"00208J702", "010284AH0", "010284AH0", "010284AH0", "U9220JAC4",
"U9220JAC4"), orig_msg_seq_nb = c(NA, NA, NA, NA, NA, NA, "0003081", NA, "0012889"), msg_seq_nb = c("0008911", "0009000", "0009075","0009152", "0031914", "0003081", "0003105", "0012889", "0017534" )), row.names = c(NA, -9L), class = c("data.table", "data.frame" ))

> dt
cusip_id orig_msg_seq_nb msg_seq_nb
1: 00208J702 <NA> 0008911
2: 00208J702 <NA> 0009000
3: 00208J702 <NA> 0009075
4: 00208J702 <NA> 0009152
5: 010284AH0 <NA> 0031914
6: 010284AH0 <NA> 0003081
7: 010284AH0 0003081 0003105
8: U9220JAC4 <NA> 0012889
9: U9220JAC4 0012889 0017534


The result should look like this



 cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 <NA> 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 <NA> 0012889
4: U9220JAC4 0012889 0017534


Note that the msg_seq_nb and the orig_msg_seq_nb are not unique in this dataset.










share|improve this question
















This is a problem pretty similar to this one Filtering the dataframe by matching values of two columns, but I think still different. In the example there it just is required to match the two columns on values for the same row. I am looking to filter a data.table on matching values in two columns, which can be scattered across the whole range of the data.table.



The resulting data.table should look like this, and it should basically filter the rows, which have matching values between msg_seq_nb and orig_msg_seq_nb. I know that I could achieve that by merging the two data.tables on the two columns, but I thought there must be an easier way, which would prevent the merging and the additional overhead this might create.



 dt <- structure(list(cusip_id = c("00208J702", "00208J702", "00208J702", 
"00208J702", "010284AH0", "010284AH0", "010284AH0", "U9220JAC4",
"U9220JAC4"), orig_msg_seq_nb = c(NA, NA, NA, NA, NA, NA, "0003081", NA, "0012889"), msg_seq_nb = c("0008911", "0009000", "0009075","0009152", "0031914", "0003081", "0003105", "0012889", "0017534" )), row.names = c(NA, -9L), class = c("data.table", "data.frame" ))

> dt
cusip_id orig_msg_seq_nb msg_seq_nb
1: 00208J702 <NA> 0008911
2: 00208J702 <NA> 0009000
3: 00208J702 <NA> 0009075
4: 00208J702 <NA> 0009152
5: 010284AH0 <NA> 0031914
6: 010284AH0 <NA> 0003081
7: 010284AH0 0003081 0003105
8: U9220JAC4 <NA> 0012889
9: U9220JAC4 0012889 0017534


The result should look like this



 cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 <NA> 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 <NA> 0012889
4: U9220JAC4 0012889 0017534


Note that the msg_seq_nb and the orig_msg_seq_nb are not unique in this dataset.







r dplyr data.table






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 16 '18 at 11:46







hannes101

















asked Nov 16 '18 at 9:43









hannes101hannes101

732719




732719












  • So for your example above, you would filter out all observations of cusip_id == 010284AH0? Maybe add a bigger example along with expected output

    – Sotos
    Nov 16 '18 at 9:51











  • The cusip doesn't matter for the matching to work. The dt is basically the desired output, and the initial data.table is just the same but with a lot more entries in the columns, don't know if that would make it easier to get.

    – hannes101
    Nov 16 '18 at 10:05






  • 1





    It is better to also give the initial table as well...no need for a lot of entries. just a few more that did not make the filtering

    – Sotos
    Nov 16 '18 at 10:17

















  • So for your example above, you would filter out all observations of cusip_id == 010284AH0? Maybe add a bigger example along with expected output

    – Sotos
    Nov 16 '18 at 9:51











  • The cusip doesn't matter for the matching to work. The dt is basically the desired output, and the initial data.table is just the same but with a lot more entries in the columns, don't know if that would make it easier to get.

    – hannes101
    Nov 16 '18 at 10:05






  • 1





    It is better to also give the initial table as well...no need for a lot of entries. just a few more that did not make the filtering

    – Sotos
    Nov 16 '18 at 10:17
















So for your example above, you would filter out all observations of cusip_id == 010284AH0? Maybe add a bigger example along with expected output

– Sotos
Nov 16 '18 at 9:51





So for your example above, you would filter out all observations of cusip_id == 010284AH0? Maybe add a bigger example along with expected output

– Sotos
Nov 16 '18 at 9:51













The cusip doesn't matter for the matching to work. The dt is basically the desired output, and the initial data.table is just the same but with a lot more entries in the columns, don't know if that would make it easier to get.

– hannes101
Nov 16 '18 at 10:05





The cusip doesn't matter for the matching to work. The dt is basically the desired output, and the initial data.table is just the same but with a lot more entries in the columns, don't know if that would make it easier to get.

– hannes101
Nov 16 '18 at 10:05




1




1





It is better to also give the initial table as well...no need for a lot of entries. just a few more that did not make the filtering

– Sotos
Nov 16 '18 at 10:17





It is better to also give the initial table as well...no need for a lot of entries. just a few more that did not make the filtering

– Sotos
Nov 16 '18 at 10:17












1 Answer
1






active

oldest

votes


















1














The easiest way I can think of:



dt[orig_msg_seq_nb %in% msg_seq_nb | msg_seq_nb %in% orig_msg_seq_nb, ]

cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 NA 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 NA 0012889
4: U9220JAC4 0012889 0017534





share|improve this answer























  • Unfortunately, this is not possible, since the msg_seq_nb are not unique and it will be much more matches than the correct ones.

    – hannes101
    Nov 16 '18 at 11:46











  • @hannes101 If you look for matching values between msg_seq_nb and orig_msg_seq_nb, then this behavior is correct. If you want to further reduce the numer of rows, then you need an additional condition. Do you want to filter those rows, where cusip_id matches as well?

    – MRau
    Nov 16 '18 at 11:53











  • I just read that the msg_seq_nb are only unique per trading day, so yes I would need an additional condition. I think I'll just go with a merge between these two tables.

    – hannes101
    Nov 16 '18 at 11:59











  • @hannes101 The example you provided doesn't enable to test this additional condition, so I can only guess, but you can try this: ids <- dt[orig_msg_seq_nb %in% msg_seq_nb, cusip_id]; dt[orig_msg_seq_nb %in% msg_seq_nb | (msg_seq_nb %in% orig_msg_seq_nb & cusip_id %in% ids), ]

    – MRau
    Nov 16 '18 at 12:45












Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53335142%2ffiltering-the-dataframe-by-matching-values-of-two-columns-in-different-rows%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














The easiest way I can think of:



dt[orig_msg_seq_nb %in% msg_seq_nb | msg_seq_nb %in% orig_msg_seq_nb, ]

cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 NA 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 NA 0012889
4: U9220JAC4 0012889 0017534





share|improve this answer























  • Unfortunately, this is not possible, since the msg_seq_nb are not unique and it will be much more matches than the correct ones.

    – hannes101
    Nov 16 '18 at 11:46











  • @hannes101 If you look for matching values between msg_seq_nb and orig_msg_seq_nb, then this behavior is correct. If you want to further reduce the numer of rows, then you need an additional condition. Do you want to filter those rows, where cusip_id matches as well?

    – MRau
    Nov 16 '18 at 11:53











  • I just read that the msg_seq_nb are only unique per trading day, so yes I would need an additional condition. I think I'll just go with a merge between these two tables.

    – hannes101
    Nov 16 '18 at 11:59











  • @hannes101 The example you provided doesn't enable to test this additional condition, so I can only guess, but you can try this: ids <- dt[orig_msg_seq_nb %in% msg_seq_nb, cusip_id]; dt[orig_msg_seq_nb %in% msg_seq_nb | (msg_seq_nb %in% orig_msg_seq_nb & cusip_id %in% ids), ]

    – MRau
    Nov 16 '18 at 12:45
















1














The easiest way I can think of:



dt[orig_msg_seq_nb %in% msg_seq_nb | msg_seq_nb %in% orig_msg_seq_nb, ]

cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 NA 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 NA 0012889
4: U9220JAC4 0012889 0017534





share|improve this answer























  • Unfortunately, this is not possible, since the msg_seq_nb are not unique and it will be much more matches than the correct ones.

    – hannes101
    Nov 16 '18 at 11:46











  • @hannes101 If you look for matching values between msg_seq_nb and orig_msg_seq_nb, then this behavior is correct. If you want to further reduce the numer of rows, then you need an additional condition. Do you want to filter those rows, where cusip_id matches as well?

    – MRau
    Nov 16 '18 at 11:53











  • I just read that the msg_seq_nb are only unique per trading day, so yes I would need an additional condition. I think I'll just go with a merge between these two tables.

    – hannes101
    Nov 16 '18 at 11:59











  • @hannes101 The example you provided doesn't enable to test this additional condition, so I can only guess, but you can try this: ids <- dt[orig_msg_seq_nb %in% msg_seq_nb, cusip_id]; dt[orig_msg_seq_nb %in% msg_seq_nb | (msg_seq_nb %in% orig_msg_seq_nb & cusip_id %in% ids), ]

    – MRau
    Nov 16 '18 at 12:45














1












1








1







The easiest way I can think of:



dt[orig_msg_seq_nb %in% msg_seq_nb | msg_seq_nb %in% orig_msg_seq_nb, ]

cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 NA 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 NA 0012889
4: U9220JAC4 0012889 0017534





share|improve this answer













The easiest way I can think of:



dt[orig_msg_seq_nb %in% msg_seq_nb | msg_seq_nb %in% orig_msg_seq_nb, ]

cusip_id orig_msg_seq_nb msg_seq_nb
1: 010284AH0 NA 0003081
2: 010284AH0 0003081 0003105
3: U9220JAC4 NA 0012889
4: U9220JAC4 0012889 0017534






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 16 '18 at 10:31









MRauMRau

32318




32318












  • Unfortunately, this is not possible, since the msg_seq_nb are not unique and it will be much more matches than the correct ones.

    – hannes101
    Nov 16 '18 at 11:46











  • @hannes101 If you look for matching values between msg_seq_nb and orig_msg_seq_nb, then this behavior is correct. If you want to further reduce the numer of rows, then you need an additional condition. Do you want to filter those rows, where cusip_id matches as well?

    – MRau
    Nov 16 '18 at 11:53











  • I just read that the msg_seq_nb are only unique per trading day, so yes I would need an additional condition. I think I'll just go with a merge between these two tables.

    – hannes101
    Nov 16 '18 at 11:59











  • @hannes101 The example you provided doesn't enable to test this additional condition, so I can only guess, but you can try this: ids <- dt[orig_msg_seq_nb %in% msg_seq_nb, cusip_id]; dt[orig_msg_seq_nb %in% msg_seq_nb | (msg_seq_nb %in% orig_msg_seq_nb & cusip_id %in% ids), ]

    – MRau
    Nov 16 '18 at 12:45


















  • Unfortunately, this is not possible, since the msg_seq_nb are not unique and it will be much more matches than the correct ones.

    – hannes101
    Nov 16 '18 at 11:46











  • @hannes101 If you look for matching values between msg_seq_nb and orig_msg_seq_nb, then this behavior is correct. If you want to further reduce the numer of rows, then you need an additional condition. Do you want to filter those rows, where cusip_id matches as well?

    – MRau
    Nov 16 '18 at 11:53











  • I just read that the msg_seq_nb are only unique per trading day, so yes I would need an additional condition. I think I'll just go with a merge between these two tables.

    – hannes101
    Nov 16 '18 at 11:59











  • @hannes101 The example you provided doesn't enable to test this additional condition, so I can only guess, but you can try this: ids <- dt[orig_msg_seq_nb %in% msg_seq_nb, cusip_id]; dt[orig_msg_seq_nb %in% msg_seq_nb | (msg_seq_nb %in% orig_msg_seq_nb & cusip_id %in% ids), ]

    – MRau
    Nov 16 '18 at 12:45

















Unfortunately, this is not possible, since the msg_seq_nb are not unique and it will be much more matches than the correct ones.

– hannes101
Nov 16 '18 at 11:46





Unfortunately, this is not possible, since the msg_seq_nb are not unique and it will be much more matches than the correct ones.

– hannes101
Nov 16 '18 at 11:46













@hannes101 If you look for matching values between msg_seq_nb and orig_msg_seq_nb, then this behavior is correct. If you want to further reduce the numer of rows, then you need an additional condition. Do you want to filter those rows, where cusip_id matches as well?

– MRau
Nov 16 '18 at 11:53





@hannes101 If you look for matching values between msg_seq_nb and orig_msg_seq_nb, then this behavior is correct. If you want to further reduce the numer of rows, then you need an additional condition. Do you want to filter those rows, where cusip_id matches as well?

– MRau
Nov 16 '18 at 11:53













I just read that the msg_seq_nb are only unique per trading day, so yes I would need an additional condition. I think I'll just go with a merge between these two tables.

– hannes101
Nov 16 '18 at 11:59





I just read that the msg_seq_nb are only unique per trading day, so yes I would need an additional condition. I think I'll just go with a merge between these two tables.

– hannes101
Nov 16 '18 at 11:59













@hannes101 The example you provided doesn't enable to test this additional condition, so I can only guess, but you can try this: ids <- dt[orig_msg_seq_nb %in% msg_seq_nb, cusip_id]; dt[orig_msg_seq_nb %in% msg_seq_nb | (msg_seq_nb %in% orig_msg_seq_nb & cusip_id %in% ids), ]

– MRau
Nov 16 '18 at 12:45






@hannes101 The example you provided doesn't enable to test this additional condition, so I can only guess, but you can try this: ids <- dt[orig_msg_seq_nb %in% msg_seq_nb, cusip_id]; dt[orig_msg_seq_nb %in% msg_seq_nb | (msg_seq_nb %in% orig_msg_seq_nb & cusip_id %in% ids), ]

– MRau
Nov 16 '18 at 12:45




















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53335142%2ffiltering-the-dataframe-by-matching-values-of-two-columns-in-different-rows%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Top Tejano songwriter Luis Silva dead of heart attack at 64

政党

天津地下鉄3号線