Coding Help on Stata










1















I have an unbalanced panel data set that gives me information on how much banks lend in different areas. Geography id and bank id are numeric variables that were created using a Stata command like egen id=group(var).



The geography id goes from 1 to n and the bank id goes from 1 to k. To give you a more concrete idea of how my data look:



Geography ID (gid) | Bank ID (bid) | lending
-----------------------------------------------
1 | 1 | 25
1 | 2 | 32
1 | 4 | 83
----------------------------------------------
2 | 1 | 76
2 | 3 | 22
---------------------------------------------
3 | 2 | 42
3 | 3 | 12
3 | 5 | 22
--------------------------------------------


My final goal is to create a dataframe that has all the pairwise combinations of the geographical areas such that:



 1 2 3 ......... n
-------------------------------
1|(1,1) (1,2) (1,3)......(1,n)
2|(2,1) (2,2) (2,3)......(2,n)
.| . . .
n|(n,1) . ......(n,n)


Such that entry (i,j) gives me:



(i,j)=(Lending from Banks Operating in Area i and j)/(Total Lending in Area i and j)


So for instance given the above data



(1,1)=1 (1,2)=(25+76)/(25+32+83+76+22) (1,3)=(32+42)/(25+32+83+42+12+22)


I have a feeling that as a first step I should use levelsof and bysort in a loop but I am unsure on how exactly to tackle the problem.



Even if you can't provide an exact solution I would be extremely grateful receiving any help or suggestion. Although I prefer Stata I also have some knowledge of Matlab/R, so if you think it would be more suitable for that problem I am open to suggestions.










share|improve this question
























  • In R that could be accomplished with xtabs( Lending ~ Geograph_ID + Bank_ID, data=dat). Or possible something with tapply if there are multiple entries in any given cell (which is not presented in the data schema you displayed.) It would be an N x K matrix, not an N x N one. It's not clear how there could be different total in an i-j combo than a single item in that same combo.

    – 42-
    Nov 15 '18 at 0:09












  • Thank you for your answer. Sorry if I wasn't clear, but it should be N x N, since I want all the pairwise combinations of N different areas and not the pairwise combinations of Area-Bank (which would be N x K) .

    – nda
    Nov 15 '18 at 0:16















1















I have an unbalanced panel data set that gives me information on how much banks lend in different areas. Geography id and bank id are numeric variables that were created using a Stata command like egen id=group(var).



The geography id goes from 1 to n and the bank id goes from 1 to k. To give you a more concrete idea of how my data look:



Geography ID (gid) | Bank ID (bid) | lending
-----------------------------------------------
1 | 1 | 25
1 | 2 | 32
1 | 4 | 83
----------------------------------------------
2 | 1 | 76
2 | 3 | 22
---------------------------------------------
3 | 2 | 42
3 | 3 | 12
3 | 5 | 22
--------------------------------------------


My final goal is to create a dataframe that has all the pairwise combinations of the geographical areas such that:



 1 2 3 ......... n
-------------------------------
1|(1,1) (1,2) (1,3)......(1,n)
2|(2,1) (2,2) (2,3)......(2,n)
.| . . .
n|(n,1) . ......(n,n)


Such that entry (i,j) gives me:



(i,j)=(Lending from Banks Operating in Area i and j)/(Total Lending in Area i and j)


So for instance given the above data



(1,1)=1 (1,2)=(25+76)/(25+32+83+76+22) (1,3)=(32+42)/(25+32+83+42+12+22)


I have a feeling that as a first step I should use levelsof and bysort in a loop but I am unsure on how exactly to tackle the problem.



Even if you can't provide an exact solution I would be extremely grateful receiving any help or suggestion. Although I prefer Stata I also have some knowledge of Matlab/R, so if you think it would be more suitable for that problem I am open to suggestions.










share|improve this question
























  • In R that could be accomplished with xtabs( Lending ~ Geograph_ID + Bank_ID, data=dat). Or possible something with tapply if there are multiple entries in any given cell (which is not presented in the data schema you displayed.) It would be an N x K matrix, not an N x N one. It's not clear how there could be different total in an i-j combo than a single item in that same combo.

    – 42-
    Nov 15 '18 at 0:09












  • Thank you for your answer. Sorry if I wasn't clear, but it should be N x N, since I want all the pairwise combinations of N different areas and not the pairwise combinations of Area-Bank (which would be N x K) .

    – nda
    Nov 15 '18 at 0:16













1












1








1








I have an unbalanced panel data set that gives me information on how much banks lend in different areas. Geography id and bank id are numeric variables that were created using a Stata command like egen id=group(var).



The geography id goes from 1 to n and the bank id goes from 1 to k. To give you a more concrete idea of how my data look:



Geography ID (gid) | Bank ID (bid) | lending
-----------------------------------------------
1 | 1 | 25
1 | 2 | 32
1 | 4 | 83
----------------------------------------------
2 | 1 | 76
2 | 3 | 22
---------------------------------------------
3 | 2 | 42
3 | 3 | 12
3 | 5 | 22
--------------------------------------------


My final goal is to create a dataframe that has all the pairwise combinations of the geographical areas such that:



 1 2 3 ......... n
-------------------------------
1|(1,1) (1,2) (1,3)......(1,n)
2|(2,1) (2,2) (2,3)......(2,n)
.| . . .
n|(n,1) . ......(n,n)


Such that entry (i,j) gives me:



(i,j)=(Lending from Banks Operating in Area i and j)/(Total Lending in Area i and j)


So for instance given the above data



(1,1)=1 (1,2)=(25+76)/(25+32+83+76+22) (1,3)=(32+42)/(25+32+83+42+12+22)


I have a feeling that as a first step I should use levelsof and bysort in a loop but I am unsure on how exactly to tackle the problem.



Even if you can't provide an exact solution I would be extremely grateful receiving any help or suggestion. Although I prefer Stata I also have some knowledge of Matlab/R, so if you think it would be more suitable for that problem I am open to suggestions.










share|improve this question
















I have an unbalanced panel data set that gives me information on how much banks lend in different areas. Geography id and bank id are numeric variables that were created using a Stata command like egen id=group(var).



The geography id goes from 1 to n and the bank id goes from 1 to k. To give you a more concrete idea of how my data look:



Geography ID (gid) | Bank ID (bid) | lending
-----------------------------------------------
1 | 1 | 25
1 | 2 | 32
1 | 4 | 83
----------------------------------------------
2 | 1 | 76
2 | 3 | 22
---------------------------------------------
3 | 2 | 42
3 | 3 | 12
3 | 5 | 22
--------------------------------------------


My final goal is to create a dataframe that has all the pairwise combinations of the geographical areas such that:



 1 2 3 ......... n
-------------------------------
1|(1,1) (1,2) (1,3)......(1,n)
2|(2,1) (2,2) (2,3)......(2,n)
.| . . .
n|(n,1) . ......(n,n)


Such that entry (i,j) gives me:



(i,j)=(Lending from Banks Operating in Area i and j)/(Total Lending in Area i and j)


So for instance given the above data



(1,1)=1 (1,2)=(25+76)/(25+32+83+76+22) (1,3)=(32+42)/(25+32+83+42+12+22)


I have a feeling that as a first step I should use levelsof and bysort in a loop but I am unsure on how exactly to tackle the problem.



Even if you can't provide an exact solution I would be extremely grateful receiving any help or suggestion. Although I prefer Stata I also have some knowledge of Matlab/R, so if you think it would be more suitable for that problem I am open to suggestions.







r matlab stata






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 '18 at 9:12









Nick Cox

25.1k42038




25.1k42038










asked Nov 14 '18 at 23:56









ndanda

82




82












  • In R that could be accomplished with xtabs( Lending ~ Geograph_ID + Bank_ID, data=dat). Or possible something with tapply if there are multiple entries in any given cell (which is not presented in the data schema you displayed.) It would be an N x K matrix, not an N x N one. It's not clear how there could be different total in an i-j combo than a single item in that same combo.

    – 42-
    Nov 15 '18 at 0:09












  • Thank you for your answer. Sorry if I wasn't clear, but it should be N x N, since I want all the pairwise combinations of N different areas and not the pairwise combinations of Area-Bank (which would be N x K) .

    – nda
    Nov 15 '18 at 0:16

















  • In R that could be accomplished with xtabs( Lending ~ Geograph_ID + Bank_ID, data=dat). Or possible something with tapply if there are multiple entries in any given cell (which is not presented in the data schema you displayed.) It would be an N x K matrix, not an N x N one. It's not clear how there could be different total in an i-j combo than a single item in that same combo.

    – 42-
    Nov 15 '18 at 0:09












  • Thank you for your answer. Sorry if I wasn't clear, but it should be N x N, since I want all the pairwise combinations of N different areas and not the pairwise combinations of Area-Bank (which would be N x K) .

    – nda
    Nov 15 '18 at 0:16
















In R that could be accomplished with xtabs( Lending ~ Geograph_ID + Bank_ID, data=dat). Or possible something with tapply if there are multiple entries in any given cell (which is not presented in the data schema you displayed.) It would be an N x K matrix, not an N x N one. It's not clear how there could be different total in an i-j combo than a single item in that same combo.

– 42-
Nov 15 '18 at 0:09






In R that could be accomplished with xtabs( Lending ~ Geograph_ID + Bank_ID, data=dat). Or possible something with tapply if there are multiple entries in any given cell (which is not presented in the data schema you displayed.) It would be an N x K matrix, not an N x N one. It's not clear how there could be different total in an i-j combo than a single item in that same combo.

– 42-
Nov 15 '18 at 0:09














Thank you for your answer. Sorry if I wasn't clear, but it should be N x N, since I want all the pairwise combinations of N different areas and not the pairwise combinations of Area-Bank (which would be N x K) .

– nda
Nov 15 '18 at 0:16





Thank you for your answer. Sorry if I wasn't clear, but it should be N x N, since I want all the pairwise combinations of N different areas and not the pairwise combinations of Area-Bank (which would be N x K) .

– nda
Nov 15 '18 at 0:16












1 Answer
1






active

oldest

votes


















2














Here's an R method:



x <- data.frame(
geoid = c(1,1,1, 2,2, 3,3,3),
bankid = c(1,2,4, 1,3, 2,3,5),
lending = c(25,32,83, 76,22, 42,12,22)
)

myfunc <- function(x, i, j)
geos <- x$geoid %in% c(i, j)
banks <- with(x, intersect(bankid[geoid == i], bankid[geoid == j]))
with(x, sum(lending[geos & bankid %in% banks]) / sum(lending[geos]))


outer(unique(x$geoid), unique(x$geoid),
function(i,j) mapply(myfunc, list(x), i, j))
# [,1] [,2] [,3]
# [1,] 1.0000000 0.4243697 0.3425926
# [2,] 0.4243697 1.0000000 0.1954023
# [3,] 0.3425926 0.1954023 1.0000000


It's not the most efficient, but it's a start. It's difficult (I think) to do this truly vectorized, since each subset requires intersections, though I'm sure this could be optimized to not require re-calculating intersect(bankid...) twice for each equivalent pair (if that's a performance factor).




Edit: slightly more efficient process that does not re-calculate equivalent pairs of geoid:



Split the data by geo:



geox <- split(x, x$geoid)

myfunc <- function(i, j)
if (i >= j) return(NA)
banks <- intersect(geox[[i]]$bankid, geox[[j]]$bankid)
sum(with(geox[[i]], lending[ bankid %in% banks ]),
with(geox[[j]], lending[ bankid %in% banks ])) /
sum(geox[[i]]$lending, geox[[j]]$lending)


o <- outer(seq_along(geox), seq_along(geox),
function(i,j) mapply(myfunc, i, j))
o
# [,1] [,2] [,3]
# [1,] NA 0.4243697 0.3425926
# [2,] NA NA 0.1954023
# [3,] NA NA NA


(Just to prove we only calculated the minimum set.) Now, flip the upper triangle's data to lower triangle:



o[which(lower.tri(o),TRUE)] <- o[which(upper.tri(o),TRUE)]
o
# [,1] [,2] [,3]
# [1,] NA 0.4243697 0.3425926
# [2,] 0.4243697 NA 0.1954023
# [3,] 0.3425926 0.1954023 NA


And assign the known-value of 1 to the diagonal:



diag(o) <- 1





share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310540%2fcoding-help-on-stata%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    Here's an R method:



    x <- data.frame(
    geoid = c(1,1,1, 2,2, 3,3,3),
    bankid = c(1,2,4, 1,3, 2,3,5),
    lending = c(25,32,83, 76,22, 42,12,22)
    )

    myfunc <- function(x, i, j)
    geos <- x$geoid %in% c(i, j)
    banks <- with(x, intersect(bankid[geoid == i], bankid[geoid == j]))
    with(x, sum(lending[geos & bankid %in% banks]) / sum(lending[geos]))


    outer(unique(x$geoid), unique(x$geoid),
    function(i,j) mapply(myfunc, list(x), i, j))
    # [,1] [,2] [,3]
    # [1,] 1.0000000 0.4243697 0.3425926
    # [2,] 0.4243697 1.0000000 0.1954023
    # [3,] 0.3425926 0.1954023 1.0000000


    It's not the most efficient, but it's a start. It's difficult (I think) to do this truly vectorized, since each subset requires intersections, though I'm sure this could be optimized to not require re-calculating intersect(bankid...) twice for each equivalent pair (if that's a performance factor).




    Edit: slightly more efficient process that does not re-calculate equivalent pairs of geoid:



    Split the data by geo:



    geox <- split(x, x$geoid)

    myfunc <- function(i, j)
    if (i >= j) return(NA)
    banks <- intersect(geox[[i]]$bankid, geox[[j]]$bankid)
    sum(with(geox[[i]], lending[ bankid %in% banks ]),
    with(geox[[j]], lending[ bankid %in% banks ])) /
    sum(geox[[i]]$lending, geox[[j]]$lending)


    o <- outer(seq_along(geox), seq_along(geox),
    function(i,j) mapply(myfunc, i, j))
    o
    # [,1] [,2] [,3]
    # [1,] NA 0.4243697 0.3425926
    # [2,] NA NA 0.1954023
    # [3,] NA NA NA


    (Just to prove we only calculated the minimum set.) Now, flip the upper triangle's data to lower triangle:



    o[which(lower.tri(o),TRUE)] <- o[which(upper.tri(o),TRUE)]
    o
    # [,1] [,2] [,3]
    # [1,] NA 0.4243697 0.3425926
    # [2,] 0.4243697 NA 0.1954023
    # [3,] 0.3425926 0.1954023 NA


    And assign the known-value of 1 to the diagonal:



    diag(o) <- 1





    share|improve this answer





























      2














      Here's an R method:



      x <- data.frame(
      geoid = c(1,1,1, 2,2, 3,3,3),
      bankid = c(1,2,4, 1,3, 2,3,5),
      lending = c(25,32,83, 76,22, 42,12,22)
      )

      myfunc <- function(x, i, j)
      geos <- x$geoid %in% c(i, j)
      banks <- with(x, intersect(bankid[geoid == i], bankid[geoid == j]))
      with(x, sum(lending[geos & bankid %in% banks]) / sum(lending[geos]))


      outer(unique(x$geoid), unique(x$geoid),
      function(i,j) mapply(myfunc, list(x), i, j))
      # [,1] [,2] [,3]
      # [1,] 1.0000000 0.4243697 0.3425926
      # [2,] 0.4243697 1.0000000 0.1954023
      # [3,] 0.3425926 0.1954023 1.0000000


      It's not the most efficient, but it's a start. It's difficult (I think) to do this truly vectorized, since each subset requires intersections, though I'm sure this could be optimized to not require re-calculating intersect(bankid...) twice for each equivalent pair (if that's a performance factor).




      Edit: slightly more efficient process that does not re-calculate equivalent pairs of geoid:



      Split the data by geo:



      geox <- split(x, x$geoid)

      myfunc <- function(i, j)
      if (i >= j) return(NA)
      banks <- intersect(geox[[i]]$bankid, geox[[j]]$bankid)
      sum(with(geox[[i]], lending[ bankid %in% banks ]),
      with(geox[[j]], lending[ bankid %in% banks ])) /
      sum(geox[[i]]$lending, geox[[j]]$lending)


      o <- outer(seq_along(geox), seq_along(geox),
      function(i,j) mapply(myfunc, i, j))
      o
      # [,1] [,2] [,3]
      # [1,] NA 0.4243697 0.3425926
      # [2,] NA NA 0.1954023
      # [3,] NA NA NA


      (Just to prove we only calculated the minimum set.) Now, flip the upper triangle's data to lower triangle:



      o[which(lower.tri(o),TRUE)] <- o[which(upper.tri(o),TRUE)]
      o
      # [,1] [,2] [,3]
      # [1,] NA 0.4243697 0.3425926
      # [2,] 0.4243697 NA 0.1954023
      # [3,] 0.3425926 0.1954023 NA


      And assign the known-value of 1 to the diagonal:



      diag(o) <- 1





      share|improve this answer



























        2












        2








        2







        Here's an R method:



        x <- data.frame(
        geoid = c(1,1,1, 2,2, 3,3,3),
        bankid = c(1,2,4, 1,3, 2,3,5),
        lending = c(25,32,83, 76,22, 42,12,22)
        )

        myfunc <- function(x, i, j)
        geos <- x$geoid %in% c(i, j)
        banks <- with(x, intersect(bankid[geoid == i], bankid[geoid == j]))
        with(x, sum(lending[geos & bankid %in% banks]) / sum(lending[geos]))


        outer(unique(x$geoid), unique(x$geoid),
        function(i,j) mapply(myfunc, list(x), i, j))
        # [,1] [,2] [,3]
        # [1,] 1.0000000 0.4243697 0.3425926
        # [2,] 0.4243697 1.0000000 0.1954023
        # [3,] 0.3425926 0.1954023 1.0000000


        It's not the most efficient, but it's a start. It's difficult (I think) to do this truly vectorized, since each subset requires intersections, though I'm sure this could be optimized to not require re-calculating intersect(bankid...) twice for each equivalent pair (if that's a performance factor).




        Edit: slightly more efficient process that does not re-calculate equivalent pairs of geoid:



        Split the data by geo:



        geox <- split(x, x$geoid)

        myfunc <- function(i, j)
        if (i >= j) return(NA)
        banks <- intersect(geox[[i]]$bankid, geox[[j]]$bankid)
        sum(with(geox[[i]], lending[ bankid %in% banks ]),
        with(geox[[j]], lending[ bankid %in% banks ])) /
        sum(geox[[i]]$lending, geox[[j]]$lending)


        o <- outer(seq_along(geox), seq_along(geox),
        function(i,j) mapply(myfunc, i, j))
        o
        # [,1] [,2] [,3]
        # [1,] NA 0.4243697 0.3425926
        # [2,] NA NA 0.1954023
        # [3,] NA NA NA


        (Just to prove we only calculated the minimum set.) Now, flip the upper triangle's data to lower triangle:



        o[which(lower.tri(o),TRUE)] <- o[which(upper.tri(o),TRUE)]
        o
        # [,1] [,2] [,3]
        # [1,] NA 0.4243697 0.3425926
        # [2,] 0.4243697 NA 0.1954023
        # [3,] 0.3425926 0.1954023 NA


        And assign the known-value of 1 to the diagonal:



        diag(o) <- 1





        share|improve this answer















        Here's an R method:



        x <- data.frame(
        geoid = c(1,1,1, 2,2, 3,3,3),
        bankid = c(1,2,4, 1,3, 2,3,5),
        lending = c(25,32,83, 76,22, 42,12,22)
        )

        myfunc <- function(x, i, j)
        geos <- x$geoid %in% c(i, j)
        banks <- with(x, intersect(bankid[geoid == i], bankid[geoid == j]))
        with(x, sum(lending[geos & bankid %in% banks]) / sum(lending[geos]))


        outer(unique(x$geoid), unique(x$geoid),
        function(i,j) mapply(myfunc, list(x), i, j))
        # [,1] [,2] [,3]
        # [1,] 1.0000000 0.4243697 0.3425926
        # [2,] 0.4243697 1.0000000 0.1954023
        # [3,] 0.3425926 0.1954023 1.0000000


        It's not the most efficient, but it's a start. It's difficult (I think) to do this truly vectorized, since each subset requires intersections, though I'm sure this could be optimized to not require re-calculating intersect(bankid...) twice for each equivalent pair (if that's a performance factor).




        Edit: slightly more efficient process that does not re-calculate equivalent pairs of geoid:



        Split the data by geo:



        geox <- split(x, x$geoid)

        myfunc <- function(i, j)
        if (i >= j) return(NA)
        banks <- intersect(geox[[i]]$bankid, geox[[j]]$bankid)
        sum(with(geox[[i]], lending[ bankid %in% banks ]),
        with(geox[[j]], lending[ bankid %in% banks ])) /
        sum(geox[[i]]$lending, geox[[j]]$lending)


        o <- outer(seq_along(geox), seq_along(geox),
        function(i,j) mapply(myfunc, i, j))
        o
        # [,1] [,2] [,3]
        # [1,] NA 0.4243697 0.3425926
        # [2,] NA NA 0.1954023
        # [3,] NA NA NA


        (Just to prove we only calculated the minimum set.) Now, flip the upper triangle's data to lower triangle:



        o[which(lower.tri(o),TRUE)] <- o[which(upper.tri(o),TRUE)]
        o
        # [,1] [,2] [,3]
        # [1,] NA 0.4243697 0.3425926
        # [2,] 0.4243697 NA 0.1954023
        # [3,] 0.3425926 0.1954023 NA


        And assign the known-value of 1 to the diagonal:



        diag(o) <- 1






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 15 '18 at 0:46

























        answered Nov 15 '18 at 0:14









        r2evansr2evans

        27.1k33058




        27.1k33058





























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310540%2fcoding-help-on-stata%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Top Tejano songwriter Luis Silva dead of heart attack at 64

            政党

            天津地下鉄3号線