Get overall smallest elements' distribution in dataframe with sorted columns more efficiently










3















I have a dataframe with sorted columns, something like this:



df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2)) for q in ['blue', 'green', 'red'])
blue green red
0 -2.15 -0.76 -2.62
1 -0.88 -0.62 -1.65
2 -0.77 -0.55 -1.51
3 -0.73 -0.17 -1.14
4 -0.06 -0.16 -0.75
5 -0.03 0.05 -0.08
6 0.06 0.38 0.37
7 0.41 0.76 1.04
8 0.56 0.89 1.16
9 0.97 2.94 1.79


What I want to know is how many of the n smallest elements in the whole frame are in each column. This is the only thing I came up with:



is_small = df.isin(np.partition(df.values.flatten(), n)[:n])


with n=10 it looks like this:



 blue green red
0 True True True
1 True False True
2 True False True
3 True False True
4 False False True
5 False False False
6 False False False
7 False False False
8 False False False
9 False False False


Then by applying np.sum I get the number corresponding to each column.



I'm dissatisfied with this solution because it in no way utilizes the sortedness of the original data. All the data gets partitioned and all the data is then checked for whether it's in the partition. It seems wasteful, and I can't seem to figure out a better way.










share|improve this question


























    3















    I have a dataframe with sorted columns, something like this:



    df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2)) for q in ['blue', 'green', 'red'])
    blue green red
    0 -2.15 -0.76 -2.62
    1 -0.88 -0.62 -1.65
    2 -0.77 -0.55 -1.51
    3 -0.73 -0.17 -1.14
    4 -0.06 -0.16 -0.75
    5 -0.03 0.05 -0.08
    6 0.06 0.38 0.37
    7 0.41 0.76 1.04
    8 0.56 0.89 1.16
    9 0.97 2.94 1.79


    What I want to know is how many of the n smallest elements in the whole frame are in each column. This is the only thing I came up with:



    is_small = df.isin(np.partition(df.values.flatten(), n)[:n])


    with n=10 it looks like this:



     blue green red
    0 True True True
    1 True False True
    2 True False True
    3 True False True
    4 False False True
    5 False False False
    6 False False False
    7 False False False
    8 False False False
    9 False False False


    Then by applying np.sum I get the number corresponding to each column.



    I'm dissatisfied with this solution because it in no way utilizes the sortedness of the original data. All the data gets partitioned and all the data is then checked for whether it's in the partition. It seems wasteful, and I can't seem to figure out a better way.










    share|improve this question
























      3












      3








      3








      I have a dataframe with sorted columns, something like this:



      df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2)) for q in ['blue', 'green', 'red'])
      blue green red
      0 -2.15 -0.76 -2.62
      1 -0.88 -0.62 -1.65
      2 -0.77 -0.55 -1.51
      3 -0.73 -0.17 -1.14
      4 -0.06 -0.16 -0.75
      5 -0.03 0.05 -0.08
      6 0.06 0.38 0.37
      7 0.41 0.76 1.04
      8 0.56 0.89 1.16
      9 0.97 2.94 1.79


      What I want to know is how many of the n smallest elements in the whole frame are in each column. This is the only thing I came up with:



      is_small = df.isin(np.partition(df.values.flatten(), n)[:n])


      with n=10 it looks like this:



       blue green red
      0 True True True
      1 True False True
      2 True False True
      3 True False True
      4 False False True
      5 False False False
      6 False False False
      7 False False False
      8 False False False
      9 False False False


      Then by applying np.sum I get the number corresponding to each column.



      I'm dissatisfied with this solution because it in no way utilizes the sortedness of the original data. All the data gets partitioned and all the data is then checked for whether it's in the partition. It seems wasteful, and I can't seem to figure out a better way.










      share|improve this question














      I have a dataframe with sorted columns, something like this:



      df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2)) for q in ['blue', 'green', 'red'])
      blue green red
      0 -2.15 -0.76 -2.62
      1 -0.88 -0.62 -1.65
      2 -0.77 -0.55 -1.51
      3 -0.73 -0.17 -1.14
      4 -0.06 -0.16 -0.75
      5 -0.03 0.05 -0.08
      6 0.06 0.38 0.37
      7 0.41 0.76 1.04
      8 0.56 0.89 1.16
      9 0.97 2.94 1.79


      What I want to know is how many of the n smallest elements in the whole frame are in each column. This is the only thing I came up with:



      is_small = df.isin(np.partition(df.values.flatten(), n)[:n])


      with n=10 it looks like this:



       blue green red
      0 True True True
      1 True False True
      2 True False True
      3 True False True
      4 False False True
      5 False False False
      6 False False False
      7 False False False
      8 False False False
      9 False False False


      Then by applying np.sum I get the number corresponding to each column.



      I'm dissatisfied with this solution because it in no way utilizes the sortedness of the original data. All the data gets partitioned and all the data is then checked for whether it's in the partition. It seems wasteful, and I can't seem to figure out a better way.







      python python-3.x pandas numpy






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 15 '18 at 19:17









      MegaBluejayMegaBluejay

      36618




      36618






















          2 Answers
          2






          active

          oldest

          votes


















          1














          Think you can compare the largest of n-smallest values against the partitioned one and then use idxmin to leverage the sorted nature -



          # Find largest of n smallest numbers
          N = (np.partition(df.values.flatten(), n)[:n]).max()
          out = (df<=N).idxmin(axis=0)


          Sample run -



          In [152]: np.random.seed(0)

          In [153]: df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2))
          for q in ['blue', 'green', 'red'])

          In [154]: df
          Out[154]:
          blue green red
          0 -0.98 -0.85 -2.55
          1 -0.15 -0.21 -1.45
          2 -0.10 0.12 -0.74
          3 0.40 0.14 -0.19
          4 0.41 0.31 0.05
          5 0.95 0.33 0.65
          6 0.98 0.44 0.86
          7 1.76 0.76 1.47
          8 1.87 1.45 1.53
          9 2.24 1.49 2.27

          In [198]: n = 5

          In [199]: N = (np.partition(df.values.flatten(), n)[:n]).max()

          In [200]: (df<=N).idxmin(axis=0)
          Out[200]:
          blue 1
          green 1
          red 3
          dtype: int64





          share|improve this answer

























          • There appears to be a problem with this: for example with the same data and n=5 its results are 1, 0, 2, while their sum should be 5

            – MegaBluejay
            Nov 15 '18 at 19:52












          • Pretty sure it works with (df<np.partition(df.values.flatten(), n)[n]).idxmin(axis=0) though

            – MegaBluejay
            Nov 15 '18 at 19:55












          • @MegaBluejay The issue was with np.partition not giving in sorted order, so we need to sort the smallest partition and then proceed. Fixed. Sort or finding max that is.

            – Divakar
            Nov 15 '18 at 20:00



















          1














          Lets say, you are looking at 10 smallest, you can stack and find value_count for the 10 smallest



          df.stack().nsmallest(10).index.get_level_values(1).value_counts()


          You get



          red 5
          blue 4
          green 1





          share|improve this answer






















            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326507%2fget-overall-smallest-elements-distribution-in-dataframe-with-sorted-columns-mor%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            Think you can compare the largest of n-smallest values against the partitioned one and then use idxmin to leverage the sorted nature -



            # Find largest of n smallest numbers
            N = (np.partition(df.values.flatten(), n)[:n]).max()
            out = (df<=N).idxmin(axis=0)


            Sample run -



            In [152]: np.random.seed(0)

            In [153]: df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2))
            for q in ['blue', 'green', 'red'])

            In [154]: df
            Out[154]:
            blue green red
            0 -0.98 -0.85 -2.55
            1 -0.15 -0.21 -1.45
            2 -0.10 0.12 -0.74
            3 0.40 0.14 -0.19
            4 0.41 0.31 0.05
            5 0.95 0.33 0.65
            6 0.98 0.44 0.86
            7 1.76 0.76 1.47
            8 1.87 1.45 1.53
            9 2.24 1.49 2.27

            In [198]: n = 5

            In [199]: N = (np.partition(df.values.flatten(), n)[:n]).max()

            In [200]: (df<=N).idxmin(axis=0)
            Out[200]:
            blue 1
            green 1
            red 3
            dtype: int64





            share|improve this answer

























            • There appears to be a problem with this: for example with the same data and n=5 its results are 1, 0, 2, while their sum should be 5

              – MegaBluejay
              Nov 15 '18 at 19:52












            • Pretty sure it works with (df<np.partition(df.values.flatten(), n)[n]).idxmin(axis=0) though

              – MegaBluejay
              Nov 15 '18 at 19:55












            • @MegaBluejay The issue was with np.partition not giving in sorted order, so we need to sort the smallest partition and then proceed. Fixed. Sort or finding max that is.

              – Divakar
              Nov 15 '18 at 20:00
















            1














            Think you can compare the largest of n-smallest values against the partitioned one and then use idxmin to leverage the sorted nature -



            # Find largest of n smallest numbers
            N = (np.partition(df.values.flatten(), n)[:n]).max()
            out = (df<=N).idxmin(axis=0)


            Sample run -



            In [152]: np.random.seed(0)

            In [153]: df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2))
            for q in ['blue', 'green', 'red'])

            In [154]: df
            Out[154]:
            blue green red
            0 -0.98 -0.85 -2.55
            1 -0.15 -0.21 -1.45
            2 -0.10 0.12 -0.74
            3 0.40 0.14 -0.19
            4 0.41 0.31 0.05
            5 0.95 0.33 0.65
            6 0.98 0.44 0.86
            7 1.76 0.76 1.47
            8 1.87 1.45 1.53
            9 2.24 1.49 2.27

            In [198]: n = 5

            In [199]: N = (np.partition(df.values.flatten(), n)[:n]).max()

            In [200]: (df<=N).idxmin(axis=0)
            Out[200]:
            blue 1
            green 1
            red 3
            dtype: int64





            share|improve this answer

























            • There appears to be a problem with this: for example with the same data and n=5 its results are 1, 0, 2, while their sum should be 5

              – MegaBluejay
              Nov 15 '18 at 19:52












            • Pretty sure it works with (df<np.partition(df.values.flatten(), n)[n]).idxmin(axis=0) though

              – MegaBluejay
              Nov 15 '18 at 19:55












            • @MegaBluejay The issue was with np.partition not giving in sorted order, so we need to sort the smallest partition and then proceed. Fixed. Sort or finding max that is.

              – Divakar
              Nov 15 '18 at 20:00














            1












            1








            1







            Think you can compare the largest of n-smallest values against the partitioned one and then use idxmin to leverage the sorted nature -



            # Find largest of n smallest numbers
            N = (np.partition(df.values.flatten(), n)[:n]).max()
            out = (df<=N).idxmin(axis=0)


            Sample run -



            In [152]: np.random.seed(0)

            In [153]: df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2))
            for q in ['blue', 'green', 'red'])

            In [154]: df
            Out[154]:
            blue green red
            0 -0.98 -0.85 -2.55
            1 -0.15 -0.21 -1.45
            2 -0.10 0.12 -0.74
            3 0.40 0.14 -0.19
            4 0.41 0.31 0.05
            5 0.95 0.33 0.65
            6 0.98 0.44 0.86
            7 1.76 0.76 1.47
            8 1.87 1.45 1.53
            9 2.24 1.49 2.27

            In [198]: n = 5

            In [199]: N = (np.partition(df.values.flatten(), n)[:n]).max()

            In [200]: (df<=N).idxmin(axis=0)
            Out[200]:
            blue 1
            green 1
            red 3
            dtype: int64





            share|improve this answer















            Think you can compare the largest of n-smallest values against the partitioned one and then use idxmin to leverage the sorted nature -



            # Find largest of n smallest numbers
            N = (np.partition(df.values.flatten(), n)[:n]).max()
            out = (df<=N).idxmin(axis=0)


            Sample run -



            In [152]: np.random.seed(0)

            In [153]: df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2))
            for q in ['blue', 'green', 'red'])

            In [154]: df
            Out[154]:
            blue green red
            0 -0.98 -0.85 -2.55
            1 -0.15 -0.21 -1.45
            2 -0.10 0.12 -0.74
            3 0.40 0.14 -0.19
            4 0.41 0.31 0.05
            5 0.95 0.33 0.65
            6 0.98 0.44 0.86
            7 1.76 0.76 1.47
            8 1.87 1.45 1.53
            9 2.24 1.49 2.27

            In [198]: n = 5

            In [199]: N = (np.partition(df.values.flatten(), n)[:n]).max()

            In [200]: (df<=N).idxmin(axis=0)
            Out[200]:
            blue 1
            green 1
            red 3
            dtype: int64






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 15 '18 at 19:59

























            answered Nov 15 '18 at 19:33









            DivakarDivakar

            157k1489181




            157k1489181












            • There appears to be a problem with this: for example with the same data and n=5 its results are 1, 0, 2, while their sum should be 5

              – MegaBluejay
              Nov 15 '18 at 19:52












            • Pretty sure it works with (df<np.partition(df.values.flatten(), n)[n]).idxmin(axis=0) though

              – MegaBluejay
              Nov 15 '18 at 19:55












            • @MegaBluejay The issue was with np.partition not giving in sorted order, so we need to sort the smallest partition and then proceed. Fixed. Sort or finding max that is.

              – Divakar
              Nov 15 '18 at 20:00


















            • There appears to be a problem with this: for example with the same data and n=5 its results are 1, 0, 2, while their sum should be 5

              – MegaBluejay
              Nov 15 '18 at 19:52












            • Pretty sure it works with (df<np.partition(df.values.flatten(), n)[n]).idxmin(axis=0) though

              – MegaBluejay
              Nov 15 '18 at 19:55












            • @MegaBluejay The issue was with np.partition not giving in sorted order, so we need to sort the smallest partition and then proceed. Fixed. Sort or finding max that is.

              – Divakar
              Nov 15 '18 at 20:00

















            There appears to be a problem with this: for example with the same data and n=5 its results are 1, 0, 2, while their sum should be 5

            – MegaBluejay
            Nov 15 '18 at 19:52






            There appears to be a problem with this: for example with the same data and n=5 its results are 1, 0, 2, while their sum should be 5

            – MegaBluejay
            Nov 15 '18 at 19:52














            Pretty sure it works with (df<np.partition(df.values.flatten(), n)[n]).idxmin(axis=0) though

            – MegaBluejay
            Nov 15 '18 at 19:55






            Pretty sure it works with (df<np.partition(df.values.flatten(), n)[n]).idxmin(axis=0) though

            – MegaBluejay
            Nov 15 '18 at 19:55














            @MegaBluejay The issue was with np.partition not giving in sorted order, so we need to sort the smallest partition and then proceed. Fixed. Sort or finding max that is.

            – Divakar
            Nov 15 '18 at 20:00






            @MegaBluejay The issue was with np.partition not giving in sorted order, so we need to sort the smallest partition and then proceed. Fixed. Sort or finding max that is.

            – Divakar
            Nov 15 '18 at 20:00














            1














            Lets say, you are looking at 10 smallest, you can stack and find value_count for the 10 smallest



            df.stack().nsmallest(10).index.get_level_values(1).value_counts()


            You get



            red 5
            blue 4
            green 1





            share|improve this answer



























              1














              Lets say, you are looking at 10 smallest, you can stack and find value_count for the 10 smallest



              df.stack().nsmallest(10).index.get_level_values(1).value_counts()


              You get



              red 5
              blue 4
              green 1





              share|improve this answer

























                1












                1








                1







                Lets say, you are looking at 10 smallest, you can stack and find value_count for the 10 smallest



                df.stack().nsmallest(10).index.get_level_values(1).value_counts()


                You get



                red 5
                blue 4
                green 1





                share|improve this answer













                Lets say, you are looking at 10 smallest, you can stack and find value_count for the 10 smallest



                df.stack().nsmallest(10).index.get_level_values(1).value_counts()


                You get



                red 5
                blue 4
                green 1






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 15 '18 at 19:22









                VaishaliVaishali

                21.7k41336




                21.7k41336



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326507%2fget-overall-smallest-elements-distribution-in-dataframe-with-sorted-columns-mor%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Top Tejano songwriter Luis Silva dead of heart attack at 64

                    政党

                    天津地下鉄3号線