Get overall smallest elements' distribution in dataframe with sorted columns more efficiently

I have a dataframe with sorted columns, something like this:

df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2)) for q in ['blue', 'green', 'red'])
 blue green red
 0 -2.15 -0.76 -2.62
 1 -0.88 -0.62 -1.65
 2 -0.77 -0.55 -1.51
 3 -0.73 -0.17 -1.14
 4 -0.06 -0.16 -0.75
 5 -0.03 0.05 -0.08
 6 0.06 0.38 0.37
 7 0.41 0.76 1.04
 8 0.56 0.89 1.16
 9 0.97 2.94 1.79

What I want to know is how many of the n smallest elements in the whole frame are in each column. This is the only thing I came up with:

is_small = df.isin(np.partition(df.values.flatten(), n)[:n])

with n=10 it looks like this:

 blue green red
 0 True True True
 1 True False True
 2 True False True
 3 True False True
 4 False False True
 5 False False False
 6 False False False
 7 False False False
 8 False False False
 9 False False False

Then by applying np.sum I get the number corresponding to each column.

I'm dissatisfied with this solution because it in no way utilizes the sortedness of the original data. All the data gets partitioned and all the data is then checked for whether it's in the partition. It seems wasteful, and I can't seem to figure out a better way.

asked Nov 15 '18 at 19:17

MegaBluejay

36618

add a comment |

I have a dataframe with sorted columns, something like this:

df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2)) for q in ['blue', 'green', 'red'])
 blue green red
 0 -2.15 -0.76 -2.62
 1 -0.88 -0.62 -1.65
 2 -0.77 -0.55 -1.51
 3 -0.73 -0.17 -1.14
 4 -0.06 -0.16 -0.75
 5 -0.03 0.05 -0.08
 6 0.06 0.38 0.37
 7 0.41 0.76 1.04
 8 0.56 0.89 1.16
 9 0.97 2.94 1.79

What I want to know is how many of the n smallest elements in the whole frame are in each column. This is the only thing I came up with:

is_small = df.isin(np.partition(df.values.flatten(), n)[:n])

with n=10 it looks like this:

 blue green red
 0 True True True
 1 True False True
 2 True False True
 3 True False True
 4 False False True
 5 False False False
 6 False False False
 7 False False False
 8 False False False
 9 False False False

Then by applying np.sum I get the number corresponding to each column.

asked Nov 15 '18 at 19:17

MegaBluejay

36618

add a comment |

I have a dataframe with sorted columns, something like this:

df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2)) for q in ['blue', 'green', 'red'])
 blue green red
 0 -2.15 -0.76 -2.62
 1 -0.88 -0.62 -1.65
 2 -0.77 -0.55 -1.51
 3 -0.73 -0.17 -1.14
 4 -0.06 -0.16 -0.75
 5 -0.03 0.05 -0.08
 6 0.06 0.38 0.37
 7 0.41 0.76 1.04
 8 0.56 0.89 1.16
 9 0.97 2.94 1.79

What I want to know is how many of the n smallest elements in the whole frame are in each column. This is the only thing I came up with:

is_small = df.isin(np.partition(df.values.flatten(), n)[:n])

with n=10 it looks like this:

 blue green red
 0 True True True
 1 True False True
 2 True False True
 3 True False True
 4 False False True
 5 False False False
 6 False False False
 7 False False False
 8 False False False
 9 False False False

Then by applying np.sum I get the number corresponding to each column.

asked Nov 15 '18 at 19:17

MegaBluejay

36618

I have a dataframe with sorted columns, something like this:

df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2)) for q in ['blue', 'green', 'red'])
 blue green red
 0 -2.15 -0.76 -2.62
 1 -0.88 -0.62 -1.65
 2 -0.77 -0.55 -1.51
 3 -0.73 -0.17 -1.14
 4 -0.06 -0.16 -0.75
 5 -0.03 0.05 -0.08
 6 0.06 0.38 0.37
 7 0.41 0.76 1.04
 8 0.56 0.89 1.16
 9 0.97 2.94 1.79

What I want to know is how many of the n smallest elements in the whole frame are in each column. This is the only thing I came up with:

is_small = df.isin(np.partition(df.values.flatten(), n)[:n])

with n=10 it looks like this:

 blue green red
 0 True True True
 1 True False True
 2 True False True
 3 True False True
 4 False False True
 5 False False False
 6 False False False
 7 False False False
 8 False False False
 9 False False False

Then by applying np.sum I get the number corresponding to each column.

python python-3.x pandas numpy

asked Nov 15 '18 at 19:17

MegaBluejay

36618

asked Nov 15 '18 at 19:17

MegaBluejay

36618

asked Nov 15 '18 at 19:17

MegaBluejay

36618

asked Nov 15 '18 at 19:17

MegaBluejay

36618

asked Nov 15 '18 at 19:17

MegaBluejay

36618

add a comment |

2 Answers
2

active

oldest

votes

Think you can compare the largest of n-smallest values against the partitioned one and then use idxmin to leverage the sorted nature -

# Find largest of n smallest numbers
N = (np.partition(df.values.flatten(), n)[:n]).max()
out = (df<=N).idxmin(axis=0)

Sample run -

In [152]: np.random.seed(0)

In [153]: df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2)) 
 for q in ['blue', 'green', 'red'])

In [154]: df
Out[154]: 
 blue green red
0 -0.98 -0.85 -2.55
1 -0.15 -0.21 -1.45
2 -0.10 0.12 -0.74
3 0.40 0.14 -0.19
4 0.41 0.31 0.05
5 0.95 0.33 0.65
6 0.98 0.44 0.86
7 1.76 0.76 1.47
8 1.87 1.45 1.53
9 2.24 1.49 2.27

In [198]: n = 5

In [199]: N = (np.partition(df.values.flatten(), n)[:n]).max()

In [200]: (df<=N).idxmin(axis=0)
Out[200]: 
blue 1
green 1
red 3
dtype: int64

edited Nov 15 '18 at 19:59

answered Nov 15 '18 at 19:33

Divakar

157k1489181

There appears to be a problem with this: for example with the same data and n=5 its results are 1, 0, 2, while their sum should be 5

– MegaBluejay
Nov 15 '18 at 19:52

Pretty sure it works with (df<np.partition(df.values.flatten(), n)[n]).idxmin(axis=0) though

– MegaBluejay
Nov 15 '18 at 19:55

@MegaBluejay The issue was with np.partition not giving in sorted order, so we need to sort the smallest partition and then proceed. Fixed. Sort or finding max that is.

– Divakar
Nov 15 '18 at 20:00

add a comment |

Lets say, you are looking at 10 smallest, you can stack and find value_count for the 10 smallest

df.stack().nsmallest(10).index.get_level_values(1).value_counts()

You get

red 5
blue 4
green 1

answered Nov 15 '18 at 19:22

Vaishali

21.7k41336

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326507%2fget-overall-smallest-elements-distribution-in-dataframe-with-sorted-columns-mor%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Think you can compare the largest of n-smallest values against the partitioned one and then use idxmin to leverage the sorted nature -

# Find largest of n smallest numbers
N = (np.partition(df.values.flatten(), n)[:n]).max()
out = (df<=N).idxmin(axis=0)

Sample run -

In [152]: np.random.seed(0)

In [153]: df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2)) 
 for q in ['blue', 'green', 'red'])

In [154]: df
Out[154]: 
 blue green red
0 -0.98 -0.85 -2.55
1 -0.15 -0.21 -1.45
2 -0.10 0.12 -0.74
3 0.40 0.14 -0.19
4 0.41 0.31 0.05
5 0.95 0.33 0.65
6 0.98 0.44 0.86
7 1.76 0.76 1.47
8 1.87 1.45 1.53
9 2.24 1.49 2.27

In [198]: n = 5

In [199]: N = (np.partition(df.values.flatten(), n)[:n]).max()

In [200]: (df<=N).idxmin(axis=0)
Out[200]: 
blue 1
green 1
red 3
dtype: int64

edited Nov 15 '18 at 19:59

answered Nov 15 '18 at 19:33

Divakar

157k1489181

There appears to be a problem with this: for example with the same data and n=5 its results are 1, 0, 2, while their sum should be 5

– MegaBluejay
Nov 15 '18 at 19:52

Pretty sure it works with (df<np.partition(df.values.flatten(), n)[n]).idxmin(axis=0) though

– MegaBluejay
Nov 15 '18 at 19:55

@MegaBluejay The issue was with np.partition not giving in sorted order, so we need to sort the smallest partition and then proceed. Fixed. Sort or finding max that is.

– Divakar
Nov 15 '18 at 20:00

add a comment |

Think you can compare the largest of n-smallest values against the partitioned one and then use idxmin to leverage the sorted nature -

# Find largest of n smallest numbers
N = (np.partition(df.values.flatten(), n)[:n]).max()
out = (df<=N).idxmin(axis=0)

Sample run -

In [152]: np.random.seed(0)

In [153]: df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2)) 
 for q in ['blue', 'green', 'red'])

In [154]: df
Out[154]: 
 blue green red
0 -0.98 -0.85 -2.55
1 -0.15 -0.21 -1.45
2 -0.10 0.12 -0.74
3 0.40 0.14 -0.19
4 0.41 0.31 0.05
5 0.95 0.33 0.65
6 0.98 0.44 0.86
7 1.76 0.76 1.47
8 1.87 1.45 1.53
9 2.24 1.49 2.27

In [198]: n = 5

In [199]: N = (np.partition(df.values.flatten(), n)[:n]).max()

In [200]: (df<=N).idxmin(axis=0)
Out[200]: 
blue 1
green 1
red 3
dtype: int64

edited Nov 15 '18 at 19:59

answered Nov 15 '18 at 19:33

Divakar

157k1489181

There appears to be a problem with this: for example with the same data and n=5 its results are 1, 0, 2, while their sum should be 5

– MegaBluejay
Nov 15 '18 at 19:52

Pretty sure it works with (df<np.partition(df.values.flatten(), n)[n]).idxmin(axis=0) though

– MegaBluejay
Nov 15 '18 at 19:55

@MegaBluejay The issue was with np.partition not giving in sorted order, so we need to sort the smallest partition and then proceed. Fixed. Sort or finding max that is.

– Divakar
Nov 15 '18 at 20:00

add a comment |

Think you can compare the largest of n-smallest values against the partitioned one and then use idxmin to leverage the sorted nature -

# Find largest of n smallest numbers
N = (np.partition(df.values.flatten(), n)[:n]).max()
out = (df<=N).idxmin(axis=0)

Sample run -

In [152]: np.random.seed(0)

In [153]: df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2)) 
 for q in ['blue', 'green', 'red'])

In [154]: df
Out[154]: 
 blue green red
0 -0.98 -0.85 -2.55
1 -0.15 -0.21 -1.45
2 -0.10 0.12 -0.74
3 0.40 0.14 -0.19
4 0.41 0.31 0.05
5 0.95 0.33 0.65
6 0.98 0.44 0.86
7 1.76 0.76 1.47
8 1.87 1.45 1.53
9 2.24 1.49 2.27

In [198]: n = 5

In [199]: N = (np.partition(df.values.flatten(), n)[:n]).max()

In [200]: (df<=N).idxmin(axis=0)
Out[200]: 
blue 1
green 1
red 3
dtype: int64

edited Nov 15 '18 at 19:59

answered Nov 15 '18 at 19:33

Divakar

157k1489181

Think you can compare the largest of n-smallest values against the partitioned one and then use idxmin to leverage the sorted nature -

# Find largest of n smallest numbers
N = (np.partition(df.values.flatten(), n)[:n]).max()
out = (df<=N).idxmin(axis=0)

Sample run -

In [152]: np.random.seed(0)

In [153]: df = pd.DataFrame(q: np.sort(np.random.randn(10).round(2)) 
 for q in ['blue', 'green', 'red'])

In [154]: df
Out[154]: 
 blue green red
0 -0.98 -0.85 -2.55
1 -0.15 -0.21 -1.45
2 -0.10 0.12 -0.74
3 0.40 0.14 -0.19
4 0.41 0.31 0.05
5 0.95 0.33 0.65
6 0.98 0.44 0.86
7 1.76 0.76 1.47
8 1.87 1.45 1.53
9 2.24 1.49 2.27

In [198]: n = 5

In [199]: N = (np.partition(df.values.flatten(), n)[:n]).max()

In [200]: (df<=N).idxmin(axis=0)
Out[200]: 
blue 1
green 1
red 3
dtype: int64

edited Nov 15 '18 at 19:59

answered Nov 15 '18 at 19:33

Divakar

157k1489181

edited Nov 15 '18 at 19:59

answered Nov 15 '18 at 19:33

Divakar

157k1489181

answered Nov 15 '18 at 19:33

Divakar

157k1489181

answered Nov 15 '18 at 19:33

Divakar

157k1489181

There appears to be a problem with this: for example with the same data and n=5 its results are 1, 0, 2, while their sum should be 5

– MegaBluejay
Nov 15 '18 at 19:52

Pretty sure it works with (df<np.partition(df.values.flatten(), n)[n]).idxmin(axis=0) though

– MegaBluejay
Nov 15 '18 at 19:55

@MegaBluejay The issue was with np.partition not giving in sorted order, so we need to sort the smallest partition and then proceed. Fixed. Sort or finding max that is.

– Divakar
Nov 15 '18 at 20:00

add a comment |

There appears to be a problem with this: for example with the same data and n=5 its results are 1, 0, 2, while their sum should be 5

– MegaBluejay
Nov 15 '18 at 19:52

Pretty sure it works with (df<np.partition(df.values.flatten(), n)[n]).idxmin(axis=0) though

– MegaBluejay
Nov 15 '18 at 19:55

@MegaBluejay The issue was with np.partition not giving in sorted order, so we need to sort the smallest partition and then proceed. Fixed. Sort or finding max that is.

– Divakar
Nov 15 '18 at 20:00

There appears to be a problem with this: for example with the same data and n=5 its results are 1, 0, 2, while their sum should be 5

– MegaBluejay
Nov 15 '18 at 19:52

Pretty sure it works with (df<np.partition(df.values.flatten(), n)[n]).idxmin(axis=0) though

– MegaBluejay
Nov 15 '18 at 19:55

@MegaBluejay The issue was with np.partition not giving in sorted order, so we need to sort the smallest partition and then proceed. Fixed. Sort or finding max that is.

– Divakar
Nov 15 '18 at 20:00

add a comment |

Lets say, you are looking at 10 smallest, you can stack and find value_count for the 10 smallest

df.stack().nsmallest(10).index.get_level_values(1).value_counts()

You get

red 5
blue 4
green 1

answered Nov 15 '18 at 19:22

Vaishali

21.7k41336

add a comment |

Lets say, you are looking at 10 smallest, you can stack and find value_count for the 10 smallest

df.stack().nsmallest(10).index.get_level_values(1).value_counts()

You get

red 5
blue 4
green 1

answered Nov 15 '18 at 19:22

Vaishali

21.7k41336

add a comment |

Lets say, you are looking at 10 smallest, you can stack and find value_count for the 10 smallest

df.stack().nsmallest(10).index.get_level_values(1).value_counts()

You get

red 5
blue 4
green 1

answered Nov 15 '18 at 19:22

Vaishali

21.7k41336

Lets say, you are looking at 10 smallest, you can stack and find value_count for the 10 smallest

df.stack().nsmallest(10).index.get_level_values(1).value_counts()

You get

red 5
blue 4
green 1

answered Nov 15 '18 at 19:22

Vaishali

21.7k41336

answered Nov 15 '18 at 19:22

Vaishali

21.7k41336

answered Nov 15 '18 at 19:22

Vaishali

21.7k41336

answered Nov 15 '18 at 19:22

Vaishali

21.7k41336

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Myujth