Performant alternative to constructing a dataframe by applying repeated pivots

I have a dataframe which contains a whole set of data and relevant id information:

| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |

I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:

| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |

At the moment I do this with a loop:

pivots = 
for i in samples:
 df_sample = df_samples[df_samples['sample_id'] == i]
 pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')

 # Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
 pivots.append(pivot)

# create new dataframe
pd.concat(pivots)

So my first question is, if I wanted to create a new dataframe which consists of repeated pivots of another dataframe. Is there a way to do that all at once with one pivot command instead of having to call it iteratively? If there is, is it more performant?

My second question involves the more complicated case. If it is possible to perform multiple pivots at once to build up the new dataframe when the new dataframe also will increase its dimensions i.e. it might look like

| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |

The functions essentially perform individual operations on the data per sample_id to create new features.

Aside: I am looking for a good resource on working with large pandas dataframes and performantley constructing new ones or performing queries. I am almost always able to get the result I want using pandas. My implementations are often not efficient and akin to how it might be done in a lower level language like c++. I would like to improve my working knowledge and maybe this involves some theory I do not know on dataframes and tables etc. A recommendation for a resource would be good. Note that that is just additional helpful information and a recommendation alone does not answer the question and any answer that answers my two use cases above will be accepted with or without a recommendation for a resource.

edited Nov 16 '18 at 5:53

asked Nov 16 '18 at 5:48

Aesir

51811023

1

have you ever tried df.groupby(['sample_id']).mean()? groupby doc

– Kevin Fang
Nov 16 '18 at 5:53

Yes I have tried groupby calls before and for some reason I did not use them in my current implementation. For which there was probably a reason I have now forgotten. I assume it would be because in the creation of the new features I want access to the information from the original dataframe, not just the grouped results, which meant I had to pivot the initial dataframe anyway.

– Aesir
Nov 16 '18 at 5:57

add a comment |

I have a dataframe which contains a whole set of data and relevant id information:

| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |

I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:

| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |

At the moment I do this with a loop:

pivots = 
for i in samples:
 df_sample = df_samples[df_samples['sample_id'] == i]
 pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')

 # Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
 pivots.append(pivot)

# create new dataframe
pd.concat(pivots)

| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |

The functions essentially perform individual operations on the data per sample_id to create new features.

edited Nov 16 '18 at 5:53

asked Nov 16 '18 at 5:48

Aesir

51811023

1

have you ever tried df.groupby(['sample_id']).mean()? groupby doc

– Kevin Fang
Nov 16 '18 at 5:53

Yes I have tried groupby calls before and for some reason I did not use them in my current implementation. For which there was probably a reason I have now forgotten. I assume it would be because in the creation of the new features I want access to the information from the original dataframe, not just the grouped results, which meant I had to pivot the initial dataframe anyway.

– Aesir
Nov 16 '18 at 5:57

add a comment |

I have a dataframe which contains a whole set of data and relevant id information:

| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |

I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:

| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |

At the moment I do this with a loop:

pivots = 
for i in samples:
 df_sample = df_samples[df_samples['sample_id'] == i]
 pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')

 # Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
 pivots.append(pivot)

# create new dataframe
pd.concat(pivots)

| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |

The functions essentially perform individual operations on the data per sample_id to create new features.

edited Nov 16 '18 at 5:53

asked Nov 16 '18 at 5:48

Aesir

51811023

I have a dataframe which contains a whole set of data and relevant id information:

| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |

I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:

| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |

At the moment I do this with a loop:

pivots = 
for i in samples:
 df_sample = df_samples[df_samples['sample_id'] == i]
 pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')

 # Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
 pivots.append(pivot)

# create new dataframe
pd.concat(pivots)

| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |

The functions essentially perform individual operations on the data per sample_id to create new features.

python pandas dataframe

edited Nov 16 '18 at 5:53

asked Nov 16 '18 at 5:48

Aesir

51811023

edited Nov 16 '18 at 5:53

asked Nov 16 '18 at 5:48

Aesir

51811023

edited Nov 16 '18 at 5:53

asked Nov 16 '18 at 5:48

Aesir

51811023

asked Nov 16 '18 at 5:48

Aesir

51811023

asked Nov 16 '18 at 5:48

Aesir

51811023

1

have you ever tried df.groupby(['sample_id']).mean()? groupby doc

– Kevin Fang
Nov 16 '18 at 5:53

Yes I have tried groupby calls before and for some reason I did not use them in my current implementation. For which there was probably a reason I have now forgotten. I assume it would be because in the creation of the new features I want access to the information from the original dataframe, not just the grouped results, which meant I had to pivot the initial dataframe anyway.

– Aesir
Nov 16 '18 at 5:57

add a comment |

1

have you ever tried df.groupby(['sample_id']).mean()? groupby doc

– Kevin Fang
Nov 16 '18 at 5:53

Yes I have tried groupby calls before and for some reason I did not use them in my current implementation. For which there was probably a reason I have now forgotten. I assume it would be because in the creation of the new features I want access to the information from the original dataframe, not just the grouped results, which meant I had to pivot the initial dataframe anyway.

– Aesir
Nov 16 '18 at 5:57

have you ever tried df.groupby(['sample_id']).mean()? groupby doc

– Kevin Fang
Nov 16 '18 at 5:53

Yes I have tried groupby calls before and for some reason I did not use them in my current implementation. For which there was probably a reason I have now forgotten. I assume it would be because in the creation of the new features I want access to the information from the original dataframe, not just the grouped results, which meant I had to pivot the initial dataframe anyway.

– Aesir
Nov 16 '18 at 5:57

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53332129%2fperformant-alternative-to-constructing-a-dataframe-by-applying-repeated-pivots%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Myujth