Keeping only one record from spark dataframe. Not necessary distinct record

-3

I have the following data:

id type price
1 us 1.99
1 ca 2.99
2 ca 1.99
3 au 2.99 
3 us 3.99
3 ca 2.99

Basically, I want to get only one record per id and if is more than one, then I want to keep one with type, us over others.

Desired output:

id type price
1 us 1.99
2 ca 1.99
3 us 3.99

How can I do this using either PySpark or SparkSQL against a Spark DataFrame or a table?

edited Nov 15 '18 at 21:47

karma4917

835614

asked Nov 15 '18 at 16:44

dunkky

Why don't you try this? cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()

– karma4917
Nov 15 '18 at 19:01

Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?

– dunkky
Nov 15 '18 at 21:38

Can you paste in the code you have tried so far?

– karma4917
Nov 16 '18 at 0:36

it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?

– dunkky
Nov 16 '18 at 2:22

add a comment |

-3

I have the following data:

id type price
1 us 1.99
1 ca 2.99
2 ca 1.99
3 au 2.99 
3 us 3.99
3 ca 2.99

Basically, I want to get only one record per id and if is more than one, then I want to keep one with type, us over others.

Desired output:

id type price
1 us 1.99
2 ca 1.99
3 us 3.99

How can I do this using either PySpark or SparkSQL against a Spark DataFrame or a table?

edited Nov 15 '18 at 21:47

karma4917

835614

asked Nov 15 '18 at 16:44

dunkky

Why don't you try this? cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()

– karma4917
Nov 15 '18 at 19:01

Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?

– dunkky
Nov 15 '18 at 21:38

Can you paste in the code you have tried so far?

– karma4917
Nov 16 '18 at 0:36

it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?

– dunkky
Nov 16 '18 at 2:22

add a comment |

-3

I have the following data:

id type price
1 us 1.99
1 ca 2.99
2 ca 1.99
3 au 2.99 
3 us 3.99
3 ca 2.99

Basically, I want to get only one record per id and if is more than one, then I want to keep one with type, us over others.

Desired output:

id type price
1 us 1.99
2 ca 1.99
3 us 3.99

How can I do this using either PySpark or SparkSQL against a Spark DataFrame or a table?

edited Nov 15 '18 at 21:47

karma4917

835614

asked Nov 15 '18 at 16:44

dunkky

I have the following data:

id type price
1 us 1.99
1 ca 2.99
2 ca 1.99
3 au 2.99 
3 us 3.99
3 ca 2.99

Basically, I want to get only one record per id and if is more than one, then I want to keep one with type, us over others.

Desired output:

id type price
1 us 1.99
2 ca 1.99
3 us 3.99

How can I do this using either PySpark or SparkSQL against a Spark DataFrame or a table?

apache-spark dataframe pyspark pyspark-sql databricks

edited Nov 15 '18 at 21:47

karma4917

835614

asked Nov 15 '18 at 16:44

dunkky

edited Nov 15 '18 at 21:47

karma4917

835614

asked Nov 15 '18 at 16:44

dunkky

edited Nov 15 '18 at 21:47

karma4917

835614

edited Nov 15 '18 at 21:47

karma4917

835614

edited Nov 15 '18 at 21:47

karma4917

835614

asked Nov 15 '18 at 16:44

dunkky

asked Nov 15 '18 at 16:44

dunkky

asked Nov 15 '18 at 16:44

dunkky

Why don't you try this? cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()

– karma4917
Nov 15 '18 at 19:01

Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?

– dunkky
Nov 15 '18 at 21:38

Can you paste in the code you have tried so far?

– karma4917
Nov 16 '18 at 0:36

it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?

– dunkky
Nov 16 '18 at 2:22

add a comment |

Why don't you try this? cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()

– karma4917
Nov 15 '18 at 19:01

Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?

– dunkky
Nov 15 '18 at 21:38

Can you paste in the code you have tried so far?

– karma4917
Nov 16 '18 at 0:36

it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?

– dunkky
Nov 16 '18 at 2:22

Why don't you try this?

cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()

– karma4917
Nov 15 '18 at 19:01

Why don't you try this?

cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()

– karma4917
Nov 15 '18 at 19:01

Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?

– dunkky
Nov 15 '18 at 21:38

Can you paste in the code you have tried so far?

– karma4917
Nov 16 '18 at 0:36

it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?

– dunkky
Nov 16 '18 at 2:22

add a comment |

1 Answer
1

active

oldest

votes

Window function with custom sorting can be used, in Scala below, guess, can be translated to Python:

// data
val df = List(
 (1, "us", 1.99),
 (1, "ca", 2.99),
 (2, "ca", 1.99),
 (3, "au", 2.99),
 (3, "us", 3.99),
 (3, "ca", 2.99))
 .toDF("id", "type", "price")

// action
val sorting = when($"type" === "us", 0).otherwise(1)
val window = Window.partitionBy($"id").orderBy(sorting)
val result = df
 .withColumn("sortIndex", row_number().over(window))
 .where($"sortIndex" === 1)
 .drop("sortIndex")

Result:

+---+----+-----+
|id |type|price|
+---+----+-----+
|1 |us |1.99 |
|2 |ca |1.99 |
|3 |us |3.99 |
+---+----+-----+

answered Nov 17 '18 at 20:43

pasha701

3,2901613

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324141%2fkeeping-only-one-record-from-spark-dataframe-not-necessary-distinct-record%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Window function with custom sorting can be used, in Scala below, guess, can be translated to Python:

// data
val df = List(
 (1, "us", 1.99),
 (1, "ca", 2.99),
 (2, "ca", 1.99),
 (3, "au", 2.99),
 (3, "us", 3.99),
 (3, "ca", 2.99))
 .toDF("id", "type", "price")

// action
val sorting = when($"type" === "us", 0).otherwise(1)
val window = Window.partitionBy($"id").orderBy(sorting)
val result = df
 .withColumn("sortIndex", row_number().over(window))
 .where($"sortIndex" === 1)
 .drop("sortIndex")

Result:

+---+----+-----+
|id |type|price|
+---+----+-----+
|1 |us |1.99 |
|2 |ca |1.99 |
|3 |us |3.99 |
+---+----+-----+

answered Nov 17 '18 at 20:43

pasha701

3,2901613

add a comment |

Window function with custom sorting can be used, in Scala below, guess, can be translated to Python:

// data
val df = List(
 (1, "us", 1.99),
 (1, "ca", 2.99),
 (2, "ca", 1.99),
 (3, "au", 2.99),
 (3, "us", 3.99),
 (3, "ca", 2.99))
 .toDF("id", "type", "price")

// action
val sorting = when($"type" === "us", 0).otherwise(1)
val window = Window.partitionBy($"id").orderBy(sorting)
val result = df
 .withColumn("sortIndex", row_number().over(window))
 .where($"sortIndex" === 1)
 .drop("sortIndex")

Result:

+---+----+-----+
|id |type|price|
+---+----+-----+
|1 |us |1.99 |
|2 |ca |1.99 |
|3 |us |3.99 |
+---+----+-----+

answered Nov 17 '18 at 20:43

pasha701

3,2901613

add a comment |

Window function with custom sorting can be used, in Scala below, guess, can be translated to Python:

// data
val df = List(
 (1, "us", 1.99),
 (1, "ca", 2.99),
 (2, "ca", 1.99),
 (3, "au", 2.99),
 (3, "us", 3.99),
 (3, "ca", 2.99))
 .toDF("id", "type", "price")

// action
val sorting = when($"type" === "us", 0).otherwise(1)
val window = Window.partitionBy($"id").orderBy(sorting)
val result = df
 .withColumn("sortIndex", row_number().over(window))
 .where($"sortIndex" === 1)
 .drop("sortIndex")

Result:

+---+----+-----+
|id |type|price|
+---+----+-----+
|1 |us |1.99 |
|2 |ca |1.99 |
|3 |us |3.99 |
+---+----+-----+

answered Nov 17 '18 at 20:43

pasha701

3,2901613

Window function with custom sorting can be used, in Scala below, guess, can be translated to Python:

// data
val df = List(
 (1, "us", 1.99),
 (1, "ca", 2.99),
 (2, "ca", 1.99),
 (3, "au", 2.99),
 (3, "us", 3.99),
 (3, "ca", 2.99))
 .toDF("id", "type", "price")

// action
val sorting = when($"type" === "us", 0).otherwise(1)
val window = Window.partitionBy($"id").orderBy(sorting)
val result = df
 .withColumn("sortIndex", row_number().over(window))
 .where($"sortIndex" === 1)
 .drop("sortIndex")

Result:

+---+----+-----+
|id |type|price|
+---+----+-----+
|1 |us |1.99 |
|2 |ca |1.99 |
|3 |us |3.99 |
+---+----+-----+

answered Nov 17 '18 at 20:43

pasha701

3,2901613

answered Nov 17 '18 at 20:43

pasha701

3,2901613

answered Nov 17 '18 at 20:43

pasha701

3,2901613

answered Nov 17 '18 at 20:43

pasha701

3,2901613

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Myujth