Keeping only one record from spark dataframe. Not necessary distinct record
I have the following data:
id type price
1 us 1.99
1 ca 2.99
2 ca 1.99
3 au 2.99
3 us 3.99
3 ca 2.99
Basically, I want to get only one record per id
and if is more than one, then I want to keep one with type, us
over others.
Desired output:
id type price
1 us 1.99
2 ca 1.99
3 us 3.99
How can I do this using either PySpark
or SparkSQL
against a Spark DataFrame
or a table?
apache-spark dataframe pyspark pyspark-sql databricks
add a comment |
I have the following data:
id type price
1 us 1.99
1 ca 2.99
2 ca 1.99
3 au 2.99
3 us 3.99
3 ca 2.99
Basically, I want to get only one record per id
and if is more than one, then I want to keep one with type, us
over others.
Desired output:
id type price
1 us 1.99
2 ca 1.99
3 us 3.99
How can I do this using either PySpark
or SparkSQL
against a Spark DataFrame
or a table?
apache-spark dataframe pyspark pyspark-sql databricks
Why don't you try this?cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()
– karma4917
Nov 15 '18 at 19:01
Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?
– dunkky
Nov 15 '18 at 21:38
Can you paste in the code you have tried so far?
– karma4917
Nov 16 '18 at 0:36
it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?
– dunkky
Nov 16 '18 at 2:22
add a comment |
I have the following data:
id type price
1 us 1.99
1 ca 2.99
2 ca 1.99
3 au 2.99
3 us 3.99
3 ca 2.99
Basically, I want to get only one record per id
and if is more than one, then I want to keep one with type, us
over others.
Desired output:
id type price
1 us 1.99
2 ca 1.99
3 us 3.99
How can I do this using either PySpark
or SparkSQL
against a Spark DataFrame
or a table?
apache-spark dataframe pyspark pyspark-sql databricks
I have the following data:
id type price
1 us 1.99
1 ca 2.99
2 ca 1.99
3 au 2.99
3 us 3.99
3 ca 2.99
Basically, I want to get only one record per id
and if is more than one, then I want to keep one with type, us
over others.
Desired output:
id type price
1 us 1.99
2 ca 1.99
3 us 3.99
How can I do this using either PySpark
or SparkSQL
against a Spark DataFrame
or a table?
apache-spark dataframe pyspark pyspark-sql databricks
apache-spark dataframe pyspark pyspark-sql databricks
edited Nov 15 '18 at 21:47
karma4917
835614
835614
asked Nov 15 '18 at 16:44
dunkkydunkky
1
1
Why don't you try this?cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()
– karma4917
Nov 15 '18 at 19:01
Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?
– dunkky
Nov 15 '18 at 21:38
Can you paste in the code you have tried so far?
– karma4917
Nov 16 '18 at 0:36
it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?
– dunkky
Nov 16 '18 at 2:22
add a comment |
Why don't you try this?cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()
– karma4917
Nov 15 '18 at 19:01
Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?
– dunkky
Nov 15 '18 at 21:38
Can you paste in the code you have tried so far?
– karma4917
Nov 16 '18 at 0:36
it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?
– dunkky
Nov 16 '18 at 2:22
Why don't you try this?
cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()
– karma4917
Nov 15 '18 at 19:01
Why don't you try this?
cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()
– karma4917
Nov 15 '18 at 19:01
Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?
– dunkky
Nov 15 '18 at 21:38
Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?
– dunkky
Nov 15 '18 at 21:38
Can you paste in the code you have tried so far?
– karma4917
Nov 16 '18 at 0:36
Can you paste in the code you have tried so far?
– karma4917
Nov 16 '18 at 0:36
it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?
– dunkky
Nov 16 '18 at 2:22
it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?
– dunkky
Nov 16 '18 at 2:22
add a comment |
1 Answer
1
active
oldest
votes
Window function with custom sorting can be used, in Scala below, guess, can be translated to Python:
// data
val df = List(
(1, "us", 1.99),
(1, "ca", 2.99),
(2, "ca", 1.99),
(3, "au", 2.99),
(3, "us", 3.99),
(3, "ca", 2.99))
.toDF("id", "type", "price")
// action
val sorting = when($"type" === "us", 0).otherwise(1)
val window = Window.partitionBy($"id").orderBy(sorting)
val result = df
.withColumn("sortIndex", row_number().over(window))
.where($"sortIndex" === 1)
.drop("sortIndex")
Result:
+---+----+-----+
|id |type|price|
+---+----+-----+
|1 |us |1.99 |
|2 |ca |1.99 |
|3 |us |3.99 |
+---+----+-----+
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324141%2fkeeping-only-one-record-from-spark-dataframe-not-necessary-distinct-record%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Window function with custom sorting can be used, in Scala below, guess, can be translated to Python:
// data
val df = List(
(1, "us", 1.99),
(1, "ca", 2.99),
(2, "ca", 1.99),
(3, "au", 2.99),
(3, "us", 3.99),
(3, "ca", 2.99))
.toDF("id", "type", "price")
// action
val sorting = when($"type" === "us", 0).otherwise(1)
val window = Window.partitionBy($"id").orderBy(sorting)
val result = df
.withColumn("sortIndex", row_number().over(window))
.where($"sortIndex" === 1)
.drop("sortIndex")
Result:
+---+----+-----+
|id |type|price|
+---+----+-----+
|1 |us |1.99 |
|2 |ca |1.99 |
|3 |us |3.99 |
+---+----+-----+
add a comment |
Window function with custom sorting can be used, in Scala below, guess, can be translated to Python:
// data
val df = List(
(1, "us", 1.99),
(1, "ca", 2.99),
(2, "ca", 1.99),
(3, "au", 2.99),
(3, "us", 3.99),
(3, "ca", 2.99))
.toDF("id", "type", "price")
// action
val sorting = when($"type" === "us", 0).otherwise(1)
val window = Window.partitionBy($"id").orderBy(sorting)
val result = df
.withColumn("sortIndex", row_number().over(window))
.where($"sortIndex" === 1)
.drop("sortIndex")
Result:
+---+----+-----+
|id |type|price|
+---+----+-----+
|1 |us |1.99 |
|2 |ca |1.99 |
|3 |us |3.99 |
+---+----+-----+
add a comment |
Window function with custom sorting can be used, in Scala below, guess, can be translated to Python:
// data
val df = List(
(1, "us", 1.99),
(1, "ca", 2.99),
(2, "ca", 1.99),
(3, "au", 2.99),
(3, "us", 3.99),
(3, "ca", 2.99))
.toDF("id", "type", "price")
// action
val sorting = when($"type" === "us", 0).otherwise(1)
val window = Window.partitionBy($"id").orderBy(sorting)
val result = df
.withColumn("sortIndex", row_number().over(window))
.where($"sortIndex" === 1)
.drop("sortIndex")
Result:
+---+----+-----+
|id |type|price|
+---+----+-----+
|1 |us |1.99 |
|2 |ca |1.99 |
|3 |us |3.99 |
+---+----+-----+
Window function with custom sorting can be used, in Scala below, guess, can be translated to Python:
// data
val df = List(
(1, "us", 1.99),
(1, "ca", 2.99),
(2, "ca", 1.99),
(3, "au", 2.99),
(3, "us", 3.99),
(3, "ca", 2.99))
.toDF("id", "type", "price")
// action
val sorting = when($"type" === "us", 0).otherwise(1)
val window = Window.partitionBy($"id").orderBy(sorting)
val result = df
.withColumn("sortIndex", row_number().over(window))
.where($"sortIndex" === 1)
.drop("sortIndex")
Result:
+---+----+-----+
|id |type|price|
+---+----+-----+
|1 |us |1.99 |
|2 |ca |1.99 |
|3 |us |3.99 |
+---+----+-----+
answered Nov 17 '18 at 20:43
pasha701pasha701
3,2901613
3,2901613
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324141%2fkeeping-only-one-record-from-spark-dataframe-not-necessary-distinct-record%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Why don't you try this?
cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()
– karma4917
Nov 15 '18 at 19:01
Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?
– dunkky
Nov 15 '18 at 21:38
Can you paste in the code you have tried so far?
– karma4917
Nov 16 '18 at 0:36
it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?
– dunkky
Nov 16 '18 at 2:22