Keeping only one record from spark dataframe. Not necessary distinct record










-3















I have the following data:



id type price
1 us 1.99
1 ca 2.99
2 ca 1.99
3 au 2.99
3 us 3.99
3 ca 2.99


Basically, I want to get only one record per id and if is more than one, then I want to keep one with type, us over others.



Desired output:



id type price
1 us 1.99
2 ca 1.99
3 us 3.99


How can I do this using either PySpark or SparkSQL against a Spark DataFrame or a table?










share|improve this question
























  • Why don't you try this? cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()

    – karma4917
    Nov 15 '18 at 19:01












  • Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?

    – dunkky
    Nov 15 '18 at 21:38












  • Can you paste in the code you have tried so far?

    – karma4917
    Nov 16 '18 at 0:36











  • it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?

    – dunkky
    Nov 16 '18 at 2:22
















-3















I have the following data:



id type price
1 us 1.99
1 ca 2.99
2 ca 1.99
3 au 2.99
3 us 3.99
3 ca 2.99


Basically, I want to get only one record per id and if is more than one, then I want to keep one with type, us over others.



Desired output:



id type price
1 us 1.99
2 ca 1.99
3 us 3.99


How can I do this using either PySpark or SparkSQL against a Spark DataFrame or a table?










share|improve this question
























  • Why don't you try this? cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()

    – karma4917
    Nov 15 '18 at 19:01












  • Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?

    – dunkky
    Nov 15 '18 at 21:38












  • Can you paste in the code you have tried so far?

    – karma4917
    Nov 16 '18 at 0:36











  • it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?

    – dunkky
    Nov 16 '18 at 2:22














-3












-3








-3








I have the following data:



id type price
1 us 1.99
1 ca 2.99
2 ca 1.99
3 au 2.99
3 us 3.99
3 ca 2.99


Basically, I want to get only one record per id and if is more than one, then I want to keep one with type, us over others.



Desired output:



id type price
1 us 1.99
2 ca 1.99
3 us 3.99


How can I do this using either PySpark or SparkSQL against a Spark DataFrame or a table?










share|improve this question
















I have the following data:



id type price
1 us 1.99
1 ca 2.99
2 ca 1.99
3 au 2.99
3 us 3.99
3 ca 2.99


Basically, I want to get only one record per id and if is more than one, then I want to keep one with type, us over others.



Desired output:



id type price
1 us 1.99
2 ca 1.99
3 us 3.99


How can I do this using either PySpark or SparkSQL against a Spark DataFrame or a table?







apache-spark dataframe pyspark pyspark-sql databricks






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 '18 at 21:47









karma4917

835614




835614










asked Nov 15 '18 at 16:44









dunkkydunkky

1




1












  • Why don't you try this? cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()

    – karma4917
    Nov 15 '18 at 19:01












  • Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?

    – dunkky
    Nov 15 '18 at 21:38












  • Can you paste in the code you have tried so far?

    – karma4917
    Nov 16 '18 at 0:36











  • it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?

    – dunkky
    Nov 16 '18 at 2:22


















  • Why don't you try this? cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()

    – karma4917
    Nov 15 '18 at 19:01












  • Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?

    – dunkky
    Nov 15 '18 at 21:38












  • Can you paste in the code you have tried so far?

    – karma4917
    Nov 16 '18 at 0:36











  • it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?

    – dunkky
    Nov 16 '18 at 2:22

















Why don't you try this? cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()

– karma4917
Nov 15 '18 at 19:01






Why don't you try this? cols = ['id', 'type', 'price'] vals = [ (1, 'us', 1.99), (1, 'ca', 2.99), (2, 'ca', 1.99), (3, 'au', 2.99), (3, 'us', 2.99), (3, 'us', 3.99), (3 ,'ca', 2.99)] df = spark.createDataFrame(vals, cols) d = df.groupBy('id').agg(first('type').alias('type'), first('price').alias('price')).orderBy('id') d.show()

– karma4917
Nov 15 '18 at 19:01














Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?

– dunkky
Nov 15 '18 at 21:38






Thanks, but it's not recognizing 'first'. Even tried. from pyspark.sql import functions as F F.first. still same thing. how is that above query going to yield 'US' first?

– dunkky
Nov 15 '18 at 21:38














Can you paste in the code you have tried so far?

– karma4917
Nov 16 '18 at 0:36





Can you paste in the code you have tried so far?

– karma4917
Nov 16 '18 at 0:36













it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?

– dunkky
Nov 16 '18 at 2:22






it's the exact same code you gave me above, i am running in databrick note. i'll post it tomorrow. but it's literally the same. i am thinking i might have to take this route. Get all the 'US' from df and save it to df1. Get all the 'CA' from df where id not exists from df1 and save it to df2. Then merge df1 and df2. would that work assuming only there are 'us' and 'ca' types?

– dunkky
Nov 16 '18 at 2:22













1 Answer
1






active

oldest

votes


















0














Window function with custom sorting can be used, in Scala below, guess, can be translated to Python:



// data
val df = List(
(1, "us", 1.99),
(1, "ca", 2.99),
(2, "ca", 1.99),
(3, "au", 2.99),
(3, "us", 3.99),
(3, "ca", 2.99))
.toDF("id", "type", "price")

// action
val sorting = when($"type" === "us", 0).otherwise(1)
val window = Window.partitionBy($"id").orderBy(sorting)
val result = df
.withColumn("sortIndex", row_number().over(window))
.where($"sortIndex" === 1)
.drop("sortIndex")


Result:



+---+----+-----+
|id |type|price|
+---+----+-----+
|1 |us |1.99 |
|2 |ca |1.99 |
|3 |us |3.99 |
+---+----+-----+





share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324141%2fkeeping-only-one-record-from-spark-dataframe-not-necessary-distinct-record%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Window function with custom sorting can be used, in Scala below, guess, can be translated to Python:



    // data
    val df = List(
    (1, "us", 1.99),
    (1, "ca", 2.99),
    (2, "ca", 1.99),
    (3, "au", 2.99),
    (3, "us", 3.99),
    (3, "ca", 2.99))
    .toDF("id", "type", "price")

    // action
    val sorting = when($"type" === "us", 0).otherwise(1)
    val window = Window.partitionBy($"id").orderBy(sorting)
    val result = df
    .withColumn("sortIndex", row_number().over(window))
    .where($"sortIndex" === 1)
    .drop("sortIndex")


    Result:



    +---+----+-----+
    |id |type|price|
    +---+----+-----+
    |1 |us |1.99 |
    |2 |ca |1.99 |
    |3 |us |3.99 |
    +---+----+-----+





    share|improve this answer



























      0














      Window function with custom sorting can be used, in Scala below, guess, can be translated to Python:



      // data
      val df = List(
      (1, "us", 1.99),
      (1, "ca", 2.99),
      (2, "ca", 1.99),
      (3, "au", 2.99),
      (3, "us", 3.99),
      (3, "ca", 2.99))
      .toDF("id", "type", "price")

      // action
      val sorting = when($"type" === "us", 0).otherwise(1)
      val window = Window.partitionBy($"id").orderBy(sorting)
      val result = df
      .withColumn("sortIndex", row_number().over(window))
      .where($"sortIndex" === 1)
      .drop("sortIndex")


      Result:



      +---+----+-----+
      |id |type|price|
      +---+----+-----+
      |1 |us |1.99 |
      |2 |ca |1.99 |
      |3 |us |3.99 |
      +---+----+-----+





      share|improve this answer

























        0












        0








        0







        Window function with custom sorting can be used, in Scala below, guess, can be translated to Python:



        // data
        val df = List(
        (1, "us", 1.99),
        (1, "ca", 2.99),
        (2, "ca", 1.99),
        (3, "au", 2.99),
        (3, "us", 3.99),
        (3, "ca", 2.99))
        .toDF("id", "type", "price")

        // action
        val sorting = when($"type" === "us", 0).otherwise(1)
        val window = Window.partitionBy($"id").orderBy(sorting)
        val result = df
        .withColumn("sortIndex", row_number().over(window))
        .where($"sortIndex" === 1)
        .drop("sortIndex")


        Result:



        +---+----+-----+
        |id |type|price|
        +---+----+-----+
        |1 |us |1.99 |
        |2 |ca |1.99 |
        |3 |us |3.99 |
        +---+----+-----+





        share|improve this answer













        Window function with custom sorting can be used, in Scala below, guess, can be translated to Python:



        // data
        val df = List(
        (1, "us", 1.99),
        (1, "ca", 2.99),
        (2, "ca", 1.99),
        (3, "au", 2.99),
        (3, "us", 3.99),
        (3, "ca", 2.99))
        .toDF("id", "type", "price")

        // action
        val sorting = when($"type" === "us", 0).otherwise(1)
        val window = Window.partitionBy($"id").orderBy(sorting)
        val result = df
        .withColumn("sortIndex", row_number().over(window))
        .where($"sortIndex" === 1)
        .drop("sortIndex")


        Result:



        +---+----+-----+
        |id |type|price|
        +---+----+-----+
        |1 |us |1.99 |
        |2 |ca |1.99 |
        |3 |us |3.99 |
        +---+----+-----+






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 17 '18 at 20:43









        pasha701pasha701

        3,2901613




        3,2901613





























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324141%2fkeeping-only-one-record-from-spark-dataframe-not-necessary-distinct-record%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Top Tejano songwriter Luis Silva dead of heart attack at 64

            政党

            天津地下鉄3号線