Dataframe column substring based on the value during join










1















I have a dataframe with column having values like "COR//xxxxxx-xx-xxxx" or "xxxxxx-xx-xxxx"



I need to compare this column with another column in a different dataframe based on the column value.



  1. If column value have "COR//xxxxx-xx-xxxx", I need to use substring("column", 4, length($"column")

  2. If the column value have "xxxxx-xx-xxxx", I can compare directly without using substring.

For example:



val DF1 = DF2.join(DF3, upper(trim($"column1".substr(4, length($"column1")))) === upper(trim(DF3("column1"))))


I am not sure how to add the condition while joining. Could anyone please let me know how can we achieve this in Spark dataframe?










share|improve this question




























    1















    I have a dataframe with column having values like "COR//xxxxxx-xx-xxxx" or "xxxxxx-xx-xxxx"



    I need to compare this column with another column in a different dataframe based on the column value.



    1. If column value have "COR//xxxxx-xx-xxxx", I need to use substring("column", 4, length($"column")

    2. If the column value have "xxxxx-xx-xxxx", I can compare directly without using substring.

    For example:



    val DF1 = DF2.join(DF3, upper(trim($"column1".substr(4, length($"column1")))) === upper(trim(DF3("column1"))))


    I am not sure how to add the condition while joining. Could anyone please let me know how can we achieve this in Spark dataframe?










    share|improve this question


























      1












      1








      1








      I have a dataframe with column having values like "COR//xxxxxx-xx-xxxx" or "xxxxxx-xx-xxxx"



      I need to compare this column with another column in a different dataframe based on the column value.



      1. If column value have "COR//xxxxx-xx-xxxx", I need to use substring("column", 4, length($"column")

      2. If the column value have "xxxxx-xx-xxxx", I can compare directly without using substring.

      For example:



      val DF1 = DF2.join(DF3, upper(trim($"column1".substr(4, length($"column1")))) === upper(trim(DF3("column1"))))


      I am not sure how to add the condition while joining. Could anyone please let me know how can we achieve this in Spark dataframe?










      share|improve this question
















      I have a dataframe with column having values like "COR//xxxxxx-xx-xxxx" or "xxxxxx-xx-xxxx"



      I need to compare this column with another column in a different dataframe based on the column value.



      1. If column value have "COR//xxxxx-xx-xxxx", I need to use substring("column", 4, length($"column")

      2. If the column value have "xxxxx-xx-xxxx", I can compare directly without using substring.

      For example:



      val DF1 = DF2.join(DF3, upper(trim($"column1".substr(4, length($"column1")))) === upper(trim(DF3("column1"))))


      I am not sure how to add the condition while joining. Could anyone please let me know how can we achieve this in Spark dataframe?







      scala apache-spark dataframe join apache-spark-sql






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 14 '18 at 5:42









      Shaido

      12.2k112540




      12.2k112540










      asked Nov 14 '18 at 5:16









      BabBab

      3610




      3610






















          2 Answers
          2






          active

          oldest

          votes


















          0














          You can try adding a new column based on the conditions and join on the new column. Something like this.



          val data = List("COR//xxxxx-xx-xxxx", "xxxxx-xx-xxxx")
          val DF2 = ps.sparkSession.sparkContext.parallelize(data).toDF("column1")
          val DF4 = DF2.withColumn("joinCol", when(col("column1").like("%COR%"),
          expr("substring(column1, 6, length(column1)-1)")).otherwise(col("column1")) )

          DF4.show(false)


          The new column will have values like this.



          +------------------+-------------+
          |column1 |joinCol |
          +------------------+-------------+
          |COR//xxxxx-xx-xxxx|xxxxx-xx-xxxx|
          |xxxxx-xx-xxxx |xxxxx-xx-xxxx|
          +------------------+-------------+


          You can now join based on the new column added.



          val DF1 = DF4.join(DF3, upper(trim(DF4("joinCol"))) === upper(trim(DF3("column1"))))


          Hope this helps.






          share|improve this answer
































            0














            Simply create a new column to use in the join:



            DF2.withColumn("column2", 
            when($"column1" rlike "COR//.*",
            $"column1".substr(lit(4), length($"column1")).
            otherwise($"column1"))


            Then use column2 in the join. It is also possible to add the whole when clause directly in the join but it would look very messy.



            Note that to use a constant value in substr you need to use lit. And if you want to remove the whole "COR//" part, use 6 instead of 4.






            share|improve this answer























            • Its perfect. Thank you for your response.

              – Bab
              Nov 14 '18 at 19:04










            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53293594%2fdataframe-column-substring-based-on-the-value-during-join%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            You can try adding a new column based on the conditions and join on the new column. Something like this.



            val data = List("COR//xxxxx-xx-xxxx", "xxxxx-xx-xxxx")
            val DF2 = ps.sparkSession.sparkContext.parallelize(data).toDF("column1")
            val DF4 = DF2.withColumn("joinCol", when(col("column1").like("%COR%"),
            expr("substring(column1, 6, length(column1)-1)")).otherwise(col("column1")) )

            DF4.show(false)


            The new column will have values like this.



            +------------------+-------------+
            |column1 |joinCol |
            +------------------+-------------+
            |COR//xxxxx-xx-xxxx|xxxxx-xx-xxxx|
            |xxxxx-xx-xxxx |xxxxx-xx-xxxx|
            +------------------+-------------+


            You can now join based on the new column added.



            val DF1 = DF4.join(DF3, upper(trim(DF4("joinCol"))) === upper(trim(DF3("column1"))))


            Hope this helps.






            share|improve this answer





























              0














              You can try adding a new column based on the conditions and join on the new column. Something like this.



              val data = List("COR//xxxxx-xx-xxxx", "xxxxx-xx-xxxx")
              val DF2 = ps.sparkSession.sparkContext.parallelize(data).toDF("column1")
              val DF4 = DF2.withColumn("joinCol", when(col("column1").like("%COR%"),
              expr("substring(column1, 6, length(column1)-1)")).otherwise(col("column1")) )

              DF4.show(false)


              The new column will have values like this.



              +------------------+-------------+
              |column1 |joinCol |
              +------------------+-------------+
              |COR//xxxxx-xx-xxxx|xxxxx-xx-xxxx|
              |xxxxx-xx-xxxx |xxxxx-xx-xxxx|
              +------------------+-------------+


              You can now join based on the new column added.



              val DF1 = DF4.join(DF3, upper(trim(DF4("joinCol"))) === upper(trim(DF3("column1"))))


              Hope this helps.






              share|improve this answer



























                0












                0








                0







                You can try adding a new column based on the conditions and join on the new column. Something like this.



                val data = List("COR//xxxxx-xx-xxxx", "xxxxx-xx-xxxx")
                val DF2 = ps.sparkSession.sparkContext.parallelize(data).toDF("column1")
                val DF4 = DF2.withColumn("joinCol", when(col("column1").like("%COR%"),
                expr("substring(column1, 6, length(column1)-1)")).otherwise(col("column1")) )

                DF4.show(false)


                The new column will have values like this.



                +------------------+-------------+
                |column1 |joinCol |
                +------------------+-------------+
                |COR//xxxxx-xx-xxxx|xxxxx-xx-xxxx|
                |xxxxx-xx-xxxx |xxxxx-xx-xxxx|
                +------------------+-------------+


                You can now join based on the new column added.



                val DF1 = DF4.join(DF3, upper(trim(DF4("joinCol"))) === upper(trim(DF3("column1"))))


                Hope this helps.






                share|improve this answer















                You can try adding a new column based on the conditions and join on the new column. Something like this.



                val data = List("COR//xxxxx-xx-xxxx", "xxxxx-xx-xxxx")
                val DF2 = ps.sparkSession.sparkContext.parallelize(data).toDF("column1")
                val DF4 = DF2.withColumn("joinCol", when(col("column1").like("%COR%"),
                expr("substring(column1, 6, length(column1)-1)")).otherwise(col("column1")) )

                DF4.show(false)


                The new column will have values like this.



                +------------------+-------------+
                |column1 |joinCol |
                +------------------+-------------+
                |COR//xxxxx-xx-xxxx|xxxxx-xx-xxxx|
                |xxxxx-xx-xxxx |xxxxx-xx-xxxx|
                +------------------+-------------+


                You can now join based on the new column added.



                val DF1 = DF4.join(DF3, upper(trim(DF4("joinCol"))) === upper(trim(DF3("column1"))))


                Hope this helps.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 14 '18 at 8:22

























                answered Nov 14 '18 at 5:58









                Apurba PandeyApurba Pandey

                315110




                315110























                    0














                    Simply create a new column to use in the join:



                    DF2.withColumn("column2", 
                    when($"column1" rlike "COR//.*",
                    $"column1".substr(lit(4), length($"column1")).
                    otherwise($"column1"))


                    Then use column2 in the join. It is also possible to add the whole when clause directly in the join but it would look very messy.



                    Note that to use a constant value in substr you need to use lit. And if you want to remove the whole "COR//" part, use 6 instead of 4.






                    share|improve this answer























                    • Its perfect. Thank you for your response.

                      – Bab
                      Nov 14 '18 at 19:04















                    0














                    Simply create a new column to use in the join:



                    DF2.withColumn("column2", 
                    when($"column1" rlike "COR//.*",
                    $"column1".substr(lit(4), length($"column1")).
                    otherwise($"column1"))


                    Then use column2 in the join. It is also possible to add the whole when clause directly in the join but it would look very messy.



                    Note that to use a constant value in substr you need to use lit. And if you want to remove the whole "COR//" part, use 6 instead of 4.






                    share|improve this answer























                    • Its perfect. Thank you for your response.

                      – Bab
                      Nov 14 '18 at 19:04













                    0












                    0








                    0







                    Simply create a new column to use in the join:



                    DF2.withColumn("column2", 
                    when($"column1" rlike "COR//.*",
                    $"column1".substr(lit(4), length($"column1")).
                    otherwise($"column1"))


                    Then use column2 in the join. It is also possible to add the whole when clause directly in the join but it would look very messy.



                    Note that to use a constant value in substr you need to use lit. And if you want to remove the whole "COR//" part, use 6 instead of 4.






                    share|improve this answer













                    Simply create a new column to use in the join:



                    DF2.withColumn("column2", 
                    when($"column1" rlike "COR//.*",
                    $"column1".substr(lit(4), length($"column1")).
                    otherwise($"column1"))


                    Then use column2 in the join. It is also possible to add the whole when clause directly in the join but it would look very messy.



                    Note that to use a constant value in substr you need to use lit. And if you want to remove the whole "COR//" part, use 6 instead of 4.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 14 '18 at 5:56









                    ShaidoShaido

                    12.2k112540




                    12.2k112540












                    • Its perfect. Thank you for your response.

                      – Bab
                      Nov 14 '18 at 19:04

















                    • Its perfect. Thank you for your response.

                      – Bab
                      Nov 14 '18 at 19:04
















                    Its perfect. Thank you for your response.

                    – Bab
                    Nov 14 '18 at 19:04





                    Its perfect. Thank you for your response.

                    – Bab
                    Nov 14 '18 at 19:04

















                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53293594%2fdataframe-column-substring-based-on-the-value-during-join%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Top Tejano songwriter Luis Silva dead of heart attack at 64

                    政党

                    天津地下鉄3号線