Split Spark Dataframe string column into multiple columns










32















I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:



rdd.map(lambda row: row + [row.my_str_col.split('-')])


which takes something looking like:



col1 | my_str_col
-----+-----------
18 | 856-yygrm
201 | 777-psgdg


and converts it to this:



col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg


I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.



Ideally, I want these new columns to be named as well.










share|improve this question


























    32















    I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:



    rdd.map(lambda row: row + [row.my_str_col.split('-')])


    which takes something looking like:



    col1 | my_str_col
    -----+-----------
    18 | 856-yygrm
    201 | 777-psgdg


    and converts it to this:



    col1 | my_str_col | _col3 | _col4
    -----+------------+-------+------
    18 | 856-yygrm | 856 | yygrm
    201 | 777-psgdg | 777 | psgdg


    I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.



    Ideally, I want these new columns to be named as well.










    share|improve this question
























      32












      32








      32


      11






      I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:



      rdd.map(lambda row: row + [row.my_str_col.split('-')])


      which takes something looking like:



      col1 | my_str_col
      -----+-----------
      18 | 856-yygrm
      201 | 777-psgdg


      and converts it to this:



      col1 | my_str_col | _col3 | _col4
      -----+------------+-------+------
      18 | 856-yygrm | 856 | yygrm
      201 | 777-psgdg | 777 | psgdg


      I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.



      Ideally, I want these new columns to be named as well.










      share|improve this question














      I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:



      rdd.map(lambda row: row + [row.my_str_col.split('-')])


      which takes something looking like:



      col1 | my_str_col
      -----+-----------
      18 | 856-yygrm
      201 | 777-psgdg


      and converts it to this:



      col1 | my_str_col | _col3 | _col4
      -----+------------+-------+------
      18 | 856-yygrm | 856 | yygrm
      201 | 777-psgdg | 777 | psgdg


      I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.



      Ideally, I want these new columns to be named as well.







      apache-spark pyspark apache-spark-sql spark-dataframe pyspark-sql






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Aug 30 '16 at 19:32









      Peter GaultneyPeter Gaultney

      9742914




      9742914






















          3 Answers
          3






          active

          oldest

          votes


















          53














          pyspark.sql.functions.split() is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself:



          split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
          df = df.withColumn('NAME1', split_col.getItem(0))
          df = df.withColumn('NAME2', split_col.getItem(1))


          The result will be:



          col1 | my_str_col | NAME1 | NAME2
          -----+------------+-------+------
          18 | 856-yygrm | 856 | yygrm
          201 | 777-psgdg | 777 | psgdg


          I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.






          share|improve this answer


















          • 1





            did you find a solution for the general uneven case?

            – Ezer K
            Apr 5 '17 at 10:37











          • unfortunately I never did.

            – Peter Gaultney
            Apr 6 '17 at 12:10






          • 2





            ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))

            – Ezer K
            Apr 6 '17 at 21:39






          • 1





            Is there a way to put the remaining items in a single column? i.e. split_col.getItem(2 - n) in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.

            – Chris
            Oct 18 '17 at 19:28







          • 1





            Curious - is anybody aware of how to specify taking the last item of the array?

            – EntryLevelR
            Mar 23 '18 at 21:52


















          12
















          Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect, or using udfs. Unfortunately this only works for spark version 2.1 and above, because it requires the posexplode function.



          Suppose you had the following DataFrame:



          df = spark.createDataFrame(
          [
          [1, 'A, B, C, D'],
          [2, 'E, F, G'],
          [3, 'H, I'],
          [4, 'J']
          ]
          , ["num", "letters"]
          )
          df.show()
          #+---+----------+
          #|num| letters|
          #+---+----------+
          #| 1|A, B, C, D|
          #| 2| E, F, G|
          #| 3| H, I|
          #| 4| J|
          #+---+----------+


          Split the letters column and then use posexplode to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr to grab the element at index pos in this array.



          import pyspark.sql.functions as f

          df.select(
          "num",
          f.split("letters", ", ").alias("letters"),
          f.posexplode(f.split("letters", ", ")).alias("pos", "val")
          )
          .show()
          #+---+------------+---+---+
          #|num| letters|pos|val|
          #+---+------------+---+---+
          #| 1|[A, B, C, D]| 0| A|
          #| 1|[A, B, C, D]| 1| B|
          #| 1|[A, B, C, D]| 2| C|
          #| 1|[A, B, C, D]| 3| D|
          #| 2| [E, F, G]| 0| E|
          #| 2| [E, F, G]| 1| F|
          #| 2| [E, F, G]| 2| G|
          #| 3| [H, I]| 0| H|
          #| 3| [H, I]| 1| I|
          #| 4| [J]| 0| J|
          #+---+------------+---+---+


          Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr which allows us use column values as parameters.



          df.select(
          "num",
          f.split("letters", ", ").alias("letters"),
          f.posexplode(f.split("letters", ", ")).alias("pos", "val")
          )
          .drop("val")
          .select(
          "num",
          f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
          f.expr("letters[pos]").alias("val")
          )
          .show()
          #+---+-------+---+
          #|num| name|val|
          #+---+-------+---+
          #| 1|letter0| A|
          #| 1|letter1| B|
          #| 1|letter2| C|
          #| 1|letter3| D|
          #| 2|letter0| E|
          #| 2|letter1| F|
          #| 2|letter2| G|
          #| 3|letter0| H|
          #| 3|letter1| I|
          #| 4|letter0| J|
          #+---+-------+---+


          Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:



          df.select(
          "num",
          f.split("letters", ", ").alias("letters"),
          f.posexplode(f.split("letters", ", ")).alias("pos", "val")
          )
          .drop("val")
          .select(
          "num",
          f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
          f.expr("letters[pos]").alias("val")
          )
          .groupBy("num").pivot("name").agg(f.first("val"))
          .show()
          #+---+-------+-------+-------+-------+
          #|num|letter0|letter1|letter2|letter3|
          #+---+-------+-------+-------+-------+
          #| 1| A| B| C| D|
          #| 3| H| I| null| null|
          #| 2| E| F| G| null|
          #| 4| J| null| null| null|
          #+---+-------+-------+-------+-------+





          share|improve this answer






























            0














            I found a solution for the general uneven case (or when you get the nested columns, obtained with .split() function) :



            import pyspark.sql.functions as f

            @f.udf(StructType([StructField(col_3, StringType(), True),
            StructField(col_4, StringType(), True)]))

            def splitCols(array):
            return array[0], ''.join(array[1:len(array)])

            df = df.withColumn("name", splitCols(f.split(f.col("my_str_col"), '-')))
            .select(df.columns+['name.*'])


            Basically, you just need to select all the preceding columns + the nested ones 'column_name.*' and you will get them as two top-level columns in this case.






            share|improve this answer
























              Your Answer






              StackExchange.ifUsing("editor", function ()
              StackExchange.using("externalEditor", function ()
              StackExchange.using("snippets", function ()
              StackExchange.snippets.init();
              );
              );
              , "code-snippets");

              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "1"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f39235704%2fsplit-spark-dataframe-string-column-into-multiple-columns%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              53














              pyspark.sql.functions.split() is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself:



              split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
              df = df.withColumn('NAME1', split_col.getItem(0))
              df = df.withColumn('NAME2', split_col.getItem(1))


              The result will be:



              col1 | my_str_col | NAME1 | NAME2
              -----+------------+-------+------
              18 | 856-yygrm | 856 | yygrm
              201 | 777-psgdg | 777 | psgdg


              I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.






              share|improve this answer


















              • 1





                did you find a solution for the general uneven case?

                – Ezer K
                Apr 5 '17 at 10:37











              • unfortunately I never did.

                – Peter Gaultney
                Apr 6 '17 at 12:10






              • 2





                ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))

                – Ezer K
                Apr 6 '17 at 21:39






              • 1





                Is there a way to put the remaining items in a single column? i.e. split_col.getItem(2 - n) in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.

                – Chris
                Oct 18 '17 at 19:28







              • 1





                Curious - is anybody aware of how to specify taking the last item of the array?

                – EntryLevelR
                Mar 23 '18 at 21:52















              53














              pyspark.sql.functions.split() is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself:



              split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
              df = df.withColumn('NAME1', split_col.getItem(0))
              df = df.withColumn('NAME2', split_col.getItem(1))


              The result will be:



              col1 | my_str_col | NAME1 | NAME2
              -----+------------+-------+------
              18 | 856-yygrm | 856 | yygrm
              201 | 777-psgdg | 777 | psgdg


              I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.






              share|improve this answer


















              • 1





                did you find a solution for the general uneven case?

                – Ezer K
                Apr 5 '17 at 10:37











              • unfortunately I never did.

                – Peter Gaultney
                Apr 6 '17 at 12:10






              • 2





                ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))

                – Ezer K
                Apr 6 '17 at 21:39






              • 1





                Is there a way to put the remaining items in a single column? i.e. split_col.getItem(2 - n) in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.

                – Chris
                Oct 18 '17 at 19:28







              • 1





                Curious - is anybody aware of how to specify taking the last item of the array?

                – EntryLevelR
                Mar 23 '18 at 21:52













              53












              53








              53







              pyspark.sql.functions.split() is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself:



              split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
              df = df.withColumn('NAME1', split_col.getItem(0))
              df = df.withColumn('NAME2', split_col.getItem(1))


              The result will be:



              col1 | my_str_col | NAME1 | NAME2
              -----+------------+-------+------
              18 | 856-yygrm | 856 | yygrm
              201 | 777-psgdg | 777 | psgdg


              I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.






              share|improve this answer













              pyspark.sql.functions.split() is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself:



              split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
              df = df.withColumn('NAME1', split_col.getItem(0))
              df = df.withColumn('NAME2', split_col.getItem(1))


              The result will be:



              col1 | my_str_col | NAME1 | NAME2
              -----+------------+-------+------
              18 | 856-yygrm | 856 | yygrm
              201 | 777-psgdg | 777 | psgdg


              I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.







              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Aug 30 '16 at 19:32









              Peter GaultneyPeter Gaultney

              9742914




              9742914







              • 1





                did you find a solution for the general uneven case?

                – Ezer K
                Apr 5 '17 at 10:37











              • unfortunately I never did.

                – Peter Gaultney
                Apr 6 '17 at 12:10






              • 2





                ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))

                – Ezer K
                Apr 6 '17 at 21:39






              • 1





                Is there a way to put the remaining items in a single column? i.e. split_col.getItem(2 - n) in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.

                – Chris
                Oct 18 '17 at 19:28







              • 1





                Curious - is anybody aware of how to specify taking the last item of the array?

                – EntryLevelR
                Mar 23 '18 at 21:52












              • 1





                did you find a solution for the general uneven case?

                – Ezer K
                Apr 5 '17 at 10:37











              • unfortunately I never did.

                – Peter Gaultney
                Apr 6 '17 at 12:10






              • 2





                ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))

                – Ezer K
                Apr 6 '17 at 21:39






              • 1





                Is there a way to put the remaining items in a single column? i.e. split_col.getItem(2 - n) in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.

                – Chris
                Oct 18 '17 at 19:28







              • 1





                Curious - is anybody aware of how to specify taking the last item of the array?

                – EntryLevelR
                Mar 23 '18 at 21:52







              1




              1





              did you find a solution for the general uneven case?

              – Ezer K
              Apr 5 '17 at 10:37





              did you find a solution for the general uneven case?

              – Ezer K
              Apr 5 '17 at 10:37













              unfortunately I never did.

              – Peter Gaultney
              Apr 6 '17 at 12:10





              unfortunately I never did.

              – Peter Gaultney
              Apr 6 '17 at 12:10




              2




              2





              ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))

              – Ezer K
              Apr 6 '17 at 21:39





              ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))

              – Ezer K
              Apr 6 '17 at 21:39




              1




              1





              Is there a way to put the remaining items in a single column? i.e. split_col.getItem(2 - n) in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.

              – Chris
              Oct 18 '17 at 19:28






              Is there a way to put the remaining items in a single column? i.e. split_col.getItem(2 - n) in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.

              – Chris
              Oct 18 '17 at 19:28





              1




              1





              Curious - is anybody aware of how to specify taking the last item of the array?

              – EntryLevelR
              Mar 23 '18 at 21:52





              Curious - is anybody aware of how to specify taking the last item of the array?

              – EntryLevelR
              Mar 23 '18 at 21:52













              12
















              Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect, or using udfs. Unfortunately this only works for spark version 2.1 and above, because it requires the posexplode function.



              Suppose you had the following DataFrame:



              df = spark.createDataFrame(
              [
              [1, 'A, B, C, D'],
              [2, 'E, F, G'],
              [3, 'H, I'],
              [4, 'J']
              ]
              , ["num", "letters"]
              )
              df.show()
              #+---+----------+
              #|num| letters|
              #+---+----------+
              #| 1|A, B, C, D|
              #| 2| E, F, G|
              #| 3| H, I|
              #| 4| J|
              #+---+----------+


              Split the letters column and then use posexplode to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr to grab the element at index pos in this array.



              import pyspark.sql.functions as f

              df.select(
              "num",
              f.split("letters", ", ").alias("letters"),
              f.posexplode(f.split("letters", ", ")).alias("pos", "val")
              )
              .show()
              #+---+------------+---+---+
              #|num| letters|pos|val|
              #+---+------------+---+---+
              #| 1|[A, B, C, D]| 0| A|
              #| 1|[A, B, C, D]| 1| B|
              #| 1|[A, B, C, D]| 2| C|
              #| 1|[A, B, C, D]| 3| D|
              #| 2| [E, F, G]| 0| E|
              #| 2| [E, F, G]| 1| F|
              #| 2| [E, F, G]| 2| G|
              #| 3| [H, I]| 0| H|
              #| 3| [H, I]| 1| I|
              #| 4| [J]| 0| J|
              #+---+------------+---+---+


              Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr which allows us use column values as parameters.



              df.select(
              "num",
              f.split("letters", ", ").alias("letters"),
              f.posexplode(f.split("letters", ", ")).alias("pos", "val")
              )
              .drop("val")
              .select(
              "num",
              f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
              f.expr("letters[pos]").alias("val")
              )
              .show()
              #+---+-------+---+
              #|num| name|val|
              #+---+-------+---+
              #| 1|letter0| A|
              #| 1|letter1| B|
              #| 1|letter2| C|
              #| 1|letter3| D|
              #| 2|letter0| E|
              #| 2|letter1| F|
              #| 2|letter2| G|
              #| 3|letter0| H|
              #| 3|letter1| I|
              #| 4|letter0| J|
              #+---+-------+---+


              Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:



              df.select(
              "num",
              f.split("letters", ", ").alias("letters"),
              f.posexplode(f.split("letters", ", ")).alias("pos", "val")
              )
              .drop("val")
              .select(
              "num",
              f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
              f.expr("letters[pos]").alias("val")
              )
              .groupBy("num").pivot("name").agg(f.first("val"))
              .show()
              #+---+-------+-------+-------+-------+
              #|num|letter0|letter1|letter2|letter3|
              #+---+-------+-------+-------+-------+
              #| 1| A| B| C| D|
              #| 3| H| I| null| null|
              #| 2| E| F| G| null|
              #| 4| J| null| null| null|
              #+---+-------+-------+-------+-------+





              share|improve this answer



























                12
















                Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect, or using udfs. Unfortunately this only works for spark version 2.1 and above, because it requires the posexplode function.



                Suppose you had the following DataFrame:



                df = spark.createDataFrame(
                [
                [1, 'A, B, C, D'],
                [2, 'E, F, G'],
                [3, 'H, I'],
                [4, 'J']
                ]
                , ["num", "letters"]
                )
                df.show()
                #+---+----------+
                #|num| letters|
                #+---+----------+
                #| 1|A, B, C, D|
                #| 2| E, F, G|
                #| 3| H, I|
                #| 4| J|
                #+---+----------+


                Split the letters column and then use posexplode to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr to grab the element at index pos in this array.



                import pyspark.sql.functions as f

                df.select(
                "num",
                f.split("letters", ", ").alias("letters"),
                f.posexplode(f.split("letters", ", ")).alias("pos", "val")
                )
                .show()
                #+---+------------+---+---+
                #|num| letters|pos|val|
                #+---+------------+---+---+
                #| 1|[A, B, C, D]| 0| A|
                #| 1|[A, B, C, D]| 1| B|
                #| 1|[A, B, C, D]| 2| C|
                #| 1|[A, B, C, D]| 3| D|
                #| 2| [E, F, G]| 0| E|
                #| 2| [E, F, G]| 1| F|
                #| 2| [E, F, G]| 2| G|
                #| 3| [H, I]| 0| H|
                #| 3| [H, I]| 1| I|
                #| 4| [J]| 0| J|
                #+---+------------+---+---+


                Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr which allows us use column values as parameters.



                df.select(
                "num",
                f.split("letters", ", ").alias("letters"),
                f.posexplode(f.split("letters", ", ")).alias("pos", "val")
                )
                .drop("val")
                .select(
                "num",
                f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
                f.expr("letters[pos]").alias("val")
                )
                .show()
                #+---+-------+---+
                #|num| name|val|
                #+---+-------+---+
                #| 1|letter0| A|
                #| 1|letter1| B|
                #| 1|letter2| C|
                #| 1|letter3| D|
                #| 2|letter0| E|
                #| 2|letter1| F|
                #| 2|letter2| G|
                #| 3|letter0| H|
                #| 3|letter1| I|
                #| 4|letter0| J|
                #+---+-------+---+


                Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:



                df.select(
                "num",
                f.split("letters", ", ").alias("letters"),
                f.posexplode(f.split("letters", ", ")).alias("pos", "val")
                )
                .drop("val")
                .select(
                "num",
                f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
                f.expr("letters[pos]").alias("val")
                )
                .groupBy("num").pivot("name").agg(f.first("val"))
                .show()
                #+---+-------+-------+-------+-------+
                #|num|letter0|letter1|letter2|letter3|
                #+---+-------+-------+-------+-------+
                #| 1| A| B| C| D|
                #| 3| H| I| null| null|
                #| 2| E| F| G| null|
                #| 4| J| null| null| null|
                #+---+-------+-------+-------+-------+





                share|improve this answer

























                  12












                  12








                  12









                  Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect, or using udfs. Unfortunately this only works for spark version 2.1 and above, because it requires the posexplode function.



                  Suppose you had the following DataFrame:



                  df = spark.createDataFrame(
                  [
                  [1, 'A, B, C, D'],
                  [2, 'E, F, G'],
                  [3, 'H, I'],
                  [4, 'J']
                  ]
                  , ["num", "letters"]
                  )
                  df.show()
                  #+---+----------+
                  #|num| letters|
                  #+---+----------+
                  #| 1|A, B, C, D|
                  #| 2| E, F, G|
                  #| 3| H, I|
                  #| 4| J|
                  #+---+----------+


                  Split the letters column and then use posexplode to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr to grab the element at index pos in this array.



                  import pyspark.sql.functions as f

                  df.select(
                  "num",
                  f.split("letters", ", ").alias("letters"),
                  f.posexplode(f.split("letters", ", ")).alias("pos", "val")
                  )
                  .show()
                  #+---+------------+---+---+
                  #|num| letters|pos|val|
                  #+---+------------+---+---+
                  #| 1|[A, B, C, D]| 0| A|
                  #| 1|[A, B, C, D]| 1| B|
                  #| 1|[A, B, C, D]| 2| C|
                  #| 1|[A, B, C, D]| 3| D|
                  #| 2| [E, F, G]| 0| E|
                  #| 2| [E, F, G]| 1| F|
                  #| 2| [E, F, G]| 2| G|
                  #| 3| [H, I]| 0| H|
                  #| 3| [H, I]| 1| I|
                  #| 4| [J]| 0| J|
                  #+---+------------+---+---+


                  Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr which allows us use column values as parameters.



                  df.select(
                  "num",
                  f.split("letters", ", ").alias("letters"),
                  f.posexplode(f.split("letters", ", ")).alias("pos", "val")
                  )
                  .drop("val")
                  .select(
                  "num",
                  f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
                  f.expr("letters[pos]").alias("val")
                  )
                  .show()
                  #+---+-------+---+
                  #|num| name|val|
                  #+---+-------+---+
                  #| 1|letter0| A|
                  #| 1|letter1| B|
                  #| 1|letter2| C|
                  #| 1|letter3| D|
                  #| 2|letter0| E|
                  #| 2|letter1| F|
                  #| 2|letter2| G|
                  #| 3|letter0| H|
                  #| 3|letter1| I|
                  #| 4|letter0| J|
                  #+---+-------+---+


                  Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:



                  df.select(
                  "num",
                  f.split("letters", ", ").alias("letters"),
                  f.posexplode(f.split("letters", ", ")).alias("pos", "val")
                  )
                  .drop("val")
                  .select(
                  "num",
                  f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
                  f.expr("letters[pos]").alias("val")
                  )
                  .groupBy("num").pivot("name").agg(f.first("val"))
                  .show()
                  #+---+-------+-------+-------+-------+
                  #|num|letter0|letter1|letter2|letter3|
                  #+---+-------+-------+-------+-------+
                  #| 1| A| B| C| D|
                  #| 3| H| I| null| null|
                  #| 2| E| F| G| null|
                  #| 4| J| null| null| null|
                  #+---+-------+-------+-------+-------+





                  share|improve this answer















                  Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect, or using udfs. Unfortunately this only works for spark version 2.1 and above, because it requires the posexplode function.



                  Suppose you had the following DataFrame:



                  df = spark.createDataFrame(
                  [
                  [1, 'A, B, C, D'],
                  [2, 'E, F, G'],
                  [3, 'H, I'],
                  [4, 'J']
                  ]
                  , ["num", "letters"]
                  )
                  df.show()
                  #+---+----------+
                  #|num| letters|
                  #+---+----------+
                  #| 1|A, B, C, D|
                  #| 2| E, F, G|
                  #| 3| H, I|
                  #| 4| J|
                  #+---+----------+


                  Split the letters column and then use posexplode to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr to grab the element at index pos in this array.



                  import pyspark.sql.functions as f

                  df.select(
                  "num",
                  f.split("letters", ", ").alias("letters"),
                  f.posexplode(f.split("letters", ", ")).alias("pos", "val")
                  )
                  .show()
                  #+---+------------+---+---+
                  #|num| letters|pos|val|
                  #+---+------------+---+---+
                  #| 1|[A, B, C, D]| 0| A|
                  #| 1|[A, B, C, D]| 1| B|
                  #| 1|[A, B, C, D]| 2| C|
                  #| 1|[A, B, C, D]| 3| D|
                  #| 2| [E, F, G]| 0| E|
                  #| 2| [E, F, G]| 1| F|
                  #| 2| [E, F, G]| 2| G|
                  #| 3| [H, I]| 0| H|
                  #| 3| [H, I]| 1| I|
                  #| 4| [J]| 0| J|
                  #+---+------------+---+---+


                  Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr which allows us use column values as parameters.



                  df.select(
                  "num",
                  f.split("letters", ", ").alias("letters"),
                  f.posexplode(f.split("letters", ", ")).alias("pos", "val")
                  )
                  .drop("val")
                  .select(
                  "num",
                  f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
                  f.expr("letters[pos]").alias("val")
                  )
                  .show()
                  #+---+-------+---+
                  #|num| name|val|
                  #+---+-------+---+
                  #| 1|letter0| A|
                  #| 1|letter1| B|
                  #| 1|letter2| C|
                  #| 1|letter3| D|
                  #| 2|letter0| E|
                  #| 2|letter1| F|
                  #| 2|letter2| G|
                  #| 3|letter0| H|
                  #| 3|letter1| I|
                  #| 4|letter0| J|
                  #+---+-------+---+


                  Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:



                  df.select(
                  "num",
                  f.split("letters", ", ").alias("letters"),
                  f.posexplode(f.split("letters", ", ")).alias("pos", "val")
                  )
                  .drop("val")
                  .select(
                  "num",
                  f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
                  f.expr("letters[pos]").alias("val")
                  )
                  .groupBy("num").pivot("name").agg(f.first("val"))
                  .show()
                  #+---+-------+-------+-------+-------+
                  #|num|letter0|letter1|letter2|letter3|
                  #+---+-------+-------+-------+-------+
                  #| 1| A| B| C| D|
                  #| 3| H| I| null| null|
                  #| 2| E| F| G| null|
                  #| 4| J| null| null| null|
                  #+---+-------+-------+-------+-------+






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Aug 3 '18 at 21:29









                  paultpault

                  15.5k32450




                  15.5k32450





















                      0














                      I found a solution for the general uneven case (or when you get the nested columns, obtained with .split() function) :



                      import pyspark.sql.functions as f

                      @f.udf(StructType([StructField(col_3, StringType(), True),
                      StructField(col_4, StringType(), True)]))

                      def splitCols(array):
                      return array[0], ''.join(array[1:len(array)])

                      df = df.withColumn("name", splitCols(f.split(f.col("my_str_col"), '-')))
                      .select(df.columns+['name.*'])


                      Basically, you just need to select all the preceding columns + the nested ones 'column_name.*' and you will get them as two top-level columns in this case.






                      share|improve this answer





























                        0














                        I found a solution for the general uneven case (or when you get the nested columns, obtained with .split() function) :



                        import pyspark.sql.functions as f

                        @f.udf(StructType([StructField(col_3, StringType(), True),
                        StructField(col_4, StringType(), True)]))

                        def splitCols(array):
                        return array[0], ''.join(array[1:len(array)])

                        df = df.withColumn("name", splitCols(f.split(f.col("my_str_col"), '-')))
                        .select(df.columns+['name.*'])


                        Basically, you just need to select all the preceding columns + the nested ones 'column_name.*' and you will get them as two top-level columns in this case.






                        share|improve this answer



























                          0












                          0








                          0







                          I found a solution for the general uneven case (or when you get the nested columns, obtained with .split() function) :



                          import pyspark.sql.functions as f

                          @f.udf(StructType([StructField(col_3, StringType(), True),
                          StructField(col_4, StringType(), True)]))

                          def splitCols(array):
                          return array[0], ''.join(array[1:len(array)])

                          df = df.withColumn("name", splitCols(f.split(f.col("my_str_col"), '-')))
                          .select(df.columns+['name.*'])


                          Basically, you just need to select all the preceding columns + the nested ones 'column_name.*' and you will get them as two top-level columns in this case.






                          share|improve this answer















                          I found a solution for the general uneven case (or when you get the nested columns, obtained with .split() function) :



                          import pyspark.sql.functions as f

                          @f.udf(StructType([StructField(col_3, StringType(), True),
                          StructField(col_4, StringType(), True)]))

                          def splitCols(array):
                          return array[0], ''.join(array[1:len(array)])

                          df = df.withColumn("name", splitCols(f.split(f.col("my_str_col"), '-')))
                          .select(df.columns+['name.*'])


                          Basically, you just need to select all the preceding columns + the nested ones 'column_name.*' and you will get them as two top-level columns in this case.







                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Nov 15 '18 at 2:46

























                          answered Nov 15 '18 at 2:33









                          JasminyasJasminyas

                          13




                          13



























                              draft saved

                              draft discarded
















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f39235704%2fsplit-spark-dataframe-string-column-into-multiple-columns%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Top Tejano songwriter Luis Silva dead of heart attack at 64

                              ReactJS Fetched API data displays live - need Data displayed static

                              Evgeni Malkin