Split Spark Dataframe string column into multiple columns

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:

rdd.map(lambda row: row + [row.my_str_col.split('-')])

which takes something looking like:

col1 | my_str_col
-----+-----------
 18 | 856-yygrm
 201 | 777-psgdg

and converts it to this:

col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
 18 | 856-yygrm | 856 | yygrm
 201 | 777-psgdg | 777 | psgdg

I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.

Ideally, I want these new columns to be named as well.

asked Aug 30 '16 at 19:32

Peter Gaultney

9742914

add a comment |

rdd.map(lambda row: row + [row.my_str_col.split('-')])

which takes something looking like:

col1 | my_str_col
-----+-----------
 18 | 856-yygrm
 201 | 777-psgdg

and converts it to this:

col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
 18 | 856-yygrm | 856 | yygrm
 201 | 777-psgdg | 777 | psgdg

I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.

Ideally, I want these new columns to be named as well.

asked Aug 30 '16 at 19:32

Peter Gaultney

9742914

add a comment |

rdd.map(lambda row: row + [row.my_str_col.split('-')])

which takes something looking like:

col1 | my_str_col
-----+-----------
 18 | 856-yygrm
 201 | 777-psgdg

and converts it to this:

col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
 18 | 856-yygrm | 856 | yygrm
 201 | 777-psgdg | 777 | psgdg

I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.

Ideally, I want these new columns to be named as well.

asked Aug 30 '16 at 19:32

Peter Gaultney

9742914

rdd.map(lambda row: row + [row.my_str_col.split('-')])

which takes something looking like:

col1 | my_str_col
-----+-----------
 18 | 856-yygrm
 201 | 777-psgdg

and converts it to this:

col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
 18 | 856-yygrm | 856 | yygrm
 201 | 777-psgdg | 777 | psgdg

I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.

Ideally, I want these new columns to be named as well.

apache-spark pyspark apache-spark-sql spark-dataframe pyspark-sql

asked Aug 30 '16 at 19:32

Peter Gaultney

9742914

asked Aug 30 '16 at 19:32

Peter Gaultney

9742914

asked Aug 30 '16 at 19:32

Peter Gaultney

9742914

asked Aug 30 '16 at 19:32

Peter Gaultney

9742914

asked Aug 30 '16 at 19:32

Peter Gaultney

9742914

add a comment |

3 Answers
3

active

oldest

votes

pyspark.sql.functions.split() is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself:

split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))

The result will be:

col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
 18 | 856-yygrm | 856 | yygrm
 201 | 777-psgdg | 777 | psgdg

I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.

answered Aug 30 '16 at 19:32

Peter Gaultney

9742914

1

did you find a solution for the general uneven case?

– Ezer K
Apr 5 '17 at 10:37

unfortunately I never did.

– Peter Gaultney
Apr 6 '17 at 12:10

2

ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))

– Ezer K
Apr 6 '17 at 21:39

1

Is there a way to put the remaining items in a single column? i.e. split_col.getItem(2 - n) in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.

– Chris
Oct 18 '17 at 19:28

1

Curious - is anybody aware of how to specify taking the last item of the array?

– EntryLevelR
Mar 23 '18 at 21:52

add a comment |

Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect, or using udfs. Unfortunately this only works for spark version 2.1 and above, because it requires the posexplode function.

Suppose you had the following DataFrame:

df = spark.createDataFrame(
 [
 [1, 'A, B, C, D'], 
 [2, 'E, F, G'], 
 [3, 'H, I'], 
 [4, 'J']
 ]
 , ["num", "letters"]
)
df.show()
#+---+----------+
#|num| letters|
#+---+----------+
#| 1|A, B, C, D|
#| 2| E, F, G|
#| 3| H, I|
#| 4| J|
#+---+----------+

Split the letters column and then use posexplode to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr to grab the element at index pos in this array.

import pyspark.sql.functions as f

df.select(
 "num",
 f.split("letters", ", ").alias("letters"),
 f.posexplode(f.split("letters", ", ")).alias("pos", "val")
 )
 .show()
#+---+------------+---+---+
#|num| letters|pos|val|
#+---+------------+---+---+
#| 1|[A, B, C, D]| 0| A|
#| 1|[A, B, C, D]| 1| B|
#| 1|[A, B, C, D]| 2| C|
#| 1|[A, B, C, D]| 3| D|
#| 2| [E, F, G]| 0| E|
#| 2| [E, F, G]| 1| F|
#| 2| [E, F, G]| 2| G|
#| 3| [H, I]| 0| H|
#| 3| [H, I]| 1| I|
#| 4| [J]| 0| J|
#+---+------------+---+---+

Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr which allows us use column values as parameters.

df.select(
 "num",
 f.split("letters", ", ").alias("letters"),
 f.posexplode(f.split("letters", ", ")).alias("pos", "val")
 )
 .drop("val")
 .select(
 "num",
 f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
 f.expr("letters[pos]").alias("val")
 )
 .show()
#+---+-------+---+
#|num| name|val|
#+---+-------+---+
#| 1|letter0| A|
#| 1|letter1| B|
#| 1|letter2| C|
#| 1|letter3| D|
#| 2|letter0| E|
#| 2|letter1| F|
#| 2|letter2| G|
#| 3|letter0| H|
#| 3|letter1| I|
#| 4|letter0| J|
#+---+-------+---+

Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:

df.select(
 "num",
 f.split("letters", ", ").alias("letters"),
 f.posexplode(f.split("letters", ", ")).alias("pos", "val")
 )
 .drop("val")
 .select(
 "num",
 f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
 f.expr("letters[pos]").alias("val")
 )
 .groupBy("num").pivot("name").agg(f.first("val"))
 .show()
#+---+-------+-------+-------+-------+
#|num|letter0|letter1|letter2|letter3|
#+---+-------+-------+-------+-------+
#| 1| A| B| C| D|
#| 3| H| I| null| null|
#| 2| E| F| G| null|
#| 4| J| null| null| null|
#+---+-------+-------+-------+-------+

answered Aug 3 '18 at 21:29

pault

15.5k32450

add a comment |

I found a solution for the general uneven case (or when you get the nested columns, obtained with .split() function) :

import pyspark.sql.functions as f

@f.udf(StructType([StructField(col_3, StringType(), True),
 StructField(col_4, StringType(), True)]))

 def splitCols(array):
 return array[0], ''.join(array[1:len(array)])

 df = df.withColumn("name", splitCols(f.split(f.col("my_str_col"), '-')))
 .select(df.columns+['name.*'])

Basically, you just need to select all the preceding columns + the nested ones 'column_name.*' and you will get them as two top-level columns in this case.

edited Nov 15 '18 at 2:46

answered Nov 15 '18 at 2:33

Jasminyas

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f39235704%2fsplit-spark-dataframe-string-column-into-multiple-columns%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))

The result will be:

col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
 18 | 856-yygrm | 856 | yygrm
 201 | 777-psgdg | 777 | psgdg

I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.

answered Aug 30 '16 at 19:32

Peter Gaultney

9742914

1

did you find a solution for the general uneven case?

– Ezer K
Apr 5 '17 at 10:37

unfortunately I never did.

– Peter Gaultney
Apr 6 '17 at 12:10

2

ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))

– Ezer K
Apr 6 '17 at 21:39

1

Is there a way to put the remaining items in a single column? i.e. split_col.getItem(2 - n) in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.

– Chris
Oct 18 '17 at 19:28

1

Curious - is anybody aware of how to specify taking the last item of the array?

– EntryLevelR
Mar 23 '18 at 21:52

add a comment |

split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))

The result will be:

col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
 18 | 856-yygrm | 856 | yygrm
 201 | 777-psgdg | 777 | psgdg

I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.

answered Aug 30 '16 at 19:32

Peter Gaultney

9742914

1

did you find a solution for the general uneven case?

– Ezer K
Apr 5 '17 at 10:37

unfortunately I never did.

– Peter Gaultney
Apr 6 '17 at 12:10

2

ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))

– Ezer K
Apr 6 '17 at 21:39

1

Is there a way to put the remaining items in a single column? i.e. split_col.getItem(2 - n) in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.

– Chris
Oct 18 '17 at 19:28

1

Curious - is anybody aware of how to specify taking the last item of the array?

– EntryLevelR
Mar 23 '18 at 21:52

add a comment |

split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))

The result will be:

col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
 18 | 856-yygrm | 856 | yygrm
 201 | 777-psgdg | 777 | psgdg

I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.

answered Aug 30 '16 at 19:32

Peter Gaultney

9742914

split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))

The result will be:

col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
 18 | 856-yygrm | 856 | yygrm
 201 | 777-psgdg | 777 | psgdg

I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.

answered Aug 30 '16 at 19:32

Peter Gaultney

9742914

answered Aug 30 '16 at 19:32

Peter Gaultney

9742914

answered Aug 30 '16 at 19:32

Peter Gaultney

9742914

answered Aug 30 '16 at 19:32

Peter Gaultney

9742914

1

did you find a solution for the general uneven case?

– Ezer K
Apr 5 '17 at 10:37

unfortunately I never did.

– Peter Gaultney
Apr 6 '17 at 12:10

2

ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))

– Ezer K
Apr 6 '17 at 21:39

1

Is there a way to put the remaining items in a single column? i.e. split_col.getItem(2 - n) in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.

– Chris
Oct 18 '17 at 19:28

1

Curious - is anybody aware of how to specify taking the last item of the array?

– EntryLevelR
Mar 23 '18 at 21:52

add a comment |

1

did you find a solution for the general uneven case?

– Ezer K
Apr 5 '17 at 10:37

unfortunately I never did.

– Peter Gaultney
Apr 6 '17 at 12:10

2

ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))

– Ezer K
Apr 6 '17 at 21:39

1

Is there a way to put the remaining items in a single column? i.e. split_col.getItem(2 - n) in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.

– Chris
Oct 18 '17 at 19:28

1

Curious - is anybody aware of how to specify taking the last item of the array?

– EntryLevelR
Mar 23 '18 at 21:52

did you find a solution for the general uneven case?

– Ezer K
Apr 5 '17 at 10:37

unfortunately I never did.

– Peter Gaultney
Apr 6 '17 at 12:10

ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))

– Ezer K
Apr 6 '17 at 21:39

Is there a way to put the remaining items in a single column? i.e. split_col.getItem(2 - n) in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.

– Chris
Oct 18 '17 at 19:28

Curious - is anybody aware of how to specify taking the last item of the array?

– EntryLevelR
Mar 23 '18 at 21:52

add a comment |

Suppose you had the following DataFrame:

df = spark.createDataFrame(
 [
 [1, 'A, B, C, D'], 
 [2, 'E, F, G'], 
 [3, 'H, I'], 
 [4, 'J']
 ]
 , ["num", "letters"]
)
df.show()
#+---+----------+
#|num| letters|
#+---+----------+
#| 1|A, B, C, D|
#| 2| E, F, G|
#| 3| H, I|
#| 4| J|
#+---+----------+

import pyspark.sql.functions as f

df.select(
 "num",
 f.split("letters", ", ").alias("letters"),
 f.posexplode(f.split("letters", ", ")).alias("pos", "val")
 )
 .show()
#+---+------------+---+---+
#|num| letters|pos|val|
#+---+------------+---+---+
#| 1|[A, B, C, D]| 0| A|
#| 1|[A, B, C, D]| 1| B|
#| 1|[A, B, C, D]| 2| C|
#| 1|[A, B, C, D]| 3| D|
#| 2| [E, F, G]| 0| E|
#| 2| [E, F, G]| 1| F|
#| 2| [E, F, G]| 2| G|
#| 3| [H, I]| 0| H|
#| 3| [H, I]| 1| I|
#| 4| [J]| 0| J|
#+---+------------+---+---+

df.select(
 "num",
 f.split("letters", ", ").alias("letters"),
 f.posexplode(f.split("letters", ", ")).alias("pos", "val")
 )
 .drop("val")
 .select(
 "num",
 f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
 f.expr("letters[pos]").alias("val")
 )
 .show()
#+---+-------+---+
#|num| name|val|
#+---+-------+---+
#| 1|letter0| A|
#| 1|letter1| B|
#| 1|letter2| C|
#| 1|letter3| D|
#| 2|letter0| E|
#| 2|letter1| F|
#| 2|letter2| G|
#| 3|letter0| H|
#| 3|letter1| I|
#| 4|letter0| J|
#+---+-------+---+

Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:

df.select(
 "num",
 f.split("letters", ", ").alias("letters"),
 f.posexplode(f.split("letters", ", ")).alias("pos", "val")
 )
 .drop("val")
 .select(
 "num",
 f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
 f.expr("letters[pos]").alias("val")
 )
 .groupBy("num").pivot("name").agg(f.first("val"))
 .show()
#+---+-------+-------+-------+-------+
#|num|letter0|letter1|letter2|letter3|
#+---+-------+-------+-------+-------+
#| 1| A| B| C| D|
#| 3| H| I| null| null|
#| 2| E| F| G| null|
#| 4| J| null| null| null|
#+---+-------+-------+-------+-------+

answered Aug 3 '18 at 21:29

pault

15.5k32450

add a comment |

Suppose you had the following DataFrame:

df = spark.createDataFrame(
 [
 [1, 'A, B, C, D'], 
 [2, 'E, F, G'], 
 [3, 'H, I'], 
 [4, 'J']
 ]
 , ["num", "letters"]
)
df.show()
#+---+----------+
#|num| letters|
#+---+----------+
#| 1|A, B, C, D|
#| 2| E, F, G|
#| 3| H, I|
#| 4| J|
#+---+----------+

import pyspark.sql.functions as f

df.select(
 "num",
 f.split("letters", ", ").alias("letters"),
 f.posexplode(f.split("letters", ", ")).alias("pos", "val")
 )
 .show()
#+---+------------+---+---+
#|num| letters|pos|val|
#+---+------------+---+---+
#| 1|[A, B, C, D]| 0| A|
#| 1|[A, B, C, D]| 1| B|
#| 1|[A, B, C, D]| 2| C|
#| 1|[A, B, C, D]| 3| D|
#| 2| [E, F, G]| 0| E|
#| 2| [E, F, G]| 1| F|
#| 2| [E, F, G]| 2| G|
#| 3| [H, I]| 0| H|
#| 3| [H, I]| 1| I|
#| 4| [J]| 0| J|
#+---+------------+---+---+

df.select(
 "num",
 f.split("letters", ", ").alias("letters"),
 f.posexplode(f.split("letters", ", ")).alias("pos", "val")
 )
 .drop("val")
 .select(
 "num",
 f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
 f.expr("letters[pos]").alias("val")
 )
 .show()
#+---+-------+---+
#|num| name|val|
#+---+-------+---+
#| 1|letter0| A|
#| 1|letter1| B|
#| 1|letter2| C|
#| 1|letter3| D|
#| 2|letter0| E|
#| 2|letter1| F|
#| 2|letter2| G|
#| 3|letter0| H|
#| 3|letter1| I|
#| 4|letter0| J|
#+---+-------+---+

Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:

df.select(
 "num",
 f.split("letters", ", ").alias("letters"),
 f.posexplode(f.split("letters", ", ")).alias("pos", "val")
 )
 .drop("val")
 .select(
 "num",
 f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
 f.expr("letters[pos]").alias("val")
 )
 .groupBy("num").pivot("name").agg(f.first("val"))
 .show()
#+---+-------+-------+-------+-------+
#|num|letter0|letter1|letter2|letter3|
#+---+-------+-------+-------+-------+
#| 1| A| B| C| D|
#| 3| H| I| null| null|
#| 2| E| F| G| null|
#| 4| J| null| null| null|
#+---+-------+-------+-------+-------+

answered Aug 3 '18 at 21:29

pault

15.5k32450

add a comment |

Suppose you had the following DataFrame:

df = spark.createDataFrame(
 [
 [1, 'A, B, C, D'], 
 [2, 'E, F, G'], 
 [3, 'H, I'], 
 [4, 'J']
 ]
 , ["num", "letters"]
)
df.show()
#+---+----------+
#|num| letters|
#+---+----------+
#| 1|A, B, C, D|
#| 2| E, F, G|
#| 3| H, I|
#| 4| J|
#+---+----------+

import pyspark.sql.functions as f

df.select(
 "num",
 f.split("letters", ", ").alias("letters"),
 f.posexplode(f.split("letters", ", ")).alias("pos", "val")
 )
 .show()
#+---+------------+---+---+
#|num| letters|pos|val|
#+---+------------+---+---+
#| 1|[A, B, C, D]| 0| A|
#| 1|[A, B, C, D]| 1| B|
#| 1|[A, B, C, D]| 2| C|
#| 1|[A, B, C, D]| 3| D|
#| 2| [E, F, G]| 0| E|
#| 2| [E, F, G]| 1| F|
#| 2| [E, F, G]| 2| G|
#| 3| [H, I]| 0| H|
#| 3| [H, I]| 1| I|
#| 4| [J]| 0| J|
#+---+------------+---+---+

df.select(
 "num",
 f.split("letters", ", ").alias("letters"),
 f.posexplode(f.split("letters", ", ")).alias("pos", "val")
 )
 .drop("val")
 .select(
 "num",
 f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
 f.expr("letters[pos]").alias("val")
 )
 .show()
#+---+-------+---+
#|num| name|val|
#+---+-------+---+
#| 1|letter0| A|
#| 1|letter1| B|
#| 1|letter2| C|
#| 1|letter3| D|
#| 2|letter0| E|
#| 2|letter1| F|
#| 2|letter2| G|
#| 3|letter0| H|
#| 3|letter1| I|
#| 4|letter0| J|
#+---+-------+---+

Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:

df.select(
 "num",
 f.split("letters", ", ").alias("letters"),
 f.posexplode(f.split("letters", ", ")).alias("pos", "val")
 )
 .drop("val")
 .select(
 "num",
 f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
 f.expr("letters[pos]").alias("val")
 )
 .groupBy("num").pivot("name").agg(f.first("val"))
 .show()
#+---+-------+-------+-------+-------+
#|num|letter0|letter1|letter2|letter3|
#+---+-------+-------+-------+-------+
#| 1| A| B| C| D|
#| 3| H| I| null| null|
#| 2| E| F| G| null|
#| 4| J| null| null| null|
#+---+-------+-------+-------+-------+

answered Aug 3 '18 at 21:29

pault

15.5k32450

Suppose you had the following DataFrame:

df = spark.createDataFrame(
 [
 [1, 'A, B, C, D'], 
 [2, 'E, F, G'], 
 [3, 'H, I'], 
 [4, 'J']
 ]
 , ["num", "letters"]
)
df.show()
#+---+----------+
#|num| letters|
#+---+----------+
#| 1|A, B, C, D|
#| 2| E, F, G|
#| 3| H, I|
#| 4| J|
#+---+----------+

import pyspark.sql.functions as f

df.select(
 "num",
 f.split("letters", ", ").alias("letters"),
 f.posexplode(f.split("letters", ", ")).alias("pos", "val")
 )
 .show()
#+---+------------+---+---+
#|num| letters|pos|val|
#+---+------------+---+---+
#| 1|[A, B, C, D]| 0| A|
#| 1|[A, B, C, D]| 1| B|
#| 1|[A, B, C, D]| 2| C|
#| 1|[A, B, C, D]| 3| D|
#| 2| [E, F, G]| 0| E|
#| 2| [E, F, G]| 1| F|
#| 2| [E, F, G]| 2| G|
#| 3| [H, I]| 0| H|
#| 3| [H, I]| 1| I|
#| 4| [J]| 0| J|
#+---+------------+---+---+

df.select(
 "num",
 f.split("letters", ", ").alias("letters"),
 f.posexplode(f.split("letters", ", ")).alias("pos", "val")
 )
 .drop("val")
 .select(
 "num",
 f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
 f.expr("letters[pos]").alias("val")
 )
 .show()
#+---+-------+---+
#|num| name|val|
#+---+-------+---+
#| 1|letter0| A|
#| 1|letter1| B|
#| 1|letter2| C|
#| 1|letter3| D|
#| 2|letter0| E|
#| 2|letter1| F|
#| 2|letter2| G|
#| 3|letter0| H|
#| 3|letter1| I|
#| 4|letter0| J|
#+---+-------+---+

Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:

df.select(
 "num",
 f.split("letters", ", ").alias("letters"),
 f.posexplode(f.split("letters", ", ")).alias("pos", "val")
 )
 .drop("val")
 .select(
 "num",
 f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
 f.expr("letters[pos]").alias("val")
 )
 .groupBy("num").pivot("name").agg(f.first("val"))
 .show()
#+---+-------+-------+-------+-------+
#|num|letter0|letter1|letter2|letter3|
#+---+-------+-------+-------+-------+
#| 1| A| B| C| D|
#| 3| H| I| null| null|
#| 2| E| F| G| null|
#| 4| J| null| null| null|
#+---+-------+-------+-------+-------+

answered Aug 3 '18 at 21:29

pault

15.5k32450

answered Aug 3 '18 at 21:29

pault

15.5k32450

answered Aug 3 '18 at 21:29

pault

15.5k32450

answered Aug 3 '18 at 21:29

pault

15.5k32450

add a comment |

I found a solution for the general uneven case (or when you get the nested columns, obtained with .split() function) :

import pyspark.sql.functions as f

@f.udf(StructType([StructField(col_3, StringType(), True),
 StructField(col_4, StringType(), True)]))

 def splitCols(array):
 return array[0], ''.join(array[1:len(array)])

 df = df.withColumn("name", splitCols(f.split(f.col("my_str_col"), '-')))
 .select(df.columns+['name.*'])

Basically, you just need to select all the preceding columns + the nested ones 'column_name.*' and you will get them as two top-level columns in this case.

edited Nov 15 '18 at 2:46

answered Nov 15 '18 at 2:33

Jasminyas

add a comment |

I found a solution for the general uneven case (or when you get the nested columns, obtained with .split() function) :

import pyspark.sql.functions as f

@f.udf(StructType([StructField(col_3, StringType(), True),
 StructField(col_4, StringType(), True)]))

 def splitCols(array):
 return array[0], ''.join(array[1:len(array)])

 df = df.withColumn("name", splitCols(f.split(f.col("my_str_col"), '-')))
 .select(df.columns+['name.*'])

Basically, you just need to select all the preceding columns + the nested ones 'column_name.*' and you will get them as two top-level columns in this case.

edited Nov 15 '18 at 2:46

answered Nov 15 '18 at 2:33

Jasminyas

add a comment |

I found a solution for the general uneven case (or when you get the nested columns, obtained with .split() function) :

import pyspark.sql.functions as f

@f.udf(StructType([StructField(col_3, StringType(), True),
 StructField(col_4, StringType(), True)]))

 def splitCols(array):
 return array[0], ''.join(array[1:len(array)])

 df = df.withColumn("name", splitCols(f.split(f.col("my_str_col"), '-')))
 .select(df.columns+['name.*'])

Basically, you just need to select all the preceding columns + the nested ones 'column_name.*' and you will get them as two top-level columns in this case.

edited Nov 15 '18 at 2:46

answered Nov 15 '18 at 2:33

Jasminyas

I found a solution for the general uneven case (or when you get the nested columns, obtained with .split() function) :

import pyspark.sql.functions as f

@f.udf(StructType([StructField(col_3, StringType(), True),
 StructField(col_4, StringType(), True)]))

 def splitCols(array):
 return array[0], ''.join(array[1:len(array)])

 df = df.withColumn("name", splitCols(f.split(f.col("my_str_col"), '-')))
 .select(df.columns+['name.*'])

Basically, you just need to select all the preceding columns + the nested ones 'column_name.*' and you will get them as two top-level columns in this case.

edited Nov 15 '18 at 2:46

answered Nov 15 '18 at 2:33

Jasminyas

edited Nov 15 '18 at 2:46

answered Nov 15 '18 at 2:33

Jasminyas

answered Nov 15 '18 at 2:33

Jasminyas

answered Nov 15 '18 at 2:33

Jasminyas

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Myujth