Split Spark Dataframe string column into multiple columns
I've seen various people suggesting that Dataframe.explode
is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:
rdd.map(lambda row: row + [row.my_str_col.split('-')])
which takes something looking like:
col1 | my_str_col
-----+-----------
18 | 856-yygrm
201 | 777-psgdg
and converts it to this:
col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am aware of pyspark.sql.functions.split()
, but it results in a nested array column instead of two top-level columns like I want.
Ideally, I want these new columns to be named as well.
apache-spark pyspark apache-spark-sql spark-dataframe pyspark-sql
add a comment |
I've seen various people suggesting that Dataframe.explode
is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:
rdd.map(lambda row: row + [row.my_str_col.split('-')])
which takes something looking like:
col1 | my_str_col
-----+-----------
18 | 856-yygrm
201 | 777-psgdg
and converts it to this:
col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am aware of pyspark.sql.functions.split()
, but it results in a nested array column instead of two top-level columns like I want.
Ideally, I want these new columns to be named as well.
apache-spark pyspark apache-spark-sql spark-dataframe pyspark-sql
add a comment |
I've seen various people suggesting that Dataframe.explode
is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:
rdd.map(lambda row: row + [row.my_str_col.split('-')])
which takes something looking like:
col1 | my_str_col
-----+-----------
18 | 856-yygrm
201 | 777-psgdg
and converts it to this:
col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am aware of pyspark.sql.functions.split()
, but it results in a nested array column instead of two top-level columns like I want.
Ideally, I want these new columns to be named as well.
apache-spark pyspark apache-spark-sql spark-dataframe pyspark-sql
I've seen various people suggesting that Dataframe.explode
is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:
rdd.map(lambda row: row + [row.my_str_col.split('-')])
which takes something looking like:
col1 | my_str_col
-----+-----------
18 | 856-yygrm
201 | 777-psgdg
and converts it to this:
col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am aware of pyspark.sql.functions.split()
, but it results in a nested array column instead of two top-level columns like I want.
Ideally, I want these new columns to be named as well.
apache-spark pyspark apache-spark-sql spark-dataframe pyspark-sql
apache-spark pyspark apache-spark-sql spark-dataframe pyspark-sql
asked Aug 30 '16 at 19:32
Peter GaultneyPeter Gaultney
9742914
9742914
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
pyspark.sql.functions.split()
is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem()
to retrieve each part of the array as a column itself:
split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))
The result will be:
col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.
1
did you find a solution for the general uneven case?
– Ezer K
Apr 5 '17 at 10:37
unfortunately I never did.
– Peter Gaultney
Apr 6 '17 at 12:10
2
ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))
– Ezer K
Apr 6 '17 at 21:39
1
Is there a way to put the remaining items in a single column? i.e.split_col.getItem(2 - n)
in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.
– Chris
Oct 18 '17 at 19:28
1
Curious - is anybody aware of how to specify taking the last item of the array?
– EntryLevelR
Mar 23 '18 at 21:52
add a comment |
Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect
, or using udf
s. Unfortunately this only works for spark
version 2.1 and above, because it requires the posexplode
function.
Suppose you had the following DataFrame:
df = spark.createDataFrame(
[
[1, 'A, B, C, D'],
[2, 'E, F, G'],
[3, 'H, I'],
[4, 'J']
]
, ["num", "letters"]
)
df.show()
#+---+----------+
#|num| letters|
#+---+----------+
#| 1|A, B, C, D|
#| 2| E, F, G|
#| 3| H, I|
#| 4| J|
#+---+----------+
Split the letters
column and then use posexplode
to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr
to grab the element at index pos
in this array.
import pyspark.sql.functions as f
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)
.show()
#+---+------------+---+---+
#|num| letters|pos|val|
#+---+------------+---+---+
#| 1|[A, B, C, D]| 0| A|
#| 1|[A, B, C, D]| 1| B|
#| 1|[A, B, C, D]| 2| C|
#| 1|[A, B, C, D]| 3| D|
#| 2| [E, F, G]| 0| E|
#| 2| [E, F, G]| 1| F|
#| 2| [E, F, G]| 2| G|
#| 3| [H, I]| 0| H|
#| 3| [H, I]| 1| I|
#| 4| [J]| 0| J|
#+---+------------+---+---+
Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter
and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr
which allows us use column values as parameters.
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)
.drop("val")
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)
.show()
#+---+-------+---+
#|num| name|val|
#+---+-------+---+
#| 1|letter0| A|
#| 1|letter1| B|
#| 1|letter2| C|
#| 1|letter3| D|
#| 2|letter0| E|
#| 2|letter1| F|
#| 2|letter2| G|
#| 3|letter0| H|
#| 3|letter1| I|
#| 4|letter0| J|
#+---+-------+---+
Now we can just groupBy
the num
and pivot
the DataFrame. Putting that all together, we get:
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)
.drop("val")
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)
.groupBy("num").pivot("name").agg(f.first("val"))
.show()
#+---+-------+-------+-------+-------+
#|num|letter0|letter1|letter2|letter3|
#+---+-------+-------+-------+-------+
#| 1| A| B| C| D|
#| 3| H| I| null| null|
#| 2| E| F| G| null|
#| 4| J| null| null| null|
#+---+-------+-------+-------+-------+
add a comment |
I found a solution for the general uneven case (or when you get the nested columns, obtained with .split() function) :
import pyspark.sql.functions as f
@f.udf(StructType([StructField(col_3, StringType(), True),
StructField(col_4, StringType(), True)]))
def splitCols(array):
return array[0], ''.join(array[1:len(array)])
df = df.withColumn("name", splitCols(f.split(f.col("my_str_col"), '-')))
.select(df.columns+['name.*'])
Basically, you just need to select all the preceding columns + the nested ones 'column_name.*' and you will get them as two top-level columns in this case.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f39235704%2fsplit-spark-dataframe-string-column-into-multiple-columns%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
pyspark.sql.functions.split()
is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem()
to retrieve each part of the array as a column itself:
split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))
The result will be:
col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.
1
did you find a solution for the general uneven case?
– Ezer K
Apr 5 '17 at 10:37
unfortunately I never did.
– Peter Gaultney
Apr 6 '17 at 12:10
2
ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))
– Ezer K
Apr 6 '17 at 21:39
1
Is there a way to put the remaining items in a single column? i.e.split_col.getItem(2 - n)
in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.
– Chris
Oct 18 '17 at 19:28
1
Curious - is anybody aware of how to specify taking the last item of the array?
– EntryLevelR
Mar 23 '18 at 21:52
add a comment |
pyspark.sql.functions.split()
is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem()
to retrieve each part of the array as a column itself:
split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))
The result will be:
col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.
1
did you find a solution for the general uneven case?
– Ezer K
Apr 5 '17 at 10:37
unfortunately I never did.
– Peter Gaultney
Apr 6 '17 at 12:10
2
ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))
– Ezer K
Apr 6 '17 at 21:39
1
Is there a way to put the remaining items in a single column? i.e.split_col.getItem(2 - n)
in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.
– Chris
Oct 18 '17 at 19:28
1
Curious - is anybody aware of how to specify taking the last item of the array?
– EntryLevelR
Mar 23 '18 at 21:52
add a comment |
pyspark.sql.functions.split()
is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem()
to retrieve each part of the array as a column itself:
split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))
The result will be:
col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.
pyspark.sql.functions.split()
is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem()
to retrieve each part of the array as a column itself:
split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))
The result will be:
col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.
answered Aug 30 '16 at 19:32
Peter GaultneyPeter Gaultney
9742914
9742914
1
did you find a solution for the general uneven case?
– Ezer K
Apr 5 '17 at 10:37
unfortunately I never did.
– Peter Gaultney
Apr 6 '17 at 12:10
2
ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))
– Ezer K
Apr 6 '17 at 21:39
1
Is there a way to put the remaining items in a single column? i.e.split_col.getItem(2 - n)
in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.
– Chris
Oct 18 '17 at 19:28
1
Curious - is anybody aware of how to specify taking the last item of the array?
– EntryLevelR
Mar 23 '18 at 21:52
add a comment |
1
did you find a solution for the general uneven case?
– Ezer K
Apr 5 '17 at 10:37
unfortunately I never did.
– Peter Gaultney
Apr 6 '17 at 12:10
2
ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))
– Ezer K
Apr 6 '17 at 21:39
1
Is there a way to put the remaining items in a single column? i.e.split_col.getItem(2 - n)
in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.
– Chris
Oct 18 '17 at 19:28
1
Curious - is anybody aware of how to specify taking the last item of the array?
– EntryLevelR
Mar 23 '18 at 21:52
1
1
did you find a solution for the general uneven case?
– Ezer K
Apr 5 '17 at 10:37
did you find a solution for the general uneven case?
– Ezer K
Apr 5 '17 at 10:37
unfortunately I never did.
– Peter Gaultney
Apr 6 '17 at 12:10
unfortunately I never did.
– Peter Gaultney
Apr 6 '17 at 12:10
2
2
ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))
– Ezer K
Apr 6 '17 at 21:39
ended up using a python loop, i.e - for i in range(max(len_of_split): df = df.withcolumn(split.getItem(i))
– Ezer K
Apr 6 '17 at 21:39
1
1
Is there a way to put the remaining items in a single column? i.e.
split_col.getItem(2 - n)
in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.– Chris
Oct 18 '17 at 19:28
Is there a way to put the remaining items in a single column? i.e.
split_col.getItem(2 - n)
in a third column. I guess something like the above loop to make columns for all items then concatenating them might work, but I don't know if that's very efficient or not.– Chris
Oct 18 '17 at 19:28
1
1
Curious - is anybody aware of how to specify taking the last item of the array?
– EntryLevelR
Mar 23 '18 at 21:52
Curious - is anybody aware of how to specify taking the last item of the array?
– EntryLevelR
Mar 23 '18 at 21:52
add a comment |
Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect
, or using udf
s. Unfortunately this only works for spark
version 2.1 and above, because it requires the posexplode
function.
Suppose you had the following DataFrame:
df = spark.createDataFrame(
[
[1, 'A, B, C, D'],
[2, 'E, F, G'],
[3, 'H, I'],
[4, 'J']
]
, ["num", "letters"]
)
df.show()
#+---+----------+
#|num| letters|
#+---+----------+
#| 1|A, B, C, D|
#| 2| E, F, G|
#| 3| H, I|
#| 4| J|
#+---+----------+
Split the letters
column and then use posexplode
to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr
to grab the element at index pos
in this array.
import pyspark.sql.functions as f
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)
.show()
#+---+------------+---+---+
#|num| letters|pos|val|
#+---+------------+---+---+
#| 1|[A, B, C, D]| 0| A|
#| 1|[A, B, C, D]| 1| B|
#| 1|[A, B, C, D]| 2| C|
#| 1|[A, B, C, D]| 3| D|
#| 2| [E, F, G]| 0| E|
#| 2| [E, F, G]| 1| F|
#| 2| [E, F, G]| 2| G|
#| 3| [H, I]| 0| H|
#| 3| [H, I]| 1| I|
#| 4| [J]| 0| J|
#+---+------------+---+---+
Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter
and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr
which allows us use column values as parameters.
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)
.drop("val")
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)
.show()
#+---+-------+---+
#|num| name|val|
#+---+-------+---+
#| 1|letter0| A|
#| 1|letter1| B|
#| 1|letter2| C|
#| 1|letter3| D|
#| 2|letter0| E|
#| 2|letter1| F|
#| 2|letter2| G|
#| 3|letter0| H|
#| 3|letter1| I|
#| 4|letter0| J|
#+---+-------+---+
Now we can just groupBy
the num
and pivot
the DataFrame. Putting that all together, we get:
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)
.drop("val")
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)
.groupBy("num").pivot("name").agg(f.first("val"))
.show()
#+---+-------+-------+-------+-------+
#|num|letter0|letter1|letter2|letter3|
#+---+-------+-------+-------+-------+
#| 1| A| B| C| D|
#| 3| H| I| null| null|
#| 2| E| F| G| null|
#| 4| J| null| null| null|
#+---+-------+-------+-------+-------+
add a comment |
Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect
, or using udf
s. Unfortunately this only works for spark
version 2.1 and above, because it requires the posexplode
function.
Suppose you had the following DataFrame:
df = spark.createDataFrame(
[
[1, 'A, B, C, D'],
[2, 'E, F, G'],
[3, 'H, I'],
[4, 'J']
]
, ["num", "letters"]
)
df.show()
#+---+----------+
#|num| letters|
#+---+----------+
#| 1|A, B, C, D|
#| 2| E, F, G|
#| 3| H, I|
#| 4| J|
#+---+----------+
Split the letters
column and then use posexplode
to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr
to grab the element at index pos
in this array.
import pyspark.sql.functions as f
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)
.show()
#+---+------------+---+---+
#|num| letters|pos|val|
#+---+------------+---+---+
#| 1|[A, B, C, D]| 0| A|
#| 1|[A, B, C, D]| 1| B|
#| 1|[A, B, C, D]| 2| C|
#| 1|[A, B, C, D]| 3| D|
#| 2| [E, F, G]| 0| E|
#| 2| [E, F, G]| 1| F|
#| 2| [E, F, G]| 2| G|
#| 3| [H, I]| 0| H|
#| 3| [H, I]| 1| I|
#| 4| [J]| 0| J|
#+---+------------+---+---+
Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter
and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr
which allows us use column values as parameters.
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)
.drop("val")
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)
.show()
#+---+-------+---+
#|num| name|val|
#+---+-------+---+
#| 1|letter0| A|
#| 1|letter1| B|
#| 1|letter2| C|
#| 1|letter3| D|
#| 2|letter0| E|
#| 2|letter1| F|
#| 2|letter2| G|
#| 3|letter0| H|
#| 3|letter1| I|
#| 4|letter0| J|
#+---+-------+---+
Now we can just groupBy
the num
and pivot
the DataFrame. Putting that all together, we get:
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)
.drop("val")
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)
.groupBy("num").pivot("name").agg(f.first("val"))
.show()
#+---+-------+-------+-------+-------+
#|num|letter0|letter1|letter2|letter3|
#+---+-------+-------+-------+-------+
#| 1| A| B| C| D|
#| 3| H| I| null| null|
#| 2| E| F| G| null|
#| 4| J| null| null| null|
#+---+-------+-------+-------+-------+
add a comment |
Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect
, or using udf
s. Unfortunately this only works for spark
version 2.1 and above, because it requires the posexplode
function.
Suppose you had the following DataFrame:
df = spark.createDataFrame(
[
[1, 'A, B, C, D'],
[2, 'E, F, G'],
[3, 'H, I'],
[4, 'J']
]
, ["num", "letters"]
)
df.show()
#+---+----------+
#|num| letters|
#+---+----------+
#| 1|A, B, C, D|
#| 2| E, F, G|
#| 3| H, I|
#| 4| J|
#+---+----------+
Split the letters
column and then use posexplode
to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr
to grab the element at index pos
in this array.
import pyspark.sql.functions as f
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)
.show()
#+---+------------+---+---+
#|num| letters|pos|val|
#+---+------------+---+---+
#| 1|[A, B, C, D]| 0| A|
#| 1|[A, B, C, D]| 1| B|
#| 1|[A, B, C, D]| 2| C|
#| 1|[A, B, C, D]| 3| D|
#| 2| [E, F, G]| 0| E|
#| 2| [E, F, G]| 1| F|
#| 2| [E, F, G]| 2| G|
#| 3| [H, I]| 0| H|
#| 3| [H, I]| 1| I|
#| 4| [J]| 0| J|
#+---+------------+---+---+
Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter
and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr
which allows us use column values as parameters.
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)
.drop("val")
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)
.show()
#+---+-------+---+
#|num| name|val|
#+---+-------+---+
#| 1|letter0| A|
#| 1|letter1| B|
#| 1|letter2| C|
#| 1|letter3| D|
#| 2|letter0| E|
#| 2|letter1| F|
#| 2|letter2| G|
#| 3|letter0| H|
#| 3|letter1| I|
#| 4|letter0| J|
#+---+-------+---+
Now we can just groupBy
the num
and pivot
the DataFrame. Putting that all together, we get:
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)
.drop("val")
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)
.groupBy("num").pivot("name").agg(f.first("val"))
.show()
#+---+-------+-------+-------+-------+
#|num|letter0|letter1|letter2|letter3|
#+---+-------+-------+-------+-------+
#| 1| A| B| C| D|
#| 3| H| I| null| null|
#| 2| E| F| G| null|
#| 4| J| null| null| null|
#+---+-------+-------+-------+-------+
Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect
, or using udf
s. Unfortunately this only works for spark
version 2.1 and above, because it requires the posexplode
function.
Suppose you had the following DataFrame:
df = spark.createDataFrame(
[
[1, 'A, B, C, D'],
[2, 'E, F, G'],
[3, 'H, I'],
[4, 'J']
]
, ["num", "letters"]
)
df.show()
#+---+----------+
#|num| letters|
#+---+----------+
#| 1|A, B, C, D|
#| 2| E, F, G|
#| 3| H, I|
#| 4| J|
#+---+----------+
Split the letters
column and then use posexplode
to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr
to grab the element at index pos
in this array.
import pyspark.sql.functions as f
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)
.show()
#+---+------------+---+---+
#|num| letters|pos|val|
#+---+------------+---+---+
#| 1|[A, B, C, D]| 0| A|
#| 1|[A, B, C, D]| 1| B|
#| 1|[A, B, C, D]| 2| C|
#| 1|[A, B, C, D]| 3| D|
#| 2| [E, F, G]| 0| E|
#| 2| [E, F, G]| 1| F|
#| 2| [E, F, G]| 2| G|
#| 3| [H, I]| 0| H|
#| 3| [H, I]| 1| I|
#| 4| [J]| 0| J|
#+---+------------+---+---+
Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter
and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr
which allows us use column values as parameters.
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)
.drop("val")
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)
.show()
#+---+-------+---+
#|num| name|val|
#+---+-------+---+
#| 1|letter0| A|
#| 1|letter1| B|
#| 1|letter2| C|
#| 1|letter3| D|
#| 2|letter0| E|
#| 2|letter1| F|
#| 2|letter2| G|
#| 3|letter0| H|
#| 3|letter1| I|
#| 4|letter0| J|
#+---+-------+---+
Now we can just groupBy
the num
and pivot
the DataFrame. Putting that all together, we get:
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)
.drop("val")
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)
.groupBy("num").pivot("name").agg(f.first("val"))
.show()
#+---+-------+-------+-------+-------+
#|num|letter0|letter1|letter2|letter3|
#+---+-------+-------+-------+-------+
#| 1| A| B| C| D|
#| 3| H| I| null| null|
#| 2| E| F| G| null|
#| 4| J| null| null| null|
#+---+-------+-------+-------+-------+
answered Aug 3 '18 at 21:29
paultpault
15.5k32450
15.5k32450
add a comment |
add a comment |
I found a solution for the general uneven case (or when you get the nested columns, obtained with .split() function) :
import pyspark.sql.functions as f
@f.udf(StructType([StructField(col_3, StringType(), True),
StructField(col_4, StringType(), True)]))
def splitCols(array):
return array[0], ''.join(array[1:len(array)])
df = df.withColumn("name", splitCols(f.split(f.col("my_str_col"), '-')))
.select(df.columns+['name.*'])
Basically, you just need to select all the preceding columns + the nested ones 'column_name.*' and you will get them as two top-level columns in this case.
add a comment |
I found a solution for the general uneven case (or when you get the nested columns, obtained with .split() function) :
import pyspark.sql.functions as f
@f.udf(StructType([StructField(col_3, StringType(), True),
StructField(col_4, StringType(), True)]))
def splitCols(array):
return array[0], ''.join(array[1:len(array)])
df = df.withColumn("name", splitCols(f.split(f.col("my_str_col"), '-')))
.select(df.columns+['name.*'])
Basically, you just need to select all the preceding columns + the nested ones 'column_name.*' and you will get them as two top-level columns in this case.
add a comment |
I found a solution for the general uneven case (or when you get the nested columns, obtained with .split() function) :
import pyspark.sql.functions as f
@f.udf(StructType([StructField(col_3, StringType(), True),
StructField(col_4, StringType(), True)]))
def splitCols(array):
return array[0], ''.join(array[1:len(array)])
df = df.withColumn("name", splitCols(f.split(f.col("my_str_col"), '-')))
.select(df.columns+['name.*'])
Basically, you just need to select all the preceding columns + the nested ones 'column_name.*' and you will get them as two top-level columns in this case.
I found a solution for the general uneven case (or when you get the nested columns, obtained with .split() function) :
import pyspark.sql.functions as f
@f.udf(StructType([StructField(col_3, StringType(), True),
StructField(col_4, StringType(), True)]))
def splitCols(array):
return array[0], ''.join(array[1:len(array)])
df = df.withColumn("name", splitCols(f.split(f.col("my_str_col"), '-')))
.select(df.columns+['name.*'])
Basically, you just need to select all the preceding columns + the nested ones 'column_name.*' and you will get them as two top-level columns in this case.
edited Nov 15 '18 at 2:46
answered Nov 15 '18 at 2:33
JasminyasJasminyas
13
13
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f39235704%2fsplit-spark-dataframe-string-column-into-multiple-columns%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown