apache-spark-如何将管道分隔列拆分为多行?


0

我有一个包含以下内容的 DataFrame :

movieId / movieName / genre
1         example1    action|thriller|romance
2         example2    fantastic|action

我想获得第二个 DataFrame (从第一个 DataFrame 开始),其中包含以下内容:

movieId / movieName / genre
1         example1    action
1         example1    thriller
1         example1    romance
2         example2    fantastic
2         example2    action

我怎么能做到?

2 答案

0

我将使用拆分标准函数。

scala> movies.show(truncate = false)
+-------+---------+-----------------------+
|movieId|movieName|genre                  |
+-------+---------+-----------------------+
|1      |example1 |action|thriller|romance|
|2      |example2 |fantastic|action       |
+-------+---------+-----------------------+

scala> movies.withColumn("genre", explode(split($"genre", "[|]"))).show
+-------+---------+---------+
|movieId|movieName| genre|
+-------+---------+---------+
| 1| example1| action|
| 1| example1| thriller|
| 1| example1| romance|
| 2| example2|fantastic|
| 2| example2| action|
+-------+---------+---------+

// You can use | for split instead
scala> movies.withColumn("genre", explode(split($"genre", "|"))).show
+-------+---------+---------+
|movieId|movieName| genre|
+-------+---------+---------+
| 1| example1| action|
| 1| example1| thriller|
| 1| example1| romance|
| 2| example2|fantastic|
| 2| example2| action|
+-------+---------+---------+

另外,你可以使用dataset.flatmap来获得同样的结果,我相信scala开发者会更喜欢这个结果。


0

使用RDD

val df = Seq((1,"example1","action|thriller|romance"),(2,"example2","fantastic|action")).toDF("Id","name","genre")
df.rdd.flatMap( x=>{ val p = x.getAs[String]("genre"); for { a <- p.split("[|]") } yield (x(0),x(1),a)} ).foreach(println)

结果:

(1,example1,action)
(2,example2,fantastic)
(1,example1,thriller)
(2,example2,action)
(1,example1,romance)

transformation回df

val rdd1 = df.rdd.flatMap( x=>{ val p = x.getAs[String]("genre"); for { a <- p.split("[|]") } yield Row(x(0),x(1),a)} )
spark.createDataFrame(rdd1,df.schema.copy(Array(StructField("Id",IntegerType),StructField("name",StringType))).add(StructField("genre2",StringType))).show(false)

我来回答