scala-如何在spark中打印特定RDD分区的元素?


0

如何单独打印特定分区的元素,比如说第5个分区?

val distData = sc.parallelize(1 to 50, 10)

3 答案

0

使用spark/scala:

val data = 1 to 50
val distData = sc.parallelize(data,10)
distData.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>it.toList.map(x => if (index ==5) {println(x)}).iterator).collect

生产:

26
27
28
29
30

0

您可以使用针对foreachpartition()API的计数器来实现它。

这里是一个Java程序,可以打印每个分区的内容。

    JavaRDD<Integer> myArray = context.parallelize(Arrays.asList(1,2,3,4,5,6,7,8,9));
    JavaRDD<Integer> partitionedArray = myArray.repartition(2);

System.out.println("partitioned array size is " + partitionedArray.count());
partitionedArray.foreachPartition(new VoidFunction&lt;Iterator&lt;Integer&gt;&gt;() {

    public void call(Iterator&lt;Integer&gt; arg0) throws Exception {

        while(arg0.hasNext()) {
            System.out.println(arg0.next());
        }

    }
});


0

假设您这样做只是为了 test目的,然后使用gamp()。参见spark文档:https://spark.apache.org/docs/1.6.0/api/python/pyspark.html pyspark.rdd.glom

>>> rdd = sc.parallelize([1, 2, 3, 4], 2)
>>> rdd.glom().collect()
[[1, 2], [3, 4]]
>>> rdd.glom().collect()[1]
[3, 4]

编辑:scala中的示例:

scala> val distData = sc.parallelize(1 to 50, 10)
scala> distData.glom().collect()(4)
res2: Array[Int] = Array(21, 22, 23, 24, 25)

我来回答