python—字数小于5的行数


0

使用Pyspark,

我写了这段代码,但我搞不清是怎么回事

from pyspark.sql import SparkSession, Row


spark = SparkSession.builder.master("spark://master:7077").appName('test').config(conf=SparkConf()).getOrCreate()
df = spark.read.text('text.txt')
rdd = df.rdd
print(df.count())
rdd1=rdd.filter(lambda line: len((line.split(" "))<5)).collect()
print(rdd1.count())

This is the a small part of the Error

Py4JJavaError Traceback (most recent call last)
<ipython-input-48-27233afa0b82> in <module>()
9 rdd = df.rdd
10 print(df.count())
---> 11 rdd1=rdd.filter(lambda line: len((line.split(" "))<5)).collect()
12 print(rdd1.count())
13
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 144.0 failed 1 times, most recent failure: Lost task 0.0 in stage 144.0 (TID 144, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/ff/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1497, in getattr
idx = self.fields.index(item)
ValueError: 'split' is not in list

2 答案

0

我想你在这个表达式的错误位置有一些括号:

rdd1=rdd.filter(lambda line: len((line.split(" "))<5)).collect()

你现在这样做是为了:

len(... < 5)

而不是这个:

len(...) < 5

0

我解决了。问题是我想把名单分开。

rdd=rdd.filter(lambda line: len(line[0].split(" "))<5).collect()

我来回答

写文章

提问题