当列的顺序不一致时,如何比较模式?


0

root据这个问题,我现在运行以下代码:

List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("A",DataTypes.LongType,true));
fields.add(DataTypes.createStructField("B",DataTypes.DoubleType,true));
StructType schema1 = DataTypes.createStructType(fields);
Dataset<Row> df1 = spark.sql("select 1 as A, 2.2 as B");
Dataset<Row> finalDf1 = spark.createDataFrame(df1.javaRDD(), schema1);

fields = new ArrayList<>();
fields.add(DataTypes.createStructField("B",DataTypes.DoubleType,true));
fields.add(DataTypes.createStructField("A",DataTypes.LongType,true));
StructType schema2 = DataTypes.createStructType(fields);
Dataset<Row> df2 = spark.sql("select 2.2 as B, 1 as A");
Dataset<Row> finalDf2 = spark.createDataFrame(df2.javaRDD(), schema2);

finalDf1.printSchema();
finalDf2.printSchema();
System.out.println(finalDf1.schema());
System.out.println(finalDf2.schema());
System.out.println(finalDf1.schema().equals(finalDf2.schema()));

输出如下:

root
 |-- A: long (nullable = true)
 |-- B: double (nullable = true)

root
|-- B: double (nullable = true)
|-- A: long (nullable = true)

StructType(StructField(A,LongType,true), StructField(B,DoubleType,true))
StructType(StructField(B,DoubleType,true), StructField(A,LongType,true))
false

虽然列的排列顺序不同,但这两个 DataSet 具有完全相同的列和列类型。这里需要什么样的比较才能实现?

3 答案

0

如果它们有不同的顺序,那么它们就不一样了。即使它们都有相同数量的列和相同的名称。如果要查看两个 schema是否具有相同的列名,请从两个 DataFrame 中获取列表中的 schema,然后编写代码进行比较。参见下面的Java示例

public static void main(String[] args)
{

List&lt;String&gt; firstSchema =Arrays.asList(DataTypes.createStructType(ConfigConstants.firstSchemaFields).fieldNames());
List&lt;String&gt; secondSchema = Arrays.asList(DataTypes.createStructType(ConfigConstants.secondSchemaFields).fieldNames());


if(schemasHaveTheSameColumnNames(firstSchema,secondSchema))
{
    System.out.println("Yes, schemas have the same column names");
}else
{
    System.out.println("No, schemas do not have the same column names");
}

}

private static boolean schemasHaveTheSameColumnNames(List<String> firstSchema, List<String> secondSchema)
{
if(firstSchema.size() != secondSchema.size())
{
return false;
}else
{
for (String column : secondSchema)
{
if(!firstSchema.contains(column))
return false;
}
}
return true;
}


0

假设order cols不匹配,相同的名称是相同的语义,并且需要相同数量的列。

一个使用Scala的例子,你应该能够适应Java:

import spark.implicits._
val df = sc.parallelize(Seq(
        ("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
        ("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
        )).toDF("c1", "c2", "Val1", "Val2")
val names = df.columns

val df2 = sc.parallelize(Seq(
("A", "X", 2, 1))).toDF("c1", "c2", "Val1", "Val2")
val names2 = df2.columns

names.sortWith(_ < ) sameElements names2.sortWith( < _)

返回true或false,尝试输入。


0

遵循前面的答案,似乎是比较结构字段(列和类型)而不仅仅是名称的最快方法,如下所示:

Set<StructField> set1 = new HashSet<>(Arrays.asList(schema1.fields()));
Set<StructField> set2 = new HashSet<>(Arrays.asList(schema2.fields()));
boolean result = set1.equals(set2);

我来回答

写文章

提问题