Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
434 views
in Technique[技术] by (71.8m points)

Spark DataFrame Schema Nullable Fields

I wrote the following code in both Scala & Python, however the DataFrame that is returned doesn't appear to apply the non-nullable fields in my schema that I am applying. italianVotes.csv is a csv file with '~' as a separator and four fields. I'm using Spark 2.1.0.

italianVotes.csv

2657~135~2~2013-11-22 00:00:00.0
2658~142~2~2013-11-22 00:00:00.0
2659~142~1~2013-11-22 00:00:00.0
2660~140~2~2013-11-22 00:00:00.0
2661~140~1~2013-11-22 00:00:00.0
2662~1354~2~2013-11-22 00:00:00.0
2663~1356~2~2013-11-22 00:00:00.0
2664~1353~2~2013-11-22 00:00:00.0
2665~1351~2~2013-11-22 00:00:00.0
2667~1357~2~2013-11-22 00:00:00.0

Scala

import org.apache.spark.sql.types._
val schema =  StructType(
StructField("id", IntegerType, false) ::
StructField("postId", IntegerType, false) ::
StructField("voteType", IntegerType, true) ::
StructField("time", TimestampType, true) :: Nil)

val fileName = "italianVotes.csv"

val italianDF = spark.read.schema(schema).option("sep", "~").csv(fileName)

italianDF.printSchema()

// output
root
 |-- id: integer (nullable = true)
 |-- postId: integer (nullable = true)
 |-- voteType: integer (nullable = true)
 |-- time: timestamp (nullable = true)

Python

from pyspark.sql.types import *

schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("postId", IntegerType(), False),
    StructField("voteType", IntegerType(), True),
    StructField("time", TimestampType(), True),
])

file_name = "italianVotes.csv"

italian_df = spark.read.csv(file_name, schema = schema, sep = "~")

# print schema
italian_df.printSchema()
root
 |-- id: integer (nullable = true)
 |-- postId: integer (nullable = true)
 |-- voteType: integer (nullable = true)
 |-- time: timestamp (nullable = true)

My main question is why are the first two fields nullable when I have set them to non-nullable in my schema?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

In general Spark Datasets either inherit nullable property from its parents, or infer based on the external data types.

You can argue if it is a good approach or not but ultimately it is sensible. If semantics of a data source doesn't support nullability constraints, then application of a schema cannot either. At the end of the day it is always better to assume that things can be null, than fail on the runtime if this the opposite assumption turns out to be incorrect.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...