We are running spark 2.3.0 on AWS EMR
. The following DataFrame
"df
" is non empty and of modest size:
scala> df.count
res0: Long = 4067
The following code works fine for writing df
to hdfs
:
scala> val hdf = spark.read.parquet("/tmp/topVendors")
hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint]
scala> hdf.count
res4: Long = 4067
However using the same code to write to a local parquet
or csv
file end up with empty results:
df.repartition(1).write.mode("overwrite").parquet("file:///tmp/topVendors")
scala> val locdf = spark.read.parquet("file:///tmp/topVendors")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
We can see why it fails:
ls -l /tmp/topVendors
total 0
-rw-r--r-- 1 hadoop hadoop 0 Jul 30 22:38 _SUCCESS
So there is no parquet file being written.
I have tried this maybe twenty times and for both csv
and parquet
and on two different EMR
Servers: this same behavior is exhibited in all cases.
Is this an EMR
specific bug? A more general EC2
bug? Something else? This code works on spark
on macOS
.
In case it matters - here is the versioning info:
Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Spark 2.3.0, Hive 2.3.2, Zeppelin 0.7.3
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…