Setup
- Spark
spark-2.4.5-bin-without-hadoop
configured with hadoop-2.8.5
.
- Spark is running in standalone mode.
- PySpark application runs as a separate container and connects to the master from another container.
What needs to happen?
- PySpark application should be able to connect to AWS S3 bucket by using
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
Attempted solution #1
- Master, workers, and the PySpark application all have the following jars at
/
hadoop-aws-2.8.5.jar
hadoop-common-2.8.5.jar
aws-java-sdk-1.10.6.jar
jackson-core-2.2.3.jar
aws-java-sdk-s3-1.10.6.jar
jackson-databind-2.2.3.jar
jackson-annotations-2.2.3.jar
joda-time-2.9.4.jar
- PySpark application is configured the following way:
builder = SparkSession.builder
builder = builder.config("spark.executor.userClassPathFirst", "true")
builder = builder.config("spark.driver.userClassPathFirst", "true")
class_path = "/hadoop-aws-2.8.5.jar:/hadoop-common-2.8.5.jar:/aws-java-sdk-1.10.6.jar:/jackson-core-2.2.3.jar:/aws-java-sdk-s3-1.10.6.jar:/jackson-databind-2.2.3.jar:/jackson-annotations-2.2.3.jar:/joda-time-2.9.4.jar"
builder = builder.config("spark.driver.extraClassPath", class_path)
builder = builder.config("spark.executor.extraClassPath", class_path)
builder = builder.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
builder = builder.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
builder = builder.config("spark.hadoop.fs.s3a.access.key", os.environ.get("AWS_ACCESS_KEY_ID"))
builder = builder.config("spark.hadoop.fs.s3a.secret.key", os.environ.get("AWS_SECRET_ACCESS_KEY"))
builder = builder.config("spark.hadoop.fs.s3a.session.token", os.environ.get("AWS_SESSION_TOKEN"))
Output
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.security.authentication.util.KerberosUtil.hasKerberosTicket(Ljavax/security/auth/Subject;)Z
....
- This was happening because
hadoop-auth
was missing. I downloaded hadoop-auth
and got another error. It lead me to conclude that I would need to provide all the dependent jars.
- Instead of going manually through maven. I tried to import
hadoop-aws
into IntelliJ IDEA to see if it's going to import other dependencies and it gave me this list: (Don't know if there is a mvn
command that would do the same thing)
accessors-smart-1.2.jar
activation-1.1.jar
apacheds-i18n-2.0.0-M15.jar
apacheds-kerberos-codec-2.0.0-M15.jar
api-asn1-api-1.0.0-M20.jar
api-util-1.0.0-M20.jar
asm-3.1.jar
asm-5.0.4.jar
avro-1.7.4.jar
aws-java-sdk-core-1.10.6.jar
aws-java-sdk-kms-1.10.6.jar
aws-java-sdk-s3-1.10.6.jar
commons-beanutils-1.7.0.jar
commons-beanutils-core-1.8.0.jar
commons-cli-1.2.jar
commons-codec-1.4.jar
commons-collections-3.2.2.jar
commons-compress-1.4.1.jar
commons-configuration-1.6.jar
commons-digester-1.8.jar
commons-io-2.4.jar
commons-lang-2.6.jar
commons-logging-1.1.3.jar
commons-math3-3.1.1.jar
commons-net-3.1.jar
curator-client-2.7.1.jar
curator-framework-2.7.1.jar
curator-recipes-2.7.1.jar
gson-2.2.4.jar
guava-11.0.2.jar
hadoop-annotations-2.8.5.jar
hadoop-auth-2.8.5.jar
hadoop-aws-2.8.5.jar
hadoop-common-2.8.5.jar
htrace-core4-4.0.1-incubating.jar
httpclient-4.5.2.jar
httpcore-4.4.4.jar
jackson-annotations-2.2.3.jar
jackson-core-2.2.3.jar
jackson-core-asl-1.9.13.jar
jackson-databind-2.2.3.jar
jackson-jaxrs-1.8.3.jar
jackson-mapper-asl-1.9.13.jar
jackson-xc-1.8.3.jar
java-xmlbuilder-0.4.jar
jaxb-api-2.2.2.jar
jaxb-impl-2.2.3-1.jar
jcip-annotations-1.0-1.jar
jersey-core-1.9.jar
jersey-json-1.9.jar
jersey-server-1.9.jar
jets3t-0.9.0.jar
jettison-1.1.jar
jetty-6.1.26.jar
jetty-sslengine-6.1.26.jar
jetty-util-6.1.26.jar
joda-time-2.9.4.jar
jsch-0.1.54.jar
json-smart-2.3.jar
jsp-api-2.1.jar
jsr305-3.0.0.jar
log4j-1.2.17.jar
netty-3.7.0.Final.jar
nimbus-jose-jwt-4.41.1.jar
paranamer-2.3.jar
protobuf-java-2.5.0.jar
servlet-api-2.5.jar
slf4j-api-1.7.10.jar
slf4j-log4j12-1.7.10.jar
snappy-java-1.0.4.1.jar
stax-api-1.0-2.jar
xmlenc-0.52.jar
xz-1.0.jar
zookeeper-3.4.6.jar
- I suspect these need to be included in PySpark application as well. It would be too long to download each
.jar
separately and probably is not the way to go.
Attempted solution #2
builder = SparkSession.builder
builder = builder.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.8.5,org.apache.hadoop:hadoop-common:2.8.5,com.amazonaws:aws-java-sdk-s3:1.10.6")
builder = builder.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
builder = builder.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
builder = builder.config("spark.hadoop.fs.s3a.access.key", os.environ.get("AWS_ACCESS_KEY_ID"))
builder = builder.config("spark.hadoop.fs.s3a.secret.key", os.environ.get("AWS_SECRET_ACCESS_KEY"))
builder = builder.config("spark.hadoop.fs.s3a.session.token", os.environ.get("AWS_SESSION_TOKEN"))
- Instead of providing
jars
I use --packages
it installs the dependencies as well. However, I end up with the following error when trying to read from S3:
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
...
My research
From that, I conclude that my problem has to do with hadoop-aws
dependencies. However, I'm having a problem checking this. The versions of Hadoop running with Spark are the same as the jars that I use when connecting and I'm providing all the dependencies.
- Any tools or commands to run for debugging this issue would be helpful.
question from:
https://stackoverflow.com/questions/65909273/python-connecting-to-spark-in-standalone-mode-and-accessing-s3-with-s3afilesyste 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…