scala - Hash function in spark

Question

Welcome To Ask or Share your Answers For Others

scala - Hash function in spark

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

scala - Hash function in spark

I'm trying to add a column to a dataframe, which will contain hash of another column.

I've found this piece of documentation: https://spark.apache.org/docs/2.3.0/api/sql/index.html#hash
And tried this:

import org.apache.spark.sql.functions._
val df = spark.read.parquet(...)
val withHashedColumn = df.withColumn("hashed", hash($"my_column"))

But what is the hash function used by that hash()? Is that murmur, sha, md5, something else?

The value I get in this column is integer, thus range of values here is probably [-2^(31) ... +2^(31-1)].
Can I get a long value here? Can I get a string hash instead?
How can I specify a concrete hashing algorithm for that?
Can I use a custom hash function?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:59:16+0000

It is Murmur based on the source code:

  /**
   * Calculates the hash code of given columns, and returns the result as an int column.
   *
   * @group misc_funcs
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def hash(cols: Column*): Column = withExpr {
    new Murmur3Hash(cols.map(_.expr))
  }

Categories

scala - Hash function in spark

scala - Hash function in spark

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags