You have two options:
1. Create a singleton object with a lazy val representing the data:
object WekaModel {
lazy val data = {
// initialize data here. This will only happen once per JVM process
}
}
Then, you can use the lazy val in your map
function. The lazy val
ensures that each worker JVM initializes their own instance of the data. No serialization or broadcasts will be performed for data
.
elementsRDD.map { element =>
// use WekaModel.data here
}
Advantages
- is more efficient, as it allows you to initialize your data once per JVM instance. This approach is a good choice when needing to initialize a database connection pool for example.
Disadvantages
- Less control over initialization. For example, it's trickier to initialize your object if you require runtime parameters.
- You can't really free up or release the object if you need to. Usually, that's acceptable, since the OS will free up the resources when the process exits.
2. Use the mapPartition
(or foreachPartition
) method on the RDD instead of just map
.
This allows you to initialize whatever you need for the entire partition.
elementsRDD.mapPartition { elements =>
val model = new WekaModel()
elements.map { element =>
// use model and element. there is a single instance of model per partition.
}
}
Advantages:
- Provides more flexibility in the initialization and deinitialization of objects.
Disadvantages
- Each partition will create and initialize a new instance of your object. Depending on how many partitions you have per JVM instance, it may or may not be an issue.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…