UPDATE for version > 2.0.0
From the version 2.0.0, as you can see here, FeatureImportances is available for Random Forest.
In fact, you can find here that:
The DataFrame API supports two major tree ensemble algorithms: Random Forests and Gradient-Boosted Trees (GBTs). Both use spark.ml decision trees as their base models.
Users can find more information about ensemble algorithms in the MLlib Ensemble guide.
In this section, we demonstrate the DataFrame API for ensembles.
The main differences between this API and the original MLlib ensembles API are:
- support for DataFrames and ML Pipelines
- separation of classification vs. regression
- use of DataFrame metadata to distinguish continuous and categorical features
- more functionality for random forests: estimates of feature importance, as well as the predicted probability of each class (a.k.a. class conditional probabilities) for classification.
If you want to have Feature Importance values, you have to work with ml package, not mllib, and use dataframes.
Below there is an example that you can find here:
# IMPORT
>>> import numpy
>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> from pyspark.ml.classification import RandomForestClassifier
# PREPARE DATA
>>> df = spark.createDataFrame([
... (1.0, Vectors.dense(1.0)),
... (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)
# BUILD THE MODEL
>>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42)
>>> model = rf.fit(td)
# FEATURE IMPORTANCES
>>> model.featureImportances
SparseVector(1, {0: 1.0})
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…