Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.9k views
in Technique[技术] by (71.8m points)

scala - Predict Class Probabilities in Spark RandomForestClassifier

I built random forest models using ml.classification.RandomForestClassifier. I am trying to extract the predict probabilities from the models but I only saw prediction classes instead of the probabilities. According to this issue link, the issue is resolved and it leads to this github pull request and this. However, It seems it's resolved in the version 1.5. I'm using the AWS EMR which provides Spark 1.4.1 and sill have no idea how to get the predict probabilities. If anyone knows how to do it, please share your thought or solutions. Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I have already answered a similar question before.

Unfortunately, with MLLIb you can't get the probabilities per instance for classification models till version 1.4.1.

There is JIRA issues (SPARK-4362 and SPARK-6885) concerning this exact topic which is IN PROGRESS as I'm writing the answer now. Nevertheless, the issue seems to be on hold since November 2014

There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This should be made available along with the label.

And here is a note from @sean-owen on the mailing list on a similar topic regarding the Naive Bayes classification algorithm:

This was recently discussed on this mailing list. You can't get the probabilities out directly now, but you can hack a bit to get the internal data structures of NaiveBayesModel and compute it from there.

Reference : source.

This issue has been resolved with Spark 1.5.0. Please refer to the JIRA issue for more details.

Concerning AWS, there is not much you can do now for that. A solution might be if you can fork the emr-bootstrap-actions for spark and configure it for you needs, then you'll be able to install Spark on AWS using the bootstrap step.

Nevertheless, this might seem a little complicated.

There is some thing you might need to consider :

  • update the spark/config.file to install you spark-1.5. Something like :

    +3  1.5.0   python  s3://support.elasticmapreduce/spark/install-spark-script.py s3://path.to.your.bucket.spark.installation/spark/1.5.0/spark-1.5.0.tgz
    
  • this file list above must be a proper build of spark located in an specified s3 bucket you own for the time being.

  • To build your spark, I advice you reading about it in the examples section about building-spark-for-emr and also the official documentation. That should be about it! (I hope I haven't forgotten anything)

EDIT : Amazon EMR release 4.1.0 offers an upgraded version of Apache Spark (1.5.0). You can check here for more details.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...