Classification in Spark 2.0: “Input validation failed” and other wondrous tales

Christos - Iraklis TsatsoulisData Science, Spark 7 Comments

Spark 2.0 has been released since last July but, despite the numerous improvements and new features, several annoyances still remain and can cause headaches, especially in the Spark machine learning APIs. Today we’ll have a look at some of them, inspired by a recent answer of mine in a Stack Overflow question (the question was about Spark 1.6 but, as …

How to use SparkR in Cloudera Hadoop

Christos - Iraklis TsatsoulisBig Data, R, Spark 20 Comments

Suppose you are an avid R user, and you would like to use SparkR in Cloudera Hadoop; unfortunately, as of the latest CDH version (5.7), SparkR is still not supported (and, according to a recent discussion in the Cloudera forums, we shouldn’t expect this to happen anytime soon). Is there anything  you can do? Well, indeed there is. In this …