Classification in Spark 2.0: “Input validation failed” and other wondrous tales

Christos - Iraklis TsatsoulisData Science, Spark 7 Comments

Spark 2.0 has been released since last July but, despite the numerous improvements and new features, several annoyances still remain and can cause headaches, especially in the Spark machine learning APIs. Today we’ll have a look at some of them, inspired by a recent answer of mine in a Stack Overflow question (the question was about Spark 1.6 but, as we’ll see, the issue remains in Spark 2.0).

We’ll first try a simple binary classification problem in PySpark using Spark MLlib, but, before doing so, let’s have a look at the current status of the machine learning APIs in Spark 2.0.

Spark MLlib was the older machine learning API for Spark, intended to be gradually replaced by the newest Spark ML library; in Spark 2.0 this terminology has changed (enough, in my opinion, to cause unnecessary confusion): now the whole machine learning functionality is termed “MLlib”, with the old MLlib being the so-called “RDD-based API”, and the (old) Spark ML library now termed the “MLlib DataFrame-based API“. The oldest RDD-based API has now entered maintenance mode, heading for gradual deprecation.

Truth is, whatever Databricks and the Spark architects may like to believe, there is some essential machine learning functionality which is still only available in the old MLlib RDD-based API, good examples being multinomial logistic regression and SVM models.

Now that we have hopefully justified interest to the “old” RDD-based API, let us proceed to our first example.

(In the code snippets below, pyspark.mllib corresponds to the old, RDD-based API, while pyspark.ml corresponds to the new DataFrame-based API.)

The following code is slightly adapted from the documentation example of logistic regression in PySpark (can you spot the difference?):

>>> print spark.version
2.0.0
>>> from pyspark.mllib.classification import LogisticRegressionModel, LogisticRegressionWithSGD
>>> from pyspark.mllib.regression import LabeledPoint
>>> data = [
...     LabeledPoint(2.0, [0.0, 1.0]),
...     LabeledPoint(1.0, [1.0, 0.0])]
>>> lrm = LogisticRegressionWithSGD.train(sc.parallelize(data), iterations=10)
[...]
: org.apache.spark.SparkException: Input validation failed.

As we can see, this simple code snippet gives the super-informative error “Input validation failed“. And the situation is the same with other classifiers – here is an SVM with the same data:

>>> from pyspark.mllib.classification import SVMModel, SVMWithSGD
>>> svm = SVMWithSGD.train(sc.parallelize(data), iterations=10)
[...]
: org.apache.spark.SparkException: Input validation failed.

No matter how long you stare at the relevant PySpark documentation, you will not find the error cause, simply because it is not documented – to get a hint, you will have to dig in the Scala (!) source code of LogisticRegressionWithSGD, where it is mentioned:

 * NOTE: Labels used in Logistic Regression should be {0, 1}

So, in binary classification, your labels cannot be whatever you like, but they need to be {0, 1} or {0.0, 1.0}…

>>> data = [
...     LabeledPoint(0.0, [0.0, 1.0]), # changed the label here from 2.0 to 0.0
...     LabeledPoint(1.0, [1.0, 0.0])]
>>> lrm = LogisticRegressionWithSGD.train(sc.parallelize(data), iterations=10) 
>>> lrm.predict([0.2, 0.5])
0
>>> svm = SVMWithSGD.train(sc.parallelize(data), iterations=10)
>>> svm.predict([0.2, 0.5])
0

A similar (again, undocumented) constraint applies for multi-class classification, where for k classes you must use the labels {0, 1, …, k-1}.

Interestingly enough, SparkR does not suffer from such a limitation; try the following code interactively from RStudio (where we use the iris dataset with one class removed, since spark.glm does not support multi-class classification):

library(SparkR, lib.loc = "/home/ctsats/spark-2.0.0-bin-hadoop2.6/R/lib") # change the path accordingly here

sparkR.session(sparkHome = "/home/ctsats/spark-2.0.0-bin-hadoop2.6")      # and here

df <- as.DataFrame(iris[1:100,]) # keep 2 classes only
head(df)

model <- spark.glm(df, Species ~ .,  family="binomial")
summary(model)

pred <- predict(model, df)
showDF(pred, 100, truncate = FALSE)

sparkR.session.stop()

Since we are interested at the (very) basic functionality only, we don’t bother with splitting to training and test sets – we just run the predict function for the dataset we used for training; here is the partial output from showDF:

+------------+-----------+------------+-----------+----------+-----+----------------------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|Species   |label|prediction            |
+------------+-----------+------------+-----------+----------+-----+----------------------+
|5.1         |3.5        |1.4         |0.2        |setosa    |1.0  |0.9999999999999999    |
|4.9         |3.0        |1.4         |0.2        |setosa    |1.0  |0.999999999999992     |
|4.7         |3.2        |1.3         |0.2        |setosa    |1.0  |0.999999999999998     |
|4.6         |3.1        |1.5         |0.2        |setosa    |1.0  |0.9999999999994831    |
[...]
|5.7         |2.9        |4.2         |1.3        |versicolor|0.0  |1.0E-16               |
|6.2         |2.9        |4.3         |1.3        |versicolor|0.0  |1.0E-16               |
|5.1         |2.5        |3.0         |1.1        |versicolor|0.0  |1.9333693386853254E-10|
|5.7         |2.8        |4.1         |1.3        |versicolor|0.0  |1.0E-16               |
+------------+-----------+------------+-----------+----------+-----+----------------------+

from which it is apparent that SparkR has performed internally the label mapping setosa -> 1.0 and versicolor -> 0.0 for us.

There is an explanation for this difference in behavior: under the hood, and unlike the PySpark examples shown above, SparkR uses the newest DataFrame-based API for the machine learning functionality; so, let’s have a quick look at this API from a PySpark point of view as well.

We recreate our first code snippet above, with the data now as a DataFrame instead of a LabeledPoint:

>>> print spark.version
2.0.0
>>> from pyspark.ml.classification import LogisticRegression
>>> from pyspark.ml.linalg import Vectors
>>> df = sqlContext.createDataFrame([
...     (2.0, Vectors.dense(0.0, 1.0)),
...     (1.0, Vectors.dense(1.0, 0.0))], 
...     ["label", "features"])
>>> df.show()
+-----+---------+
|label| features|
+-----+---------+
|  2.0|[0.0,1.0]|
|  1.0|[1.0,0.0]|
+-----+---------+
>>> lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
>>> model = lr.fit(df)
[...]
: org.apache.spark.SparkException: Currently, LogisticRegression with ElasticNet in ML package only supports binary classification. Found 3 in the input dataset.

Well… as you can see, our classifier complains that it has found 3 classes in the data, despite that, evidently, being not the case…

Changing the labels above from {1.0, 2.0} to {0.0, 1.0} resolves this issue (not shown); again, this requirement is nowhere documented in PySpark, and the error message does little to help locate the actual issue.

And here is our last take on weird, counter-intuitive, and undocumented features of the new DataFrame-based machine learning API…

One could easily argue that encoding class labels (i.e. what is actually a factor) with floating-point numbers is unnatural; and the old, RDD-based API sure permits the more natural choice of encoding the class labels as integers, instead:

>>> print spark.version
2.0.0
>>> from pyspark.mllib.classification import LogisticRegressionModel, LogisticRegressionWithSGD
>>> from pyspark.mllib.regression import LabeledPoint
>>> data = [
...     LabeledPoint(0, [0.0, 1.0]), # integer labels instead
...     LabeledPoint(1, [1.0, 0.0])] # of float
>>> lrm = LogisticRegressionWithSGD.train(sc.parallelize(data), iterations=10) 
>>> lrm.predict([0.2, 0.5])
0

What’s more, the binary LogisticRegression classifier from the new, DataFrame-based API also allows for integer labels:

>>> print spark.version
2.0.0
>>> from pyspark.ml.classification import LogisticRegression
>>> from pyspark.ml.linalg import Vectors
>>> df = sqlContext.createDataFrame([
...     (0, Vectors.dense(0.0, 1.0)),  # integer labels instead
...     (1, Vectors.dense(1.0, 0.0))], # of float
...     ["label", "features"])
>>> df.show()
+-----+---------+
|label| features|
+-----+---------+
|    0|[0.0,1.0]|
|    1|[1.0,0.0]|
+-----+---------+
>>> lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
>>> model = lr.fit(df) # works OK

But here is what happens when we try a DecisionTreeClassifier from the very same module (namely pyspark.ml.classification):

>>> print spark.version
2.0.0
>>> from pyspark.ml.classification import DecisionTreeClassifier
>>> df = sqlContext.createDataFrame([
...     (0, Vectors.dense(0.0, 1.0)),  # integer labels instead
...     (1, Vectors.dense(1.0, 0.0))], # of float
...     ["label", "features"])
>>> dt = DecisionTreeClassifier(maxDepth=2, labelCol="label")
>>> model = dt.fit(df) 
[...]
: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Double

As you might have guessed by now, changing the labels back to floating-point numbers resolves the issue – but if you expect to find a reference in the relevant PySpark documentation, or at least a hint in the general Spark Machine Learning Library Guide, well, good luck…

* * *

I will argue that

  • Such unexpected and counter-intuitive behavior in Spark abounds
  • The documentation, especially for the Python API (PySpark), is hopelessly uninformative at such issues
  • This can cause considerable strain and frustration to both novice and seasoned data scientists alike, especially since such users are naturally expected to rely on PySpark, rather than the Scala or Java APIs

Consider the following issue:

>>> print spark.version
2.0.0
>>> from pyspark.ml.linalg import Vectors
>>> x = Vectors.dense([0.0, 1.0])
>>> x
DenseVector([0.0, 1.0])
>>> -x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: func() takes exactly 2 arguments (1 given)

I have written in the past about this, but back then it concerned the “old” pyspark.mllib.linalg module; the reaction from the Spark community clearly implied something in the lines of “It is well-known that […]“; as I counter-argued (after I did my research), since it seems to be not so well-known, we might even consider adding it to the documentation – so I opened a documentation issue in Spark JIRA. Not only it remains unresolved, but, as I have just shown above, the same behavior has been inherited by the newer pyspark.ml.linalg module, again without any relevant mention in the documentation.

Wondrous tales indeed…

Christos - Iraklis Tsatsoulis
Latest posts by Christos - Iraklis Tsatsoulis (see all)
Subscribe
Notify of
7 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
trackback
September 7, 2016 10:33

[…] article was first published on R – Nodalpoint, and kindly contributed to […]

trackback
September 7, 2016 14:19

[…] article was first published on R – Nodalpoint, and kindly contributed to […]

Evan
Evan
September 8, 2016 17:06

I have similarly found numerous gotchas that seem undocumented. One that comes to mind immediately is the way that the weightCol is handled. I’ve found that in PySpark it throws an error if you don’t have one (contrary to what the docs imply), and if you do have a weightCol it must contain floats. I spent at least a couple hours before figuring that one out.

trackback
November 7, 2016 21:19

[…] chose not to include it in later versions of the documentation – but again, I have already argued elsewhere about puzzling and missing things in Spark […]

Anu
Anu
June 8, 2017 13:12

Hi,
I am trying to get the probability score in Pyspark instead of classes.. Is it possible to get so and I am using here decision tree

trackback
November 24, 2017 14:55

[…] not to include it in later versions of the documentation – but again, I have already argued elsewhere about puzzling and missing things in Spark […]