Limitations of Spark MLlib linear algebra module

Christos - Iraklis Tsatsoulis Spark 1 Comment

A couple of days ago I stumbled upon some unexpected behavior of Spark MLlib (v. 1.5.2), while trying some ultra-simple operations on vectors. Consider the following Pyspark snippet:

>>> from pyspark.mllib.linalg import Vectors
>>> x = Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0])
>>> x
DenseVector([0.0, 1.0, 0.0, 7.0, 0.0])
>>> -x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: func() takes exactly 2 arguments (1 given)
>>> y = Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0])
>>> y
DenseVector([2.0, 0.0, 3.0, 4.0, 5.0])
>>> x-y
DenseVector([-2.0, 1.0, -3.0, 3.0, -5.0])
>>> -y+x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: func() takes exactly 2 arguments (1 given)
>>> -1*x
DenseVector([-0.0, -1.0, -0.0, -7.0, -0.0])

Clearly, what happens is that the unary operator - (minus) for vectors fails, giving errors for expressions like -x and -y+x, although x-y behaves as expected. The result of the last operation, -1*x, although mathematically “correct”, includes minus signs for the zero entries, which again is normally not expected.

I thought I had discovered a bug, so I opened an issue in Spark JIRA, titled Unary operator “-” fails for MLlib vectors.

One hour later, the issue was closed as “Not an issue”.

Joseph K. Bradley, a frequent Spark contributor working for Databricks, who had closed the issue, had commented as follows:

There simply isn’t a unary operation. There are ongoing discussions about turning MLlib vectors and matrices into a full-fledged local linear algebra library, but currently, you could convert to numpy/scipy and use those library for pyspark.

That was a surprise. I had definitely performed a fair amount of searching in the documentation and googling before raising the issue, without results; I tried again, hoping that I would now find some trace of the issue or of the “ongoing discussions” mentioned in Bradley’s comment.

This second search attempt was fruitless, just like the first one; I could not find any mention of this supposedly known issue, even in the recently published academic papers on MLlib and its linalg submodule (Bradley is a co-author of the former).

I commented back in the Spark JIRA:

If this is the case, then a warning/clarification in the documentation wouldn’t hurt – Spark users are not supposed to be aware of the internal “ongoing discussions” between Spark developers (BTW, any relevant link would be very welcome – I could not find any mention in MLlib & Breeze docs, neither in the recent preprint papers on linalg & MLlib).
All in all, I suggest you re-open the issue with a different type (it’s not a bug, as you say), and the required resolution being a notification in the relevant docs (“don’t try this…, because…”).

Fortunately, Bradley considered this as a good point; so, he reopened the issue, changing its title to Document limitations of MLlib local linear algebra.

So, until this clarification finds its way through to the documentation, you now know that Spark MLlib’s local linear algebra types are supposed to provide only simple functionality, without being a full-fledged local linear algebra library.-

Christos - Iraklis Tsatsoulis

Christos - Iraklis Tsatsoulis

Christos - Iraklis is one of our resident Data Scientists. He holds advanced graduate degrees in applied mathematics, engineering, and computing. He has been awarded both Chartered Engineer and Chartered Manager status in the UK, as well as Master status in Kaggle.com due to "consistent and stellar results" in predictive analytics contests.
Christos - Iraklis Tsatsoulis

Latest posts by Christos - Iraklis Tsatsoulis (see all)

1
Leave a Reply

avatar
1 Comment threads
0 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
0 Comment authors
Classification in Spark 2.0: "Input validation failed" and other wondrous tales - Nodalpoint Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
trackback

[…] have written in the past about this, but back then it concerned the “old” pyspark.mllib.linalg module; the […]