Limitations of Spark MLlib linear algebra module

Christos - Iraklis Tsatsoulis Spark 1 Comment

A couple of days ago I stumbled upon some unexpected behavior of Spark MLlib (v. 1.5.2), while trying some ultra-simple operations on vectors. Consider the following Pyspark snippet:

>>> from pyspark.mllib.linalg import Vectors
>>> x = Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0])
>>> x
DenseVector([0.0, 1.0, 0.0, 7.0, 0.0])
>>> -x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: func() takes exactly 2 arguments (1 given)
>>> y = Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0])
>>> y
DenseVector([2.0, 0.0, 3.0, 4.0, 5.0])
>>> x-y
DenseVector([-2.0, 1.0, -3.0, 3.0, -5.0])
>>> -y+x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: func() takes exactly 2 arguments (1 given)
>>> -1*x
DenseVector([-0.0, -1.0, -0.0, -7.0, -0.0])

Clearly, what happens is that the unary operator - (minus) for vectors fails, giving errors for expressions like -x and -y+x, although x-y behaves as expected. The result of the last operation, -1*x, although mathematically “correct”, includes minus signs for the zero entries, which again is normally not expected.

I thought I had discovered a bug, so I opened an issue in Spark JIRA, titled Unary operator “-” fails for MLlib vectors.

One hour later, the issue was closed as “Not an issue”.

Joseph K. Bradley, a frequent Spark contributor working for Databricks, who had closed the issue, had commented as follows:

There simply isn’t a unary operation. There are ongoing discussions about turning MLlib vectors and matrices into a full-fledged local linear algebra library, but currently, you could convert to numpy/scipy and use those library for pyspark.

That was a surprise. I had definitely performed a fair amount of searching in the documentation and googling before raising the issue, without results; I tried again, hoping that I would now find some trace of the issue or of the “ongoing discussions” mentioned in Bradley’s comment.

This second search attempt was fruitless, just like the first one; I could not find any mention of this supposedly known issue, even in the recently published academic papers on MLlib and its linalg submodule (Bradley is a co-author of the former).

I commented back in the Spark JIRA:

If this is the case, then a warning/clarification in the documentation wouldn’t hurt – Spark users are not supposed to be aware of the internal “ongoing discussions” between Spark developers (BTW, any relevant link would be very welcome – I could not find any mention in MLlib & Breeze docs, neither in the recent preprint papers on linalg & MLlib).
All in all, I suggest you re-open the issue with a different type (it’s not a bug, as you say), and the required resolution being a notification in the relevant docs (“don’t try this…, because…”).

Fortunately, Bradley considered this as a good point; so, he reopened the issue, changing its title to Document limitations of MLlib local linear algebra.

So, until this clarification finds its way through to the documentation, you now know that Spark MLlib’s local linear algebra types are supposed to provide only simple functionality, without being a full-fledged local linear algebra library.-

Christos - Iraklis Tsatsoulis
Latest posts by Christos - Iraklis Tsatsoulis (see all)
Subscribe
Notify of
1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
trackback

[…] have written in the past about this, but back then it concerned the “old” pyspark.mllib.linalg module; the […]