Unexpected behavior of Spark dataframe filter method

Christos - Iraklis TsatsoulisBig Data, Spark 4 Comments

[EDIT: Thanks to this post, the issue reported here has been resolved since Spark 1.4.1 – see the comments below] While writing the previous post on Spark dataframes, I encountered an unexpected behavior of the respective .filter method; but, on the one hand, I needed some more time to experiment and confirm it and, on the other hand, I knew …

Spark data frames from CSV files: handling headers & column types

Christos - Iraklis TsatsoulisBig Data, Spark 16 Comments

If you come from the R (or Python/pandas) universe, like me, you must implicitly think that working with CSV files must be one of the most natural and straightforward things to happen in a data analysis context. Indeed, if you have your data in a CSV file, practically the only thing you have to do from R is to fire …