Dataframes from CSV files in Spark 1.5: automatic schema extraction, neat summary statistics, & elementary data exploration

Christos - Iraklis TsatsoulisBig Data, Spark 25 Comments

In a previous post, we glimpsed briefly at creating and manipulating Spark dataframes from CSV files. In the couple of months since, Spark has already gone from version 1.3.0 to 1.5, with more than 100 built-in functions introduced in Spark 1.5 alone; so, we thought it is a good time for revisiting the subject, this time also utilizing the external …

Unexpected behavior of Spark dataframe filter method

Christos - Iraklis TsatsoulisBig Data, Spark 4 Comments

[EDIT: Thanks to this post, the issue reported here has been resolved since Spark 1.4.1 – see the comments below] While writing the previous post on Spark dataframes, I encountered an unexpected behavior of the respective .filter method; but, on the one hand, I needed some more time to experiment and confirm it and, on the other hand, I knew …

Spark data frames from CSV files: handling headers & column types

Christos - Iraklis TsatsoulisBig Data, Spark 16 Comments

If you come from the R (or Python/pandas) universe, like me, you must implicitly think that working with CSV files must be one of the most natural and straightforward things to happen in a data analysis context. Indeed, if you have your data in a CSV file, practically the only thing you have to do from R is to fire …