Dataframes from CSV files in Spark 1.5: automatic schema extraction, neat summary statistics, & elementary data exploration

Christos - Iraklis TsatsoulisBig Data, Spark 25 Comments

In a previous post, we glimpsed briefly at creating and manipulating Spark dataframes from CSV files. In the couple of months since, Spark has already gone from version 1.3.0 to 1.5, with more than 100 built-in functions introduced in Spark 1.5 alone; so, we thought it is a good time for revisiting the subject, this time also utilizing the external …

Unexpected behavior of Spark dataframe filter method

Christos - Iraklis TsatsoulisBig Data, Spark 4 Comments

[EDIT: Thanks to this post, the issue reported here has been resolved since Spark 1.4.1 – see the comments below] While writing the previous post on Spark dataframes, I encountered an unexpected behavior of the respective .filter method; but, on the one hand, I needed some more time to experiment and confirm it and, on the other hand, I knew …

Installing rJava R package in Oracle Linux

Christos - Iraklis TsatsoulisR 1 Comment

The Oracle Big Data Lite (BDLite) VM is a handy and convenient platform for testing, development, and training on the related tools and technologies, such as Cloudera Hadoop, Oracle NoSQL database, Oracle SQL Developer & Data Modeler etc. Among other things, it includes a full distribution of the Oracle R Enterprise (ORE) and the Oracle R Connectors for Hadoop (ORCH). …

Spark data frames from CSV files: handling headers & column types

Christos - Iraklis TsatsoulisBig Data, Spark 16 Comments

If you come from the R (or Python/pandas) universe, like me, you must implicitly think that working with CSV files must be one of the most natural and straightforward things to happen in a data analysis context. Indeed, if you have your data in a CSV file, practically the only thing you have to do from R is to fire …

Undocumented behavior of ore.make.names() function in Oracle R Enterprise

Christos - Iraklis TsatsoulisOracle R 2 Comments

While working with some data in Hive recently using the Oracle R Connectors for Hadoop (ORCH), I tried to use the ore.make.names function (of package OREbase ). The function creates valid column names for ore.frame objects. Here is a reproducible example, copied straight from the function documentation: Experimenting a little, I discovered that ore.make.names becomes functional after executing ore.connect. Indeed, …

Oracle R Enterprise issues in Oracle Big Data Lite VM 4.1.0

Christos - Iraklis TsatsoulisOracle R 4 Comments

In the previous post, we examined some configuration issues with Cloudera Manager and Hadoop services in the latest release of Oracle Big Data Lite VM (4.1.0). In this post we report issues with Oracle R Enterprise, and the remedies we applied. It turns out that if we load the ORE package in R, we subsequently cannot use the help system …

Cloudera Manager configuration issues in Oracle Big Data Lite VM 4.1.0

Christos - Iraklis TsatsoulisBig Data, Hadoop 2 Comments

Oracle has recently announced the release of a new version (4.1.0) of its Big Data Lite VM. Compared to the previous release (4.0.1), we now have more recent versions of Oracle Enterprise Linux (6.5), Oracle NoSQL database (3.2.5), Cloudera distribution of Apache Hadoop (CDH 5.3.0) and Cloudera Manager (5.3.0). The new version of CDH, by itself, also brings forward several …