sparklyr: a test drive on YARN

Christos - Iraklis Tsatsoulis R, Spark 2 Comments

sparklyr is a new R front-end for Apache Spark, developed by the good people at RStudio. It offers much more functionality compared to the existing SparkR interface by Databricks, allowing both dplyr-based data transformations, as well as access to the machine learning libraries of both Spark and H2O Sparkling Water. Moreover, the latest RStudio IDE v1.0 now offers native support …

Classification in Spark 2.0: “Input validation failed” and other wondrous tales

Christos - Iraklis Tsatsoulis Data Science, Spark 4 Comments

Spark 2.0 has been released since last July but, despite the numerous improvements and new features, several annoyances still remain and can cause headaches, especially in the Spark machine learning APIs. Today we’ll have a look at some of them, inspired by a recent answer of mine in a Stack Overflow question (the question was about Spark 1.6 but, as …

Installing the additional R packages in Oracle Big Data Lite VM 4.5.0

Christos - Iraklis Tsatsoulis R 2 Comments

Oracle has just released version 4.5.0 of the Big Data Lite VM which, when it comes to R, still suffers from the issues we had pinpointed for the previous version 4.4.0 (and then some). The first attempt to install the additional packages fails with a ‘cannot open URL’ error: Fortunately, the warning about the proxy helps to locate the issue, …

How to use SparkR in Cloudera Hadoop

Christos - Iraklis Tsatsoulis Big Data, R, Spark 13 Comments

Suppose you are an avid R user, and you would like to use SparkR in Cloudera Hadoop; unfortunately, as of the latest CDH version (5.7), SparkR is still not supported (and, according to a recent discussion in the Cloudera forums, we shouldn’t expect this to happen anytime soon). Is there anything¬† you can do? Well, indeed there is. In this …

Bulk load data to HBase in Oracle Big Data Appliance

Christos - Iraklis Tsatsoulis Big Data, HBase 0 Comments

I ran into an issue recently, while trying to bulk load some data to HBase in Oracle Big Data Appliance. Following is a reproducible description and solution using the current version of Oracle Big Data Lite VM (4.4.0). Enabling HBase in Oracle Big Data Lite VM (Feel free to skip this section if you do not use Oracle Big Data …

Installing the additional R packages in Oracle Big Data Lite VM 4.4.0

Christos - Iraklis Tsatsoulis R 0 Comments

In the just-released version 4.4.0 of Oracle Big Data Lite VM, as in the previous one (4.3.0.1), there is a rather large number of additional R packages to be installed by the provided script install_additional_packages.sh, i.e. 28 packages without counting their dependencies (the respective number in version 4.2.1 was only 10). Unfortunately, what has also changed is the form of …

Using ROracle with Oracle Instant Client 12c

Christos - Iraklis Tsatsoulis Oracle R, R 0 Comments

The other day, while setting up the new Oracle R Enterprise (ORE) 1.5 client packages in a Linux server, we installed the Oracle DB Instant Client v. 12.1, as advised in the relevant documentation. Problem was, ORE failed to load, in fact due to ROracle failure: Truth is, the file libclntsh.so.11.1 did not exist, but this was expected, simply due …

Querying Big Data SQL tables with Oracle R Enterprise

Christos - Iraklis Tsatsoulis Big Data, Oracle Big Data SQL, Oracle R 0 Comments

I was wondering recently if I could use Oracle R Enterprise (ORE) to query Big Data SQL tables (i.e. Oracle Database external tables based on HDFS or Hive data), since I have never seen such a combination mentioned in the relevant Oracle documentation and white papers. I am happy to announce that the answer is an unconditional yes. In this …

Caution when installing Oracle R Distribution in Oracle Linux using Yum

Christos - Iraklis Tsatsoulis Oracle R 0 Comments

Last week we tried to install Oracle R Distribution (ORD) in Oracle Linux 7.1 using Yum, which is the installation method recommended by Oracle. After following closely the instructions provided in the documentation, instead of the Oracle R Distribution 3.2.0, we found ourselves with the latest (3.2.3) version of GNU R installed. What had happened is that in our /etc/yum.repos.d, …