Enabling the Green-Marl compiler for Parallel Graph Analytics in Oracle Big Data Lite VM

Panagiotis Konstantinidis Oracle Big Data Spatial & Graph 0 Comments

Recently, I began working with Parallel Graph Analytics (PGX) on my Oracle Big Data Lite (BDL) VM version 4.7.0.1. I was especially intrigued and curious about the capabilities of a PGX component called Green-Marl (GM), a domain-specific language specially designed for graph data analysis. It was stated to extend PGX’s capabilities and¬†“implement algorithms with no limit”. Especially the last argument …

Streaming data from Raspberry Pi to Oracle NoSQL via Node-RED

Christos - Iraklis Tsatsoulis Internet of Things, Node-RED, Oracle NoSQL, Raspberry Pi 0 Comments

Starting from version 4.2, Oracle NoSQL now offers drivers for Node.js and Python, in addition to the existing ones for Java, C, and C++; this is good news for data science people, like myself, since we are normally not accustomed to code in Java or C/C++. So, I thought to build a short demo project, putting into test both the …

sparklyr: a test drive on YARN

Christos - Iraklis Tsatsoulis R, Spark 2 Comments

sparklyr is a new R front-end for Apache Spark, developed by the good people at RStudio. It offers much more functionality compared to the existing SparkR interface by Databricks, allowing both dplyr-based data transformations, as well as access to the machine learning libraries of both Spark and H2O Sparkling Water. Moreover, the latest RStudio IDE v1.0 now offers native support …

Nonlinear regression using Spark – Part 2: sum-of-squares objective functions

Constantinos Voglis Data Science, Spark 4 Comments

This post is the second one in a series that discusses algorithmic and implementation issues about nonlinear regression using Spark. In the previous post we identified a small window for contribution into Spark MLlib by adding methods for nonlinear regression, starting with the definition and implementation of a general nonlinear model. We remind the reader that regression is essentially an …

Classification in Spark 2.0: “Input validation failed” and other wondrous tales

Christos - Iraklis Tsatsoulis Data Science, Spark 6 Comments

Spark 2.0 has been released since last July but, despite the numerous improvements and new features, several annoyances still remain and can cause headaches, especially in the Spark machine learning APIs. Today we’ll have a look at some of them, inspired by a recent answer of mine in a Stack Overflow question (the question was about Spark 1.6 but, as …

How to use SparkR in Cloudera Hadoop

Christos - Iraklis Tsatsoulis Big Data, R, Spark 17 Comments

Suppose you are an avid R user, and you would like to use SparkR in Cloudera Hadoop; unfortunately, as of the latest CDH version (5.7), SparkR is still not supported (and, according to a recent discussion in the Cloudera forums, we shouldn’t expect this to happen anytime soon). Is there anything¬† you can do? Well, indeed there is. In this …

Bulk load data to HBase in Oracle Big Data Appliance

Christos - Iraklis Tsatsoulis Big Data, HBase 0 Comments

I ran into an issue recently, while trying to bulk load some data to HBase in Oracle Big Data Appliance. Following is a reproducible description and solution using the current version of Oracle Big Data Lite VM (4.4.0). Enabling HBase in Oracle Big Data Lite VM (Feel free to skip this section if you do not use Oracle Big Data …

Querying Big Data SQL tables with Oracle R Enterprise

Christos - Iraklis Tsatsoulis Big Data, Oracle Big Data SQL, Oracle R 0 Comments

I was wondering recently if I could use Oracle R Enterprise (ORE) to query Big Data SQL tables (i.e. Oracle Database external tables based on HDFS or Hive data), since I have never seen such a combination mentioned in the relevant Oracle documentation and white papers. I am happy to announce that the answer is an unconditional yes. In this …

Nonlinear regression using Spark – Part 1: Nonlinear models

Constantinos Voglis Spark 1 Comment

Regression constitutes a very important topic in supervised learning. Its goal is to predict the value of one or more continuous target variables (responses) given the value of a -dimensional vector of input variables (predictors). More specifically, given a training data set comprising of observations , where , together with corresponding target values , the goal is to predict the …

Limitations of Spark MLlib linear algebra module

Christos - Iraklis Tsatsoulis Spark 0 Comments

A couple of days ago I stumbled upon some unexpected behavior of Spark MLlib (v. 1.5.2), while trying some ultra-simple operations on vectors. Consider the following Pyspark snippet: Clearly, what happens is that the unary operator – (minus) for vectors fails, giving errors for expressions like -x and -y+x, although x-y behaves as expected. The result of the last operation, …