Nonlinear regression using Spark – Part 1: Nonlinear models

Constantinos VoglisSpark 2 Comments

Regression constitutes a very important topic in supervised learning. Its goal is to predict the value of one or more continuous target variables (responses) given the value of a -dimensional vector of input variables (predictors). More specifically, given a training data set comprising of observations , where , together with corresponding target values , the goal is to predict the …

Limitations of Spark MLlib linear algebra module

Christos - Iraklis TsatsoulisSpark 1 Comment

A couple of days ago I stumbled upon some unexpected behavior of Spark MLlib (v. 1.5.2), while trying some ultra-simple operations on vectors. Consider the following Pyspark snippet: Clearly, what happens is that the unary operator – (minus) for vectors fails, giving errors for expressions like -x and -y+x, although x-y behaves as expected. The result of the last operation, …

Manipulating Hive tables with Oracle R connectors for Hadoop

Christos - Iraklis TsatsoulisHadoop, Hive, Oracle R 2 Comments

In this post, we’ll have a look at how easy it is to manipulate Hive tables using Oracle R connectors for Hadoop (ORCH, presently known as Oracle R Advanced Analytics for Hadoop – ORAAH). We will use the weblog data from Athens Datathon 2015, which we have already loaded in a Hive table named weblogs, as described in more detail …

Augmenting PCA functionality in Spark 1.5

Christos - Iraklis TsatsoulisDimensionality Reduction, Spark 7 Comments

Surprisingly enough, although the relatively new Spark ML library (not to be confused with Spark MLlib) includes a method for principal components analysis (PCA), there is no way to extract some very useful information regarding the PCA transformation, namely the resulting eigenvalues (check the Python API documentation); and, without the eigenvalues, one cannot compute the proportion of variance explained (PVE), …

Log files exploration with Oracle Big Data Discovery 1.1

Christos - Iraklis TsatsoulisBig Data, Exploratory Data Analysis, Oracle Big Data Discovery 1 Comment

In a previous post, we described how we performed exploratory data analysis (EDA) in real-world log files, as provided by Skroutz.gr, the leading online company in Greece for online price comparison, in the context of Athens Datathon 2015. In the present post we will have a look at the same job as performed with Oracle Big Data Discovery (v. 1.1), …

Big Data Discovery configuration in Oracle Big Data Lite VM 4.2.1

Christos - Iraklis TsatsoulisBig Data, Oracle Big Data Discovery 1 Comment

The latest version (4.2.1) of Oracle Big Data Lite VM, among many additions, now includes also the much-expected Oracle Big Data Discovery (v. 1.1), which I had not played with so far (it is a new product); so I thought to take it for a ride. Since my test data included geolocation attributes (latitude/longitude), one of the first things I …

Dataframes from CSV files in Spark 1.5: automatic schema extraction, neat summary statistics, & elementary data exploration

Christos - Iraklis TsatsoulisBig Data, Spark 25 Comments

In a previous post, we glimpsed briefly at creating and manipulating Spark dataframes from CSV files. In the couple of months since, Spark has already gone from version 1.3.0 to 1.5, with more than 100 built-in functions introduced in Spark 1.5 alone; so, we thought it is a good time for revisiting the subject, this time also utilizing the external …

Development and deployment of Spark applications with Scala, Eclipse, and sbt – Part 2: A Recommender System

Constantinos VoglisBig Data, Spark 11 Comments

In our previous post, we demonstrated how to setup the necessary software components, so that we can develop and deploy Spark applications with Scala, Eclipse, and sbt. We also included the example of a simple application. In this post, we are taking this demonstration one step further. We discuss a more serious application of a recommender system and present the …

Development and deployment of Spark applications with Scala, Eclipse, and sbt – Part 1: Installation & configuration

Constantinos VoglisBig Data, Spark 23 Comments

The purpose of this tutorial is to setup the necessary environment for development and deployment of Spark applications with Scala. Specifically, we are going to use the Eclipse IDE for development of applications and deploy them with spark-submit. The glue that ties everything together is the sbt interactive build tool. The sbt tool provides plugins used to: Create an Eclipse …

Unexpected behavior of Spark dataframe filter method

Christos - Iraklis TsatsoulisBig Data, Spark 4 Comments

[EDIT: Thanks to this post, the issue reported here has been resolved since Spark 1.4.1 – see the comments below] While writing the previous post on Spark dataframes, I encountered an unexpected behavior of the respective .filter method; but, on the one hand, I needed some more time to experiment and confirm it and, on the other hand, I knew …