Nonlinear regression using Spark – Part 2: sum-of-squares objective functions

Constantinos Voglis Data Science, Spark 4 Comments

This post is the second one in a series that discusses algorithmic and implementation issues about nonlinear regression using Spark. In the previous post we identified a small window for contribution into Spark MLlib by adding methods for nonlinear regression, starting with the definition and implementation of a general nonlinear model. We remind the reader that regression is essentially an …

Classification in Spark 2.0: “Input validation failed” and other wondrous tales

Christos - Iraklis Tsatsoulis Data Science, Spark 4 Comments

Spark 2.0 has been released since last July but, despite the numerous improvements and new features, several annoyances still remain and can cause headaches, especially in the Spark machine learning APIs. Today we’ll have a look at some of them, inspired by a recent answer of mine in a Stack Overflow question (the question was about Spark 1.6 but, as …

How to evaluate R models in Azure Machine Learning Studio

Constantinos Voglis Azure Machine Learning Studio, Data Science, R 5 Comments

Azure Machine Learning Studio is a GUI-based integrated development environment for constructing and operationalizing machine learning workflows. The basic computational unit of an Azure ML Studio workflow (or Experiment) is a module which implements machine learning algorithms, data conversion and transformation functions etc. Modules can be connected by data flows, thus implementing a machine learning pipeline. A typical pipeline in …

How NOT to perform feature selection!

Christos - Iraklis Tsatsoulis Data Science 2 Comments

Cross-validation (CV) is nowadays being widely used for model assessment in predictive analytics tasks; nevertheless, cases where it is incorrectly applied are not uncommon, especially when the predictive model building includes a feature selection stage. I was reminded of such a situation while reading this recent Revolution Analytics blog post, where CV is used to assess both the feature selection …

Augmenting PCA functionality in Spark 1.5

Christos - Iraklis Tsatsoulis Dimensionality Reduction, Spark 7 Comments

Surprisingly enough, although the relatively new Spark ML library (not to be confused with Spark MLlib) includes a method for principal components analysis (PCA), there is no way to extract some very useful information regarding the PCA transformation, namely the resulting eigenvalues (check the Python API documentation); and, without the eigenvalues, one cannot compute the proportion of variance explained (PVE), …

Log files exploration with Oracle Big Data Discovery 1.1

Christos - Iraklis Tsatsoulis Big Data, Exploratory Data Analysis, Oracle Big Data Discovery 1 Comment

In a previous post, we described how we performed exploratory data analysis (EDA) in real-world log files, as provided by Skroutz.gr, the leading online company in Greece for online price comparison, in the context of Athens Datathon 2015. In the present post we will have a look at the same job as performed with Oracle Big Data Discovery (v. 1.1), …

Athens Datathon 2015: exploratory data analysis for anomaly detection & data quality

Christos - Iraklis Tsatsoulis Data Science, Exploratory Data Analysis, R 8 Comments

Together with my friend and former colleague Georgios Kaiafas, we formed a team to participate to the Athens Datathon 2015, organized by ThinkBiz on October 3; the datathon took place at the premises of Skroutz.gr, which was also the major sponsor and the data provider. It was the second such event organized in Athens, and you can see the Datathon …