How NOT to perform feature selection!

Christos - Iraklis Tsatsoulis Data Science 3 Comments

Cross-validation (CV) is nowadays being widely used for model assessment in predictive analytics tasks; nevertheless, cases where it is incorrectly applied are not uncommon, especially when the predictive model building includes a feature selection stage. I was reminded of such a situation while reading this recent Revolution Analytics blog post, where CV is used to assess both the feature selection …

Oracle R Enterprise 1.4: ore.make.names does not work for Oracle DB connections

Christos - Iraklis Tsatsoulis Oracle R 0 Comments

I have reported in the past about some unexpected behavior issues of Oracle R Enterprise 1.4 ore.make.names function; nevertheless, back then I had only tried it with Hive connections. I tried to use it today with an Oracle database connection, and it doesn’t seem to work. Here is a reproducible example in Oracle Big Data Lite VM 4.2.1, using the …

Manipulating Hive tables with Oracle R connectors for Hadoop

Christos - Iraklis Tsatsoulis Hadoop, Hive, Oracle R 0 Comments

In this post, we’ll have a look at how easy it is to manipulate Hive tables using Oracle R connectors for Hadoop (ORCH, presently known as Oracle R Advanced Analytics for Hadoop – ORAAH). We will use the weblog data from Athens Datathon 2015, which we have already loaded in a Hive table named weblogs, as described in more detail …

Using Ansible to install WebLogic 12c R2 and Fussion Middleware

Chris Vezalis Ansible, DEVOPS, Fusion Middleware, Linux, Oracle ADF, Oracle Linux, Vagrant, WebLogic 0 Comments

Before a couple of days Oracle release WebLogic 12c R2 (12.2.1). There are a lot of cool features like Java EE 7 support and Multitenancy Support for WebLogic domains. Installation of WebLogic server along with ADF runtime (Fusion Middleware Infrastructure) are not hard but requires a lot of parameters to be configured and a significant time when you need to …

Augmenting PCA functionality in Spark 1.5

Christos - Iraklis Tsatsoulis Dimensionality Reduction, Spark 7 Comments

Surprisingly enough, although the relatively new Spark ML library (not to be confused with Spark MLlib) includes a method for principal components analysis (PCA), there is no way to extract some very useful information regarding the PCA transformation, namely the resulting eigenvalues (check the Python API documentation); and, without the eigenvalues, one cannot compute the proportion of variance explained (PVE), …

Log files exploration with Oracle Big Data Discovery 1.1

Christos - Iraklis Tsatsoulis Big Data, Exploratory Data Analysis, Oracle Big Data Discovery 1 Comment

In a previous post, we described how we performed exploratory data analysis (EDA) in real-world log files, as provided by Skroutz.gr, the leading online company in Greece for online price comparison, in the context of Athens Datathon 2015. In the present post we will have a look at the same job as performed with Oracle Big Data Discovery (v. 1.1), …

Using Ansible to configure an Oracle Linux 7.1 server with Oracle 12c R1 Enterprise Edition Database

Chris Vezalis Ansible, DEVOPS, Linux, Oracle Database, Oracle Linux, Vagrant 16 Comments

Ansible is the leading tool for configuring software and various parameters on servers. It does not require agents and other software installed on nodes like other popular tools (puppet or chef). Also, it is modular and already has hundreds of modules that help us configure our servers in several ways. In this article I will demonstrate how we can install …

Athens Datathon 2015: exploratory data analysis for anomaly detection & data quality

Christos - Iraklis Tsatsoulis Data Science, Exploratory Data Analysis, R 8 Comments

Together with my friend and former colleague Georgios Kaiafas, we formed a team to participate to the Athens Datathon 2015, organized by ThinkBiz on October 3; the datathon took place at the premises of Skroutz.gr, which was also the major sponsor and the data provider. It was the second such event organized in Athens, and you can see the Datathon …

Big Data Discovery configuration in Oracle Big Data Lite VM 4.2.1

Christos - Iraklis Tsatsoulis Big Data, Oracle Big Data Discovery 1 Comment

The latest version (4.2.1) of Oracle Big Data Lite VM, among many additions, now includes also the much-expected Oracle Big Data Discovery (v. 1.1), which I had not played with so far (it is a new product); so I thought to take it for a ride. Since my test data included geolocation attributes (latitude/longitude), one of the first things I …