Graph analysis of Stack Overflow tags with Oracle PGX

This is the second part of our three-part blog post series (see the first part here), which deals with incremental data updates. In our scenario we assume that we acquire small batches of data updates using some kind of web scraping mechanism. We will not deal with the details of that mechanism, as it is beyond the scope of this post. These batches will be used to either create new vertices or update already existing ones, thus ending up with new or updated graph edges too. Just make sure to have Java 8 installed, as we will use Java streams in order to initially count the occurrence of each tag in our data and then the pairs of tags we will create.

The data will be stored in a text file, with each line containing the tags that are present in a Stack Overflow question separated by space (recall that each Stack Overflow question contains one to five tags). An example of the data is shown below:

c# winforms forms type-conversion opacity
html css css3 internet-explorer
c# code-generation j# visualj# winforms
c# .net datetime winforms
datediff winforms
html browser timezone timezoneoffset
.net visual-studio math msdn
big-data hdfs
java hadoop nosql
pgx apache-zepellin
cli terminal
oracle-linux
etl map-reduce java hdfs
nosql property-graph
oracle-linux xquery
...

The starting point of our process is the above data stored in an HDFS file named newTags.txt. In the general case, there will be tags that already exist in our graph, as well as new ones which will have to be created. As we have already mentioned in the first part of our blog series, all the work is done on our Big Data Lite Virtual Machine, version 4.8.

After retrieving the newTags.txt file stored in HDFS, we first update the vertex data residing in our Oracle Database and Oracle NoSQL (property graph format). The process is quite simple:

Connect to the respective database
Iterate over the tags present in our file
If a tag exists, update its count
If a tag doesn’t exist, add it and set its count accordingly

After updating the vertices, the next task is to update the edge data of our property graph. Here the procedure is similar to the vertex update above, with one additional step – creation of tag pairs:

Connect to Oracle NoSQL
Create tag pairs from the file
Iterate over the tag pairs (potential edges)
If the corresponding edge exists, update its weight
If the corresponding edge doesn’t exist, create it and set its weight accordingly

The Java code can be found in the Update.java file in my Github repo.

To run the Update class the following two lines of code must be added to the main method (CreateGraph.java) of our program:

Update update = new Update();
update.updateOracleNoSQLGraph("/user/oracle/pgx/newTags.txt");

If you have chosen to store the newTags.txt file in some other directory, change the path provided to the updateOracleNoSQLGraph method accordingly.

In conclusion, the kind of data on which we are working are never stationary. They are always updated and enriched and that raises the need of keeping track of the newly added information.With the above procedure that takes care of these updates, our use case scenario is now complete – we are always working with the most recent data and our property graph is always up to date.

Having finished with the data processing and formatting, it’s time to deal with data analysis, which will be the third and last part of this blog series.

As always, stay tuned!-

Author
Recent Posts

Panagiotis Konstantinidis

Panagiotis is our resident Big Data Engineer. He holds a Bachelor's degree in Computer Science, and he is an Oracle Certified Implementation Specialist for Big Data, ADF (11g & 12c), and Linux.

Latest posts by Panagiotis Konstantinidis (see all)

Graph analysis of Stack Overflow tags with Oracle PGX – Part 2: Incremental Updates - August 30, 2017
Graph analysis of Stack Overflow tags with Oracle PGX – Part 1: Data Engineering - July 31, 2017
Enabling the Green-Marl compiler for Parallel Graph Analytics in Oracle Big Data Lite VM - June 12, 2017