Graph analysis of Stack Overflow tags with Oracle PGX – Part 1: Data Engineering

Panagiotis KonstantinidisBig Data, Data Engineering, Oracle NoSQL, Oracle PGX 1 Comment

Intoduction

Oracle Parallel Graph Analytics (PGX) is a toolkit for graph analysis, both for running algorithms such as PageRank and for performing SQL-like pattern-matching against graphs. Extreme performance is offered through algorithm parallelization, and graphs can be loaded from a variety of sources such as flat files, SQL and NoSQL databases etc. So, in order to get a deeper feeling, we have put together a demo use case regarding the analysis of the Stack Overflow posts when their tags are viewed as a graph.

Stack Overflow is a site devoted to questions about computer programming and related topics, with tags being an important part of the submitted questions. Tags allow answering users to monitor questions related to their field of expertise and answer promptly to submitted questions. Only “advanced” status users are allowed to add new tags, while the rest of the community is limited to the use of the existing tags for their questions. Tagging a submitted question is mandatory with a minimum of one and a maximum of five tags per question.

We will replicate the scenario presented in Empirical Analysis of the Stack Overflow Tags Network, written by our Lead Data Scientist Christos Iraklis Tsatsoulis and published as a chapter in Text Mining and Visualization: Case Studies Using Open-Source Tools. To elaborate, our aim will be to gain some insight on the Stack Overflow community by representing tags and their co-occurrences as a graph. In our graph, nodes are represented by the tags themselves, while an edge between two nodes exists if the corresponding tags are found together in the same question. The resulting graph edges are weighted, i.e. the more questions exist with two tags co-occurring, the higher the weight of the relevant edge between them will be; for example, if the tag oracle is found to co-exist with the tag java in 1000 questions, the weight of the edge between the nodes oracle and java in our graph will be 1000. In graph terminology, the resulting graph is a weighted undirected one.

In this first part of a three-part post series, we will extract the needed data out of the raw posts and tags xml files, in order to create our graph in Oracle PGX. We will employ a variety of technologies along the way, namely Oracle Database, XQuery, HDFS/Hadoop, Hive, Java, and Oracle NoSQL. In summary, our steps will be:

  • Store the tag data into Oracle Database
  • Extract the necessary information from the raw post data with XQuery
  • Store the cleared up post data into HDFS
  • Process and transform the post data using Java and Hive
  • Import the transformed data into Oracle NoSQL, in property graph format
  • Load our graph into Oracle PGX for further analysis

The steps mentioned are presented in more detail below. All the work will be done on a Big Data Lite VM (v 4.8).

The files needed are the raw Posts and Tags xml files located in stackoverflow.com-Posts.7z and stackoverflow.com-Tags.7z respectively. You can download both of them here – just make sure you have sufficient storage space, as the Posts file takes up ~ 56 GB when unzipped.

Loading the Tags data into Oracle Database

In the latest dump of the Stack Overflow data linked above, there are about 48,000 unique tags in a file Tags.xml, including also their total occurrences. The size and nature of this data make Oracle Database a suitable option for their storage.

First, we create a new Oracle Database user with the appropriate privileges, pgx. An external database directory is also needed so that we can reference our Tags.xml file through the database. The path to this external directory in our scenario is /home/oracle/external_directories/xml_dir.

Furthermore, in order to gain access to the data stored inside the Tags.xml file, we need to use either an XMLTYPE table or a table with an XMLTYPE column. We go with the XMLTYPE column option here, although both options work perfectly fine for our scenario.

Finally we construct the database TAGS table, which contains three columns – the tag name and the total times the tag appears in the posts.

The steps mentioned above, as well as the respective code, are presented below.

  1. Create the “pgx” user with the appropriate privileges:
     SQL> CREATE USER pgx IDENTIFIED BY pgx DEFAULT TABLESPACE USERS TEMPORARY TABLESPACE TEMP QUOTA UNLIMITED ON USERS; 
    SQL> GRANT CREATE SESSION, CREATE TABLE, CREATE ANY PROCEDURE, CREATE ANY DIRECTORY TO pgx; 
  2. Create the external directory and copy the Tags.xml file in it:
     $ mkdir /home/oracle/external_directories/xml_dir 
    $ cp /media/sf_BigDataLite-4.8/Tags.xml /home/oracle/external_directories/xml_dir/
     SQL> CREATE OR REPLACE DIRECTORY xml_dir AS '/home/oracle/external_directories/xml_dir';
     SQL> GRANT READ, WRITE ON DIRECTORY xml_dir TO pgx; 
  3. Create a table with XMLTYPE column and populate it:
     SQL> CREATE TABLE xml_table (xml_col XMLTYPE);
    SQL> INSERT INTO xml_table VALUES (XMLTYPE(BFILENAME('XML_DIR','Tags.xml'), NLS_CHARSET_ID('AL32UTF8')));
    SQL> SELECT x.xml_col.getClobVal() FROM xml_table x; 
  4. Preview the desired format of the data:
     SQL> SELECT xt.*
              FROM xml_table x,
              XMLTABLE('/tags/row' PASSING x.xml_col
              COLUMNS "TAGNAME" VARCHAR2(100 CHAR) PATH '@TagName',
                      "COUNT" NUMBER(10) PATH '@Count') xt; 

    Data:

    TAGNAME     COUNT
    .net	    248858
    html	    632135
    javascript	1339816
    css	        455125
    php	        1047779
    c	        244538
    c#	        1068888
    c++	        502132
    ruby	    178146
    lisp	    4978
    python	    715171
    java	    1223220
    ....
    
  5. Create the database Tags table:
     SQL> CREATE TABLE tags AS 
              SELECT xt.*
              FROM xml_table x,
              XMLTABLE('/tags/row' PASSING x.xml_col
              COLUMNS "TAGNAME" VARCHAR2(100 char) PATH '@TagName',
                      "COUNT" NUMBER(10) PATH '@Count') xt; 

Loading the Posts data into HDFS

As mentioned above, the uncompressed Posts.xml file is ~ 56 GB in size, with most of the information contained being useless for our purposes, since we are only interested in tag co-occurrences. So, it makes sense to load it into HDFS for performing the necessary ETL.

Posts.xml contains all the information about posts, including the answers. From the provided README file, we can see that PostTypeID=1 refers to questions, while PostTypeID=2 refers to answers. Since tags are only included in the questions, we are not interested in the latter.

After copying the file into HDFS, we create an XQuery script file, filterposts.xq. The purpose of this script is to filter each row on its PostTypeId and isolate the Id and Tags attributes of each entry. We also add a filter to the script, which checks if the filtered lines contain only one tag; these lines (close to 1.7 million) will not contribute to the creation of any edge in our graph, so it would be wise to exclude them as early as possible in the process. The format of the Tags attribute is the following: Tags="<tag1><tag2><tag3>". If the Tags attribute does not contain “><“, it means that it only has one tag and should be excluded.

By running the XQuery script, a file (named part-m-00000) is created in HDFS at the location specified in the command line. In our case this location is /user/oracle/pgx/myout. As we mentioned above, this file will contain only the Id and Tags attributes separated by “|”.

Be patient when running the scripts and queries as they may take a while… Using the single-node Big Data Lite VM, copying the file into HDFS took about 30 minutes, while running the XQuery script took 1 hour and 17 minutes.

Here is the detailed code:

  1. Create the following XQuery script and save it as filterposts.xq:
     import module "oxh:xmlf";
    import module "oxh:text";
           
    for $x in xmlf:collection("pgx/Posts.xml")/posts/row
    where $x/@PostTypeId eq "1" and fn:contains($x/@Tags, "><")
    return text:put($x/@Id || "|" || $x/@Tags) 
  2. Copy the Posts.xml file into HDFS and run the created XQuery script:
     $ hadoop fs -copyFromLocal /media/sf_BigDataLite-4.8/Posts.xml /user/oracle/pgx
    $ hadoop jar $OXH_HOME/lib/oxh.jar filterposts.xq -output ./pgx/myout 

Processing data with Java

Now that we have both our Tags and Posts data in place it’s time to process them using Java. We used JDeveloper to create an application named tagprocessing and a Java project inside it named dataparser. If you are not used to JDeveloper you can use any other IDE, such as Eclipse. What is performed by the Java classes is explained here, while the classes themselves can be found through this link.

We used the Blueprints for Oracle NoSQL Database API to create, access, and manipulate the graph stored in Oracle NoSQL. Τhe jar files needed in the project’s “Libraries and Classpath”, as well as the links to their download pages, are listed below:

antlr4-runtime-4.5.jar avro-1.8.2.jar blueprints-core-2.6.0.jar
commons-cli-1.2.jar commons-collections-3.2.2 commons-configuration-1.6.jar
commons-io-2.4.jar commons-lang3-3.1.jar commons-lang-2.4.jar
commons-logging-1.1.3.jar commons-vfs2-2.0.jar guava-11.0.2.jar
hadoop-auth-2.6.0.jar hadoop-common-2.6.0.jar hadoop-hdfs-2.6.0.jar
hive-exec-0.13.0.jar hive-jdbc-1.1.0.jar hive-service-0.14.0.jar
hive-service-rpc-2.1.1.jar htrace-core-3.0.4.jar httpclient-4.2.3.jar
httpclient-osgi-4.3.jar httpcore-4.3.jar jackson-core-asl.jar
jackson-mapper-asl.jar libthrift-0.9.3.jar log4j-1.2.17.jar
ojdbc8.jar pgx-api-2.4.1.jar protobuf-java-2.5.0.jar
slf4j-api-1.7.6.jar threetenbp-0.8.jar thrift-1.0.jar

There are four more needed jar files – pgx-common-2.4.0.jarsdopgdal.jar, kvclient.jar and kvstore.jar. The first two can be found under /opt/oracle/oracle-spatial-graph/property_graph/lib while the last two under /u01/nosql/kv-ee/lib on your Oracle Big Data Lite VM.

Next, we will create a Connect.java file which will take care of the necessary connections with our various data sources (Oracle Database, Oracle NoSQL, Hadoop, and Hive). Regarding Hive, we must connect as an authorized user in order to run MapReduce jobs, or else we will be presented with the following error:

java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

This error basically informs us that an unauthorized user cannot run a MapReduce job. As this is a demonstration, we provide the credentials directly in the code.; if this were a production environment, credentials should be provided by Oracle Wallet or some other secure application. If your username and password are different from the default ones change them in the following line of the Connect.java file:

connect = DriverManager.getConnection(Config.HIVE_SERVER_URL, "<username>", "<password>");

Next we create the ImportToOracleNoSQL.java file. Here we create the Oracle NoSQL property graph  named my_graph.  The graph vertices will be created from the TAGS table stored in Oracle Database. If you have chosen a different username and password for your Oracle Database user, change them accordingly in the following line of code of the ImportToOracleNoSQL.java file:

Connection dbConn = connect.connectRDB(Config.JDBC_RDBMS_URL, "<username>", "<password>");

The id of the vertices will be their corresponding tag name (translated under the scenes to a long value), for easy retrieval when needed. The vertices will have two properties, tagName and count, the tag name being also put as a property for an easier visual recognition. Everything mentioned above is implemented in the importVertices method.

Finally we implement the importEdges method to create the edges of the graph. This method retrieves the HDFS data produced by the filterposts.xq XQuery script, pairs each tag in a line with the ones following it, and ends up with a list of tag pairs. Notice that for the same two tags, we will have instances of both <tag1><tag2> and <tag2><tag1>, something which we will take care of later on, since our graph will be undirected and such pairs represent essentially the same edge.

Two staging tables are created next in Hive, one to hold the entirety of the tag pairs and another one to aggregate on them and get the count of each pair. The second staging table is the one we will finally use to create the graph edges in Oracle NoSQL. The edges are created having an incremented long value as an id and contain only one property, their weight, which will be the count of each pair of the second staging table.

The rest of the Java files are Config.java, HiveQueries.java and CreateGraph.java. The Config.java file is an interface which holds various configuration parameters, and HiveQueries.java holds and executes the queries to create and then drop the staging tables. The main method of our program is inside CreateGraph.java:

public class CreateGraph {
    public static void main (String[] args) {
        Stopwatch timer = new Stopwatch().start();
        ImportToOracleNoSQL imp = new ImportToOracleNoSQL();
        
        imp.importVertices();
        imp.importEdges("/user/oracle/pgx/myout/part-m-00000");
        
        timer.stop();
        
        System.out.println("Time: " + timer.elapsedTime(TimeUnit.MINUTES) + " min");
    }
}

The HDFS path passed as a parameter to the importEdges method (line #7) should be the same as the one used when running the XQuery script filterposts.xq above; if you provided a different path when running that script, you should change the HDFS path here accordingly.

In our VM, the whole process required about 1.5 hours to process and load about 4.2 million edges.

Creating the Property Graph in PGX

The final step in this first part of the Tags Network Analysis would be to create the Property Graph in PGX. If you decide to work via CLI, you can use either version of PGX, meaning either the OTN (Oracle Technology Network) or the BDSG (Big Data Spatial and Graph) one – just don’t forget to start your NoSQL database. If however you decide to work via Zeppelin, be careful to use the BDSG version of the PGX server, as the OTN one does not support loading a property graph stored in Oracle NoSQL.

Currently, some edges of our graph may be bidirectional, but this is not desirable. As we already mentioned above, that is because we consider <tag1><tag2> and <tag2><tag1> to be the same, but that kind of checking was not implemented in our code so far. Fortunately, PGX allows us to undirect the loaded graph and sum up the numeric properties of the edges using a MergingStrategy. The resulting graph’s edges will be reduced to about 4 million.

The steps and the respective commands are presented below:

  1. Load and undirect the graph:
    pgx> config = new PgNosqlGraphConfigBuilder().addVertexProperty("count", PropertyType.INTEGER).addVertexProperty("tagName", PropertyType.STRING).addEdgeProperty("weight", PropertyType.INTEGER).setHosts(["localhost:5000"].toList()).setStoreName("kvstore").setName("my_graph").build()
    pgx> nosqlgraph = session.readGraphWithProperties(config)
    pgx> nosqlgraph_und = nosqlgraph.undirect()
  2. Create the merging strategy and undirect the newly create graph using the strategy:
    pgx> msb = new MergingStrategyBuilder(nosqlgraph_und).setPropertyMergingStrategy("weight", MergingFunction.SUM).build()
    pgx> nosqlgraph_merge = nosqlgraph_und.undirect(msb)

Bear in mind that if you try to undirect the nosqlgraph graph, created above, directly using the msb MergingStrategy you will get the following error:

ERROR Task - >> [ERROR] ParallelExecutionException on analysis-pool: UNDIRECT_GRAPH failed

When loading a property graph from Oracle NoSQL into PGX, you have to do it in two steps as showcased above.

Summary

To summarize, we combined quite a lot of things, in order to end up with our final graph, and the tasks at hand proved far from trivial:

  • Oracle Database and HDFS were our storage options for parts of the initial data
  • Java and the Oracle NoSQL Blueprints API offered easy data manipulation and surprisingly fast performance
  • Hive queries were used to construct staging tables to build our final graph
  • The property graph format of Oracle NoSQL proved a perfect fit for storing our graph
  • Finally, PGX provided a neat solution for undirecting the graph and merging the unwanted bidirectional edges

At this stage, the graph is ready for further exploration and querying by data analysts. But before doing so, in the second of this three-part post series, we will first see how to perform incremental updates of our graph, as more data are added to the initial source.

Stay tuned!-

Panagiotis Konstantinidis
Subscribe
Notify of
1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
trackback
November 24, 2017 15:09

[…] Blog Blog Graph analysis of Stack Overflow tags with Oracle PGX – Part 2: Incremental Updates […]