More often than not, data scientists work with huge amounts of data that are produced by a business. They typically use numerous mathematical, statistical and programmatic techniques to extract valuable information and get untapped value out of data.
As a data scientist, you might face a challenge at some point of an insufficient amount of historical and/or labeled data. As a result, the majority of knowledge discovery methods can’t be used effectively. Without being able to produce extra information for business inquiries, we tend to move on from the project. However, I’d like to challenge you to look at your data from a different point of view. I can show you the value of graphs by showcasing examples in anomaly detection applications where one is interested in finding the most unusual data occurrences.
A great deal of information can be discovered not only from data records themselves, but also from analysing the relationship between the data points. Analysts can begin utilizing more out of what they already have by trying to understand the relationship between data. So-called graphs and graph analysis (a.k.a social networks or networks) can help you to discover different interactions, communications and relationships between one, or multiple data points. This enables you to imply more measures and context onto the data.
What is graph?
Graph (a.k.a social network) is an ordered (for directional graph) or unordered (for undirected graphs) pair of vertices and edges (a.k.a. nodes and links), where vertices can be represented as a circle, and edges as a line connecting circles. The most common graph examples are family trees, tournament brackets, organisational charts, IOT charts, or Facebook and Linkedin friends/colleague networks. Most commonly, graphs are interpreted as an adjacency matrix. This means that relationships can be processed by any conventional programming language and stored in any database.
Graph data structure represents each data point as an individual instance. In comparison to Entity Relation (ER), the data structure relationship between individual nodes can be unique. Graphs can be seen as a different paradigm to the traditional ER representation. Yet, it also comes with its own pros and cons.
As once said by D. M. Hawkins: “An outlier is an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism”. We can use graphs to identify and extract even more possible ‘deviations’, various types of measures and characteristics from the normal observations. Methods we can take to extract these measures from a graph-like structure is being researched by graph theory and the subfield of social sciences that focuses on social networks.
These methods can be split into two parts: traditional and graph methods, where graph methods can be statistical, machine learning, and any other knowledge discovery-based approach that utilises graph structure and/or its characteristics. graphs can be looked at from a fixed (static) or dynamic (for example describe change of interactions in time) standpoint. Not only we are able to extract proximity, statistics, angle, relation, isolation-based and other metrics for more traditionally structured data, but we also are able to explore graph based metrics that describe individual vertices, edges, graphs or subgraphs. Although, it might sound complicated for beginners, measuring different characteristics in graph structures are made easier by using researched algorithms that can be easily applied to your graph.
By understanding different measures like connectivity, distance, centrality, similarity etc., we can build a better understanding of our data. Arguably, the most popular graph algorithm, in my opinion, is PageRank which determines the importance of each individual element in the graph by measuring the connectivity and the quality of connections of a particular vertex. This algorithm was the first Google Search algorithm. Some readers might be familiar with Dijkstra’s shortest path algorithm or A* (A star – shortest path with heuristics) which is able to find the most optimal path between two points (very handy in navigation). To be honest, there are a dozen more algorithms and approaches on how to solve and gather information for a particular problem. Try to find out what algorithm or approach would be better for your problem.
Analytical approaches to utilize graph structure
Graphs are often used to give more specific context to your inquiries. For example, you could apply community algorithms to narrow down a specific subgraph (subgroup of interest). Filtering is applied on the instance level. By giving context to the dataset, you might find local anomalies within each community. It’s a very powerful concept because many outliers could globally be considered ‘normal’. By identifying specific communities one might discover how individual community members breach into other communities (and which ones).
Just like with regular SQL reporting queries, in order to find predefined ‘anomalies’ or points of interest, one might want to apply specific filters (patterns) to a graph. For example, applying ego graph filter (vertex of interest with its neighboring vertices) with specific incoming or outgoing connection count OR [(point of interest)-(personal identifier)-(secondary point of interest)] like patterns where a researcher is trying to extract specific linking points. Pattern matching could help detect previously identified “anomalies”.
Edge and node, meta-path prediction
Not only we are able to model instance-level relationships, and set corresponding vertices, but we are able to predict the probability of a link, node, or a pattern. It gives us the possibility to build a recommender, find potential conflicts of interest, or future collaborators, for example. Similar to more traditional data structures, the same knowledge discovery methods can be applied to graph-like data structure. Having graph data can provide us with more features to train our models. It’s most likely that advanced knowledge discovery methods are already utilizing instance level relationships between data points.
Graph Databases (gdb) are databases that are optimised for graph structure data storage and processing. My experience and opinion mostly is biased by a particular database that I’ve been using – neo4j. Graph databases focuses on storing instantiated data structure in a more efficient structure. The main difference between graph and relational db’s is that ERDB is modelled by (ER diagram) where each entity is a table and complex relationships (n:m; m:n) are described in additional tables (transaction tables). This provides a very effective and efficient way to store large amounts of data, yet it’s more difficult to describe unique relationships. It builds on another obstruction layer on top of the data. The only reason why we still accept obstructions is that we’ve been using ER approach for multiple decades, and it has become the “normal” approach. On the other hand, Graph databases are storing relevant information generarly in two tables – node and link tables which are partitioned by labels you can give to nodes. For this reason when you compare Graph Databases, ERDB is much harder to scale and distribute evenly because splitting graphs into multiple subgraphs has to take into consideration connectivity, path traversal, and amount of nodes in a partition.
Querying is yet another major difference which should be utilized if you’re working with graphs. You no longer need to join tables together in order to expand your analysis. You build your queries based on the graph meta-model (schema). Graph meta-model describes possible node-link-node connection options. Query returns all possible outcomes that matched your desired pattern. Patterns are important features for graph data structures because data points are laid out on an individual node and relationship instance level. Understanding this difference will enable you to understand this very different data structure paradigm. For example, Cypher Query Language (CQL) is using MATCH operator apply pattern you’re looking for. In the table below I demonstrate the difference between a regular SQL query and CQL.
Parsing through data records
Parsing through data points are very different as well. GDB are looping through data records in a particular table only once to find the first node in your match statement. Then the match statement is traversed through using list of available outgoing relationships for that particular node. This process is roughly illustrated in the animation below. Traversal through available space using only available paths can significantly increase query performance because engine is only looking at relevant and available neighboring nodes. By limiting maximal hops, you make sure that you’re trying to find only ‘close connections’. Parsing through data points are very different as well. GDB are looping through data records in a particular table only once to find the first node in your match statement. Then the match statement is traversed through using list of available outgoing relationships for that particular node. This process is roughly illustrated in the animation below. By limiting maximal hops, you make sure that you’re trying to find only ‘close connections’. A very amusing example of understanding connection and traversal between two points of interest is called Six Degrees of Kevin Bacon. This game states that any two people are six or fewer links apart from each other.
To sum up graph analysis and it’s analytical tools are very beneficial for any data explorist and data scientist. You might be more effective in your work by being able to work with, or at least understand concepts and the different paradigm of graphs. Now you have been introduced to graphs, looked at their purpose and use cases, seen the drawbacks and advantages of instantiated and highly relational data, and learnt a bit of the technical side of graph databases. I hope you will be more aware of your available toolkit while solving new and unknown challenges.