Implementing your own algorithms
In the previous tutorial we looked at the algorithms within the standard Raphtory library and the different ways we can apply them to a graph. While we have also written some simple custom functions, here we’re going to go over how to write a new multi-step algorithm. This example algorithm will calculate the degrees of separation from one a character to every other, defaulting to
As background: Six degrees of separation is “the idea that all people on average are six, or fewer, social connections away from each other.”
How analysis works in Raphtory
Raphtory’s analysis engine is based on vertex centric computation where algorithms are executed in bulk synchronous parallel
supersteps. Within each superstep all vertices complete a function, within which they may execute any of the follow
Gather: Receive information sent from other nodes in the graph (normally neighbouring nodes).
Apply: Calculate some new state and store these values.
Scatter: Send this state to neighbours to be used in the next step.
While Raphtory does have APIs for global aggregation and histograms, individual vertices have limited knowledge of the graph and must converge to their final state over several supersteps. Each vertex knows:
Its own update history, property set and property history e.g. values and update times.
The history/properties of the edges attached to it - both incoming and outgoing.
Its own algorithmic state - this is a map of temporary values that can be set and modified during analysis.
Writing algorithms in this manner allows them to be run on both your laptop and a cluster without requiring any change. Let’s see how it looks to implement the proposed Degree Separation algorithm in this format.
To begin first import pyraphtory, create a local context and ingest the Lord of The Rings data into a graph.
from pyraphtory.algorithm import PyAlgorithm from pyraphtory.graph import TemporalGraph, Row, Table from pyraphtory.vertex import Vertex import csv from pyraphtory.context import PyRaphtory from pyraphtory.graph import Row structure_file = "/tmp/lotr.csv" ctx = PyRaphtory.local() graph = ctx.new_graph() with open(structure_file, 'r') as csvfile: datareader = csv.reader(csvfile) for row in datareader: source_node = row destination_node = row timestamp = int(row) graph.add_vertex(timestamp, source_node, vertex_type="Character") graph.add_vertex(timestamp, destination_node, vertex_type="Character") graph.add_edge(timestamp, source_node, destination_node, edge_type="Character_Co-occurence")
Now that we have the graph set up we can create a new algorithm.
Algorithms are classes which extend
PyAlgorithm. This class has two functions to implement,
__call__ is where we will be doing the majority of our computation. This takes a
TemporalGraphand applies a set of operations to it (the functions we will define) and returns a new
tabularise is where we will choose what data on our vertices to return. This takes a
TemporalGraphand returns a
Tablewhich can either be turned into a dataframe or output to a
In this tutorial we will implement both as we want to compute a new property and output the results, but in many instances we may only need one or the other. For instance, we chained multiple algorithms together in the last tutorial by only applying the __call__ functions of each algorithm. No tabularise was then required as we finished off with our own call to
select (returning a
There are many functions which may be called on the
TemporalGraph to build algorithms as complex as you like; these can be found in the PythonDocs on the left. The main two of these are
step takes a function which will execute on all vertices once. This is normally to set up an algorithm or called in succession when you have a known number of steps to complete an algorithm.
iterate takes a function which will be applied on vertices until some convergence state is met. This is normally used for the bulk of an algorithm when the number of required steps is unknown.
This function also requires the user to set a maximum number of steps (to bound the computation) and to decide if all vertices need to execute every superstep or if only messaged vertices are required to, minimising wasted cycles.
As tabularise receives a TemporalGraph we can continue calling the same functions here, however, be aware that these won’t be applied if you use this algorithm in a chain or run your own
select. Once you have called select, as we have seen before, we can apply several
Table steps, such as
filter, before returning our data in the desired format.
Degree Separation from 10,000ft
For our algorithm we want to take these building blocks and set it up such that we start at Gandalf who will tell all of his neighbours they are 1 hop away. These nodes should then message all their neighbours telling them they are 2 hops away, and so on until no more nodes can be reached. Importantly, nodes should only update their state once, as we want the lowest number of hops.
Lets take a look at an example implementation of this algorithm, we can then go through what is happening step by step.
state_name = 'separation' #name the state we are going to be creating on each vertex class DegreeSeparation(PyAlgorithm): #our new class extends PyAlgorithm #create our __call__ function which takes the TemporalGraph and a name (defaulting to Gandalf) def __call__(self, graph: TemporalGraph,name = "Gandalf") -> TemporalGraph: #create a setup step function def step(v: Vertex): #find our starting node - the one with the correct name if v.get_property_or_else("name","") == name: #Set the `separation` of this node to 0 and send this value to all its neighbours v[state_name] = 0 v.message_all_neighbours(0) else: #if this isn't the starting node, set its separation to -1, which we can imagine means 'unreached' v[state_name] = -1 #create our iteration function def iterate(v: Vertex): #if the node hasn't been reached before (current state is still -1) if v[state_name] == -1: #take the minimum value of the messages received and add 1 sep_state = min(v.message_queue()) + 1 #set its state to the new separation value v[state_name] = sep_state #and tell its neighbours v.message_all_neighbours(sep_state) #if the node has been reached before we can ignore the messages and vote to halt else: v.vote_to_halt() #finally we apply these two functions to the graph #for iterate we set the maximum supersteps to 6 and # execute_messaged_only to True as nodes which receive no new information don't need to do anything. return graph.step(step).iterate(iterate, iterations=6,execute_messaged_only=True) #define our tabularise function which calls select on the graph and returns the vertex name and its separation value def tabularise(self, graph: TemporalGraph) -> Table: return graph.select("name", "separation") #execute our new algorithm on the graph and return a dataframe graph \ .execute(DegreeSeparation()) \ .to_df()
139 rows × 3 columns
Digging in the steps
As you can see from the output we now have a separation value for all vertices in the graph. Most of these scores are quite low with Gandalf being a central character, but we still have a couple with
-1 who were unreachable. Lets dig into how we calculated these scores.
To begin we have one setup
step in which we:
First check if a vertex is our starting point, where in the default case we are looking for Gandalf.
Once this node is found we:
Create a new state on the vertex representing the separation score and set it to
0- i.e. no separation
Message all of the node’s neighbours to tell them they are directly connected.
If a node is not the starting point their state is initialised to
-1. We can read this as currently
Next we implemented the bulk of the algorithm inside of a singular
iterate function. Within this we:
Check if the vertex has been reached before (its state is still
If it hasn’t we take the minimum value it has been messaged by its neighbours and increment this by one.
This is its separation from the starting node which it can then tell to its neighbours.
If it has already been contacted it ignores all messages and
votes to haltmeaning the node is happy with its current state and it thinks the algorithm has converged.
This function is then set to execute only on vertices that have been sent a message (
execute_messaged_only=True) and runs up to 6 times (
iterations = 6) for our six degrees of separation).
The Return of The King
Following the definition of
__call__() we implemented a
tabularise() which goes through all vertices and extracts the node name and the final degree of separation.
You could additionally add a filter here to ignore any nodes with a score of
-1. This would exclude nodes that are not at all connected to Gandalf or whose shortest path to Gandalf is longer than 6 hops
What is next?
You should now be equipped with all the tools you need to do some very cool graph analysis with Raphtory. Your next stop should be the Python docs on the left where you can dig into the variety of functionality available on the
Table. We also have several example projects with some interesting algorithms which will be a good inspiration!
If you have any issues or just want to tell us about your awesome new project, come and join us on Slack where there are always many members of the Raphtory team hanging out.