Installing PyRaphtory

PyRaphtory can be easily installed via pip. This will pull all of the background dependencies for Raphtory, automatically setting up any system paths to point at the correct location. Our only requirement here is you running python version 3.9.13.

Install

pip install pyraphtory


Running PyRaphtory

Once installed, let’s set up the most bare bones PyRaphtory graph, test that we can add some data to it and run our first query. Once this is all working we can move on to some much more exciting examples in the next section!

Before we start, however, you may have noticed that this page looks oddly like a iPython notebook. That is because it is! If you click the open on github link in the top right of the page you can follow along on your own machine. Right, Back to the code!

First we need to import PyRaphtory. You may see some references to Java in the logs here, this is because under the hood Raphtory is written in Scala. You don’t have to worry about any of that though as its all hidden away!

[1]:

import pyraphtory


Once Raphtory is installed we can create our first graph! To do this we first need a context which we can get from the PyRaphtory object.

Our two options here are local and remote. As we are just testing it on our laptops we can use local, meaning the Raphtory code will run within your python process. We will dig into remote contexts later when you want to deploy in a separate process or scale your graph past what your laptop can handle.

Once we have our context we can call new_graph(), which we can add data into and run queries on.

[2]:

context = pyraphtory.local()
graph = context.new_graph()


Once a graph is created, we need to add some data to it if we want run anything interesting. There are loads of ways of doing this in Raphtory, which we will cover in the next section, but for simplicity lets just add some vertices and edges without any properties.

As Raphtory is focused on dynamic and temporal analysis, all events in the graph’s history (adding, updating or deleting nodes/edges) must happen at a given time. This can all be at the same time (if, for example, you are working with snapshots) but we still need a time.

As such, when we add a vertex we have two arguments: the timestamp and the vertex ID. Similarly, when adding an edge, we have three arguments: the timestamp, the source vertex and the destination vertex.

Note: All graphs are directed by default in Raphtory, but can be projected into an undirected graph - we will go in-depth into graph projections later in the tutorial.

In the following code block we have five updates for our graph, adding three vertices (1,2,3) at time 1 and two edges (1->2, 1->3) at time 2 .

[3]:

graph.add_vertex(1, 1)


Now that our data is loaded we can start interrogating it!

While we can write some very complicated algorithms in Raphtory, lets start off with something simple, getting the indegree and outdegree of our nodes.

For this we call select on the graph which takes the names of properties we want to extract, running on every vertex to obtain the respective values. This will return a Table full of Rows which represent the result for each node. Note, providing no names is seen as the equivalent of select *, returning all properties for the vertices. Following a call to select we can either write our results to a Sink (file, database, etc.), which we will cover later in the tutorial, or convert it into a dataframe for further analysis.

In this example we have called to_df to get a dataframe.

If you have a look in the logs you can see that your query is given a Job ID and Raphtory will report how long it took for it to run.

[4]:

df = graph \
.step(lambda vertex: vertex.set_state("outdegree", vertex.out_degree())) \
.step(lambda vertex: vertex.set_state("indegree", vertex.in_degree())) \
.select("name","outdegree","indegree") \
.to_df()


Checking out the output

Finally, once our query has run and we have got our dataframe, we can take a look at the results.

One aspect which is notable here is that we requested three variables, but we have four columns. This is because algorithms in Raphtory run at set points in time, meaning the values for each vertex must be associated with a timestamp (in this case the most recent one 2).

As with every other cool feature I have hinted at, you will soon be an expert in queries, time-analysis and much more. All you have to do is continue on to the next page!

[5]:

df

[5]:

timestamp name outdegree indegree
0 2 1 2 0
1 2 2 0 1
2 2 3 0 1