Installing PyRaphtory
PyRaphtory can be easily installed via pip
. This will pull all of the background dependencies for Raphtory, automatically setting up any system paths to point at the correct location. Our only requirement here is you running python version 3.9.13
.
Install
pip install pyraphtory
Running PyRaphtory
Once installed, let’s set up the most bare bones PyRaphtory graph, test that we can add some data to it and run our first query. Once this is all working we can move on to some much more exciting examples in the next section!
Before we start, however, you may have noticed that this page looks oddly like a iPython notebook. That is because it is! If you click the open on github
link in the top right of the page you can follow along on your own machine. Right, Back to the code!
First we need to import PyRaphtory
. You may see some references to Java
in the logs here, this is because under the hood Raphtory is written in Scala
. You don’t have to worry about any of that though as its all hidden away!
[1]:
import pyraphtory
Creating your first graph
Once Raphtory is installed we can create our first graph! To do this we first need a context
which we can get from the PyRaphtory object.
Our two options here are local
and remote
. As we are just testing it on our laptops we can use local
, meaning the Raphtory code will run within your python process. We will dig into remote
contexts later when you want to deploy in a separate process or scale your graph past what your laptop can handle.
Once we have our context we can call new_graph()
, which we can add data into and run queries on.
[2]:
context = pyraphtory.local()
graph = context.new_graph()
Adding data to your Graph
Once a graph is created, we need to add some data to it if we want run anything interesting. There are loads of ways of doing this in Raphtory, which we will cover in the next section, but for simplicity lets just add some vertices and edges without any properties.
As Raphtory is focused on dynamic and temporal analysis, all events in the graph’s history (adding, updating or deleting nodes/edges) must happen at a given time. This can all be at the same time (if, for example, you are working with snapshots) but we still need a time.
As such, when we add a vertex we have two arguments: the timestamp
and the vertex ID
. Similarly, when adding an edge, we have three arguments: the timestamp
, the source vertex
and the destination vertex
.
Note: All graphs are directed by default in Raphtory, but can be projected
into an undirected graph - we will go in-depth into graph projections later in the tutorial.
In the following code block we have five updates for our graph, adding three vertices (1
,2
,3
) at time 1
and two edges (1->2
, 1->3
) at time 2
.
[3]:
graph.add_vertex(1, 1)
graph.add_vertex(1, 2)
graph.add_vertex(1, 3)
graph.add_edge(2, 1, 2)
graph.add_edge(2, 1, 3)
Running your first Query
Now that our data is loaded we can start interrogating it!
While we can write some very complicated algorithms in Raphtory, lets start off with something simple, getting the indegree
and outdegree
of our nodes.
For this we call select
on the graph which takes the names of properties we want to extract, running on every vertex to obtain the respective values. This will return a Table
full of Rows
which represent the result for each node. Note, providing no names is seen as the equivalent of select *
, returning all properties for the vertices. Following a call to select we can either write our results to a Sink
(file, database, etc.), which we will cover later in the tutorial, or
convert it into a dataframe for further analysis.
In this example we have called to_df
to get a dataframe.
If you have a look in the logs you can see that your query is given a Job ID
and Raphtory will report how long it took for it to run.
[4]:
df = graph \
.step(lambda vertex: vertex.set_state("outdegree", vertex.out_degree())) \
.step(lambda vertex: vertex.set_state("indegree", vertex.in_degree())) \
.select("name","outdegree","indegree") \
.to_df()
Checking out the output
Finally, once our query has run and we have got our dataframe, we can take a look at the results.
One aspect which is notable here is that we requested three variables, but we have four columns. This is because algorithms in Raphtory run at set points in time, meaning the values for each vertex must be associated with a timestamp
(in this case the most recent one 2
).
As with every other cool feature I have hinted at, you will soon be an expert in queries, time-analysis and much more. All you have to do is continue on to the next page!
[5]:
df
[5]:
timestamp | name | outdegree | indegree | |
---|---|---|---|---|
0 | 2 | 1 | 2 | 0 |
1 | 2 | 2 | 0 | 1 |
2 | 2 | 3 | 0 | 1 |