Getting data into a graph
Now that we know PyRaphtory is installed and running, let’s look at the different ways to get some real data into a graph.
For this first set of tutorials we are going to be building graphs from a Lord of the Rings dataset, looking at when characters interact throughout the trilogy 🧝🏻♀️🧙🏻♂️💍.
As with the quick start install guide, this and all following python pages are built as iPython notebooks. If you want to follow along on your own machine, click the open on github
link in the top right of this page.
Let’s have a look at the example data
The data we are going to use is two csv
files which will be pulled from our Github data repository. These are the structure of the graph (lotr.csv
) and some metadata about the characters (lotr_properties.csv
)
For the structure file each line contains two characters that appeared in the same sentence, along with the sentence number, which we will use as a timestamp
. The first line of the file is Gandalf,Elrond,33
which tells us that Gandalf and Elrond appears together in sentence 33.
For the properties file each line gives a characters name, their race and gender. For example Gimli,dwarf,male
.
Downloading the csv from Github 💾
The following curl
command will download the csv files and save them in the tmp
directory on your computer. This will be deleted when you restart your computer, but it’s only a couple of KB in any case.
[1]:
print("****Downloading Data****")
!curl -o /tmp/lotr.csv https://raw.githubusercontent.com/Raphtory/Data/main/lotr.csv
!curl -o /tmp/lotr_properties.csv https://raw.githubusercontent.com/Raphtory/Data/main/lotr_properties.csv
print("****LOTR GRAPH STRUCTURE****")
!head -n 3 /tmp/lotr.csv
print("****LOTR GRAPH PROPERTIES****")
!head -n 3 /tmp/lotr_properties.csv
****Downloading Data****
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 52206 100 52206 0 0 926k 0 --:--:-- --:--:-- --:--:-- 926k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 686 100 686 0 0 1889 0 --:--:-- --:--:-- --:--:-- 1895
****LOTR GRAPH STRUCTURE****
Gandalf,Elrond,33
Frodo,Bilbo,114
Blanco,Marcho,146
****LOTR GRAPH PROPERTIES****
Aragorn,men,male
Gandalf,ainur,male
Goldberry,ainur,female
Setting up our imports and Raphtory Context
Now that we have our data we can sort out our imports and create the Raphtory Context
which we will use to build our graphs.
The imports are for parsing CSV files, accessing pandas dataframes, and bringing in all the Raphtory classes we will use in the tutorial.
The filenames are pointing at the data we just downloaded. If you change the download location above, make sure to change them here as well.
[2]:
import csv
import pandas as pd
from pyraphtory.context import PyRaphtory
from pyraphtory.input import ImmutableString
from pyraphtory.input import GraphBuilder
from pyraphtory.spouts import FileSpout
from pyraphtory.sources import CSVEdgeListSource
from pyraphtory.sources import Source
from pyraphtory.graph import Row
structure_file = "/tmp/lotr.csv"
properties_file = "/tmp/lotr_properties.csv"
ctx = PyRaphtory.local()
Adding data directly into the Graph
The simplest way to add data into a graph is to directly call the add_vertex
and add_edge
functions, which we saw in the quick start guide. These have required arguments defining the time the addition occurred and an identifier for the entity being updated. These functions, however, have several optional arguments allowing us to add properties
and types
on top of the base structure. Raphtory also allows for a secondary time index for disambiguating event ordering, this defaults
to the number of prior updates sent +1.
Function |
Required Arguments |
Optional Arguments |
---|---|---|
|
|
|
|
|
|
Lets take a look at this with our example data. In the below code we are opening The Lord of The Rings structural data via the csv reader and looping through each line.
To insert the data we:
Extract the two characters names, referring to them as the
source_node
anddestination_node
.Extract the sentence number, referring to is as
timestamp
. This is then cast to anint
as timestamps in Raphtory must be a number.Call
add_vertex
for both nodes, setting their type toCharacter
.Creating an edge between them via
add_edge
and labelling this aCo-occurence
.
[3]:
graph = ctx.new_graph()
with open(structure_file, 'r') as csvfile:
datareader = csv.reader(csvfile)
for row in datareader:
source_node = row[0]
destination_node = row[1]
timestamp = int(row[2])
graph.add_vertex(timestamp, source_node, vertex_type="Character")
graph.add_vertex(timestamp, destination_node, vertex_type="Character")
graph.add_edge(timestamp, source_node, destination_node, edge_type="Character_Co-occurence")
Let’s see if the data has ingested
To do this, much like the quick start, we can run a query on our graph. As Raphtory allows us to explore the network’s history, lets add a bit of this in as well.
Below we create a function to extract the first appearance of a character. This takes a vertex and calls name()
and earliest_activity()
. These return the name we gave in the add_vertex
calls above and a HistoricEvent
object. This object contains the sentence the character was introduced (.time()
) and the update this was in our data (.index()
). From our function we return a Row
with all the elements we are interested in.
Once defined we can call select
on our graph and apply this function to all vertices, followed by a call to to_df
which returns a dataframe with our results.
You will see in the results we have a timestamp
column, this is because both updates and queries must happen at a given time. This defaults to the latest time in the data, 32674
in our case. Don’t worry too much about the details of Raphtory queries here, we will get into this in the coming tutorials.
[4]:
def characters_first_appearance(vertex):
name = vertex.name()
event = vertex.earliest_activity()
earliest_appearance = event.time()
index = event.index()
return Row(name,earliest_appearance,index)
first_appearance_df = graph \
.select(characters_first_appearance) \
.to_df(["name", "earliest_appearance","index"])
first_appearance_df
[4]:
timestamp | name | earliest_appearance | index | |
---|---|---|---|---|
0 | 32674 | Hirgon | 26628 | 4965 |
1 | 32674 | Hador | 8105 | 1708 |
2 | 32674 | Horn | 28044 | 5329 |
3 | 32674 | Galadriel | 374 | 63 |
4 | 32674 | Isildur | 1309 | 213 |
... | ... | ... | ... | ... |
134 | 32674 | Faramir | 359 | 52 |
135 | 32674 | Bain | 6717 | 1258 |
136 | 32674 | Walda | 31162 | 7090 |
137 | 32674 | Thranduil | 7053 | 1414 |
138 | 32674 | Boromir | 7059 | 1423 |
139 rows × 4 columns
Updating graphs, merging datasets and adding properties
One cool thing about Raphtory is that we can freely insert new information at any point in time and it will be automatically inserted in chronological order. This makes it really easy to merge datasets or ingest out of order data.
Raphtory currently support several types of mutable
properties which can change throughout the lifetime of a vertex or edge, giving them a history to be explored. We also allow the user to specify immutable
properties which only have one value, useful for meta data and saving memory! All property objects require the user to specify a name and value. The current supported properties include:
MutableString()
MutableLong()
MutableDouble()
MutableBoolean()
ImmutableString()
To explore this and to add some properties to our graph, lets load our second dataset!
Below we are opening our property file the same way as the first. As we don’t have any time to work with in this data we will have to create some of our own. We have two options, we can say it all happens at time 1
or we can use the results of our earliest appearance query to decide when to insert the properties.
For the latter we have zipped the name
and earliest_timestamp
columns from our dataframe and turned them into a dict
where we can look up the best timestamps for each character.
For each line we then:
Get the name and look it up in our dict to get the timestamp.
Get the race and gender from the data and wrap them in an
ImmutableString
as they are unchanging metadata, so no need to maintain a history.Call
add_vertex
passing all of this information.
Now it’s worthwhile noting that we aren’t calling a function called update_vertex
or something similar, even though we know the vertex exists. This is because everything is considered an addition into the history and Raphtory sorts all the ordering internally!
[5]:
earliest_appearence = dict(zip(first_appearance_df.name,first_appearance_df.earliest_appearance))
with open(properties_file, 'r') as csvfile:
datareader = csv.reader(csvfile)
for row in datareader:
name = row[0]
timestamp = earliest_appearence[name]
race = ImmutableString("race",row[1])
gender = ImmutableString("gender",row[2])
graph.add_vertex(timestamp, name, properties=[race,gender])
Using our properties as part of a query
To quickly see if our new properties are included in the graph we can write a new query! Lets have a look at the dwarves who have the most interactions.
To start we can create a function which for each vertex returns its name and the length of its history()
i.e. the number of updates it has had. As we have one update per interaction, this will give us a quick count of the total interactions throughout the books.
This function can be given to the select
applied to the graph as before, but first let’s apply a vertex_filter()
which will check the value for the race property and remove anyone who isn’t a dwarf.
Finally, we can sort our dataframe by the number of interactions to see Gimli has by far the most!
[6]:
def and_my_axe(vertex):
name = vertex.name()
interactions = len(vertex.history())
return Row(name,interactions)
popular_dwarves = graph \
.vertex_filter(lambda vertex: vertex.get_property_or_else("race","unknown") == "dwarf")\
.select(and_my_axe) \
.to_df(["name", "interactions"])
popular_dwarves.sort_values(by="interactions",ascending=False)
[6]:
timestamp | name | interactions | |
---|---|---|---|
3 | 32674 | Gimli | 371 |
0 | 32674 | Glóin | 63 |
1 | 32674 | Balin | 29 |
2 | 32674 | Thorin | 11 |
Ingesting data with Sources
Inserting updates one by one works for small datasets like this Lord of The Rings graph, but it isn’t the most efficient way to parse your data. To enable you to work with large datasets we provide the Source
API. Sources
let you define where to pull data from and how to convert each tuple into graph updates
. Raphtory can then handle batching and other speed-ups internally.
Sources
take two arguments:
A
Spout
- which defines the location of the data.A
GraphBuilder
- which contains your parsing function.
We will come onto custom builders in a second, as if your data exists in a standard graph format, there is a good chance Raphtory already has one defined! For instance, the lotr.csv
file we used above is in an edge list
format, so we may use the CSVEdgeListSource
. This particular source will parse each line as two vertex additions
and an edge addition
at the given timestamp. By default the timestamp is assumed to be at the end of the line, but this can be changed via
arguments.
In the below code we:
Create a new graph called
edge_list_graph
.Create a
FileSpout
, giving it thestructure_file
.Create the
CSVEdgeListSource
and hand it theFileSpout
which it will use to pull the data.Connect the source to the graph by calling the
load()
function.Check the
Source
has ingested the data by running our earliest appearance query.
It is worth noting here:
We can pass the
FileSpout
several more advanced options such as a directory of files, a filepath Regex etc. and it will pull in all files which match.load()
can be called as many times as you like on the graph with difference spouts and builders, allowing you to merge data from multiple sources.
[7]:
edge_list_graph = ctx.new_graph()
spout = FileSpout(structure_file)
source = CSVEdgeListSource(spout,source_index=0,target_index=1,time_index=2,delimiter=",",header=False)
edge_list_graph.load(source)
edge_list_graph \
.select(characters_first_appearance) \
.to_df(["name", "earliest_appearance","index"])
[7]:
timestamp | name | earliest_appearance | index | |
---|---|---|---|---|
0 | 32674 | Hirgon | 26628 | 1656 |
1 | 32674 | Hador | 8105 | 570 |
2 | 32674 | Horn | 28044 | 1777 |
3 | 32674 | Galadriel | 374 | 22 |
4 | 32674 | Isildur | 1309 | 72 |
... | ... | ... | ... | ... |
134 | 32674 | Faramir | 359 | 18 |
135 | 32674 | Bain | 6717 | 420 |
136 | 32674 | Walda | 31162 | 2364 |
137 | 32674 | Thranduil | 7053 | 472 |
138 | 32674 | Boromir | 7059 | 475 |
139 rows × 4 columns
Creating custom Sources
Finally, lets wrap up this tutorial by combining everything we have already learnt with some custom sources! The next bit of code is chunkier than before so we have put comments inline to make it easier to follow along.
As explained above, the generic Source
object takes two arguments. We have already worked with the FileSpout
which we can reuse here, so lets focus on the second argument, the GraphBuilder
.
GraphBuilders
require you to provide a function which takes two arguments a graph
and a tuple
. The graph here is the same class that we have been using to add updates individually, so the functions are exactly the same. The tuple
is a singular piece of data which is going to be output from the Spout
. In our case the spout is going to produce strings, one for each line in the file we give it.
Below we create two custom sources, once for each file we have been working with, requiring two parsing functions. These functions are almost an exact copy and paste from above, however we don’t need the for-loop as we only need to think on the level of a singular line.
The full pipeline of analysis has been recreated to enable this to run as a standalone script.
Once you are comfortable with everything here, continue onto the next tutorial to get started on some real temporal queries.
[8]:
#First define our query functions
def characters_first_appearance(vertex):
name = vertex.name()
event = vertex.earliest_activity()
earliest_appearance = event.time()
index = event.index()
return Row(name,earliest_appearance,index)
def and_my_axe(vertex):
name = vertex.name()
interactions = len(vertex.history())
return Row(name,interactions)
#Create a new graph
custom_source_graph = ctx.new_graph()
#Define the first graph builder parsing function which is going to handle the structure_file
def parse_structure(graph, tuple: str):
row = [v.strip() for v in tuple.split(",")]
source_node = row[0]
destination_node = row[1]
timestamp = int(row[2])
graph.add_vertex(timestamp, source_node, vertex_type="Character")
graph.add_vertex(timestamp, destination_node, vertex_type="Character")
graph.add_edge(timestamp, source_node, destination_node, edge_type="Character_Co-occurence")
#Create a new FileSpout for the structure file
structure_spout = FileSpout(structure_file)
#Create a custom source, giving it the structure_spout and a GraphBuilder with our function to parse the structure_file
structure_graph_builder = GraphBuilder(parse_structure)
structure_source = Source(structure_spout,structure_graph_builder)
#Connect our structure_source to our graph
custom_source_graph.load(structure_source)
#Run the earliest appearance query on our new graph so we can use it in the second parser
first_appearance_df = graph \
.select(characters_first_appearance) \
.to_df(["name", "earliest_appearance","index"])
earliest_appearence = dict(zip(first_appearance_df.name,first_appearance_df.earliest_appearance))
#Define the second parsing function to handle the properties_files
def parse_properties(graph, tuple: str):
row = [v.strip() for v in tuple.split(",")]
name = row[0]
timestamp = earliest_appearence[name]
race = ImmutableString("race",row[1])
gender = ImmutableString("gender",row[2])
graph.add_vertex(timestamp, name, properties=[race,gender])
#Create a second FileSpout for the properties_file
property_spout = FileSpout(properties_file)
#Create a source for the property_spout with a graph builder which uses our second parsing function
property_graph_builder = GraphBuilder(parse_properties)
property_source = Source(property_spout,property_graph_builder)
#Load our properties_file into the graph
custom_source_graph.load(property_source)
#Finally, we can run our popular_dwarves query and get out the result!
popular_dwarves = graph \
.vertex_filter(lambda vertex: vertex.get_property_or_else("race","unknown") == "dwarf")\
.select(and_my_axe) \
.to_df(["name", "interactions"])
popular_dwarves.sort_values(by="interactions",ascending=False)
[8]:
timestamp | name | interactions | |
---|---|---|---|
3 | 32674 | Gimli | 371 |
0 | 32674 | Glóin | 63 |
1 | 32674 | Balin | 29 |
2 | 32674 | Thorin | 11 |