Probably one of the most valuable features of a network graph is the ability to see existing connections. However, a more valuable feature is the ability to see possible connections that don’t exist yet. There is a lot of value in being able to create logical connections. Facebook is valuable not only because you can see your own friend network, but because it can also suggest new friends to add to your network. Google is not only valuable because it can provide you answers to your questions, but because it can also suggest new ways of asking a question that people similar to you have asked. This article will show us how we can explore a network by not only asking about existing connections, but by also asking about potential connections.
For example, in the below network, we have Persons A, B, C, D, and E connected by is friends with links.
Everyone except person E is connected directly to someone else. E does not have any friends in this network. Person A is connected with the most people and is friends with three different people. Persons B, C, and D are all friends with person A, but not friends with eachother. In theory, it should be easy to create a friendship link between any pairing of B, C, and D because of their indirect connection via person A. It would be difficult to connect person E with anyone because of the lack of connection.
Let’s add item 1 to the network. Item 1 is an object that persons A, B, C, and E like.
Now we can see that person E is indirectly connected with persons A, B, and C. We can say that there is a reasonable chance that E can form a friendship link within the network.
Something else that happened is persons B and C now have two indirect connections (via person A and item 1). This suggests that they are even more likely to form a successful friendship link.
Facebook and LinkedIn are valuable because they have enough information about people that they can accurately suggest links between two people via their friendship network, their professional network, and their interests. These suggestions can be derived by querying network graphs.
In this article we will explore a simple social network of people, their friends, and the type of television they like watching. We are going to attempt to identify possible connections between nodes.
Let’s load some packages that we know we’ll need:
packages <- c('dplyr', 'stringr', 'tidyr', 'igraph')
sapply(packages, library, character.only =T)
We’ll also work with some custom function I made to traverse link lists and nodes:
devtools::install_github("beemyfriend/tinkR")
library(tinkR)
This will be the last time I use a made-up example for these articles on graphs. Below we’ll create a tibble
where each observation includes a person, that person’s friends, and the type of televison that person likes to watch. While this fake dataset is pretty ugly, it actually looks a lot like data I see in the wild. It’s not uncommon to see a dataset where one of the variables is actually a csv
:
social_network <- tibble(
person = c(
'Darnold',
'Josh',
'Ridley',
'Rhonda',
'Tremaine',
'Miesha'
),
friendsWith = c(
"Lamar;Jackson;Steve;Logan;Miesha",
"Billy;Mayerfield;Allen;Todd;Rhonda",
'Marcell;Ozuna;Brandon;Morrow;Josh',
"Darnold;Ridley;Marcell;Allen;Tremaine;Logan",
"Baker;Mayerfield;Ridley;Josh",
"Logan;Todd;Brandon;Macell;Baker"
),
likesWatching = c(
"MMA;Boxing;Kickball;Football;Rugby",
"Table Tennis;Football;WWF;Rugby",
"Underwater Basket Weaving;Monster Trucks;Rugby",
"My Little Pony;Football;MMA;Table Tennis;Rugby",
"Underwater Basket Weaving",
"Rugby;MMA;The Cooking Channel"
)
)
tidyr
is awesome in cases like this because unnesting this data naturally creates a linked pairing. Let’s create two links list - one that creates person-friend links and another that creates person-likes links. We will then combine them in a single links variable to use for our network.
person_friends <- social_network %>%
select(person, friendsWith) %>%
mutate(friendsWith = str_split(friendsWith, ';')) %>%
unnest() %>%
mutate(type = 'is friends with') %>%
rename(source = person,
target = friendsWith)
person_likes <- social_network %>%
select(person, likesWatching) %>%
mutate(likesWatching = str_split(likesWatching, ';')) %>%
unnest() %>%
mutate(type = 'likes to watch') %>%
rename(source = person,
target = likesWatching)
links <- rbind(person_likes, person_friends) %>%
distinct()
Let’s take a look at links:
source | target | type |
---|---|---|
Josh | Table Tennis | likes to watch |
Ridley | Marcell | is friends with |
Josh | Todd | is friends with |
Miesha | Brandon | is friends with |
Rhonda | Tremaine | is friends with |
Miesha | Todd | is friends with |
Darnold | MMA | likes to watch |
Ridley | Monster Trucks | likes to watch |
Josh | Allen | is friends with |
Darnold | Lamar | is friends with |
We can naturally derive a node list from this link list by pulling all the unique values from the source and target columns. We will also provide each node with a type and a color. They type is an important attribute to add in larger network graphs because we can filter the nodes by type and then filter the to only nodes we want to query and ignore the nodes we know we won’t use in the query. This helps with regards to performance.
people_nodes <- create_nodes(c(person_friends$source, person_friends$target) %>% unique,
type = 'Person',
color = 'skyblue')
tv_nodes <- create_nodes(person_likes$target %>% unique,
type = "Televison",
color = 'purple')
nodes <- rbind(people_nodes, tv_nodes)
Let’s take a look at the nodes we created:
name | type | color |
---|---|---|
Boxing | Televison | purple |
Brandon | Person | skyblue |
Steve | Person | skyblue |
Table Tennis | Televison | purple |
Jackson | Person | skyblue |
The Cooking Channel | Televison | purple |
WWF | Televison | purple |
Lamar | Person | skyblue |
Tremaine | Person | skyblue |
Miesha | Person | skyblue |
We will use igraph
’s graph_from_data_frame()
to create the network and then we will plot the network with the a wrapper of the plot()
function we will use a lot:
net <- igraph::graph_from_data_frame(d = links,
directed = F,
vertices = nodes)
create_net_plot <- function(net,
vertex.size = 4,
edge.arrow.size = .3,
vertex.label.cex = .65,
vertex.label.color ='black',
vertex.label.dist = 1.1,
main = 'Social Network',
frame = T,
layout = layout_nicely,
seed = 1234){
set.seed(seed)
plot(net,
vertex.size = vertex.size,
edge.arrow.size = edge.arrow.size,
vertex.label.cex = vertex.label.cex,
vertex.label.color = vertex.label.color,
vertex.label.dist = vertex.label.dist,
main = main,
frame = frame,
layout = layout)
}
create_net_plot(net)
Looking at an entire network can be very unwieldy and difficult to navigate. In order to better understand a network, it is often best to filter the network down to only the nodes we are interested in. tinkR has a function called find_links()
which finds all the links attached to a node or nodes of interest. Let’s find all the links attached to the MMA node:
likes_mma <- find_links(nodes = 'MMA', links = links)
We can repeat this process to find the links of the nodes connected to MMA
connection_to_likes_mma <- find_links(nodes = likes_mma$source, links)
Now let’s consolidate the link list into one sublinks variable:
mma_sublinks <- rbind(likes_mma, connection_to_likes_mma) %>%
distinct()
This filtered link list does not have any information about the nodes involved. tinkR has a function called find_nodes()
that pulls all the relevant node information from a master node list
mma_subnodes <- find_nodes(nodes, mma_sublinks)
Now, let’s see the subnetwork
mma_subnet <- igraph::graph_from_data_frame(mma_sublinks, F, mma_subnodes)
create_net_plot(mma_subnet)
Two things that stand out in this subnetwork are:
Let’s explore this futher by looking at the connections these nodes have and the connections of those connections. The above pattern of finding links of links is a common pattern. I consolidated that pattern into a function called find_links_n()
which finds links of links n times. I’ve also created a wrapper function called create_network
that creates a subnetwork directly from a link list:
logan_links <- find_links_n('Logan', links, 2) %>%
create_network(nodes) %>%
create_net_plot()
Logan’s personal network is very similar to the MMA network. This strongly suggest that Logan would be a person who would like MMA, or at least fit in with the community of MMA lovers.
rugby_links <- find_links_n('Rugby', links, 2) %>%
create_network(nodes) %>%
create_net_plot()
The Rugby network, however, is very different from the MMA network. While it seems that everyone who likes MMA likes Rugby, it does not seem that everyone who likes Rugby likes MMA.
Let’s explore how Rugby and MMA are different by using tinkR’s different_first_layer()
function. This checks for differences between two indirectly connected nodes and only shows the nodes that are connected to the first node, but not the second:
different_first_layer(wantNodes = 'Rugby',
notNodes = 'MMA',
links = links) %>%
create_network(nodes) %>%
create_net_plot()
Josh and Ridley are two people who like Rugby, but do not like MMA. Let’s check out their networks to see if either of them are indirectly connected to MMA:
find_links_n(c('Ridley', 'Josh'), links, 2) %>%
create_network(nodes) %>%
create_net_plot()
It seems that both Ridley and Josh are indirectly connected to MMA via Rhonda. Let’s check out Rhonda’s network to confirm this:
find_links_n('Rhonda', links, 2) %>%
create_network(nodes) %>%
create_net_plot()
As mentioned before, one of the most important features of a network is identifying potential connections that don’t exist yet. We just went through a process of filtering the graph down to nodes and links we think are important. However, looking at multiple network graphs can be exhausting. Sometimes we just want to know what the strongest “almost” connections are without having to navigate a plotted graph. We can do this with tinkR’s get_secondary_link_count()
function:
get_secondary_link_count('MMA', links) %>%
head()
## Joining, by = c("source", "target", "type")
node | connections |
---|---|
Rugby | 3 |
Logan | 3 |
Football | 2 |
Josh | 1 |
Boxing | 1 |
Kickball | 1 |
With regards to MMA it is definitely worth exploring possible connections to Rugby, Logan, and Football.
get_secondary_link_count('Rugby', links) %>%
head
## Joining, by = c("source", "target", "type")
node | connections |
---|---|
Tremaine | 3 |
MMA | 3 |
Football | 3 |
Logan | 3 |
Table Tennis | 2 |
Allen | 2 |
With regards to Rugby it is definitely worth exploring possible connections to Tremaine, MMA, Football, and Logan.
get_secondary_link_count('Logan', links) %>%
head
## Joining, by = c("source", "target", "type")
node | connections |
---|---|
MMA | 3 |
Rugby | 3 |
Football | 2 |
Josh | 1 |
Boxing | 1 |
Kickball | 1 |
With regards to Logan it is definitely worth exploring possible connetions to MMA, Rugby, and Football.
So far we’ve learned how to explore networks by manipulating the link lists and node lists. However, we haven’t actually learned how to manipulate a true igraph
network graph. igraph
is optimized for graph searching and so we should learn how to do that. Also, we haven’t been storing any information in the nodes and links other than type and color. In a future article we will explore how we can store data in the attributes of nodes and links.