Playing with Graphs: Part 2

Intro

Probably one of the most valuable features of a network graph is the ability to see existing connections. However, a more valuable feature is the ability to see possible connections that don’t exist yet. There is a lot of value in being able to create logical connections. Facebook is valuable not only because you can see your own friend network, but because it can also suggest new friends to add to your network. Google is not only valuable because it can provide you answers to your questions, but because it can also suggest new ways of asking a question that people similar to you have asked. This article will show us how we can explore a network by not only asking about existing connections, but by also asking about potential connections.

For example, in the below network, we have Persons A, B, C, D, and E connected by is friends with links.

Everyone except person E is connected directly to someone else. E does not have any friends in this network. Person A is connected with the most people and is friends with three different people. Persons B, C, and D are all friends with person A, but not friends with eachother. In theory, it should be easy to create a friendship link between any pairing of B, C, and D because of their indirect connection via person A. It would be difficult to connect person E with anyone because of the lack of connection.

Let’s add item 1 to the network. Item 1 is an object that persons A, B, C, and E like.

Now we can see that person E is indirectly connected with persons A, B, and C. We can say that there is a reasonable chance that E can form a friendship link within the network.

Something else that happened is persons B and C now have two indirect connections (via person A and item 1). This suggests that they are even more likely to form a successful friendship link.

Facebook and LinkedIn are valuable because they have enough information about people that they can accurately suggest links between two people via their friendship network, their professional network, and their interests. These suggestions can be derived by querying network graphs.

In this article we will explore a simple social network of people, their friends, and the type of television they like watching. We are going to attempt to identify possible connections between nodes.

Setup

Let’s load some packages that we know we’ll need:

packages <- c('dplyr', 'stringr', 'tidyr', 'igraph')
sapply(packages, library, character.only =T)

We’ll also work with some custom function I made to traverse link lists and nodes:

devtools::install_github("beemyfriend/tinkR")

library(tinkR)

This will be the last time I use a made-up example for these articles on graphs. Below we’ll create a tibble where each observation includes a person, that person’s friends, and the type of televison that person likes to watch. While this fake dataset is pretty ugly, it actually looks a lot like data I see in the wild. It’s not uncommon to see a dataset where one of the variables is actually a csv:

social_network <- tibble(
  person = c(
    'Darnold',
    'Josh',
    'Ridley',
    'Rhonda',
    'Tremaine',
    'Miesha'
  ),
  friendsWith = c(
    "Lamar;Jackson;Steve;Logan;Miesha",
    "Billy;Mayerfield;Allen;Todd;Rhonda",
    'Marcell;Ozuna;Brandon;Morrow;Josh',
    "Darnold;Ridley;Marcell;Allen;Tremaine;Logan",
    "Baker;Mayerfield;Ridley;Josh",
    "Logan;Todd;Brandon;Macell;Baker"
  ),
  likesWatching = c(
    "MMA;Boxing;Kickball;Football;Rugby",
    "Table Tennis;Football;WWF;Rugby",
    "Underwater Basket Weaving;Monster Trucks;Rugby",
    "My Little Pony;Football;MMA;Table Tennis;Rugby",
    "Underwater Basket Weaving",
    "Rugby;MMA;The Cooking Channel"
  )
)

tidyr is awesome in cases like this because unnesting this data naturally creates a linked pairing. Let’s create two links list - one that creates person-friend links and another that creates person-likes links. We will then combine them in a single links variable to use for our network.

person_friends <- social_network %>%
  select(person, friendsWith) %>%
  mutate(friendsWith = str_split(friendsWith, ';')) %>%
  unnest() %>%
  mutate(type = 'is friends with') %>%
  rename(source = person,
         target = friendsWith)

person_likes <- social_network %>%
  select(person, likesWatching) %>%
  mutate(likesWatching = str_split(likesWatching, ';')) %>%
  unnest() %>%
  mutate(type = 'likes to watch') %>%
  rename(source = person,
         target = likesWatching)

links <- rbind(person_likes, person_friends) %>%
  distinct()

Let’s take a look at links:

source	target	type
Josh	Table Tennis	likes to watch
Ridley	Marcell	is friends with
Josh	Todd	is friends with
Miesha	Brandon	is friends with
Rhonda	Tremaine	is friends with
Miesha	Todd	is friends with
Darnold	MMA	likes to watch
Ridley	Monster Trucks	likes to watch
Josh	Allen	is friends with
Darnold	Lamar	is friends with

We can naturally derive a node list from this link list by pulling all the unique values from the source and target columns. We will also provide each node with a type and a color. They type is an important attribute to add in larger network graphs because we can filter the nodes by type and then filter the to only nodes we want to query and ignore the nodes we know we won’t use in the query. This helps with regards to performance.

people_nodes <- create_nodes(c(person_friends$source, person_friends$target) %>% unique, 
                             type = 'Person', 
                             color = 'skyblue')

tv_nodes <- create_nodes(person_likes$target %>% unique, 
                         type = "Televison", 
                         color = 'purple')

nodes <- rbind(people_nodes, tv_nodes)

Let’s take a look at the nodes we created:

name	type	color
Boxing	Televison	purple
Brandon	Person	skyblue
Steve	Person	skyblue
Table Tennis	Televison	purple
Jackson	Person	skyblue
The Cooking Channel	Televison	purple
WWF	Televison	purple
Lamar	Person	skyblue
Tremaine	Person	skyblue
Miesha	Person	skyblue

We will use igraph’s graph_from_data_frame() to create the network and then we will plot the network with the a wrapper of the plot() function we will use a lot:

net <- igraph::graph_from_data_frame(d = links, 
                                     directed =  F, 
                                     vertices = nodes)

create_net_plot <- function(net, 
                            vertex.size = 4,
                            edge.arrow.size = .3,
                            vertex.label.cex = .65,
                            vertex.label.color ='black',
                            vertex.label.dist = 1.1,
                            main = 'Social Network',
                            frame = T,
                            layout = layout_nicely,
                            seed = 1234){
  
  set.seed(seed)
  plot(net,
       vertex.size = vertex.size,
       edge.arrow.size = edge.arrow.size,
       vertex.label.cex = vertex.label.cex,
       vertex.label.color = vertex.label.color,
       vertex.label.dist = vertex.label.dist,
       main = main,
       frame = frame,
       layout =  layout)
}

create_net_plot(net)

Querying

Looking at an entire network can be very unwieldy and difficult to navigate. In order to better understand a network, it is often best to filter the network down to only the nodes we are interested in. tinkR has a function called find_links() which finds all the links attached to a node or nodes of interest. Let’s find all the links attached to the MMA node:

likes_mma <- find_links(nodes = 'MMA', links = links)

We can repeat this process to find the links of the nodes connected to MMA

connection_to_likes_mma <- find_links(nodes = likes_mma$source, links)

Now let’s consolidate the link list into one sublinks variable:

mma_sublinks <- rbind(likes_mma, connection_to_likes_mma) %>%
  distinct()

This filtered link list does not have any information about the nodes involved. tinkR has a function called find_nodes() that pulls all the relevant node information from a master node list

mma_subnodes <- find_nodes(nodes, mma_sublinks)

Now, let’s see the subnetwork

mma_subnet <- igraph::graph_from_data_frame(mma_sublinks, F, mma_subnodes)
create_net_plot(mma_subnet)

Two things that stand out in this subnetwork are:

Rugby is connected to all of the same people as MMA which could lead us to believe that if someone likes Rugby then he or she will probably like MMA
Logan is connected to all of the same people as MMA then there is a strong chance that Logan might also like watching MMA

Let’s explore this futher by looking at the connections these nodes have and the connections of those connections. The above pattern of finding links of links is a common pattern. I consolidated that pattern into a function called find_links_n() which finds links of links n times. I’ve also created a wrapper function called create_network that creates a subnetwork directly from a link list:

logan_links <- find_links_n('Logan', links, 2) %>%
  create_network(nodes) %>%
  create_net_plot()

Logan’s personal network is very similar to the MMA network. This strongly suggest that Logan would be a person who would like MMA, or at least fit in with the community of MMA lovers.

rugby_links <- find_links_n('Rugby', links, 2) %>%
  create_network(nodes) %>%
  create_net_plot()

The Rugby network, however, is very different from the MMA network. While it seems that everyone who likes MMA likes Rugby, it does not seem that everyone who likes Rugby likes MMA.

Let’s explore how Rugby and MMA are different by using tinkR’s different_first_layer() function. This checks for differences between two indirectly connected nodes and only shows the nodes that are connected to the first node, but not the second:

different_first_layer(wantNodes = 'Rugby', 
                      notNodes = 'MMA', 
                      links = links) %>%
  create_network(nodes) %>%
  create_net_plot()

Josh and Ridley are two people who like Rugby, but do not like MMA. Let’s check out their networks to see if either of them are indirectly connected to MMA:

find_links_n(c('Ridley', 'Josh'), links, 2) %>%
  create_network(nodes) %>%
  create_net_plot()

It seems that both Ridley and Josh are indirectly connected to MMA via Rhonda. Let’s check out Rhonda’s network to confirm this:

find_links_n('Rhonda', links, 2) %>%
  create_network(nodes) %>%
  create_net_plot()

Quantifying missing connections

As mentioned before, one of the most important features of a network is identifying potential connections that don’t exist yet. We just went through a process of filtering the graph down to nodes and links we think are important. However, looking at multiple network graphs can be exhausting. Sometimes we just want to know what the strongest “almost” connections are without having to navigate a plotted graph. We can do this with tinkR’s get_secondary_link_count() function:

get_secondary_link_count('MMA', links) %>% 
  head()

## Joining, by = c("source", "target", "type")

node	connections
Rugby	3
Logan	3
Football	2
Josh	1
Boxing	1
Kickball	1

With regards to MMA it is definitely worth exploring possible connections to Rugby, Logan, and Football.

get_secondary_link_count('Rugby', links) %>% 
  head

## Joining, by = c("source", "target", "type")

node	connections
Tremaine	3
MMA	3
Football	3
Logan	3
Table Tennis	2
Allen	2

With regards to Rugby it is definitely worth exploring possible connections to Tremaine, MMA, Football, and Logan.

get_secondary_link_count('Logan', links) %>% 
  head

## Joining, by = c("source", "target", "type")

node	connections
MMA	3
Rugby	3
Football	2
Josh	1
Boxing	1
Kickball	1

With regards to Logan it is definitely worth exploring possible connetions to MMA, Rugby, and Football.

Moving Forward

So far we’ve learned how to explore networks by manipulating the link lists and node lists. However, we haven’t actually learned how to manipulate a true igraph network graph. igraph is optimized for graph searching and so we should learn how to do that. Also, we haven’t been storing any information in the nodes and links other than type and color. In a future article we will explore how we can store data in the attributes of nodes and links.