Ok, so we can now access information inside nodes and edges, as well as create subgraphs. What else can we do with these objects?

Exploratory functions

Other than querying the nodes and edges directly, igraph provides a number of functions to explore graphs. igraph is a very large package that offers a lot of functionality, but here we will focus on

Functions that provide information specific to each component
Functions that provide information on how each component relates to the overall graph

Setup

library(igraph)
library(igraphdata)

data("UKfaculty")

UKfaculty

## IGRAPH 6f42903 D-W- 81 817 -- 
## + attr: Type (g/c), Date (g/c), Citation (g/c), Author (g/c),
## | Group (v/n), weight (e/n)
## + edges from 6f42903:
##  [1] 57->52 76->42 12->69 43->34 28->47 58->51  7->29 40->71  5->37 48->55
## [11]  6->58 21-> 8 28->69 43->21 67->58 65->42  5->67 52->75 37->64  4->36
## [21] 12->49 19->46 37-> 9 74->36 62-> 1 15-> 2 72->49 46->62  2->29 40->12
## [31] 22->29 71->69  4-> 3 37->69  5-> 6 77->13 23->49 52->35 20->14 62->70
## [41] 34->35 76->72  7->42 37->42 51->80 38->45 62->64 36->53 62->77 17->61
## [51]  7->68 46->29 44->53 18->58 12->16 72->42 52->32 58->21 38->17 15->51
## [61] 22-> 7 22->69  5->13 29-> 2 77->12 37->35 18->46 10->71 22->47 20->19
## + ... omitted several edges

If we examine the graph, we know that this is both a directed and weighted graph with the vertex attribute of Group and the edge attribute of weight. Let’s take a look at these attributes.

UKfaculty %>%
  V() %>%
  .$Group %>%
  summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.877   3.000   4.000

UKfaculty %>%
  E() %>%
  .$weight %>%
  summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   4.565   6.000  16.000

UKfaculty %>%
  V() %>%
  .$Group %>%
  unique() %>%
  sort()

## [1] 1 2 3 4

UKfaculty %>%
  E() %>%
  .$weight %>%
  unique() %>% 
  sort()

##  [1]  1  2  3  4  5  6  7  8 10 12 14 16

finally, let’s make a quick visualization of this graph

set.seed(4321)
plot(UKfaculty,
     vertex.label = V(UKfaculty)$Group,
     edge.curved = T,
     edge.arrow.size = .3)

We will be filtering this graph down into a subgraph to show the power of some of these functions. In order to keep things consistent between the graph and the subgraph, we will need to name the vertices.

V(UKfaculty)$name <- paste('person', V(UKfaculty), sep = '_')

Functions that provide information specific to the components

Functions that provide information specific to edges

Sometimes we know the edges we want, but don’t know the nodes they are connected to. This often happens when we query the edges for a specific attribute. We can get get this information.

Let’s look at the relationships that have strong weights. Let’s look for the number of quality connections. The above analysis showed the weight’s 3rd quartile is 6. So we will use this as the criteria to define a relationship as strong.

For directed edges, we can find the node that is at the head (what the edge is pointing to) by using the head_of() function. For undirected graphs, the head is simply the second node mentioned in the edge description.

topLiked <- UKfaculty %>%
  E() %>%
  .[weight >= 6] %>%
  head_of(UKfaculty, .)

topLiked %>% sort

## + 265/81 vertices, named, from 6f42903:
##   [1] person_1  person_2  person_2  person_2  person_2  person_2  person_2 
##   [8] person_4  person_4  person_4  person_5  person_5  person_5  person_5 
##  [15] person_6  person_6  person_7  person_7  person_7  person_7  person_7 
##  [22] person_7  person_8  person_8  person_9  person_10 person_10 person_10
##  [29] person_10 person_10 person_10 person_10 person_10 person_12 person_12
##  [36] person_12 person_12 person_12 person_13 person_13 person_13 person_13
##  [43] person_13 person_13 person_14 person_14 person_14 person_14 person_16
##  [50] person_16 person_16 person_16 person_17 person_18 person_18 person_18
##  [57] person_18 person_19 person_19 person_19 person_19 person_20 person_20
##  [64] person_20 person_20 person_20 person_21 person_21 person_21 person_21
## + ... omitted several vertices

hist(topLiked, 
     vcount(UKfaculty),
     col = 'lightblue')

topLiked$name %>%
  table %>%
  sort(T) %>%
  head()

## .
## person_31 person_77 person_21 person_29 person_10 person_27 
##        12        11         9         9         8         8

Alternatively, we can get the nodes at the tail of a directed edge (where the edge is coming from) by using the tail_of() function. For undirected graphs, the tail is simply the first node mentioned in the edge description.

topLikers <- UKfaculty %>%
  E() %>%
  .[weight >= 6] %>%
  tail_of(UKfaculty, .)

hist(topLikers, 
     vcount(UKfaculty), 
     col = 'lightblue')

topLikers$name %>%
  table() %>%
  sort(T) %>%
  head()

## .
## person_29 person_37  person_7 person_21 person_13 person_46 
##        19        11         9         8         7         7

You can also get a matrix of vertex ids. The first column represents the tail of the edges and the second column represents the head of the edges.

UKfaculty %>%
  E() %>%
  .[weight >= 6] %>%
  ends(UKfaculty, .) %>%
  head()

##      [,1]        [,2]       
## [1,] "person_76" "person_42"
## [2,] "person_28" "person_47"
## [3,] "person_7"  "person_29"
## [4,] "person_5"  "person_67"
## [5,] "person_12" "person_49"
## [6,] "person_19" "person_46"

Let’s go ahead and identify who the most liked person is and who likes the most people. Remember, we only care about edges that have a weight of greater than or equal to 6.The results would be different if we included all edges.

mostLikedPerson <- topLiked$name %>%
  table() %>%
  sort(T) %>%
  .[1] %>%
  names()

mostLikedPerson

## [1] "person_31"

likesTheMost <- topLikers$name %>%
  table() %>%
  sort(T) %>%
  .[1] %>%
  names() 

likesTheMost

## [1] "person_29"

Functions that provide information specific to nodes

ego()

The ego() function provides all the nodes connected to a node or nodes of interest. The connected nodes can either be directly connected or connected within a certain order of edges. Let’s find all the people that liked the mostLikedPerson and let’s also find all the people the person who likesTheMost likes.

mlp_likedBy <- ego(UKfaculty, 1, mostLikedPerson, 'in')
mlp_likedBy

## [[1]]
## + 20/81 vertices, named, from 6f42903:
##  [1] person_31 person_2  person_8  person_15 person_19 person_20 person_21
##  [8] person_26 person_29 person_34 person_35 person_37 person_39 person_41
## [15] person_43 person_46 person_51 person_57 person_62 person_79

ltm_likes <- ego(UKfaculty, 1, likesTheMost, 'out')
ltm_likes

## [[1]]
## + 42/81 vertices, named, from 6f42903:
##  [1] person_29 person_2  person_4  person_6  person_7  person_8  person_14
##  [8] person_15 person_18 person_19 person_20 person_21 person_22 person_23
## [15] person_24 person_26 person_27 person_31 person_32 person_34 person_35
## [22] person_37 person_39 person_41 person_43 person_46 person_50 person_51
## [29] person_52 person_54 person_55 person_57 person_58 person_62 person_64
## [36] person_66 person_69 person_70 person_71 person_77 person_79 person_80

You can enter any number of vertices in the ego() function and the order of the list output will correspond to the order the vertices were entered. Here we will look for all connections (both incomming and outgoing) for each node.

ego(UKfaculty, 1, c(mostLikedPerson, likesTheMost), 'all')

## [[1]]
## + 22/81 vertices, named, from 6f42903:
##  [1] person_31 person_2  person_8  person_15 person_19 person_20 person_21
##  [8] person_25 person_26 person_29 person_32 person_34 person_35 person_37
## [15] person_39 person_41 person_43 person_46 person_51 person_57 person_62
## [22] person_79
## 
## [[2]]
## + 42/81 vertices, named, from 6f42903:
##  [1] person_29 person_2  person_4  person_6  person_7  person_8  person_14
##  [8] person_15 person_18 person_19 person_20 person_21 person_22 person_23
## [15] person_24 person_26 person_27 person_31 person_32 person_34 person_35
## [22] person_37 person_39 person_41 person_43 person_46 person_50 person_51
## [29] person_52 person_54 person_55 person_57 person_58 person_62 person_64
## [36] person_66 person_69 person_70 person_71 person_77 person_79 person_80

let’s combine these two people together and see their subnetwork

mlp_ltm <- c(mlp_likedBy[[1]], ltm_likes[[1]]) %>%
  induced_subgraph(UKfaculty, .)
mlp_ltm

## IGRAPH b9217ff DNW- 42 410 -- 
## + attr: Type (g/c), Date (g/c), Citation (g/c), Author (g/c),
## | Group (v/n), name (v/c), weight (e/n)
## + edges from b9217ff (vertex names):
##  [1] person_57->person_52 person_43->person_34 person_58->person_51
##  [4] person_7 ->person_29 person_6 ->person_58 person_21->person_8 
##  [7] person_43->person_21 person_37->person_64 person_19->person_46
## [10] person_15->person_2  person_46->person_62 person_2 ->person_29
## [13] person_22->person_29 person_71->person_69 person_37->person_69
## [16] person_52->person_35 person_20->person_14 person_62->person_70
## [19] person_34->person_35 person_51->person_80 person_62->person_64
## + ... omitted several edges

set.seed(1234)
plot(mlp_ltm,
     vertex.color = sapply(V(mlp_ltm)$name, function(x){
       if(x %in% c(mostLikedPerson, likesTheMost)){
         'lightblue'
       } else {
         'lightgrey'
       }
     }),
     edge.curved = T,
     edge.arrow.size = .4,
     vertex.label = V(mlp_ltm)$Group)

induced.subgraphs include all edges between the included nodes, but we’re just interested in the edges of interest. Particularly, we only want to see incoming edges for the mostLikedPerson and we only want to see outgoing edges for the person who likesTheMost. Let’s add a filtering step.

mlp_ltm <- mlp_ltm %>%
  E() %>%
  .[likesTheMost %->% V(mlp_ltm) | mostLikedPerson %<-% V(mlp_ltm)] %>%
  .[weight >= 6] %>%
  subgraph.edges(mlp_ltm, .) 

mlp_ltm

## IGRAPH 8929de7 DNW- 24 30 -- 
## + attr: Type (g/c), Date (g/c), Citation (g/c), Author (g/c),
## | Group (v/n), name (v/c), weight (e/n)
## + edges from 8929de7 (vertex names):
##  [1] person_29->person_2  person_19->person_31 person_29->person_69
##  [4] person_29->person_35 person_29->person_54 person_21->person_31
##  [7] person_29->person_18 person_29->person_7  person_29->person_21
## [10] person_29->person_51 person_8 ->person_31 person_43->person_31
## [13] person_39->person_31 person_29->person_43 person_29->person_27
## [16] person_41->person_31 person_29->person_34 person_29->person_79
## [19] person_29->person_46 person_79->person_31 person_29->person_31
## + ... omitted several edges

set.seed(1234)
plot(mlp_ltm,
     vertex.color = sapply(V(mlp_ltm)$name, function(x){
       if(x %in% c(mostLikedPerson, likesTheMost)){
         'lightblue'
       } else {
         'lightgrey'
       }
     }),
     edge.curved = T,
     edge.arrow.size = .4,
     vertex.label = V(mlp_ltm)$Group)

degree()

A degree of a node is the simply the raw number of edges incident to it. You can generate this number by using the degree() function and can distinguish what direction of edge you want to count.

degree(UKfaculty, mode = 'out') %>%
  sort(T) %>%
  head()

## person_29 person_37 person_62  person_5 person_52 person_43 
##        41        36        34        28        27        23

degree(UKfaculty, mode = 'in') %>%
  sort(T) %>%
  head()

## person_69 person_77 person_54 person_21 person_29  person_2 
##        24        24        23        22        21        19

degree(UKfaculty, c(likesTheMost, mostLikedPerson)) %>%
  sort(T)

## person_29 person_31 
##        62        35

strength() - Weighted Degree

Rather than getting the raw count of incident edges, we can also add the edge attributes stored in them. This is done by using the strength() function. This is valuable, because as we seen earlier, the raw number of edges doesn’t necessarily show the strength of those relationships. Remember that the node with the most incoming edges doesn’t match the person we identified as the person with the most quality connections.

strength(UKfaculty, mode = 'out', weights = E(UKfaculty)$weight) %>%
  sort(T) %>%
  head()

## person_29 person_37 person_31 person_21  person_5 person_62 
##       243       131       112       100        96        96

strength(UKfaculty, mode = 'in', weights = E(UKfaculty)$weight) %>%
  sort(T) %>%
  head()

## person_29 person_31 person_21 person_77 person_69 person_10 
##       136       133       119       119       105       100

strength(UKfaculty, c('person_31', 'person_29', 'person_69'),  weights = E(UKfaculty)$weight)

## person_31 person_29 person_69 
##       245       379       168

Using Strength and Degree together

We can get the average weight of edges incident to a vertex by combining the information strength() and degree() functions together.

strength(UKfaculty, mode = 'out') %>%
  { ./degree(UKfaculty, mode = 'out')} %>%
  sort(T) %>% 
  head

## person_18 person_14 person_32 person_76 person_23  person_8 
## 14.666667 13.200000 12.000000  9.500000  8.857143  8.000000

strength(UKfaculty, mode = 'in') %>%
  { ./degree(UKfaculty, mode = 'in')} %>%
  sort(T) %>% 
  head

## person_20 person_26 person_56 person_75 person_31 person_73 
##  9.428571  9.428571  9.333333  8.142857  7.000000  7.000000

Functions that provide information on how each component relates to the overall graph

Walks

shortest_paths

The shortest path between two vertices is the path that uses the least edges. This can be caluculated by using the shortest_paths() function

sp <- shortest_paths(UKfaculty, 
               likesTheMost, 
               mostLikedPerson, 
               'out', 
               output = 'both')
sp

## $vpath
## $vpath[[1]]
## + 3/81 vertices, named, from 6f42903:
## [1] person_29 person_20 person_31
## 
## 
## $epath
## $epath[[1]]
## + 2/817 edges from 6f42903 (vertex names):
## [1] person_29->person_20 person_20->person_31
## 
## 
## $predecessors
## NULL
## 
## $inbound_edges
## NULL

UKfaculty %>%
  E() %>%
  .[[. %in% sp$epath[[1]]]]

## + 2/817 edges from 6f42903 (vertex names):
##          tail      head tid hid weight
## 370 person_29 person_20  29  20      1
## 816 person_20 person_31  20  31      1

It’s important to note, however, that shortest paths use the weight attribute of edges. It also interpret the weights as euclidian distances - that is, paths with less accumulated weight are prioritized over paths with more accumulated weight. Sometimes, we want to ignore the weight attribute because it doesn’t necessarily make sense in the context of our data.

shortest_paths(UKfaculty,
               likesTheMost,
               mostLikedPerson,
               'out',
               output = 'both',
               weights = NA) %>%
  .$epath %>%
  .[[1]] %>%
  {E(UKfaculty)[[E(UKfaculty) %in% .]]}

## + 1/817 edge from 6f42903 (vertex names):
##          tail      head tid hid weight
## 604 person_29 person_31  29  31     16

distances

If you don’t care so much about the actual path and only want to know know the magnitude of the walks, then you can use the distance() function. Again, weight matters.

distances(UKfaculty, c(likesTheMost , mostLikedPerson), c(likesTheMost, mostLikedPerson), 'out')

##           person_29 person_31
## person_29         0         2
## person_31         4         0

distances(UKfaculty, c(likesTheMost , mostLikedPerson), c(likesTheMost, mostLikedPerson), 'in')

##           person_29 person_31
## person_29         0         4
## person_31         2         0

distances(UKfaculty, c(likesTheMost , mostLikedPerson), c(likesTheMost, mostLikedPerson), 'all', weights = NA)

##           person_29 person_31
## person_29         0         1
## person_31         1         0

Random Walks

From the given start vertex, take the given number of steps, choosing an edge from the actual vertex uniformly randomly. If the graph is directed, then the random walk only uses edges that traverse outwards. Random walks are fundamental in both community detection and centrality measurements.

set.seed(1234)
random_walk(UKfaculty, likesTheMost, 3)

## + 3/81 vertices, named, from 6f42903:
## [1] person_29 person_8  person_41

sapply(1:100, function(i){
  random_walk(UKfaculty, likesTheMost, 3)[[2]]
}) %>% 
  table() %>%
  sort(T)

## .
## 23 14 15 32  7 22 24 80  8 18 19 20 27 37 46 52 55 66 71  2 21 26 34 35 57 
##  7  6  5  5  4  4  4  4  3  3  3  3  3  3  3  3  3  3  3  2  2  2  2  2  2 
## 69 77 79  4  6 39 41 50 51 54 58 62 64 
##  2  2  2  1  1  1  1  1  1  1  1  1  1

Centrality Scores

betweenness()

The betweenness value is roughly defined as the number of shortest paths going through a vertex or an edge. A betweenness value of 0 means that no shortest paths go through that vertex - that vertex is not essential for connecting any two vertices. Weights are interpreted as distances and a path with a lower weight will be chosen over a path with a higher weight. The betweenness value of a vertex can be calculated with the betweenness() function.

betweenness(UKfaculty, likesTheMost)

## person_29 
##  684.1951

V(UKfaculty)$betweenness <- betweenness(UKfaculty)
V(UKfaculty)[[betweenness == max(betweenness)]]

## + 1/81 vertex, named, from 6f42903:
##    Group      name betweenness
## 37     1 person_37    1223.084

V(UKfaculty)$unweightedBetweenness <- betweenness(UKfaculty, weights = NA)
V(UKfaculty)[[unweightedBetweenness == max(unweightedBetweenness)]]

## + 1/81 vertex, named, from 6f42903:
##    Group      name betweenness unweightedBetweenness
## 62     3 person_62     965.596              1098.767

Betweenness can also be calculated for edges by using the edge_betweeness() function:

E(UKfaculty)$betweenness <- edge_betweenness(UKfaculty)
E(UKfaculty)[[betweenness == max(betweenness)]]

## + 1/817 edge from 6f42903 (vertex names):
##          tail      head tid hid weight betweenness
## 520 person_17 person_62  17  62      1    561.5816

E(UKfaculty)$unweightedBetweenness <- edge_betweenness(UKfaculty, weights = NA)
E(UKfaculty)[[unweightedBetweenness == max(unweightedBetweenness)]]

## + 1/817 edge from 6f42903 (vertex names):
##          tail      head tid hid weight betweenness unweightedBetweenness
## 520 person_17 person_62  17  62      1    561.5816                342.03

closeness()

Closeness measures how many steps is required to access every other vertex from a given vertex.

closeness(UKfaculty, mode = 'out') %>% 
  sort(T) %>% 
  head

##   person_62   person_29   person_37    person_5   person_52   person_34 
## 0.004651163 0.004347826 0.004255319 0.004219409 0.004016064 0.003831418

closeness(UKfaculty, mode = 'in') %>%
  sort(T) %>%
  head

##   person_11   person_18   person_10   person_54   person_58    person_7 
## 0.002283105 0.002277904 0.002262443 0.002262443 0.002257336 0.002247191

eigen_centrality()

In general, vertices with high eigenvector centralities are those which are connected to many other vertices which are, in turn, connected to many others (and so on).

eigen_centrality(UKfaculty, directed = F)$vector %>%
  sort(T) %>%
  head()

## person_29 person_31 person_21 person_79 person_35 person_19 
## 1.0000000 0.8221039 0.7400463 0.5895395 0.5872218 0.5145590

authority_score() & hub_score()

Keinberg’s Hyperlink-Induced Topic Search (HITS) - also known as Hubs and Authorities - stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users direct to other authoritative pages. In other words, a good hub represented a page that pointed to many other pages, and a good authority represented a page that was linked by many different hubs. These can be calculated using the authority_score() and hub_score() functions.

authority_score(UKfaculty)$vector %>% 
  sort(T) %>%
  head()

## person_31 person_21 person_29 person_35 person_79  person_2 
## 1.0000000 0.9086895 0.8235752 0.7727795 0.7607748 0.6349021

page_rank(UKfaculty)$vector %>%
  sort(T) %>%
  head()

##  person_77  person_31  person_10  person_75  person_69  person_50 
## 0.03050407 0.02968359 0.02740006 0.02611524 0.02604082 0.02549718

page_rank()

PageRank is a centrality metric created by Google.PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites. For undirected graphs, PageRank treats each edge as a bidirectional edge. PageRank can be calculated by using page_rank()

page_rank(UKfaculty, directed = F)$vector %>%
  sort(T) %>%
  head()

##  person_29  person_77  person_31  person_10  person_21  person_62 
## 0.02464145 0.01861998 0.01844505 0.01760688 0.01737333 0.01681448

page_rank(UKfaculty, weights = NA, directed = F)$vector %>%
  sort(T) %>%
  head()

##  person_29  person_37  person_62  person_77  person_52   person_5 
## 0.03273052 0.02919849 0.02530033 0.02346460 0.02169196 0.02143179

Part 3: Exploratory Functions