If you check out the CRAN website you will find a page of curated task views. These are lists of packages important to particular topics. A topic could be NaturalLanguageProcessing, Finance, or even Graphics. I mined these task views and CRAN for these topic-package connections. I’m very grateful for all the task view maintainers who organized these well thought out lists.
This article is how I created and explored a CRAN Task View Network. You can find the data here
packages <- c('tidyverse', 'rvest', 'stringr', 'igraph')
lapply(packages, library, character.only = T)
tv_d <- read_csv('topicPackageLinkTitleAuthorDesc.csv')
Let’s take a look at the data:
tv_d %>%
head(10)
## # A tibble: 10 x 6
## package topic
## <chr> <chr>
## 1 arm Bayesian
## 2 arm SocialSciences
## 3 BACCO Bayesian
## 4 bayesm Bayesian
## 5 bayesm Cluster
## 6 bayesm Distributions
## 7 bayesm Econometrics
## 8 bayesm HighPerformanceComputing
## 9 bayesm MedicalImaging
## 10 bayesm Multivariate
## # ... with 4 more variables: pTitle <chr>, pDescription <chr>,
## # pAuthor <chr>, link <chr>
If you explore the data a bit, you might find that some packages and topics share the same name. This will cause problems if we want to represent each package and topic with its own node. Let’s go ahead and add some identification information to the package names:
tv_d <- tv_d %>%
mutate(package = str_c(package, ' (package)'))
Let’s create an edge list connecting packages to their topics:
pt_links <- tv_d %>%
group_by(topic, package) %>%
nest() %>%
select(-data) %>%
rename(source = topic,
target = package) %>%
mutate(label = 'includes')
Let’s take a look at the edge list we created:
source | target | label |
---|---|---|
Bayesian | arm (package) | includes |
SocialSciences | arm (package) | includes |
Bayesian | BACCO (package) | includes |
Bayesian | bayesm (package) | includes |
Cluster | bayesm (package) | includes |
Distributions | bayesm (package) | includes |
Now let’s create the node list. The nodes we create will all have a name property that will be unique to each node, a label to identify the type of node it is, a color to show graphically, and data that is unique to that particular node type.
#we don't have unique data for the topics
#but we still need to create a data column
#for when we bind all the nodes together
t_nodes <- tv_d %>%
mutate(placeholder = 'fake') %>%
select(topic, placeholder) %>%
distinct() %>%
group_by(topic) %>%
nest() %>%
mutate(label = 'topic',
color = '#7570b3') %>%
rename(name = topic)
p_nodes <- tv_d %>%
select(-topic) %>%
group_by(package) %>%
nest() %>%
mutate(label = 'package',
color = '#d95f02',
data = lapply(data, distinct)) %>%
rename(name = package)
pt_nodes <- rbind(p_nodes, t_nodes) %>%
select(name, label, color, data)
Let’s take a look at these nodes:
## # A tibble: 12 x 4
## name label color data
## <chr> <chr> <chr> <list>
## 1 arm (package) package #d95f02 <tibble [1 x 4]>
## 2 BACCO (package) package #d95f02 <tibble [1 x 4]>
## 3 bayesm (package) package #d95f02 <tibble [1 x 4]>
## 4 bayesSurv (package) package #d95f02 <tibble [1 x 4]>
## 5 DPpackage (package) package #d95f02 <tibble [1 x 4]>
## 6 LaplacesDemon (package) package #d95f02 <tibble [1 x 4]>
## 7 Robust topic #7570b3 <tibble [1 x 1]>
## 8 Pharmacokinetics topic #7570b3 <tibble [1 x 1]>
## 9 NaturalLanguageProcessing topic #7570b3 <tibble [1 x 1]>
## 10 Graphics topic #7570b3 <tibble [1 x 1]>
## 11 NumericalMathematics topic #7570b3 <tibble [1 x 1]>
## 12 WebTechnologies topic #7570b3 <tibble [1 x 1]>
Now let’s create a network and take a look at what kind of mess comes out:
pt_network <- graph_from_data_frame(pt_links, T, pt_nodes)
set.seed(1234)
plot(
pt_network,
vertex.size = 3,
vertex.label = sapply(seq_along(V(pt_network)), function(i){
if(V(pt_network)$label[i] == 'topic'){
V(pt_network)$name[i]
} else {
NA
}
}),
main = 'R CRAN Task View Network',
vertex.label.cex = .65,
vertex.label.color = 'black',
vertex.label.dist = 1.1,
edge.label = NA,
edge.arrow.size = .25
)
Now that we have a network graph set up, let’s play with some of the tools igraph
provides to explore networks:
The degree()
function counts all the links attached to a node. We can specify whether or not we want to only count links that go in to the node, links that go out of the node, or all of the links. The default is all:
ptn_degree <- degree(pt_network)
ptn_degree %>% head
## arm (package) BACCO (package) bayesm (package)
## 2 1 7
## bayesSurv (package) DPpackage (package) LaplacesDemon (package)
## 2 2 1
You can sort this list to find the name of the node with the most connections:
top_degree <- ptn_degree %>%
sort(T) %>%
names() %>%
.[1]
top_degree
## [1] "Survival"
The ego()
function finds all the nodes within a selected distance of a node of interest. Let’s find all the nodes within 2 edge connections from Survival:
top_ego <- ego(pt_network, 2, top_degree)
top_ego
## [[1]]
## + 295/2847 vertices, named, from 93414c9:
## [1] Survival bayesSurv (package)
## [3] DPpackage (package) MCMCpack (package)
## [5] BayHaz (package) BMA (package)
## [7] MCMCglmm (package) PReMiuM (package)
## [9] LearnBayes (package) clinfun (package)
## [11] coin (package) InformativeCensoring (package)
## [13] multcomp (package) survival (package)
## [15] mixAK (package) mixPHM (package)
## [17] VGAM (package) flexsurv (package)
## [19] fitdistrplus (package) msm (package)
## + ... omitted several vertices
The induced_subgraph()
function filters down a network graph down to a subgraph that only contains nodes of interest. Let’s create a subgraph from the Survival ego node list:
top_ego_net <- induced_subgraph(pt_network, top_ego[[1]])
set.seed(1234)
plot(top_ego_net,
vertex.size = 3,
vertex.label = sapply(seq_along(V(top_ego_net)), function(i){
if(V(top_ego_net)$label[i] == 'topic'){
V(top_ego_net)$name[i]
} else {
NA
}
}),
main = 'R CRAN Task View: Topic with most packages Subgraph',
vertex.label.cex = .65,
vertex.label.color = 'black',
vertex.label.dist = 1.1,
edge.label = NA,
edge.arrow.size = .25)
This subgraph points out something important – the Survival subnetwork includes packages that are unique to Survival. That is, most packages in the Survival topic don’t belong to any other topics. The purpose of this article is to explore connections, so let’s only include package nodes that have more than 1 degree. That is, only include packages that belong to more than one topic:
gt1_deg_net <- induced_subgraph(pt_network, V(pt_network)[ptn_degree > 1])
Now that we’ve exluded packages that belong to only one topic we should see what topic has the highest degree:
new_top_degree <- degree(gt1_deg_net) %>%
sort(T) %>%
names() %>%
.[1]
new_top_degree
## [1] "Multivariate"
Let’s find the subgraph for Multivariate:
new_top_ego <- ego(gt1_deg_net, 2, new_top_degree)
new_top_ego_net <- induced_subgraph(gt1_deg_net, new_top_ego[[1]])
plot(new_top_ego_net,
vertex.size = 3,
vertex.label = sapply(seq_along(V(new_top_ego_net)), function(i){
if(V(new_top_ego_net)$label[i] == 'topic'){
V(new_top_ego_net)$name[i]
} else {
NA
}
}),
main = 'R CRAN Task View: Most Connected Topic Subgraph',
vertex.label.cex = .65,
vertex.label.color = 'black',
vertex.label.dist = 1.1,
edge.label = NA,
edge.arrow.size = .25)
The ego()
function can take look at the ego of more than 1 nodes:
top2_degree <- degree(gt1_deg_net) %>%
sort(T) %>%
names() %>%
.[1:2]
top2_ego <- ego(gt1_deg_net, order = 2, top2_degree)
top2_ego
## [[1]]
## + 118/521 vertices, named, from 654eda4:
## [1] Multivariate bayesm (package)
## [3] MCMCpack (package) Hmisc (package)
## [5] MNP (package) monomvn (package)
## [7] PTAk (package) pls (package)
## [9] ppls (package) psy (package)
## [11] homals (package) pcaPP (package)
## [13] fastICA (package) clustvarsel (package)
## [15] kohonen (package) cluster (package)
## [17] hybridHclust (package) clusterSim (package)
## [19] kernlab (package) trimcluster (package)
## + ... omitted several vertices
##
## [[2]]
## + 91/521 vertices, named, from 654eda4:
## [1] TimeSeries BAYSTAR (package)
## [3] ensembleBMA (package) bspec (package)
## [5] bsts (package) dlm (package)
## [7] MSBVAR (package) spTimer (package)
## [9] stochvol (package) depmix (package)
## [11] depmixS4 (package) sde (package)
## [13] pomp (package) Sim.DiffProc (package)
## [15] TSA (package) AER (package)
## [17] zoo (package) xts (package)
## [19] forecast (package) dynlm (package)
## + ... omitted several vertices
top2_ego_net <- induced_subgraph(gt1_deg_net, top2_ego %>% unlist %>% names() %>% unique)
set.seed(1234)
plot(top2_ego_net,
vertex.size = 3,
vertex.label = sapply(seq_along(V(top2_ego_net)), function(i){
if(V(top2_ego_net)$label[i] == 'topic'){
V(top2_ego_net)$name[i]
} else {
NA
}
}),
main = 'R CRAN Task View: Top 2 Connected Topics Subgraph',
vertex.label.cex = .65,
vertex.label.color = 'black',
vertex.label.dist = 1,
edge.label = NA,
edge.arrow.size = .25)
The all_shortest_paths()
function can be used if you want to see the easiest/shortest path that connects two nodes. Let’s see the quickest way to connect Multivariate and , TimeSeries and :
top2_connection <- all_shortest_paths(top2_ego_net,
from = top2_degree[1],
to = top2_degree[2],
mode = 'all')
top2_connection$res
## [[1]]
## + 3/188 vertices, named, from b0c9463:
## [1] Multivariate mAr (package) TimeSeries
##
## [[2]]
## + 3/188 vertices, named, from b0c9463:
## [1] Multivariate tsfa (package) TimeSeries
connection_nodes <- top2_connection$res %>%
unlist() %>%
names() %>%
unique()
Let’s visualize this connection:
top2_connection_net <- induced_subgraph(gt1_deg_net, connection_nodes)
set.seed(1234)
plot(top2_connection_net,
vertex.size = 5,
main = 'R CRAN Task View: Top 2 Connected Topics Shortest Path',
vertex.label = V(top2_connection_net)$name,
vertex.label.cex = .65,
vertex.label.color = 'black',
vertex.label.dist = 1,
edge.arrow.size = 1,
edge.label.cex = .5
)
There is a lot more information available in the CRAN task view network. For instance, each package has a description field. I plan on using tidy text mining tools to explore this network even further by using this description field.