Traversing a graph
- We’re now going to talk about how to query a graph, which is also known as traversing a graph
- Quick definitions:
- Adjacent:
- A node is adjacent to another node if they share a similar edge
- An edge is adjacent to another edge if they share a similar node
- Here link1 is adjacent to link2 and node2 is adjacent to both node1 and node3, but node1 is not adjacent to node 3
- in directed graphs node3 is outwardly adjacent to node 2 and node1 is inwardly adjacent to node 2
- Incident:
- A node that is attached to an edge is said to be incident to that edge
- Node3 is at the head of link2 and node2 is at the tail of link2
- A walk is a sequence of nodes and edges that are incident to eachother
- A path is a walk that has no repeating nodes or edges
- we often care about shortest paths between two nodes
- Traversal:
- While we can query for specific nodes or edges, we often want to know what surrounds them
- What we need to do is walk around the nodes or edges of interest to derive more information
- Let’s say we have a teacher of interest.
- If we want to know what courses a teacher teaches, then we need to traverse the graph for any edge that represents “teaches”
- Once we learn what courses the teacher teaches, then we can traverse the graph to see what students take that teachers course.
- We might want to do the same traversal for two teachers at once.
- Now we can see if there any overlaps in the two courses
- This kind of traversal is important because:
- we can semantically pull information from our data
- we can also infer new information about the teacher, courses, and students because of these connections.
- We can make these inferences without explicitly knowing the attributes or information of the nodes
- However, the more information we have, the better our inferences can be.
Centrality Scores
- Centrality scores are measurements derived simply by understanding the structure of the graph. While these measurements are simple and intuitive, it turns out that there is a lot of power behind them.
- Different centrality scores emphasize different strengths and weaknesses of a node
- “Nepusz T., Petroczi A., Negyessy L., Bazso F.: Fuzzy communities and the concept of bridgeness in complex networks. Physical Review E, 77:016107, 2008.”
- Degree:
- This is simply the number of edges incident to a node.
- In social media you might care about the raw number of followers
- Strength:
- In igraph, the strength of a node is the sum of all the weights of it’s incident edges
- In epidimiology, a weight can represent the amount of time two people spend with eachother, increasing the risk of spreading a conatagious disease.
- Shortest Paths:
- As the name suggests, the shortest paths between two vertices is the most direct route to get between two nodes.
- If weights are involved, then the shortest path isn’t necessarily the one with the least amount of edges, but the least total weight
- In maps, roads can be represented as edges, and locations can be represented as addresses.
- The weight of a road could be traffic. The shortest distance isn’t necessarilty the shortest travel time
- betweeness:
- If you look at all the shortest paths between all the nodes, you’ll notice that some nodes show up a lot and other nodes never show up other than as end points.
- These nodes are given a high betweenness score because they are essential for quickly connecting the most nodes.
- if they were removed, then the shortest paths would grow larger in distances, that is, the shortest paths would now require more nodes and edges to connect
- For military purposes, if we analyze the command and control network of a terrorist group, where all the members are the nodes and the edges are their interactions, then we can identify who the leaders are using betweeness.
- in a hierarchical tree structures, the leaders tend to have a higher betweeness score
- closeness:
- The closeness of a node is the shortest distance between it and all the other nodes.
- If a node has the highest closeness score, then all adjacent nodes will have a relatively high closeness score as well, because they are at most one degree farther from any given node.
- Such vertices might have better access to information at other vertices or more direct influence on other vertices.
- In a social network, for instance, a person with lower mean distance to others might find that their opinions reach others in the community more quickly than the opinion of someone with higher mean distance.
- Eigenvector Centrality:
- Relative scores are assigned to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes.
- A high eigenvector score means that a node is connected to many nodes who themselves have high scores.
- A high EigenCentrality score indicates a strong influence over other nodes in the network. It is useful because it indicates not just direct influence, but also implies influence over nodes more than one ‘hop’ away.
- Eigenvector centrality is the basis for many popular centrality measurements.
- Kleinberg’s HITS (hyperlink-induced topic search):
- even if trying to find pages that contain the query words should be the starting point, a different ranking system is needed in order to find those pages that are authoritative for a given query.
- Page i is called an authority for the query “automobile makers” if it contains valuable information on the subject.
- Official web sites of car manufacturers, such as www.bmw.com, HyundaiUSA.com, www.mercedes-benz.com would be authorities for this search. Commercial web sites selling cars might be authorities on the subject as well.
- there is a second category of pages relevant to the process of finding the authoritative pages, called hubs. Their role is to advertise the authoritative pages. They contain useful links towards the authoritative pages.
- In other words, hubs point the search engine in the “right direction”. In real life, when you buy a car, you are more inclined to purchase it from a certain dealer that your friend recommends. Following the analogy, the authority in this case would be the car dealer, and the hub would be your friend. You trust your friend, therefore you trust what your friend recommends.
- In the world wide web, hubs for our query about automobiles might be pages that contain rankings of the cars, blogs where people discuss about the cars that they purchased, and so on.
- PageRank:
- PageRank can be thought of as a model of user behavior. We assume there is a “random surfer” who is given a web page at random and keeps clicking on links, never hitting “back” but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. And, the d damping factor is the probability at each page the “random surfer” will get bored and request another random page. One important variation is to only add the damping factor d to a single page, or a group of pages.
- Another intuitive justification is that a page can have a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank. Intuitively, pages that are well cited from many places around the web are worth looking at. Also, pages that have perhaps only one citation from something like the Yahoo! homepage are also generally worth looking at. If a page was not high quality, or was a broken link, it is quite likely that Yahoo’s homepage would not link to it.