## Data Science from Scratch: First Principles with Python (2015)

### Chapter 21. Network Analysis

Your connections to all the things around you literally define who you are.

Aaron O’Connell

# Betweenness Centrality

`users`

`=`

`[`

`{`

`"id":`

`0,`

`"name":`

`"Hero"`

`},`

`{`

`"id":`

`1,`

`"name":`

`"Dunn"`

`},`

`{`

`"id":`

`2,`

`"name":`

`"Sue"`

`},`

`{`

`"id":`

`3,`

`"name":`

`"Chi"`

`},`

`{`

`"id":`

`4,`

`"name":`

`"Thor"`

`},`

`{`

`"id":`

`5,`

`"name":`

`"Clive"`

`},`

`{`

`"id":`

`6,`

`"name":`

`"Hicks"`

`},`

`{`

`"id":`

`7,`

`"name":`

`"Devin"`

`},`

`{`

`"id":`

`8,`

`"name":`

`"Kate"`

`},`

`{`

`"id":`

`9,`

`"name":`

`"Klein"`

`}`

`]`

`friendships`

`=`

`[(`

`0,`

`1),`

`(`

`0,`

`2),`

`(`

`1,`

`2),`

`(`

`1,`

`3),`

`(`

`2,`

`3),`

`(`

`3,`

`4),`

`(`

`4,`

`5),`

`(`

`5,`

`6),`

`(`

`5,`

`7),`

`(`

`6,`

`8),`

`(`

`7,`

`8),`

`(`

`8,`

`9)]`

for`user`

in`users:`

`user["friends"]`

`=`

`[]`

for`i,`

`j`

in`friendships:`

` `*# this works because users[i] is the user whose id is i*

`users[i]["friends"].append(users[j])`

# add i as a friend of j

`users[j]["friends"].append(users[i])`

# add j as a friend of i

1. Our goal is a function that takes a `from_user`

and finds *all* shortest paths to every other user.

2. We’ll represent a path as `list`

of user IDs. Since every path starts at `from_user`

, we won’t include her ID in the list. This means that the length of the list representing the path will be the length of the path itself.

3. We’ll maintain a dictionary `shortest_paths_to`

where the keys are user IDs and the values are lists of paths that end at the user with the specified ID. If there is a unique shortest path, the list will just contain that one path. If there are multiple shortest paths, the list will contain all of them.

4. We’ll also maintain a queue `frontier`

that contains the users we want to explore in the order we want to explore them. We’ll store them as pairs `(prev_user, user)`

so that we know how we got to each one. We initialize the queue with all the neighbors of `from_user`

. (We haven’t ever talked about queues, which are data structures optimized for “add to the end” and “remove from the front” operations. In Python, they are implemented as `collections.deque`

which is actually a double-ended queue.)

5. As we explore the graph, whenever we find new neighbors that we don’t already know shortest paths to, we add them to the end of the queue to explore later, with the current user as `prev_user`

.

6. When we take a user off the queue, and we’ve never encountered that user before, we’ve definitely found one or more shortest paths to him — each of the shortest paths to `prev_user`

with one extra step added.

7. When we take a user off the queue and we *have* encountered that user before, then either we’ve found another shortest path (in which case we should add it) or we’ve found a longer path (in which case we shouldn’t).

8. When no more users are left on the queue, we’ve explored the whole graph (or, at least, the parts of it that are reachable from the starting user) and we’re done.

from

collections

import`deque`

def`shortest_paths_from(from_user):`

` `*# a dictionary from "user_id" to *all* shortest paths to that user*

`shortest_paths_to`

`=`

`{`

`from_user["id"]`

`:`

`[[]]`

`}`

` `*# a queue of (previous user, next user) that we need to check.*

` `*# starts out with all pairs (from_user, friend_of_from_user)*

`frontier`

`=`

`deque((from_user,`

`friend)`

for`friend`

in`from_user["friends"])`

` `*# keep going until we empty the queue*

while`frontier:`

`prev_user,`

`user`

`=`

`frontier.popleft()`

# remove the user who's

`user_id`

`=`

`user["id"]`

# first in the queue

` `*# because of the way we're adding to the queue,*

` `*# necessarily we already know some shortest paths to prev_user*

`paths_to_prev_user`

`=`

`shortest_paths_to[prev_user["id"]]`

`new_paths_to_user`

`=`

`[`

`path`

`+`

`[`

`user_id]`

for`path`

in`paths_to_prev_user]`

` `*# it's possible we already know a shortest path*

`old_paths_to_user`

`=`

`shortest_paths_to.get(user_id,`

`[])`

` `*# what's the shortest path to here that we've seen so far?*

if`old_paths_to_user:`

`min_path_length`

`=`

`len(old_paths_to_user[0])`

` `**else**:

`min_path_length`

`=`

`float('inf')`

` `*# only keep paths that aren't too long and are actually new*

`new_paths_to_user`

`=`

`[`

`path`

for`path`

in`new_paths_to_user`

if`len(path)`

`<=`

`min_path_length`

and`path`

not

in`old_paths_to_user]`

`shortest_paths_to[user_id]`

`=`

`old_paths_to_user`

`+`

`new_paths_to_user`

` `*# add never-seen neighbors to the frontier*

`frontier.extend((user,`

`friend)`

for`friend`

in`user["friends"]`

if`friend["id"]`

not

in`shortest_paths_to)`

return`shortest_paths_to`

for`user`

in`users:`

`user["shortest_paths"]`

`=`

`shortest_paths_from(user)`

for`user`

in`users:`

`user["betweenness_centrality"]`

`=`

`0.0`

for`source`

in`users:`

`source_id`

`=`

`source["id"]`

for`target_id,`

`paths`

in`source["shortest_paths"].iteritems():`

if`source_id`

`<`

`target_id:`

# don't double count

`num_paths`

`=`

`len(paths)`

# how many shortest paths?

`contrib`

`=`

`1`

`/`

`num_paths`

# contribution to centrality

for`path`

in`paths:`

for`id`

in`path:`

if`id`

not

in`[`

`source_id,`

`target_id]:`

`users[id]["betweenness_centrality"]`

`+=`

`contrib`

###### NOTE

def`farness(user):`

` `*"""the sum of the lengths of the shortest paths to each other user"""*

return`sum(len(paths[0])`

for`paths`

in`user["shortest_paths"].values())`

for`user`

in`users:`

`user["closeness_centrality"]`

`=`

`1`

`/`

`farness(user)`

## Matrix Multiplication

def`matrix_product_entry(A,`

`B,`

`i,`

`j):`

return`dot(get_row(A,`

`i),`

`get_column(B,`

`j))`

def`matrix_multiply(A,`

`B):`

`n1,`

`k1`

`=`

`shape(A)`

`n2,`

`k2`

`=`

`shape(B)`

if`k1`

`!=`

`n2:`

raise

ArithmeticError("incompatible shapes!")

return`make_matrix(n1,`

`k2,`

`partial(matrix_product_entry,`

`A,`

`B))`

`v`

`=`

`[`

`1,`

`2,`

`3]`

`v_as_matrix`

`=`

`[[`

`1],`

`[`

`2],`

`[`

`3]]`

def`vector_as_matrix(v):`

` `*"""returns the vector v (represented as a list) as a n x 1 matrix"""*

return`[[`

`v_i]`

for`v_i`

in`v]`

def`vector_from_matrix(v_as_matrix):`

` `*"""returns the n x 1 matrix as a list of values"""*

return`[`

`row[0]`

for`row`

in`v_as_matrix]`

def`matrix_operate(A,`

`v):`

`v_as_matrix`

`=`

`vector_as_matrix(v)`

`product`

`=`

`matrix_multiply(A,`

`v_as_matrix)`

return`vector_from_matrix(product)`

def`find_eigenvector(A,`

`tolerance=0.00001):`

`guess`

`=`

`[`

`random.random()`

for`__`

in`A]`

while`True:`

`result`

`=`

`matrix_operate(A,`

`guess)`

`length`

`=`

`magnitude(result)`

`next_guess`

`=`

`scalar_multiply(1/length,`

`result)`

if`distance(guess,`

`next_guess)`

`<`

`tolerance:`

return`next_guess,`

`length`

# eigenvector, eigenvalue

`guess`

`=`

`next_guess`

`rotate`

`=`

`[[`

`0,`

`1],`

`[-`

`1,`

`0]]`

`flip`

`=`

`[[`

`0,`

`1],`

`[`

`1,`

`0]]`

## Centrality

def`entry_fn(i,`

`j):`

return`1`

if`(`

`i,`

`j)`

in`friendships`

or`(`

`j,`

`i)`

in`friendships`

else`0`

`n`

`=`

`len(users)`

`adjacency_matrix`

`=`

`make_matrix(n,`

`n,`

`entry_fn)`

###### NOTE

`eigenvector_centralities,`

`_`

`=`

`find_eigenvector(adjacency_matrix)`

`matrix_operate(adjacency_matrix,`

`eigenvector_centralities)`

`dot(get_row(adjacency_matrix,`

`i),`

`eigenvector_centralities)`

1. Give each node a new centrality score that equals the sum of its neighbors’ (old) centrality scores.

2. Rescale the vector of centralities to have magnitude 1.

`endorsements`

`=`

`[(`

`0,`

`1),`

`(`

`1,`

`0),`

`(`

`0,`

`2),`

`(`

`2,`

`0),`

`(`

`1,`

`2),`

`(`

`2,`

`1),`

`(`

`1,`

`3),`

`(`

`2,`

`3),`

`(`

`3,`

`4),`

`(`

`5,`

`4),`

`(`

`5,`

`6),`

`(`

`7,`

`5),`

`(`

`6,`

`8),`

`(`

`8,`

`7),`

`(`

`8,`

`9)]`

for`user`

in`users:`

`user["endorses"]`

`=`

`[]`

# add one list to track outgoing endorsements

`user["endorsed_by"]`

`=`

`[]`

# and another to track endorsements

for`source_id,`

`target_id`

in`endorsements:`

` ``users[source_id]["endorses"].append(users[target_id])`

` ``users[target_id]["endorsed_by"].append(users[source_id])`

`endorsements_by_id`

`=`

`[(`

`user["id"],`

`len(user["endorsed_by"]))`

for`user`

in`users]`

`sorted(endorsements_by_id,`

`key=`

lambda`(`

`user_id,`

`num_endorsements):`

`num_endorsements,`

` ``reverse=True)`

1. There is a total of 1.0 (or 100%) PageRank in the network.

2. Initially this PageRank is equally distributed among nodes.

3. At each step, a large fraction of each node’s PageRank is distributed evenly among its outgoing links.

4. At each step, the remainder of each node’s PageRank is distributed evenly among all nodes.

def`page_rank(users,`

`damping`

`=`

`0.85,`

`num_iters`

`=`

`100):`

` `*# initially distribute PageRank evenly*

`num_users`

`=`

`len(users)`

`pr`

`=`

`{`

`user["id"]`

`:`

`1`

`/`

`num_users`

for`user`

in`users`

`}`

` `*# this is the small fraction of PageRank*

` `*# that each node gets each iteration*

`base_pr`

`=`

`(`

`1`

`-`

`damping)`

`/`

`num_users`

for`__`

in`range(num_iters):`

`next_pr`

`=`

`{`

`user["id"]`

`:`

`base_pr`

for`user`

in`users`

`}`

for`user`

in`users:`

` `*# distribute PageRank to outgoing links*

`links_pr`

`=`

`pr[user["id"]]`

`*`

`damping`

for`endorsee`

in`user["endorses"]:`

`next_pr[endorsee["id"]]`

`+=`

`links_pr`

`/`

`len(user["endorses"])`

`pr`

`=`

`next_pr`

return`pr`

§ There are many other notions of centrality besides the ones we used (although the ones we used are pretty much the most popular ones).

§ NetworkX is a Python library for network analysis. It has functions for computing centralities and for visualizing graphs.

§ Gephi is a love-it/hate-it GUI-based network-visualization tool.