Twitter Interactions Analysis using Twinetverse


















Introduction

The goal of the twinetverse is to provide everything one might need to view Twitter interactions, from data collection to visualisation. This could be a powerful tool for social media analysis, since it could help visualizing how users communicate with one another on a given topic or how information spreads throughout the Twitter network.

On this article, we’re going to briefly explore the twinetverse, with creating a graph that link users to the users they retweet to fundamentally visualise how information spreads throughout Twitter.

Packages

The twinetverse package is available on Github

# install.packages("devtools")
devtools::install_github("JohnCoene/twinetverse") # github

The twinetverse includes three packages:

  • rtweet (Kearney 2018) : wraps the Twitter API, thereby giving R users easy access to tweets. fills the very first step in visualising Twitter interactions.
  • graphTweets (Coene 2019a) : extract nodes and edges from tweets collected with rtweet, fills the second step in visualising Twitter interactions, building the graphs from the collected data.
  • sigmajs (Coene 2019b) : visualise the networks we have built using graphTweets, is the last piece of the puzzle, visualising the graphs we have built.

Within the context of visualising Twitter interactions, each of the packages listed above fill in a specific need and a distinct step of the process, 1) collecting the data, 2) building the graphs and finally 3) visualising the graphs of said interactions.

The packages are pipe ( %>% ) friendly, therefore making it easy to go from building a graph to visualising it.

library(twinetverse)
## -- Attaching twinetverse ------------------------------------------------ twinetverse 0.0.2 --
library(tidyverse)

Prerequisites

API Authorization

All users must be authorized to interact with Twitter’s APIs. To access the API, you will need to create a Twitter Developer Account here: https://developer.twitter.com/en/apps.

After created one, you can now create an “app” to get keys and access tokens for use in the rtweet package. The set up process is rather simple, but if you need further explanation, you can head over rtweet’s official website here.

Notes on filling “app” application:
1. Website, simply put a valid website, you can link to your Twitter profile if you do not have one, i.e.: https://twitter.com/jdatap
2. Callback URL, this is important, in there put the following: http://127.0.0.1:1410, exactly as is.

Create & Save Token

You’re now setup with an app, take note of the crendentials of your app under “Keys and Access Tokens”, as you will need it to create your token and fetch tweets:

mytoken <- create_token(
  "My Application Name",
  consumer_key = "XxxxXxXXxXx",
  consumer_secret = "XxxxXxXXxXx",
  access_token = "XxxxXxXXxXx",
  access_secret = "XxxxXxXXxXx"
)

Ideally, also save it. There is no need to re-create a token everytime you want to download data.

saveRDS(mytoken, file = "mytoken.rds")

Retweets Analysis

There are several types of graphs that the twinetverse, through graphTweets, allows us to build. On this article, our focus will be on the Retweets type, in which will help us understand how information spreads throughout the Twitter network.

Collect

We’ll start with collecting our tweets. I’m gonna use the hashtag #TheyAreUs, which was trending on Twitter nowadays after the the Christchurch twin mosque shootings, as our example.

# export API token
mytoken <- readRDS("data_input/mytoken.rds")
tweets <- search_tweets("#TheyAreUs filter:retweets", n = 1000, include_rts = TRUE)

Note:
If you want to skip the API authorization process and prefer to practice on existing twitter data, you can also export the twitter data csv on this directory:

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   created_at = col_datetime(format = ""),
##   display_text_width = col_double(),
##   reply_to_status_id = col_logical(),
##   reply_to_user_id = col_logical(),
##   reply_to_screen_name = col_logical(),
##   is_quote = col_logical(),
##   is_retweet = col_logical(),
##   favorite_count = col_double(),
##   retweet_count = col_double(),
##   symbols = col_logical(),
##   ext_media_type = col_logical(),
##   quoted_status_id = col_logical(),
##   quoted_text = col_logical(),
##   quoted_created_at = col_logical(),
##   quoted_source = col_logical(),
##   quoted_favorite_count = col_logical(),
##   quoted_retweet_count = col_logical(),
##   quoted_user_id = col_logical(),
##   quoted_screen_name = col_logical(),
##   quoted_name = col_logical()
##   # ... with 27 more columns
## )
## See spec(...) for full column specifications.

The search_tweets function takes a few arguments. Above, we fetch 1000 tweets about “#TheyAreUs”, and since we want to focus on re-tweets, we also ensured the tweets we collect include re-tweets.

Each row a is a tweet, rtweet returns quite a lot of variables (88), we’ll only look at a select few.

names(tweets)
##  [1] "user_id"                 "status_id"              
##  [3] "created_at"              "screen_name"            
##  [5] "text"                    "source"                 
##  [7] "display_text_width"      "reply_to_status_id"     
##  [9] "reply_to_user_id"        "reply_to_screen_name"   
## [11] "is_quote"                "is_retweet"             
## [13] "favorite_count"          "retweet_count"          
## [15] "hashtags"                "symbols"                
## [17] "urls_url"                "urls_t.co"              
## [19] "urls_expanded_url"       "media_url"              
## [21] "media_t.co"              "media_expanded_url"     
## [23] "media_type"              "ext_media_url"          
## [25] "ext_media_t.co"          "ext_media_expanded_url" 
## [27] "ext_media_type"          "mentions_user_id"       
## [29] "mentions_screen_name"    "lang"                   
## [31] "quoted_status_id"        "quoted_text"            
## [33] "quoted_created_at"       "quoted_source"          
## [35] "quoted_favorite_count"   "quoted_retweet_count"   
## [37] "quoted_user_id"          "quoted_screen_name"     
## [39] "quoted_name"             "quoted_followers_count" 
## [41] "quoted_friends_count"    "quoted_statuses_count"  
## [43] "quoted_location"         "quoted_description"     
## [45] "quoted_verified"         "retweet_status_id"      
## [47] "retweet_text"            "retweet_created_at"     
## [49] "retweet_source"          "retweet_favorite_count" 
## [51] "retweet_retweet_count"   "retweet_user_id"        
## [53] "retweet_screen_name"     "retweet_name"           
## [55] "retweet_followers_count" "retweet_friends_count"  
## [57] "retweet_statuses_count"  "retweet_location"       
## [59] "retweet_description"     "retweet_verified"       
## [61] "place_url"               "place_name"             
## [63] "place_full_name"         "place_type"             
## [65] "country"                 "country_code"           
## [67] "geo_coords"              "coords_coords"          
## [69] "bbox_coords"             "status_url"             
## [71] "name"                    "location"               
## [73] "description"             "url"                    
## [75] "protected"               "followers_count"        
## [77] "friends_count"           "listed_count"           
## [79] "statuses_count"          "favourites_count"       
## [81] "account_created_at"      "verified"               
## [83] "profile_url"             "profile_expanded_url"   
## [85] "account_lang"            "profile_banner_url"     
## [87] "profile_background_url"  "profile_image_url"

Build

A network consists of nodes and edges: this is just what graphTweets returns.

In this graph, each node is a user who is connected to other users who he/she retweeted. Functions in graphTweets are meant to be run in a specific order:

  • Extract edges
  • Extract the nodes
net <- tweets %>% 
  gt_edges(source = screen_name, target = retweet_screen_name) %>% # get edges
  gt_nodes() # get nodes

We called gt_edges on our tweets data frame, passing a few bare column names. The source of the tweets (the user posting the tweets) will also be the source of our edges so we pass source = screen_name, then the target of these edges will be users whom they retweeted, which is given by the API as retweet_screen_name; this will be target of our edges.

The object returned is of an unfamiliar class.

class(net)
## [1] "graphTweets"

To extracts the results from graphTweets run gt_collect, this will work at any point in the chain of pipes (%>%).

net <- net %>% 
  gt_collect()

class(net)
## [1] "list"

Visualise

We can visualise the network with sigmajs. Then again, it’s very easy and follows the same idea as graphTweets; we pipe our nodes and edges through. Before we do so, for the sake of clarity, let’s unpack our network using the %<-% from the Zeallot package (Teetor 2018), imported by the twinetverse.

c(edges, nodes) %<-% net

Note: You can always unpack the network with edges <- net$edges and nodes <- net$nodes if you are not comfortable with the above.

Let’s take a look at the edges.

head(edges)
## # A tibble: 6 x 3
##   source        target              n
##   <chr>         <chr>           <int>
## 1 __choeeey     mmmmaggy            1
## 2 _19bm         hurricanesrugby     1
## 3 _alleiahmalik voicesofyouth       1
## 4 _denchtastic  hurricanesrugby     1
## 5 10dubai       fahimaq             1
## 6 1cuteone      intactive           1

Edges simply consist of source and target, as explained earlier on, source essentially corresponds to screen_name passed in gt_edges, it is the user who posted the tweet. In contrast, target includes includes the users whom they retweeted on that tweet. The n variable indicates how many tweets connect the source to the target.

Now let’s take a look at the nodes:

head(nodes)
## # A tibble: 6 x 3
##   nodes          type      n
##   <chr>          <chr> <int>
## 1 __choeeey      user      1
## 2 __interfaith__ user      6
## 3 _19bm          user      1
## 4 _alleiahmalik  user      1
## 5 _denchtastic   user      1
## 6 10dubai        user      1

In the nodes data frame, the column n is the number of times the node appears (whether as source or as target), while the nodes column are the Twitter handles of both the authors of the tweets and those who retweeted the tweets.

Below we rename a few columns, to meet sigmajs naming convention.

  1. We add ids to our nodes, this can be a string and thus simply corresponds to our nodes column.
  2. We essentially rename n to size as this is what sigmajs understands.
  3. We add ids to our edges as sigmajs requires each edge to have a unique id.

sigmajs has a specific but sensible naming convention as well as basic minimal requirements:

  • Nodes must at least include id, and size.
  • Edges must at least include id, source, and target.

Now, the twinetverse comes with helper functions to prepare the nodes and edges build from graphTweets for use in sigmajs (these are the only functions the ’verse provides).

nodes <- nodes2sg(nodes)
edges <- edges2sg(edges)

Let’s visualise that, we must initialise every sigmajs graph with the sigmajs function, then we add our nodes with sg_nodes, passing the column names we mentioned previously, id, and size to meet sigmajs’ minimum requirements.In sigmajs, at the exception of the function called sigmajs, all start with sg_

sigmajs actually allows us to build graphs using only nodes or edges. Contrary to graphTweets rules, we have to run sigmajs functions in the correct order; first the nodes, then the edges.

Let’s begin with map our nodes:

sigmajs() %>% 
  sg_nodes(nodes, id, size) 

Then, let’s add the edges:

sigmajs() %>% 
  sg_nodes(nodes, id, size) %>% 
  sg_edges(edges, id, source, target)

Each disk/point on the graph is a twitter user, they are connected when one has retweeted the other in their tweet.

Now above graph doesn’t look really informative, but sigmajs is highly customisable. We’re going to beautify that a bit, starting with add appropriate layout to the graph. The layout we’re going to use on the following code is taken from one of igraph’s layout algorithms.

We’ll also add labels that will display on hover by simply passing the label column to sg_nodes.

sigmajs() %>% 
  sg_nodes(nodes, id, label, size) %>% 
  sg_edges(edges, id, source, target) %>% 
  sg_layout(layout = igraph::layout_components)

Looks a lot better, isn’t it? Next, we color the nodes by cluster with sg_cluster

sigmajs() %>% 
  sg_nodes(nodes, id, label, size) %>% 
  sg_edges(edges, id, source, target) %>% 
  sg_layout(layout = igraph::layout_components) %>% 
  sg_cluster(
    colors = c(
      "#60dd8e",
      "#3f9f7f",
      "#188a8d",
      "#17577e",
      "#141163"
      )
  ) %>% 
  sg_settings(
    minNodeSize = 1,
    maxNodeSize = 2.5,
    edgeColor = "default",
    defaultEdgeColor = "#d3d3d3"
  )

From above visualisation, we can learn about each cluster of “interactions” and how a certain user be the highest influence among #TheyAreUs campaign.

[Optional]: Dynamic Edges

We’ve been visualising Twitter interactions in a static manner, but they are dynamic when you think of it. Twitter conversations happen over time, thus far, we’ve just been drawing all encompassing snapshots. So let’s take into account the time factor to make a where the edges appear at different time steps.

Let’s use the same tweets data:

tweets <- read_csv("data_input/tweets_twinet.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   created_at = col_datetime(format = ""),
##   display_text_width = col_double(),
##   reply_to_status_id = col_logical(),
##   reply_to_user_id = col_logical(),
##   reply_to_screen_name = col_logical(),
##   is_quote = col_logical(),
##   is_retweet = col_logical(),
##   favorite_count = col_double(),
##   retweet_count = col_double(),
##   symbols = col_logical(),
##   ext_media_type = col_logical(),
##   quoted_status_id = col_logical(),
##   quoted_text = col_logical(),
##   quoted_created_at = col_logical(),
##   quoted_source = col_logical(),
##   quoted_favorite_count = col_logical(),
##   quoted_retweet_count = col_logical(),
##   quoted_user_id = col_logical(),
##   quoted_screen_name = col_logical(),
##   quoted_name = col_logical()
##   # ... with 27 more columns
## )
## See spec(...) for full column specifications.

Build

Now onto building the graph.

net <- tweets %>% 
  gt_edges(screen_name, mentions_screen_name, created_at) %>% 
  gt_nodes() %>% 
  gt_dyn() %>% 
  gt_collect()

Quite a few things differ from previous graphs we have built.

  1. We pass created_at in gt_edges. This in effect adds the created_at column to our edges, so that we know the created time of post in which the edge appears.
  2. We use gt_dyn which stands for dynamic, to essentially compute the time at which edges and nodes should appear on the graph.

Visualise

Like what we’ve done earlier, first we need to unpack both edges and nodes:

c(edges, nodes) %<-% net # unpack

nodes <- nodes2sg(nodes)

Notice that after we unpacked them, we have only prepared our nodes for the sigmajs visualisation. This is because we have to perform another preparation to our edges for it to be dynamically appear on the graph.

The way this works in sigmajs is by specifying the delay in milliseconds before each respective edge should be added. Therefore, we need to transform the date to milliseconds and rescale them to be within a reasonable range: we don’t want the edges to actually take 15 hours to appear on the graph.
1. We change the date time column (POSIXct actually) to a numeric, which gives the number of milliseconds.
2. We rescale between 0 and 1 then multiply by 10,000 (milliseconds) so that the edges are added over 10 seconds.

edges <- edges %>% 
  mutate(
    id = 1:n(),
    created_at = as.numeric(created_at),
    created_at = (created_at - min(created_at)) / (max(created_at) - min(created_at)),
    created_at = created_at * 10000
  ) %>% 
  select(id, source, target, created_at)

Now, the actual visualisation, as mentioned at the begining to the chapter, we’ll plot the nodes then add edges dynamically. Let’s break it down step by step.

First, we plot the nodes.

sigmajs() %>% 
  sg_nodes(nodes, id, size, label) 

We’ll add the layout as it looks a bit messy with nodes randomly scattered across the canvas. We’ll have to compute the layout differently this time, we cannot simply use sg_layout as it requires both nodes and edges and we only have nodes on the graph (since edges are to be added later on, dynamically); instead we use sg_get_layout.

nodes <- sg_get_layout(nodes, edges, layout = igraph::layout_components)
## Warning in if (class(newval) == "factor") {: the condition has length > 1 and
## only the first element will be used

## Warning in if (class(newval) == "factor") {: the condition has length > 1 and
## only the first element will be used
head(nodes)
##               id          label               start                 end type
## 1      __choeeey      __choeeey 2019-03-24 00:55:35 2019-03-28 15:50:45 user
## 2 __interfaith__ __interfaith__ 2019-03-27 11:47:53 2019-03-28 15:50:45 user
## 3          _19bm          _19bm 2019-03-23 07:37:26 2019-03-28 15:50:45 user
## 4  _alleiahmalik  _alleiahmalik 2019-03-23 13:43:37 2019-03-28 15:50:45 user
## 5   _denchtastic   _denchtastic 2019-03-24 08:31:10 2019-03-28 15:50:45 user
## 6        10dubai        10dubai 2019-03-23 13:32:56 2019-03-28 15:50:45 user
##   size           x          y
## 1    1  49.8069419 -102.04333
## 2    7  32.5864390 -139.74546
## 3    1  91.9909276   72.36279
## 4    1 -93.8590137   13.77834
## 5    1  99.7119963   66.74064
## 6    1   0.7785031  -38.10881

Notice that sg_get_layoutcomputes the coordinates of the nodes (x and y) and adds them to our nodes dataframe.

Now we can simply pass the coordinates x and y to sg_nodes.

sigmajs() %>% 
  sg_nodes(nodes, id, size, label, x, y) 

Now we have something that looks like a graph, except it’s missing edges. Let’s add them.

We add the edges almost exactly as we did before, we use sg_add_edges instead of sg_edges. Other than the function name, the only difference is that we pass created_at as delay. We also set cumsum=FALSE, otherwise the function computes the cumulative sum on the delay, which is, here, our created_at column, and does not require counting the cumulative sum.

sigmajs() %>% 
     sg_nodes(nodes, id, size, label, x, y) %>%
     sg_add_edges(edges, created_at, id, source, target, cumsum = FALSE, refresh=TRUE)