Introduction

In the last 2 previous Notebook, we trained a model of Sentiment Analysis and create an API to query live tweets, analyse them and store them in a NoSQL database. This model ran during the match France Argentian of 30/06/2018 during the World Cup 2018. In this netbook, we will dig a bit deeper in those data and try to analyse multiple things.

In [1]:
import numpy as np
import json
import datetime
import tqdm

import seaborn as sns

from collections import Counter
from nltk.corpus import stopwords

import pandas as pd
from bson import json_util

import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as md
from matplotlib import dates

from wordcloud import WordCloud

import pymongo
from pymongo import MongoClient

Creating DataFrame

With so few datas and no being very comfortable with NoSQL database, we will first create a DataFrame.

In [2]:
client = MongoClient('localhost', 27017)

db = client['Twitter_db']
collection_clean = db['tweets_clean']
In [3]:
print(json.dumps(collection_clean.find_one(), indent=4, default=json_util.default))
{
    "_id": 1013053260956098561,
    "text": "Come on France. #worldcup2018 #FRAARG",
    "time": {
        "$date": 1530372893000
    },
    "hashtags": [
        "worldcup2018",
        "FRAARG"
    ],
    "sentiment": 0.6849161215932269,
    "tokens": [
        "come",
        "on",
        "franc"
    ]
}

This what is stored for every tweets. We have the complete text, some tokens found with a TweetTokenizer and a Stemmer from nltk. We also have all hashtags and the predicted sentiment. Now we can create a DataFrame empty do be filled with datas from MongoDB.

In [4]:
nb = collection_clean.count_documents({})
df = pd.DataFrame(index=range(nb), columns=['ID','Text', "Time",'Hashtags','Sentiment','tokens'])

for i, record in enumerate(collection_clean.find()):
    obj = {
        'ID' : record["_id"],
        'Text' : record["text"],
        'Time' : record["time"],
        'Hashtags' : "-".join(record["hashtags"]),
        'Sentiment' : record["sentiment"],
        'tokens' : "-".join(record["tokens"])
    }

    df.iloc[i, :] = obj
    
df = df.set_index("ID")
In [5]:
client.close()
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 112700 entries, 1013053260956098561 to 1013096849245290496
Data columns (total 5 columns):
Text         112700 non-null object
Time         112700 non-null object
Hashtags     112700 non-null object
Sentiment    112700 non-null object
tokens       112700 non-null object
dtypes: object(5)
memory usage: 5.2+ MB
In [7]:
df.to_csv("F:/Twitter_data/dataset/fra_arg_full.csv", encoding="utf-8", sep="%")

Now we have converted our 112700 tweets to a dataframe and save it to not do those steps later

Exploration

Now, we will explore the content for multiple points but first let's prepare the required datas

In [8]:
df = pd.read_csv("F:/Twitter_data/dataset/fra_arg_full.csv", encoding="utf-8", sep="%", index_col=0)
df.tokens = df.tokens.fillna("N/A")
df.Hashtags = df.Hashtags.fillna("N/A")
df['Time'] = pd.to_datetime(df['Time'])
In [9]:
df.head(5)
Out[9]:
Text Time Hashtags Sentiment tokens
ID
1013053260956098561 Come on France. #worldcup2018 #FRAARG 2018-06-30 15:34:53 worldcup2018-FRAARG 0.684916 come-on-franc
1013053268950315008 This argentina bench actually has more talent ... 2018-06-30 15:34:55 N/A 0.832588 this-argentina-bench-actual-has-more-talent-th...
1013053270313619456 Let them know Messi, let them know. #WorldCup1... 2018-06-30 15:34:55 WorldCup18-FRAARG 0.824827 let-them-know-messi-let-them-know
1013053272758759425 #FRAARG \r\n\r\nI would be surprised if #Arg b... 2018-06-30 15:34:56 FRAARG-Arg-FRA 0.466349 would-be-surpris-if-beat-that-would-be-an-upse...
1013053274872799232 Today šŸ˜\r\n\r\n#FRAARG #URUPOR https://t.co/7t... 2018-06-30 15:34:57 FRAARG-URUPOR 0.541423 today

First, we can look at all tokens and their frequencies. To do so, we will remove the standard StopWords from English Vocabulary and also the 2 created "words" which are "three_dots" and "exc_mark"

Most Commong Words

In [10]:
stopWords = stopwords.words('english')
stopWords += ['three_dot', 'exc_mark', "N/A"]
results = Counter()
df['tokens'].str.split("-").apply(results.update)

for word in stopWords:
    if word in results:
        del results[word]
In [11]:
results.most_common(50)
Out[11]:
[('argentina', 17392),
 ('franc', 15465),
 ('game', 13305),
 ('goal', 11682),
 ('mbapp', 11275),
 ('messi', 11196),
 ('match', 6240),
 ('world', 5906),
 ('go', 5467),
 ('2', 4832),
 ('team', 4831),
 ('di', 4772),
 ('maria', 4707),
 ('like', 4573),
 ('play', 4510),
 ('cup', 4272),
 ('1', 3930),
 ('one', 3660),
 ('win', 3653),
 ('vote', 3537),
 ('score', 3533),
 ('get', 3377),
 ('time', 3238),
 ('good', 3075),
 ('player', 3008),
 ('watch', 2932),
 ('best', 2735),
 ('footbal', 2701),
 ('see', 2693),
 ('look', 2693),
 ('pavard', 2642),
 ('4', 2606),
 ('come', 2592),
 ('today', 2496),
 ('penalti', 2319),
 ('great', 2314),
 ('3', 2252),
 ('fuck', 2217),
 ('back', 2193),
 ('tap', 2094),
 ('well', 2083),
 ('vs', 2049),
 ('maradona', 2036),
 ('half', 1947),
 ('far', 1917),
 ('make', 1903),
 ('live', 1882),
 ('strike', 1868),
 ('take', 1858),
 ('french', 1833)]

If we explore the result (I did the check on the top 1000 but I display the top 50 for readability), we can see that for several player for example, there is multiple grammars. The worst one is Mbappe which appears written in 5 ways.

In [12]:
print("mbapp:", results["mbapp"])
print("mbappe:", results["mbappe"])
print("mbappƩ:", results["mbappƩ"])
print("bappe:", results["bappe"])
print("bappƩ:", results["bappƩ"])
print("kilian:", results["kilian"])
print("killian:", results["killian"])
mbapp: 11275
mbappe: 322
mbappƩ: 755
bappe: 11
bappƩ: 4
kilian: 7
killian: 26

Due to all the cleanup required, this step will be continued just a bit later

Most Commong Hashtags

We can do the same with Hastags but in that case, we don't have to clean it and we can look at the "balance" using "cloudword"

In [13]:
results_tag = Counter()
df['Hashtags'].str.split("-").apply(results_tag.update)

for word in stopWords:
    if word in results_tag:
        del results_tag[word]

wordcloud = WordCloud().generate_from_frequencies(results_tag)

plt.figure(figsize=(20,12))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Analysis of tweet frequencies during the match

We can group the dataframe by minute and juste count for now the number of tweets. We can after plot the result and add also some specific actions (goals) to see the effect

In [14]:
agg = df.Time.groupby([df.Time.dt.hour, df.Time.dt.minute]).agg(["min", "count"])
x = agg["min"].values
y = agg["count"].values

agg.head()
Out[14]:
min count
Time Time
15 34 2018-06-30 15:34:53 8
35 2018-06-30 15:35:00 64
36 2018-06-30 15:36:01 71
37 2018-06-30 15:37:00 65
38 2018-06-30 15:38:00 78
In [15]:
start_first = datetime.datetime(2018, 6, 30, 16, 0)
end_first = datetime.datetime(2018, 6, 30, 16, 47)
start_second = datetime.datetime(2018, 6, 30, 17, 2)
end_second = datetime.datetime(2018, 6, 30, 17, 51)

goal_time = [
    datetime.datetime(2018, 6, 30, 16, 13), 
    datetime.datetime(2018, 6, 30, 16, 40),
    datetime.datetime(2018, 6, 30, 17, 5),
    datetime.datetime(2018, 6, 30, 17, 14),
    datetime.datetime(2018, 6, 30, 17, 21),
    datetime.datetime(2018, 6, 30, 17, 25),
    datetime.datetime(2018, 6, 30, 17, 50)
]

goal_tweet = []
for goal in np.array(goal_time, dtype='datetime64[ns]'):
    for nb, time in zip(y, x):
        if abs((time - goal) / np.timedelta64(1, 's')) < 60:
            goal_tweet.append(nb)
            break
            
goal_team = ["Fr", "Arg", "Arg", "Fr", "Fr", "Fr", "Arg"]
goal_color = ["red" if team == "Fr" else "blue" for team in goal_team]
goal_player = ["Griezmann", "Di Maria", "Mercado", "Pavard", "Mbappe", "Mbappe", "Aguero"]
delta_x = [-0.020, -0.005, 0.010, 0.005, 0.005, 0.005, 0.01]
delta_y = [100,    -200,    -100,  -200,  -600,  -650,  0.050]

fault = datetime.datetime(2018, 6, 30, 16, 10)
for nb, time in zip(y, x):
    if abs((time - np.array(fault, dtype='datetime64[ns]')) / np.timedelta64(1, 's')) < 60:
        tweet_fault = nb
        break
In [16]:
fig = plt.figure(figsize=(20, 12))
ax = fig.add_subplot(111)

plt.plot(x, y)

plt.axvline(x=start_first)
plt.axvline(x=end_first)
plt.axvline(x=start_second)
plt.axvline(x=end_second)

plt.axvspan(start_first, end_first, alpha=0.3, color='green', label="first half")
plt.axvspan(start_second, end_second, alpha=0.3, color='orange', label="second half")

plt.scatter(goal_time, goal_tweet, c=goal_color)


for i in range(7):
    ax.annotate(goal_player[i], 
                xy=(md.date2num(goal_time[i]), goal_tweet[i]), 
                xytext=(md.date2num(goal_time[i])+delta_x[i], goal_tweet[i] + delta_y[i]),
                arrowprops=dict(facecolor=goal_color[i], shrink=0.05),
                color=goal_color[i],
                fontsize=20
            )

ax.annotate("Fault on Mbappe (Penalty)", 
            xy=(md.date2num(fault), tweet_fault), 
            xytext=(md.date2num(fault) - 0.001, tweet_fault -300),
            arrowprops=dict(facecolor="black", shrink=0.05),
            color="black",
            fontsize=20
    )


ax.xaxis.set_major_locator(dates.MinuteLocator(byminute=[0,15,30,45], interval = 1))
ax.xaxis.set_major_formatter(dates.DateFormatter('%H:%M'))

plt.ylabel("Number of Tweets", fontsize=15)
plt.title("Evolution of tweets posted during the match", fontsize=20)
plt.legend()
ax.grid(True)
plt.show()

A good way to find the start of all peaks is to derivate and look for max peaks. When we found the peak, we can extract perdios to explore and see which tokens are the most frequent

In [17]:
dy = []
dx = []
time_to_explore = []

for i in range(1, len(x)-1):
    dy.append((y[i+1] - y[i])/((x[i+1] - x[i]) / np.timedelta64(1, 's')))
    dx.append(x[i] + (x[i+1] - x[i])/2)

fig, (ax1, ax2) = plt.subplots(2, 1, figsize = (20,12))

ax1.plot(dx, dy)
ax2.plot(x, y)

for i in range(1, len(dy)-1):
    if dy[i] > 5:
        if dy[i] - dy[i-1] > 1:
            ax1.scatter(dx[i], dy[i])
            ax2.scatter(dx[i], y[i+1])
            time_to_explore.append((dx[i], dx[i] + np.timedelta64(360, 's')))
            plt.axvspan(dx[i], dx[i] + np.timedelta64(360, 's'), alpha=0.3, color='green', label="periods to explore")

plt.legend()
plt.show()