Introduction

In this notebook, we will setup a model for Sentiment Analysis. To do so, we will use a known dataset named "sentiment140". It has been created from 1.6m tweet selected based on a smiley. If the tweet has an "Happy smiley" it is supposed to be positive else it's negative. After that smileys are removed. I tell you this point as it will be relevant at the end.

In [1]:
import pandas as pd
import numpy as np

import re
import tqdm

import nltk
from nltk.tokenize import TweetTokenizer
from nltk.stem.snowball import SnowballStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

Cleaning Dataset

To do so, we will only keep the text and the target. After based on exploration, we will apply regular expression to clean sentences :

  • remove hastags
  • remove names (@foo)
  • remove urls
  • merge some vocabulary (for example "can't", "cant", "isn't", "isnt")
  • replace "..." by "three_dots" to be considered in the model
  • replace "!!!!" by "exc_mark" to be considered in the model
  • remove all repeated character (for example looooooooooooooool = lool != lol) as I keep 2 of them by samefty to not have foot -> fot for example
In [2]:
df = pd.read_csv("F:/Twitter_data/dataset/sentiment_140.csv", encoding='latin1', header=None)
df.columns = ["target", "id", "date", "flag", "user", "text"]
In [3]:
df.head()
Out[3]:
target id date flag user text
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t...
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ...
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man...
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all....
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
target    1600000 non-null int64
id        1600000 non-null int64
date      1600000 non-null object
flag      1600000 non-null object
user      1600000 non-null object
text      1600000 non-null object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB
In [5]:
df.target /= 4
In [6]:
X = df["text"]
y = df["target"]
del df
In [10]:
X
Out[10]:
0          @switchfoot http://twitpic.com/2y1zl - Awww, t...
1          is upset that he can't update his Facebook by ...
2          @Kenichan I dived many times for the ball. Man...
3            my whole body feels itchy and like its on fire 
4          @nationwideclass no, it's not behaving at all....
5                              @Kwesidei not the whole crew 
6                                                Need a hug 
7          @LOLTrish hey  long time no see! Yes.. Rains a...
8                       @Tatiana_K nope they didn't have it 
9                                  @twittera que me muera ? 
10               spring break in plain city... it's snowing 
11                                I just re-pierced my ears 
12         @caregiving I couldn't bear to watch it.  And ...
13         @octolinz16 It it counts, idk why I did either...
14         @smarrison i would've been the first, but i di...
15         @iamjazzyfizzle I wish I got to watch it with ...
16         Hollis' death scene will hurt me severely to w...
17                                      about to file taxes 
18         @LettyA ahh ive always wanted to see rent  lov...
19         @FakerPattyPattz Oh dear. Were you drinking ou...
20         @alydesigns i was out most of the day so didn'...
21         one of my friend called me, and asked to meet ...
22          @angry_barista I baked you a cake but I ated it 
23                    this week is not going as i had hoped 
24                                blagh class at 8 tomorrow 
25            I hate when I have to call and wake people up 
26         Just going to cry myself to sleep after watchi...
27                                    im sad now  Miss.Lilly
28         ooooh.... LOL  that leslie.... and ok I won't ...
29         Meh... Almost Lover is the exception... this t...
                                 ...                        
1599970    Thanks @eastwestchic &amp; @wangyip Thanks! Th...
1599971    @marttn thanks Martin. not the most imaginativ...
1599972            @MikeJonesPhoto Congrats Mike  Way to go!
1599973    http://twitpic.com/7jp4n - OMG! Office Space.....
1599974    @yrclndstnlvr ahaha nooo you were just away fr...
1599975    @BizCoachDeb  Hey, I'm baack! And, thanks so m...
1599976    @mattycus Yeah, my conscience would be clear i...
1599977    @MayorDorisWolfe Thats my girl - dishing out t...
1599978                            @shebbs123 i second that 
1599979                                       In the garden 
1599980    @myheartandmind jo jen by nemuselo zrovna té ...
1599981    Another Commenting Contest! [;: Yay!!!  http:/...
1599982    @thrillmesoon i figured out how to see my twee...
1599983    @oxhot theri tomorrow, drinking coffee, talkin...
1599984    You heard it here first -- We're having a girl...
1599985    if ur the lead singer in a band, beware fallin...
1599986                @tarayqueen too much ads on my blog. 
1599987    @La_r_a NEVEER  I think that you both will get...
1599988    @Roy_Everitt ha- good job. that's right - we g...
1599989                   @Ms_Hip_Hop im glad ur doing well 
1599990                                WOOOOO! Xbox is back 
1599991    @rmedina @LaTati Mmmm  That sounds absolutely ...
1599992                    ReCoVeRiNg FrOm ThE lOnG wEeKeNd 
1599993                                    @SCOOBY_GRITBOYS 
1599994    @Cliff_Forster Yeah, that does work better tha...
1599995    Just woke up. Having no school is the best fee...
1599996    TheWDB.com - Very cool to hear old Walt interv...
1599997    Are you ready for your MoJo Makeover? Ask me f...
1599998    Happy 38th Birthday to my boo of alll time!!! ...
1599999    happy #charitytuesday @theNSPCC @SparksCharity...
Name: text, Length: 1600000, dtype: object
In [8]:
def my_regex(x):
    return re.sub(r'(.)\1+', r'\1\1', x)
In [12]:
X1 = X.str.replace("@\S+", "")  # remove name
X1 = X1.str.replace("#\S+", "") # remove hashtag

X1 = X1.apply(my_regex)  # remove all repeated characters

X1 = X1.str.replace("\.\.", " three_dots ") # convert ... to three_dots to be like a word
X1 = X1.str.replace("\!\!", " exc_mark ") # convert ... to exc_mark to be like a word

X1 = X1.str.replace("'t", "t")

X1 = X1.str.replace("https?://\S+", "")
X1 = X1.str.replace("www.\S+.\S{2, 4}", "")

X1 = X1.str.replace("\s([0-9.,]+)\s", "") # remove numbers
In [13]:
X1
Out[13]:
0            - Aww, that's a bummer.  You shoulda got Dav...
1          is upset that he cant update his Facebook by t...
2           I dived many times for the ball. Managed to s...
3            my whole body feels itchy and like its on fire 
4           no, it's not behaving at all. i'm mad. why am...
5                                        not the whole crew 
6                                                Need a hug 
7           hey  long time no see! Yes three_dots  Rains ...
8                                   nope they didnt have it 
9                                            que me muera ? 
10         spring break in plain city three_dots  it's sn...
11                                I just re-pierced my ears 
12          I couldnt bear to watch it.  And I thought th...
13          It it counts, idk why I did either. you never...
14          i would've been the first, but i didnt have a...
15          I wish I got to watch it with you exc_mark  I...
16         Hollis' death scene will hurt me severely to w...
17                                      about to file taxes 
18          ahh ive always wanted to see rent  love the s...
19          Oh dear. Were you drinking out of the forgott...
20          i was out most of the day so didnt get much d...
21         one of my friend called me, and asked to meet ...
22                         I baked you a cake but I ated it 
23                    this week is not going as i had hoped 
24                                   blagh class attomorrow 
25            I hate when I have to call and wake people up 
26         Just going to cry myself to sleep after watchi...
27                                    im sad now  Miss.Lilly
28         ooh three_dots  LOL  that leslie three_dots  a...
29         Meh three_dots  Almost Lover is the exception ...
                                 ...                        
1599970    Thanks  &amp;  Thanks! That was just what I wa...
1599971     thanks Martin. not the most imaginative inter...
1599972                            Congrats Mike  Way to go!
1599973     - OMG! Office Space three_dots  I wanna steal...
1599974     ahaha noo you were just away from everyone el...
1599975      Hey, I'm baack! And, thanks so much for all ...
1599976     Yeah, my conscience would be clear in that ca...
1599977     Thats my girl - dishing out the &quot;advice&...
1599978                                       i second that 
1599979                                       In the garden 
1599980      jo jen by nemuselo zrovna té holce ael co nic 
1599981      Another Commenting Contest! [;: Yay exc_mark   
1599982     i figured out how to see my tweets and facebo...
1599983     theri tomorrow, drinking coffee, talking abou...
1599984    You heard it here first -- We're having a girl...
1599985    if ur the lead singer in a band, beware fallin...
1599986                            too much ads on my blog. 
1599987     NEVEER  I think that you both will get on wel...
1599988     ha- good job. that's right - we gotta throw t...
1599989                               im glad ur doing well 
1599990                                   WOO! Xbox is back 
1599991      Mmm  That sounds absolutely perfect three_do...
1599992                    ReCoVeRiNg FrOm ThE lOnG wEeKeNd 
1599993                                                     
1599994     Yeah, that does work better than just waiting...
1599995    Just woke up. Having no school is the best fee...
1599996    TheWDB.com - Very cool to hear old Walt interv...
1599997    Are you ready for your MoJo Makeover? Ask me f...
1599998    Happy 38th Birthday to my boo of all time exc_...
1599999                                              happy  
Name: text, Length: 1600000, dtype: object

Model Basic

First, we will not go more in details in tweets and just apply a TF-IDF and a model to have a baseline.

In [25]:
tfv=TfidfVectorizer(min_df=0, max_features=None, strip_accents='unicode',lowercase =True,
                    analyzer='word', token_pattern=r'\w{3,}', ngram_range=(1,1),
                    use_idf=True,smooth_idf=True, sublinear_tf=True, stop_words = "english")  
In [27]:
X_train, X_test, y_train, y_test = train_test_split(X1, y, test_size=0.2, random_state=42)
In [28]:
X_train_transform = tfv.fit_transform(X_train)
X_test_transform = tfv.transform(X_test)
In [30]:
X_train_transform.shape
Out[30]:
(1280000, 586147)

Our matrice is only with one-gram words and still have a voculary of 586147 words

In [ ]:
#specify model and parameters
model=LogisticRegression(C=1.)

#fit model
model.fit(X_train_transform, y_train)
In [40]:
#make prediction on the same (train) data
pred_train=model.predict_proba(X_train_transform)[:,1]
pred_test=model.predict_proba(X_test_transform)[:,1]

#chcek AUC(Area Undet the Roc Curve) to see how well the score discriminates between negative and positive
print (" auc " , roc_auc_score(y_train//4, pred_train))
print (" auc " , roc_auc_score(y_test//4, pred_test))

#print top 10 scores as a sanity check
print (pred_test[:10])
 auc  0.8956017262751066
 auc  0.8617762997561205
[0.73061013 0.62599728 0.67317318 0.05010197 0.4804858  0.63243171
 0.6851576  0.43986609 0.00974582 0.99346639]
In [49]:
idx_weight = np.argsort(model.coef_[0])
In [45]:
inversed = { value: key for key, value in tfv.vocabulary_.items() }
In [51]:
for idx in idx_weight[:10]:
    print(inversed[idx], " => ", model.coef_[0][idx])
sad  =>  -12.778844971279486
miss  =>  -8.761970325884333
sadly  =>  -8.671690392259233
poor  =>  -8.170278714517776
unfortunately  =>  -7.617925255903621
sucks  =>  -7.218759033954542
bummed  =>  -7.151845350574452
missing  =>  -7.087200029087839
gutted  =>  -7.065545068431576
sick  =>  -6.889312048800282
In [54]:
for idx in idx_weight[-10:]:
    print(inversed[idx], " => ", model.coef_[0][idx])
pleasure  =>  4.383097453236274
tweeteradder  =>  4.45996142398864
smiling  =>  4.53934889692321
congrats  =>  4.573540572191535
glad  =>  4.665180957066852
smile  =>  4.924781976595472
congratulations  =>  5.1171661936577895
welcome  =>  6.1123945200887855
thank  =>  6.463801443258966
thanks  =>  6.527945847658128

Now we have a auc quite correct with not a lot of overfitting. Nevertheless, if we look at words with more impact on the prediction, we can see double like thank and thanks or congrats and congratulation. That's why we need a more in-depth preparation

Better pre-processing

To do this step, we will use the tweet tokenizer which is better to tokenize tweet which have a more complex sentence grammar. After, to reduce the number of words, we will apply a stemmer to keep only the root of the word and merge words like "queries", "query", "querying" to the same word "queri" in that case.

Due to the TweetTokenizer, we may end with 1 ponctuation point or a letter (it tries to keep smileys). We will also remove them.

In [14]:
def dummy_fun(doc):
    return doc

def clean(x):
    tokens = TweetTokenizer().tokenize(x)
    return [SnowballStemmer("english").stem(word) for word in tokens if len(word) > 1]
In [15]:
X_1 = []
for sentence in tqdm.tqdm(X1):
    X_1.append(clean(sentence))
    
X1 = X.tolist()  # just for clean check
100%|██████████████████████████████████████████████████████████████████████| 1600000/1600000 [04:30<00:00, 5910.83it/s]

Now we should have a lot less words. We can apply the TF-IDF and in addition, we can remove all words which doesn't appears in more than 5 documents. This will end to 30k words. As a result, I add the 2-gram words in the vocabulary. It's slower but provides better results.

In [16]:
tfidf = TfidfVectorizer(min_df=5, 
                        max_features=None, 
                        strip_accents='unicode',
                        lowercase =True,
                        analyzer='word', 
                        ngram_range=(1,2),
                        use_idf=True, 
                        smooth_idf=True, 
                        sublinear_tf=True, 
                        stop_words = None, #"english", 
                        tokenizer=dummy_fun, 
                        preprocessor=dummy_fun)  
In [17]:
X_train, X_test, y_train, y_test, init_train, init_test = train_test_split(X_1, y, X1, test_size=0.2, random_state=42)

X_train_transform = tfidf.fit_transform(X_train)
X_test_transform = tfidf.transform(X_test)
In [18]:
X_train_transform.shape
Out[18]:
(1280000, 295052)

Even if we have 2-grams, our dictionnay is a lot lighter with 295000 words. Now we can train models and fine tune it.

I didn't kept all test bu I tried with L1 or L2 penalyt and multiple C factors. The best result is reached with L1-penalty and a C = 0.3. We nearly have no overfitting and the auc above the previous model.

In [20]:
#specify model and parameters
model=LogisticRegression(penalty='l1', C=0.3)

#fit model
model.fit(X_train_transform, y_train)
Out[20]:
LogisticRegression(C=0.3, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [21]:
pred_train=model.predict_proba(X_train_transform)[:,1]
pred_test=model.predict_proba(X_test_transform)[:,1]

print ("Train AUC : " , roc_auc_score(y_train, pred_train))
print ("Test AUC : " , roc_auc_score(y_test, pred_test))
Train AUC :  0.8977395289151319
Test AUC :  0.8944455770417095
In [22]:
from sklearn.metrics import accuracy_score

for t in range(40, 60, 5):
    print("Rate", t/100)
    print ("Train Acc. : " , accuracy_score(y_train, pred_train>t/100 ))
    print ("Test Acc. : " , accuracy_score(y_test, pred_test>t/100 ))
Rate 0.4
Train Acc. :  0.80464375
Test Acc. :  0.801515625
Rate 0.45
Train Acc. :  0.81315546875
Test Acc. :  0.80950625
Rate 0.5
Train Acc. :  0.8180921875
Test Acc. :  0.8148125
Rate 0.55
Train Acc. :  0.81872890625
Test Acc. :  0.815359375

In term of accuracy, we don't have a very good result. We will check why just after but first, let's look at words and their weight as we did previously

In [23]:
idx_weight = np.argsort(model.coef_[0])
inversed = { value: key for key, value in tfidf.vocabulary_.items() }
In [26]:
for idx in idx_weight[:50]:
    print(inversed[idx], " => ", model.coef_[0][idx])
sad  =>  -21.21056124724744
miss  =>  -14.716917432255471
poor  =>  -12.870395639744682
not happi  =>  -12.563533452059742
cant  =>  -12.081573217801306
sick  =>  -11.88037997943026
unfortun  =>  -10.944235784615682
depress  =>  -10.607840797101318
disappoint  =>  -10.551721093413594
hurt  =>  -10.405286120145677
hate  =>  -10.38763446225119
upset  =>  -10.250304218824947
not look  =>  -10.112668100651998
wish  =>  -10.07693420054282
cri  =>  -10.019714452212156
bummer  =>  -9.85788570737082
cancel  =>  -9.830784279154846
suck  =>  -9.786252314870064
broke  =>  -9.585485520110876
headach  =>  -9.534027443128409
lost  =>  -9.401723711727291
rip  =>  -9.229089286813736
air franc  =>  -9.224771814291797
sadden  =>  -9.207044905850355
pass away  =>  -9.104886054315351
not good  =>  -8.991351965485975
not cool  =>  -8.798331030267612
ugh  =>  -8.77816362303622
sadd  =>  -8.618030177100175
broken  =>  -8.59458566763673
lone  =>  -8.470450117633918
father day  =>  -8.393575757561562
saddest  =>  -8.348713584138153
bad  =>  -8.273044425956842
missin  =>  -8.220725865157835
horribl  =>  -8.20613572225263
not fun  =>  -8.204359143084057
sold out  =>  -8.02089748394297
boo  =>  -7.9984023178477806
die  =>  -7.94123446698408
funer  =>  -7.936886086508879
no  =>  -7.9040370564252855
noo  =>  -7.901483582082177
fail  =>  -7.696261615939354
not nice  =>  -7.666301903850101
iran  =>  -7.644679125042084
want  =>  -7.639434949264665
isnt  =>  -7.455857794759793
didnt  =>  -7.443137666530462
unfair  =>  -7.413446194449365
In [27]:
for idx in idx_weight[-50:]:
    print(inversed[idx], " => ", model.coef_[0][idx])
great  =>  5.351644348941758
didnt miss  =>  5.3985093250432605
pleasur  =>  5.41503465567103
couldnt resist  =>  5.44641887737757
heheh  =>  5.469564852498508
enjoy  =>  5.55688323154081
congrat  =>  5.590964171832272
noth wrong  =>  5.625265555491154
dont miss  =>  5.625571098327797
miss me  =>  5.669256543955395
no pain  =>  5.677395675773511
dont forget  =>  5.7098465281800115
nice  =>  5.828553893130498
not alon  =>  5.829690939208242
dont hate  =>  5.896274749638175
no school  =>  5.910846106539679
hehe  =>  5.955756071230851
dont need  =>  5.9638867686941115
cute  =>  5.971062802364617
cool  =>  6.009077592413637
congratul  =>  6.093079638388798
amaz  =>  6.130528656249716
no doubt  =>  6.1720584794851465
yay  =>  6.2201379739388285
never fail  =>  6.238140754977784
wont hurt  =>  6.267905717766572
never too  =>  6.479048101923512
awesom  =>  6.512105550458961
not problem  =>  6.564228846598438
good  =>  6.757880602348531
glad  =>  6.813956091130689
love  =>  6.917969396525922
excit  =>  7.012052179033998
no need  =>  7.013737326551238
proud  =>  7.1680357169266475
to worri  =>  7.338427927658341
welcom  =>  7.3803047464257245
thank  =>  7.678684761489882
happi  =>  8.145662186004994
=(  =>  8.26196364255093
no prob  =>  8.845679450094266
smile  =>  9.28889159764574
cannot wait  =>  9.68972720268853
dont worri  =>  9.763674530289258
not sad  =>  9.800908142841603
doesnt hurt  =>  10.1670147558745
not bad  =>  10.680289202658724
no problem  =>  12.606565964257802
no worri  =>  13.716947335534774
cant wait  =>  14.93130643436979

We don't see double words which is a good point and most words makes sense. The strange one is "=(" which is considered as positive. We can see that 2-gram helped to manage more grammar. For example :

  • good is positive with +6.75
  • not good is negative with -8.99
In [28]:
model.coef_[0][tfidf.vocabulary_["exc_mark"]]
Out[28]:
1.3194047289395907
In [37]:
model.coef_[0][tfidf.vocabulary_["three_dot"]]
Out[37]:
-2.071659496958665

As we can imagine "..." is considered negative and "!!!" is considered positive. The second one is less true but should be more a factor as "yes!!!" is more positive than "yes" and "no!!!" is more negative than "no".

In [29]:
error = pred_test - y_test

tweet_text = np.array(init_test)

index_0 = np.argwhere(y_test == 0)
index_1 = np.argwhere(y_test == 1)

tweet_0 = tweet_text[index_0]
tweet_1 = tweet_text[index_1]

my_pred_0 = pred_test[index_0].flatten()
my_pred_1 = pred_test[index_1].flatten()

Check

Success Positive

Now let's look at where we are right to predict positive sentiment where is it positive.

In [30]:
for i in np.argsort(my_pred_1)[-5:]:  # the 5 last index are ones with highest prediction
    print("Predict {:.03f} - True {}".format(my_pred_1[i], "1"))
    print(tweet_1[i][0])
    print("\n")
Predict 1.000 - True 1
@MGiraudOfficial http://twitpic.com/7epvz - You look great. and happy. smiling is good.  haha. i love your smile.


Predict 1.000 - True 1
i like smiling 


Predict 1.000 - True 1
@COACHPARSELLS  smiles


Predict 1.000 - True 1
@Amy_LaRee &quot;&quot;&quot;&quot;SMILES &quot;&quot;&quot;&quot;  


Predict 1.000 - True 1
@itsuber smile. 


Success Negative

Now, where we are right to predict negative sentiment where is it negative.

In [31]:
for i in np.argsort(my_pred_0)[:5]:  # the 5 last index are ones with highest prediction
    print("Predict {:.03f} - True {}".format(my_pred_0[i], "0"))
    print(tweet_0[i][0])
    print("\n")
Predict 0.000 - True 0
@Elliecopter_  Ellish sad?


Predict 0.000 - True 0
@RyanSeacrest @vianacoke 9how sad 


Predict 0.000 - True 0
@Nailhead SAD FAEC 


Predict 0.000 - True 0
@teffysnedgehead Sadness. 


Predict 0.000 - True 0
@marissalindh sadness 


Error Positive

Now let's look at where we are wrong to predict negative sentiment where is it positive.

In [32]:
for i in np.argsort(my_pred_1)[:5]:  # the 5 last index are ones with highest prediction
    print("Predict {:.03f} - True {}".format(my_pred_1[i], "1"))
    print(tweet_1[i][0])
    print("\n")
Predict 0.000 - True 1
@triciabuck I wish! Airports suck! 


Predict 0.000 - True 1
@GF_Steph I miss my Stephie-poo 


Predict 0.000 - True 1
@LabattBoo i miss my chocobo   ? http://blip.fm/~5ei99


Predict 0.000 - True 1
@mygaysecrets  I wish!


Predict 0.000 - True 1
@ScottSharman I wish! 


Now we can see that there is most probably errors due to the dataset bias. Some tweets are considered positive but they are for some of them clearly negative (first one for example). It has been set as positive most probably because it had a ":)" at the end (for sarcasm maybe). The 2 last ones are positive and guessed negative due to the weight of wish (-10.07).

Error Positive

Now let's look at where we are wrong to predict positive sentiment where is it negative.

In [33]:
for i in np.argsort(my_pred_0)[-5:]:  # the 5 last index are ones with highest prediction
    print("Predict {:.03f} - True {}".format(my_pred_0[i], "0"))
    print(tweet_0[i][0])
    print("\n")
Predict 1.000 - True 0
@DuckDrake Thanks. 


Predict 1.000 - True 0
@jay_f_k thank u 


Predict 1.000 - True 0
@KristianaNKOTB  THANKS


Predict 1.000 - True 0
@oshidori Thanks. 


Predict 1.000 - True 0
@howbo15 thanks 


Now it's less logical... All of them are "Thanks" but labelled as negative. This is clearly a bias of dataset due to use of a smiley.

No sentiment

Now let's look at sentences considered without sentiments

In [34]:
idx = np.argsort(np.abs(pred_test-0.5))
for i in idx[:5]:
    print("Positive : Predict {:.03f} %".format(pred_test[i]))
    print(init_test[i])
    print("\n")
Positive : Predict 0.500 %
@Hibippytea  how's ur sat been?


Positive : Predict 0.500 %
Sitting at Chillies, with folks, waiting on the food. I just want to sleep! 


Positive : Predict 0.500 %
I've gained some weight since I've been in Houston.  


Positive : Predict 0.500 %
Oh and Brad Roudebush is officially a nerd. Maybe I am just jelous because he is following my Dad, but not me 


Positive : Predict 0.500 %
@pleasurep thee only time we talk is on myspace lol what happened to tweeting me?? 


For some of them it's just because we have positive and negative words but the sentence have a light balance. For example the second one has:

  • "just want" which is positive
  • "waiting" which is negative

Save for reuse

In [35]:
from sklearn.externals import joblib

joblib.dump(tfidf, 'F:/Twitter_data/models/tfidf.pkl') 
joblib.dump(model, 'F:/Twitter_data/models/log_reg.pkl') 
Out[35]:
['F:/Twitter_data/models/log_reg.pkl']

Conclusion

In this notebook, we learn how to clean tweets and train a model to predict sentiments on a sentence. This will be used on prediction of tweet related to a specific subject later on.

We use different tools of NLP (mainly from nltk) to clean the approximative grammar of tweet compare to newspapers for example.