Contents

Sentiment Analysis with Deep Neural Networks

October 06th, 2018 sentiment-analysis

In this post, we will be training the Twitter sentiment corpus using a Keras deep neural network. This is a famous dataset with labeled sentiment on over a million tweets and coming in at a hefty 150mb.

There are many ways to train a sentiment anlaysis model on the twitter corpus, and in this tutorial, we will be focusing on a deep neural network.

If you want to follow along with the tutorial, I have hosted the dataset in its entirety at datadreams.ai/uploads/data/twitter-corpus.csv.

Taking a Peek at the Data

I have attached a mini version of the dataset with all the features below.

ID Sentiment Source Text

As you can see above, the source of the tweet has been anonymized. The two integral fields are Sentiment and Text. Looking through this small example list, you can see that the Sentiment column is a boolean logic, where 1 indicates positive sentiment and 0 is negative.

A word about quality

This list was put together by people. Different people have different views on what constitutes a positive or negative sentiment, and could be highly dependent on one's mood that day. When you think about it, sentiment is by its very nature subjective. Adding in colloquialisms and sarcasm makes this picture even murkier. Humans often disagree over 10% of the time on the sentiment of text, so I hope you will be impressed how accurate our simple model in this tutorial can get in such an opaque, complex domain.

Imports


import numpy as np
import json
import keras
import keras.preprocessing.text as kpt
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
import h5py

Features and Tokenizing

First we open the twitter corpus and then set our training and target fields. We create a new tokenizer and feed it our text and then dump the results to a dictionary.


training = np.genfromtxt('twitter-sentiment.csv', delimiter=',', skip_header=1, usecols=(1, 3), dtype=None)

train_x = [str(x[1]) for x in training]
train_y = np.asarray([x[0] for x in training])

max_words = 5000

#New Tokenizer
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train_x)

dictionary = tokenizer.word_index
# Let's save this out so we can use it later
with open('dictionary.json', 'w') as dictionary_file:
    json.dump(dictionary, dictionary_file)

Our Dictionary

Below is a snippet of our dictionary.json tokens, or words, and their counts.

{"b'": 1, "i": 2, "to": 3, "the": 4, "'": 5, "b": 6, "a": 7, "my": 8, "and": 9, "you": 10, "for": 11, "it": 12, "is": 13, "in": 14, "of": 15, "on": 16, "me": 17, "so": 18, "that": 19, "have": 20, "with": 21, "at": 22, "i'm": 23, "just": 24, "be": 25, "but": 26, "was": 27, "not": 28, "this": 29, "up": 30, "good": 31, "get": 32, "day": 33, "out": 34, "now": 35, "are": 36, "like": 37, "all": 38, "go": 39, "quot": 40, "no": 41, "b'i": 42, "your": 43, "http": 44, "love": 45, "do": 46, "got": 47, "going": 48, "from": 49, "work": 50, "today": 51, "too": 52, "u": 53, "com": 54, "xc3": 55, "it's": 56, "what": 57, "back": 58, "we": 59, "time": 60, "xc2": 61, "can": 62}

Clean up for training


#make all tweets same length
def convert_text_to_index_array(text):
    return [dictionary[word] for word in kpt.text_to_word_sequence(text)]

indices = []
for text in train_x:
    word_indices = convert_text_to_index_array(text)
    indices.append(word_indices)

indices = np.asarray(indices)

# Let's create one-hot matrices
train_x = tokenizer.sequences_to_matrix(indices, mode='binary')
train_y = keras.utils.to_categorical(train_y, 2)

Training

Now we can use Keras to create a Tensorflow neural network. The dropout layers will help make our model more robust.


model = Sequential()
model.add(Dense(600, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(300, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy'])

model.fit(train_x, train_y,
  epochs=10,
  verbose=1,
  validation_split=0.1,
  shuffle=True)

Save the model


model_json = model.to_json()
with open('model.json', 'w') as json_file:
    json_file.write(model_json)

model.save_weights('twitter-sentiment.h5')

Prediction

You may want to separate the prediction part to its own file. This code allows you to run your sentiment query through the command line.


from keras.models import model_from_json
import sys

okenizer = Tokenizer(num_words=5000)
labels = ['negative', 'positive']

# read saved dictionary
with open('dictionary.json', 'r') as dictionary_file:
    dictionary = json.load(dictionary_file)

#Make sure words are registered in dictionary
def convert_text_to_index_array(text):
    words = kpt.text_to_word_sequence(text)
    wordIndices = []
    for word in words:
        if word in dictionary:
            wordIndices.append(dictionary[word])
        else:
            print("'%s' not in training corpus; ignoring." %(word))
    return wordIndices

# read in your saved model structure
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
model = model_from_json(loaded_model_json)
#weigh your nodes with your saved values
model.load_weights('twitter-sentiment.h5')

while 1:
    evaluation = input("Input a sentence to be evaluated, or Enter to quit: ")

    if len(evaluation) == 0:
        break

    testArr = convert_text_to_index_array(evaluation)
    input = tokenizer.sequences_to_matrix([testArr], mode='binary')
    # predict which bucket your input belongs in
    pred = model.predict(input)
    print("%s sentiment; %f%% confidence" % (labels[np.argmax(pred)], pred[0][np.argmax(pred)] * 100))
    print ( str(sys.argv))

Query your model


$ python3 predict.py
$ Input a sentence to be evaluated, or Enter to quit: no-good-dirty-rotten-pig-stealing-great-great-grandfather
$ negative sentiment; 63.360399% confidence

Back