18 2. METHODS
1 def confusion_matrices(training_data, num_folds=10):
2 text_training_data = np.array([row[0] for row in training_data])
3 class_training_data = np.array([row[1] for row in training_data])
4 kf = KFold(n_splits=num_folds, random_state=42, shuffle=True)
5 cnf_matrix_test = np.zeros((2, 2), dtype=int)
6 for train_index, test_index in kf.split(text_training_data):
7 text_train, text_test = (text_training_data[train_index],
8 text_training_data[test_index])
9 class_train, class_test = (class_training_data[train_index],
10 class_training_data[test_index])
11
12 sentiment_pipeline.fit(text_train, class_train)
13 predictions_test = sentiment_pipeline.predict(text_test)
14 cnf_matrix_test += confusion_matrix(class_test, predictions_test)
In each iteration of the 10 folds, the above program splits apart a training set and a test set.
The classifier is trained on the training set using sentiment_pipeline.fit, and the classifier’s
predictions for the test set in that fold are used to add toward a confusion matrix which will
categorically visualize the performance of the classifier. We also calculate the precision, recall,
and F
1
score for each fold’s individual test set.
2.3 Creating the Twitter Bot
We use Tweepy to interact with the Twitter API. It provides a convenient object for streaming
Twitter data in real time. The StreamListener class can track tweets by searching for specific
users, locations, and keywords. For our purposes, it has to be extended to track subtweets.
1 class StreamListener(tweepy.StreamListener):
2 def on_status(self, status):
3 id_str = status.id_str
4 screen_name = status.user.screen_name
5 created_at = status.created_at
6 retweeted = status.retweeted
7 in_reply_to = status.in_reply_to_status_id_str
8 text = status.full_text
9
10 # Genericize extra features and clean up the text
11 text = (urls_pattern.sub("GENERIC_URL",
12 at_mentions_pattern.sub("GENERIC_MENTION",
13 names_pattern.sub("GENERIC_NAME",
14 text)))
15 .replace("\u2018", "'")
16 .replace("\u2019", "'")
17 .replace("\u201c", "\"")
18 .replace("\u201d", "\"")
19 .replace(""", "\"")
20 .replace("&", "&")
21 .replace(">", ">")
22 .replace("<", "<"))
23
24 tokens = tokenizer.tokenize(text)
25
26 english_tokens = [english_dict.check(token) for token in tokens]