The dataset that I used is the Votes from women txt file written by Elizabeth Robins. It is found in Gutenberg Ebook Collection.
I first used OpenRefine to do data cleaning on my dataset. Since the file was a text file, I didn’t want to remove any possible word information from the text file. I explored the dataset and found there are some “-” as well as “_” in the text that might mess up the word identification process, so I deleted these characters using regular expression in the transform section. Here are the two command that I put in:
By using these two commands, 750 cells were modified to remove “-” and over 1000 cells were revised to remove “_”. This cleaning step saves a lot of time because if there’s no cleaning before the word analysis, some preposition words with these special characters will be counted as unique words and appear in the word cloud.
After exporting the clean dataset from OpenRefine, I used Voyant in order to conduct a text analysis. I set the stop word list with English and conducted a first round analysis, and I found some words that should be added to the stop word list. Some contractions such as “it’s”, “you’re” are added to the list to be removed, and the word “gutenberg” is also recognized by Voyant as high frequent word even though it is just generated by the source provider in the text file. After making some changes to the stop word list, Voyant is able to generate word clouds and other relevant plots for visualizing the text analysis on the dataset.
Voyant is able to generate interactive word cloud analysis and other plots for us to share. Here are some on the plots that I generated for Votes for Women.
This is a word cloud containing the words with top frequency. The main characters, such as John, Lady John, Mrs Jean, all appear in this word cloud. The words “men” and “women” are also mentioned a lot since this is about women’s rights to vote. You can drug the slide bar in the lower left corner to adjust the number of words appear in the word cloud.
This analysis shows the trends of the top frequent words’ appearances through out the book. Some words are quite frequent in the beginning and end, such as “Stonor”, “John”, and “lady”. Some has relatively stable frequency of appearance, for example “men” and “women”.
This network analysis of the words shows how each of them are linked and related.
Using text analysis to my data, which is a play script, is very useful in terms of know the main characters and main themes discussed in the document. With the unique characteristic of the play script, once a character has said anything, the name of the character will appear once. Therefore, characters with more lines, which are more likely to be main characters, will be more likely to be in the word cloud and high frequency list. Meanwhile, the words that they mention frequently, in this case “men” and “women”, are more likely to be the main topic discussed in the script, and therefore we can infer some connection using these words to the main theme of the data. The network analysis of the words also gives some clues about possible character relations, and it helps us better analyze the text from a data perspective.
Dig deeper, text analysis on general texts, such as books and play scripts, can give us a different view about the book, and we could possibly figure out some important words that are mentioned in the text that are easy to be ignored when just reading the text. It also gives clue about the styles of the author, what words that the author likes to use, and how they are used.