Search This Blog

Monday, 29 September 2014

Sentiment Analysis

Guess what student's life is not an easy one. Classes, assignments, presentations, exams are things which sucks up all of your time. After having an encounter with R which I used for doing a "popularity comparison of two football teams" hiked my curiosity and was waiting for some spare time to get hold of  something new. Now that I am enjoying my Durga Puja vacations, I tried to explore the concepts of "Sentiment Analysis".

As the words suggests, Sentiment Analysis is a technique in which opinions expressed by people on a particular statement, product, person etc are analysed to understand what people feel about that respective entity? What are the positive aspects and what are the negatives? How many are in the favor of the statement and how many against of it? 

You might be thinking that how do we capture the sentiments?
To keep it simple look at the positive and negative words like:

Some positive statements are:
* I am happy
* I loved watching this movie and had a great time
* I can dance all day long

Some negatives statements are:
* The movie was pathetic
* I hate Chinese food 
* I don't like playing soccer

So I hope now you have some idea of what Sentiment Analysis means.
Now we will look at how R can help us in getting hold of this topic. In this post my

Goal:        To find which telecom operator is liked/disliked by its customers.
Source:    Twitter tweets
Tool:         R

One more thing, If you are new to conversations between R and Twitter, I will request you to please go through my earlier posts  that will help you configure your R environment in order to communicate with twitter. 

Ok so let's get started...
At first let's look at the code then the execution and output and then I will explain the code...

Step 1: Get the list of Positive and negative words from here. Extract the .rar file as we will need the positive-words.txt and negative-words.txt files in our analysis.

Step 2: Open a blank R script and paste the following code:

                  pos_words=scan(file="path for positive-       
                  words.txt",what="character",comment.char = ";")
                  neg_words=scan(file="path for negative-
                  words.txt",what="character",comment.char = ";")

                  #Fetch tweets from twitter...

                  #Now we need to convert the  list to a dataframe...
                  #So select the package plyr...

                   objectdf=ldply(object,function(t) t$toDataFrame())



                fill=Operator), binwidth=1) +facet_grid(Operator~.)  

Step 3: Save the script.
Step 4:  Open another script and paste the following code:

               score.sentiment = function(sentences, pos.words, neg.words, .progress='none')

                        # we got a vector of sentences. plyr will handle a list or a vector as an "l" for us
                        # we want a simple array of scores back, so we use "l" + "a" + "ply" = laply:
                       scores = laply(sentences, function(sentence, pos.words, neg.words) {

                       # clean up sentences with R's regex-driven global substitute, gsub():
                       sentence = gsub('[[:punct:]]', '', sentence)
                       sentence = gsub('[[:cntrl:]]', '', sentence)
                       sentence = gsub('\\d+', '', sentence)

                      # and convert to lower case:
                      sentence = tolower(sentence)

                     # split into words. str_split is in the stringr package
                     word.list = str_split(sentence, '\\s+')

                     # sometimes a list() is one level of hierarchy too much
                     words = unlist(word.list)

                     # compare our words to the dictionaries of positive & negative terms
                     pos.matches = match(words, pos.words)
                     neg.matches = match(words, neg.words)

                    # match() returns the position of the matched term or NA
                    # we just want a TRUE/FALSE:

                    pos.matches = !
                    neg.matches = !

                   # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
                   score = sum(pos.matches) - sum(neg.matches)
                   }, pos.words, neg.words, .progress=.progress )
                  scores.df = data.frame(score=scores, text=sentences)

Step 5: Save the above script.


Now in your R console type the following codes...

Fig 1: Run Pos_Neg_Words( ) function.
Fig 2: Load Vodafone data using "@VodafoneIN" twitter handle and show first 6 observations.

Fig 3: Extracting Scores using GetSentimentScoreDF(...) function. Please note we have added two more columns 'Operator' and 'Code' which will help us when we will do the comparison.

Fig 4: Plot of score values

Similarly we will run the above commands for other operators. I am not attaching the output screenshots for the rest. And I request you to please do add these two columns for all the operators with appropriate values otherwise you will be having tough time scratching your head.

Other operators for my experiment are:
  • Airtel (Handle: "@airtel_care","@Airtel_Presence")
  • Idea    (Handle: "@ideacellular")
Fig 5: Summarized Screenshot after loading the data to respective dataframes.


Fig 6: Bind scores of all operators into one.
Fig 7: Compare Score Distributions (Result)

Now its time we understand the codes with the help of screenshots shown above.

  • The code written in Pos_Neg_Words() extracts the list of positive and negative words and store it in two suitable variables. 

  • Then the Getdata() function takes a searchterm (here our twitter handles) and the limit (here 1000). It extracts the tweets based on the search term and creates a dataframe to store the result. We have taken only 500 observations because sometimes you may not get 1000 tweets at that time. It might also happen that you may get less than 500 and at that point of time select same number of available tweets for all contenders. In my case I got less than 500 tweets for "Idea".

  • Now we will compute the score. What is this??? Well if you look at the positive and negative statements discussed above you can find the number of positive sounding and negative sounding words from each sentence.For ex-

Positive words=2
Negative words=4
Therefore,  Score=2-4 i.e -2

This score of -2 says that the statement is expressed in negative sense.

Yes you are thinking right. There might be cases when sentences are written using negative words to depict some positive meaning. In those cases the above model would fail to capture the true essence of  statements. Well this method is discussed just to give reader a feel of this topic. The above computation is done using a well written code by "Jeffrey Breen" - score.sentiment() written in step 4. The code removes all the redundant and useless text from the tweets and then breaks each tweet sentence into words. It then tries to find a match for each word from both the positive and negative words list and stores the result of match in appropriate variables and from there the score for each sentence is computed as I discussed. The score.sentiment is called from GetSentimentScoreDF(dataframe)

  • Looking at Figure 4, we can see that bars are high in negative side that might suggest that the customers are complaining about the Vodafone services. This is what we planned to achieve right???

  • Look at Figure 5. Here we have extracted the data and score values for other two operators and added two columns explicitly i.e. Operators and Code. Depending on these two columns the comparison would be done easily. 

  • The code in Figure 6, binds all the score dataframes into one dataframe which will act as an input to the ggplot() function. Please install the package ggplot2.
  • The code in Figure 7  is for plotting the score values Operator-wise. This was the reason we added the column Operator. The plot is shown below the code.
Looking at the final plot we can say that Vodafone customers are highly complaining about the services followed by Airtel. I guess the tagline "No ULLU Banaowing" is working fine for IDEA users. 
Again I would say this inferences are just for demonstration and learning. Nothing can be inferred without statistical tests and I hope there are some for Sentiment Analysis too. Well having said that brings me to the end of this post. I hope, it was of some worth reading this article. You can also check the sources from where I borrowed this concepts.
Happy Learning...

1 comment: