Search This Blog

Monday, 29 September 2014

Sentiment Analysis

Guess what student's life is not an easy one. Classes, assignments, presentations, exams are things which sucks up all of your time. After having an encounter with R which I used for doing a "popularity comparison of two football teams" hiked my curiosity and was waiting for some spare time to get hold of  something new. Now that I am enjoying my Durga Puja vacations, I tried to explore the concepts of "Sentiment Analysis".

As the words suggests, Sentiment Analysis is a technique in which opinions expressed by people on a particular statement, product, person etc are analysed to understand what people feel about that respective entity? What are the positive aspects and what are the negatives? How many are in the favor of the statement and how many against of it? 

You might be thinking that how do we capture the sentiments?
To keep it simple look at the positive and negative words like:

Some positive statements are:
* I am happy
* I loved watching this movie and had a great time
* I can dance all day long

Some negatives statements are:
* The movie was pathetic
* I hate Chinese food 
* I don't like playing soccer

So I hope now you have some idea of what Sentiment Analysis means.
Now we will look at how R can help us in getting hold of this topic. In this post my

Goal:        To find which telecom operator is liked/disliked by its customers.
Source:    Twitter tweets
Tool:         R

One more thing, If you are new to conversations between R and Twitter, I will request you to please go through my earlier posts  that will help you configure your R environment in order to communicate with twitter. 

Ok so let's get started...
At first let's look at the code then the execution and output and then I will explain the code...

Step 1: Get the list of Positive and negative words from here. Extract the .rar file as we will need the positive-words.txt and negative-words.txt files in our analysis.

Step 2: Open a blank R script and paste the following code:

                  pos_words=scan(file="path for positive-       
                  words.txt",what="character",comment.char = ";")
                  neg_words=scan(file="path for negative-
                  words.txt",what="character",comment.char = ";")

                  #Fetch tweets from twitter...

                  #Now we need to convert the  list to a dataframe...
                  #So select the package plyr...

                   objectdf=ldply(object,function(t) t$toDataFrame())



                fill=Operator), binwidth=1) +facet_grid(Operator~.)  

Step 3: Save the script.
Step 4:  Open another script and paste the following code:

               score.sentiment = function(sentences, pos.words, neg.words, .progress='none')

                        # we got a vector of sentences. plyr will handle a list or a vector as an "l" for us
                        # we want a simple array of scores back, so we use "l" + "a" + "ply" = laply:
                       scores = laply(sentences, function(sentence, pos.words, neg.words) {

                       # clean up sentences with R's regex-driven global substitute, gsub():
                       sentence = gsub('[[:punct:]]', '', sentence)
                       sentence = gsub('[[:cntrl:]]', '', sentence)
                       sentence = gsub('\\d+', '', sentence)

                      # and convert to lower case:
                      sentence = tolower(sentence)

                     # split into words. str_split is in the stringr package
                     word.list = str_split(sentence, '\\s+')

                     # sometimes a list() is one level of hierarchy too much
                     words = unlist(word.list)

                     # compare our words to the dictionaries of positive & negative terms
                     pos.matches = match(words, pos.words)
                     neg.matches = match(words, neg.words)

                    # match() returns the position of the matched term or NA
                    # we just want a TRUE/FALSE:

                    pos.matches = !
                    neg.matches = !

                   # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
                   score = sum(pos.matches) - sum(neg.matches)
                   }, pos.words, neg.words, .progress=.progress )
                  scores.df = data.frame(score=scores, text=sentences)

Step 5: Save the above script.


Now in your R console type the following codes...

Fig 1: Run Pos_Neg_Words( ) function.
Fig 2: Load Vodafone data using "@VodafoneIN" twitter handle and show first 6 observations.

Fig 3: Extracting Scores using GetSentimentScoreDF(...) function. Please note we have added two more columns 'Operator' and 'Code' which will help us when we will do the comparison.

Fig 4: Plot of score values

Similarly we will run the above commands for other operators. I am not attaching the output screenshots for the rest. And I request you to please do add these two columns for all the operators with appropriate values otherwise you will be having tough time scratching your head.

Other operators for my experiment are:
  • Airtel (Handle: "@airtel_care","@Airtel_Presence")
  • Idea    (Handle: "@ideacellular")
Fig 5: Summarized Screenshot after loading the data to respective dataframes.


Fig 6: Bind scores of all operators into one.
Fig 7: Compare Score Distributions (Result)

Now its time we understand the codes with the help of screenshots shown above.

  • The code written in Pos_Neg_Words() extracts the list of positive and negative words and store it in two suitable variables. 

  • Then the Getdata() function takes a searchterm (here our twitter handles) and the limit (here 1000). It extracts the tweets based on the search term and creates a dataframe to store the result. We have taken only 500 observations because sometimes you may not get 1000 tweets at that time. It might also happen that you may get less than 500 and at that point of time select same number of available tweets for all contenders. In my case I got less than 500 tweets for "Idea".

  • Now we will compute the score. What is this??? Well if you look at the positive and negative statements discussed above you can find the number of positive sounding and negative sounding words from each sentence.For ex-

Positive words=2
Negative words=4
Therefore,  Score=2-4 i.e -2

This score of -2 says that the statement is expressed in negative sense.

Yes you are thinking right. There might be cases when sentences are written using negative words to depict some positive meaning. In those cases the above model would fail to capture the true essence of  statements. Well this method is discussed just to give reader a feel of this topic. The above computation is done using a well written code by "Jeffrey Breen" - score.sentiment() written in step 4. The code removes all the redundant and useless text from the tweets and then breaks each tweet sentence into words. It then tries to find a match for each word from both the positive and negative words list and stores the result of match in appropriate variables and from there the score for each sentence is computed as I discussed. The score.sentiment is called from GetSentimentScoreDF(dataframe)

  • Looking at Figure 4, we can see that bars are high in negative side that might suggest that the customers are complaining about the Vodafone services. This is what we planned to achieve right???

  • Look at Figure 5. Here we have extracted the data and score values for other two operators and added two columns explicitly i.e. Operators and Code. Depending on these two columns the comparison would be done easily. 

  • The code in Figure 6, binds all the score dataframes into one dataframe which will act as an input to the ggplot() function. Please install the package ggplot2.
  • The code in Figure 7  is for plotting the score values Operator-wise. This was the reason we added the column Operator. The plot is shown below the code.
Looking at the final plot we can say that Vodafone customers are highly complaining about the services followed by Airtel. I guess the tagline "No ULLU Banaowing" is working fine for IDEA users. 
Again I would say this inferences are just for demonstration and learning. Nothing can be inferred without statistical tests and I hope there are some for Sentiment Analysis too. Well having said that brings me to the end of this post. I hope, it was of some worth reading this article. You can also check the sources from where I borrowed this concepts.
Happy Learning...

Monday, 8 September 2014

Dimensional Datawarehousing using Mysql

This time I am getting my hands on Dimensional Data Warehousing....
The topic is very new to me but being familiar to Databases is the reason I am exploring this concepts.. And in exploring the concepts I am going to use Mysql as my data support.

Before we begin with the technical things let us understand why we need to study Data Warehousing? What is the difference between typical RDBMS and Data Warehouse?

Little bit  of googling can give you many answers. There is one basic difference between these two is that the data storage technique is little bit different in a  DW as compared to RDBMS.

Generally, RDBMS are designed for transactional purposes whereas DW are designed for analysis purpose. And for analysis, the data from traditional RDBMS systems are Extracted, Transformed in a way which is best suited for analysis and then Loaded into the DW, also called ETL. Once you are done with ETL you are ready to go for analysis.

This is the basic difference known to me. You can add your answers in replies. So there are many things which we will be exploring one by one as and when needed. Let's have a glance at basic fundamentals(please go through the following terms from this link in case you find it difficult to understand) The contents of this post are borrowed from this link.

Star Schema is basically the database structure of  DDW and Surrogate key is a column we need to add to DW table that acts as a primary key.
Fact table consists of facts(or measures) of the business and Dimension table contains the descriptions for querying the database.

There are many arrangements for Fact and Dimension table and one of them is "One Star Table" which has one Fact table and two or more than two Dimensional tables.

Here we are going to start with 4 little tasks.

         1. Create a new user in Mysql environment.
         2. Creating a relational database for DW and one Source database ( subject to ETL).
         3. Creating tables for DW.
         4. Generating Surrogate Keys ( please note unlike primary keys we do not enter values for surrogate keys, as we will make the system to generate one).
Following are the snapshots for task 2 to 4. Creating a new user is hardly any challenge for SQL users.
 Task 2

 Task 3

Task 4

**End Of chapter 1**

********Chapter 2********

In this chapter we are going to learn how to modify the dimension tables. Since DW is different from typical OLTP RDBMS w.r.t maintaining Dimension History, let us see how does it help us.

We will be dealing with two aspects of DW here. SCD1 and SCD2.
SCD1->Slowly Changing Dimension of type 1 in which the original data is replaced with the  updated data and no track/history is maintained for the previous data. 

SCD2->Slowly Changing Dimension of type 2 in  which the original data is made history data by changing its expiry date to one day before the origination of new data and its entrance to the table. In this way the previous record is kept whereas the change is also reflected in form of new entry.

Applying SCD1 to customer_dim table where the customer name for customer_number=1 is changed from "Big Customers" to "Really Big Customers" and a new record is inserted. By definition of SCD1 we are not keeping any track of updation done.

The result of above code
Now let us apply SCD2.

Here we have two records in product_dim table. Snapshot is given below.
Now we have two records in 
product_dim table. We will change the name of the product from Hard_Disk to Hard Disk Drive and we will insert one more record for LCD Monitors. Since we are changing the name we need to keep the track of previous record. Let us look at the following snapshot to understand how it is done.

**End of Chapter 2**

*******Chapter 3*******

Now we will discuss about additive measures. If you remember we have two types of tables i.e. fact table and dimension tables. So a measure is an attribute of fact table that is numeric and subject to arithmetic operations i.e sum, difference, avg, min, max etc. It can be fully additive as well as semi-additive.

Fully additive means you can sum the values of a measure in all situations and semi-additive means you can add up the values of the measure in some situations.

Before exploring the concepts of additive measures let us fill the tables
1. Order_dim
2. Date_dim
3. Sales_order_fact

Because there is no data in these tables. Following are the snapshots that explains how to fill in the data.

Now let us check for the whether the measure "order_amount" in sales_order_fact table is fully additive or not. It means it should give the same value every time a query is run on it. The query may be run on different table but it should show the same value every time. The total  value of this measure should be Consistent.

1. Sum from the sales_order_fact. Sum is 58000

2. Sum from customer_dim. If we add all the values of sum_of_order_amount it will give 58000

3.  Sum from product_dim. The values of the last column adds up to 58000

4.  Sum from all the dimension tables. If all the values from the last column are added it turns up to   58000.

Clearly we can see the order_amount is fully additive.

**End of chapter 3**

*********Chapter 4**********

In this chapter we are going to query the dimensional data warehouse. Before that let us insert few more records into


2        Date_dim


The snapshot for the above data insertion is.

Now we will run the aggregate functions on these tables;

Here we will run a query that give daily sales summary.

Now we will look into the annual sales.

Specific Queries

A specific query selects and aggregates the facts on a specific dimension value. The two

examples show the application of dimensional queries in specific queries.

The query aggregates sales amounts and the number of orders every month.

Quarterly Sales in Mechanisburg

An another specific query. It produces the quarterly aggregation of the order amounts in Mechanicsburg.

Inside-Out Queries

While the preceding queries have dimensional constraints (the selection of facts is based on

dimensions), an inside out dimensional query selects the facts based on one or more

measure values. In other words, your query starts from the fact (the centre of the star

schema) towards the dimensions, hence the name inside-out query. The following two

examples are inside-out dimensional queries.

Product Performer

The dimensional query gives you the sales orders of products that have a monthly sales amount of 75,000 or more.

Loyal Customer

The following query  is a more complex inside out query than the previous one.. If your users would like to know which customer(s) placed more than five orders annually in the past eighteen months, you can use the query in. This query shows that even for such a complex query, you can still use a dimensional query.

**End of chapter 4**

**********Chapter 5*********

This following chapter will  deal with the ETL tasks. Extract, Load and Transform the data from source tables(generally OLTP RDBMS tables) to Dimensional Data Warehouse tables.

Here we will do the Extraction and Loading parts. We need to deal with which data we need to extract and load into DW tables.

Data can be extracted in two ways. “Whole Source” in which all the records from the source tables are loaded into DW tables, and another one is “Change Data Capture” in which only the new and changed records are loaded since last extraction.

Direction of data extraction. “Pull by Data” in which the source simply waits for the DW to extract it. “Push by Source” in which the source pushes the data as soon as it is available.

Push-by-source CDC on Sales Orders Extraction

In this section we will look on how push-by-source CDC works on sales order source data.

Push-by-source CDC means the source system extracts only the changes since the last

extraction. In the case of sales order source data, the source system is the source

database created in Chapter 1.

But before doing any extraction we need to create tables in source database if you remember. So lets create the sales_order table there.

Now let us write a stored-procedure which we can run daily to extract the data from source table.

Now let us insert some records into order_dim and date_dim

Now we need to insert some data into source table i.e sales_order table.

Now let us switch to source database as we want to push the data from source to DW and we will do that with the help of stored procedure written earlier.

**End of chapter 5**

*********Chapter 6**********

Populating the date dimension

A date dimension has a special role in dimensional data warehousing. First and foremost, a date dimension contains times and times are of utmost importance because one of the primary functions of a data warehouse is to store historical data. Therefore, the data in a data warehouse always has a time aspect. Another unique aspect of a date dimension is that you have the option to generate the dates to populate the date dimension from within

the data warehouse.

There are three techniques to populate data in date dimension.

The three techniques are

·        Pre-population

·        One date everyday

·         Loading the date from the source data


Among the three techniques, pre-population is the easiest one for populating the date

dimension. With pre-population, you insert all dates for a period of time. For example, you

can pre-populate the date dimension with dates in ten years, such as from January 1, 2005

to December 31, 2015. Using this technique you may pre-populate the date dimension once

only for the life of your data warehouse.

the following snapshot shows a stored procedure you can use for pre-population. The stored procedure accepts two arguments, start_dt (start date) and end_dt (end date). The WHILE loop in the stored procedure incrementally generates all the dates from start_dt to end_dt and inserts these dates into the date_dim table.


The second technique for populating the date dimension is one-date-every-day, which is

similar to pre-population. However, with one-date-every-day you pre-populate one date

every day, not all dates in a period. The following snapshot loads the current date into the date_dim table.


Loading Dates from the Source

The following snap loads the sales order dates from the sales_order table in the

source database into the date_dim table. We will use the DISTINCT keyword in the script to

make sure no duplicates are loaded.

Now let us add more records to source sales order table and re run  command.

**End of chapter 6**

**********Chapter 7**********

Initial Population

In chapter 6 we learned how to populate date dimension table. In this chapter we will learn how to populate other dimension tables .

Clearing the tables

Pre-populating the Date Dimension

call pre_populate_date ('2005-03-01', '2010-12-31'); as we did in chapter 6. This procedure will populate the date dimension with all the dates between 01/03/2005 to 31/12/2010.

Preparing the Sales Orders

Running initial population

**End of chapter 7**

**********Chapter 8**********

Regular Population

Regular population differs from initial population where the data is loaded at one go whereas in regular population as name suggests populates the data into tables from source tables on a regular basis.

Before running the Regular population codes, we need to load new data to sales_order table of source.

Now let us run the scripts for loading data into dimension tables in "dw "database using regular method.

Results to verify whether the data has been loaded into the respective tables or not.

Customer_dim table

 product_dim table
**End of chapter 8**