Tuesday, November 25, 2014

Use Python for Recommendation Engine



Python For Recommendation Engine

well here Page1 we got the data and we fetched associated details with each of the dataset , alright now lets roll , will just write the algorithm and its done...

Good Morning .. its real world and nothing happens so easily here ..

Why did I say so ?

   Var1 Freq
    N/A  405

Because I have 405 NA values in imdbRating .. it does mean I have no records available for 405 among the list of 1173 total movies.. That sick because then almost 40% of the dataset is invalid.

So what should I do , should I simply dump the data and why it happened.

I can't dump that much data and it happened because there are movies which has no details in IMDB, may be because of some movies are based on particular zone only nt globally known or any other reason.


What Should I Do ?

I will cleanse and shape my data.

Lets consider a movie named KNHN and now I will try to find how many fb friends' like the movie lets say 10, now I can compare that number to the movie having highest viewer among my list ie 3 idiots which is 30.

So now I know among highest number 30% liked KNHN as well , so weightage of this movie is bit high. So naturally is above average rating ie 5.0 , so again I added 30% of highest rating 8.4 to 5.0 and finally I am close to some value around 7.0 (example).

So are we finished here , not possible , there are lot more fields to address like imdbUserVotes , rottenUserVotes. So what should I do here.

Simple I will use linear Regression or precisely Machine Learning ml to do this . Bingo !!!! I used scipy library of python to run polynomial regression-

  result = sm.ols(  
           formula="tomatoUserReviews ~ imdbRating + tomatoRating + I(genre1 ** 1.0) +I(genre2 ** 1.0)+I(genre3 ** 4.0)", data=df2).fit()  
         pretomatoRating = result.predict(  
           pd.DataFrame({'imdbRating': [newRating], 'tomatoRating': [newRating],'genre1':[0],'genre2':[0],'genre3':[0]}))  
         pretomatoRating = int(round(pretomatoRating))  
         if pretomatoRating < 0:  
           pretomatoRating = 10000  
         self.df2.loc[index, "tomatoUserReviews"] = pretomatoRating  


So now my dataset is quite preety




imdbRating imdbVotes imdbID tomatoRating tomatoUserReviews
1 6.20 89960 6.20 3504356
2 6.40 5108 tt2678948 6.30 152
3 6.80 35 tt0375174 6.30 125
4 7.20 99763 tt0087538 6.90 314496
5 6.90 231169 tt1068680 5.30 316060
6 7.70 291519 tt0289879 4.80 621210
7 6.90 174047 tt1401152 5.80 74879
8 7.20 1038 tt0452580 6.30 253
9 4.90 10381 6.30 6006
10 7.20 28 tt0297197 6.30 6006
11 4.90 10381 tt1278160 6.30 6006
12 4.90 10381 6.30 6006
13 4.90 10381 6.30 6006
14 3.20 690 tt2186731 6.30 862
15 6.20 5594 tt1252596 6.30 2436

So now all of my movies have Ratings and user reviews (almost).

So Now I can run my recommendation




Shaping The Final Outcome

Now I have all the required variables in the dataset I can finally run my distance algorithm to conclude which is the next best movie.

For rating as the data is consistent I used euclidean but votes are very sparse so I used consine distance to get the data.

   def calculatedistance(self, movie1, movie2):  
     FEATURES = [  
       'imdbRating', 'imdbVotes', 'tomatoRating', 'tomatoUserReviews']  
     movie1DF = df2[df2.MOVIE == movie1]  
     movie2DF = df2[df2.MOVIE == movie2]  
     rating1 = (float(movie1DF.iloc[0]['imdbRating']), float(  
       movie1DF.iloc[0]['tomatoRating']))  
     rating2 = (float(movie2DF.iloc[0]['imdbRating']), float(  
       movie2DF.iloc[0]['tomatoRating']))  
     review1 = (long(movie1DF.iloc[0]['imdbVotes']), long(  
       movie1DF.iloc[0]['tomatoUserReviews']))  
     review2 = (long(movie2DF.iloc[0]['imdbVotes']), long(  
       movie2DF.iloc[0]['tomatoUserReviews']))  
     # ValueError  
     distances = []  
     distances.append(round(distance.euclidean(rating1, rating2), 2))  
     '''  
                Since votes have sparse data , so i preffered to use cosine rather euclidean..  
                http://stats.stackexchange.com/questions/29627/euclidean-distance-is-usually-not-good-for-sparse-data  
           '''  
     distances.append(round(distance.cosine(review1, review2), 2))  
     return distances  


Python code has -

User Recommendation - recommendMovie
Distance Comparison between 2 movies - calculatedistance
Distance comparison between multiple movies - compareOneMovieWithMultiple
Distance Comparision Among Movies - findSimilarityBetweenMovies

and few more.

One of sample I have uploaded in cloud - Distance Calculated between Movies 

Entire Source Code - git link


Few More analytic in R will be uploaded.



What More Can be done ?

A Lot more like lots of movies I couldn't extract the Genre , so I can get that from using text extraction from twitter and get the best result.

Lots of basic ratings as well can be derived from Sentiment Analysis of the movie name in different crawler.

Recommendation Algorithm can be made more versatile .












1 comment:

Boro Dega said...

Great blog. I have a question

1: Your first model states:
model <- lm(imdbVotes ~ imdbRating + tomatoRating + tomatoUserReviews+ I(genre1 ** 3.0) +I(genre2 ** 2.0)+I(genre3 ** 1.0), data = movies)

How can you run a regression model which includes NAs included)?

Br