Analyzing TIMDB (The Indian Movie Database)

The complete database can be found on Github.

This notebook is created by the author of this dataset.

This database contains a list of Indian movies and it's meta-data, with release years ranging between 1950 and 2019. As the database is big, we will start by analyzing the primary dataset: ./1950-2019/bollywood_full.csv

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
from difflib import get_close_matches
from surprise import Reader, Dataset, KNNBaseline

file = "./1950-2019/bollywood_full.csv"
df = pd.read_csv(open(file, "r"))

Initial Analysis

In [2]:
df.head()
Out[2]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult year_of_release runtime genres imdb_rating imdb_votes story summary tagline actors wins_nominations release_date
0 Uri: The Surgical Strike tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Uri:_The_Surgica... Uri: The Surgical Strike Uri: The Surgical Strike 0 2019 138 Action|Drama|War 8.4 35112.0 Divided over five chapters the film chronicle... Indian army special forces execute a covert op... NaN Vicky Kaushal|Paresh Rawal|Mohit Raina|Yami Ga... 4 wins 11 January 2019 (USA)
1 Battalion 609 tt9472208 NaN https://en.wikipedia.org/wiki/Battalion_609 Battalion 609 Battalion 609 0 2019 131 War 4.1 73.0 The story revolves around a cricket match betw... The story of Battalion 609 revolves around a c... NaN Vicky Ahuja|Shoaib Ibrahim|Shrikant Kamat|Elen... NaN 11 January 2019 (India)
2 The Accidental Prime Minister (film) tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/The_Accidental_P... The Accidental Prime Minister The Accidental Prime Minister 0 2019 112 Biography|Drama 6.1 5549.0 Based on the memoir by Indian policy analyst S... Explores Manmohan Singh's tenure as the Prime ... NaN Anupam Kher|Akshaye Khanna|Aahana Kumra|Atul S... NaN 11 January 2019 (USA)
3 Why Cheat India tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Why_Cheat_India Why Cheat India Why Cheat India 0 2019 121 Crime|Drama 6.0 1891.0 The movie focuses on existing malpractices in ... The movie focuses on existing malpractices in ... NaN Emraan Hashmi|Shreya Dhanwanthary|Snighdadeep ... NaN 18 January 2019 (USA)
4 Evening Shadows tt6028796 NaN https://en.wikipedia.org/wiki/Evening_Shadows Evening Shadows Evening Shadows 0 2018 102 Drama 7.3 280.0 While gay rights and marriage equality has bee... Under the 'Evening Shadows' truth often plays... NaN Mona Ambegaonkar|Ananth Narayan Mahadevan|Deva... 17 wins & 1 nomination 11 January 2019 (India)
In [3]:
df.describe(include="all")
Out[3]:
title_x imdb_id poster_path wiki_link title_y original_title is_adult year_of_release runtime genres imdb_rating imdb_votes story summary tagline actors wins_nominations release_date
count 4330 4330 3580 4330 4330 4330 4330.0 4330 4330 4330 4317.000000 4317.000000 4065 4329 685 4320 1344 3049
unique 4288 4284 3529 4330 4052 4047 NaN 71 149 264 NaN NaN 4020 4021 678 4271 250 2278
top Aan tt0347416 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org/wiki/Aakraman Anari Dushman NaN 2014 \N Drama NaN NaN A Royal Indian family consists of the Emperor ... Add a Plot » A million-dollar fake Mala Sinha|Dharmendra|Kumkum|Sujit Kumar|Mehmo... 1 nomination 1977 (India)
freq 2 2 7 1 4 4 NaN 110 919 513 NaN NaN 2 264 2 2 229 8
mean NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN 5.911744 2387.874913 NaN NaN NaN NaN NaN NaN
std NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN 1.330077 9404.126400 NaN NaN NaN NaN NaN NaN
min NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN 0.000000 0.000000 NaN NaN NaN NaN NaN NaN
25% NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN 5.100000 32.000000 NaN NaN NaN NaN NaN NaN
50% NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN 6.100000 131.000000 NaN NaN NaN NaN NaN NaN
75% NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN 6.900000 966.000000 NaN NaN NaN NaN NaN NaN
max NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN 9.400000 310481.000000 NaN NaN NaN NaN NaN NaN
In [4]:
df.shape
Out[4]:
(4330, 18)

Looking at the above information, few elementary conclusions can be drawn:

  • There are 4330 movies in this dataset
  • As mean is 0.0 for is_adult, the value is same for all movies. Hence we can drop that attribute
  • actors and genres are separated by |
  • Many movies don't have information about tagline and wins_nominations (Less information pertaining to wins_nominations is obvious as not every film can win awards or get nominated 😄)
  • runtime is of dtype object as null values are entered in character form ('\N'). We need to convert it to float64 type
  • The mean IMDB rating is 5.91 and the maximum rating a film has been given is 9.4

A note from documentation (for ./1950-2019/bollywood_full.csv)

(This dataset is merged ON "imdb_id", hence if you find a niche-movie missing, see the respective year's directory).

i.e if a movie released in 1994 is missing from this dataset, check the directory ./1990-2009/

In [5]:
df = df.drop(['is_adult'], axis=1)

df["actors"] = df["actors"].apply(lambda actors: actors.strip().split("|") if isinstance(actors, str) else [])
df["genres"] = df["genres"].apply(lambda genres: genres.strip().split("|") if isinstance(genres, str) else [])

df["runtime"] = pd.to_numeric(df["runtime"].replace("\\N", np.nan))
In [6]:
df["runtime"].dtype
Out[6]:
dtype('float64')
In [7]:
df[["actors", "genres"]].head()
Out[7]:
actors genres
0 [Vicky Kaushal, Paresh Rawal, Mohit Raina, Yam... [Action, Drama, War]
1 [Vicky Ahuja, Shoaib Ibrahim, Shrikant Kamat, ... [War]
2 [Anupam Kher, Akshaye Khanna, Aahana Kumra, At... [Biography, Drama]
3 [Emraan Hashmi, Shreya Dhanwanthary, Snighdade... [Crime, Drama]
4 [Mona Ambegaonkar, Ananth Narayan Mahadevan, D... [Drama]

Before we continue, I have got to know which film scored a 9.4 on IMDB 😁

In [8]:
df[df["imdb_rating"] == 9.4][["original_title", "year_of_release", "imdb_id"]]
Out[8]:
original_title year_of_release imdb_id
41 Family of Thakurganj 2019 tt8897986

How ? What .... Apparently this film does have a rating of 9.4! (as of December 27, 2019)

Let's start playing with the dataset. First we will find statistics about runtime attribute.

In [9]:
(df["runtime"].describe(), "Total hours of content created by Indian cinema: " + str(sum(df["runtime"].dropna().tolist())/60))
Out[9]:
(count    3411.000000
 mean      140.250660
 std        23.763995
 min         7.000000
 25%       126.000000
 50%       140.000000
 75%       155.000000
 max       321.000000
 Name: runtime, dtype: float64,
 'Total hours of content created by Indian cinema: 7973.25')

As we can see, the average runtime of a Bollywood movie is around 140 minutes.

Also 7973 hours (approx.) of content has been produced by Indian Cinema. That's impressive!

In [10]:
year_mapping = dict()
for year in range(1950, 2020, 10):
    year_mapping[year] = 0
    for key in df["year_of_release"].dropna().value_counts().keys():
        if key != "\\N" and int(key) - year < 10:
            year_mapping[year] += df["year_of_release"].dropna().value_counts()[key]
                
plt.pie([val for val in year_mapping.values()], labels=[key for key in year_mapping.keys()], autopct="%.2f")
plt.plot()
plt.show()

The above pie chart shows the % of movies released (categorized by decade). Almost 75% of all movies were released in the last 3 decades. The film industry saw an approx. of 5% growth every decade!

We will now look at the blockbusters produced by Indian cinema. For this, I will be using IMDB's rating algorithm:

weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
Where:
R = average for the movie (mean) = (rating)
v = number of votes for the movie = (votes)
m = minimum votes required to be listed in the Top Rated list (currently 25,000)
C = the mean vote across the whole report

All the genres in the dataset are:

In [11]:
all_genres = set([genre for genre in [genres for genres in df["genres"].tolist()] for genre in genre])
all_genres.remove("\\N")
all_genres
Out[11]:
{'Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Thriller',
 'War',
 'Western'}
In [12]:
def imdb_rating(film, C, m):
    vote = film["imdb_votes"]
    rating = film["imdb_rating"]

    return ((vote / (vote + m)) * rating) + ((m / (vote + m)) * C)
In [13]:
def top_movies(df, percentile=0.85, limit=10, offset=0, genre=None):
    if genre != None:
        top_df = df[df["genres"].apply(lambda genres: genre in genres)]
    else:
        top_df = df

    rating = df[df["imdb_rating"].notnull()]["imdb_rating"]
    votes = df[df["imdb_votes"].notnull()]["imdb_votes"]
    imdb_C = rating.mean()
    # For the dataset, percentile = 0.85 roughly corresponds to 2500
    # To match IMDB's parameter, percentile should be around 0.98026
    imdb_m = votes.quantile(percentile)

    top_df = top_df[top_df["imdb_votes"] >= imdb_m]

    pd.options.mode.chained_assignment = None
    top_df["rating"] = top_df.apply(imdb_rating, args=(imdb_C, imdb_m), axis=1)
    top_df = top_df.sort_values("rating", ascending=False)
    
    return top_df.iloc[offset: offset+limit]
In [14]:
top_movies(df)[["original_title", "imdb_rating", "imdb_votes", "rating", "year_of_release", "imdb_id"]]
Out[14]:
original_title imdb_rating imdb_votes rating year_of_release imdb_id
3615 Anand 8.8 23953.0 8.527217 1971 tt0066763
916 3 Idiots 8.4 310481.0 8.380139 2009 tt1187043
1131 Taare Zameen Par 8.4 148498.0 8.358832 2007 tt0986264
354 Dangal 8.4 131338.0 8.353554 2016 tt5074352
146 Andhadhun 8.4 51615.0 8.285127 2018 tt8108198
3119 Gol Maal 8.6 16086.0 8.238628 1979 tt0079221
0 Uri: The Surgical Strike 8.4 35112.0 8.234721 2019 tt8291224
1209 Bacheha-Ye aseman 8.3 56293.0 8.198516 1997 tt0118849
1062 Black Friday 8.5 16761.0 8.164265 2004 tt0400234
1201 Rang De Basanti 8.2 103071.0 8.145850 2006 tt0405508

To find top movies (by genre), simply fill the genre attribute

In [15]:
top_movies(df, genre="Mystery")[["original_title", "imdb_rating", "imdb_votes", "rating", "year_of_release", "imdb_id"]]
Out[15]:
original_title imdb_rating imdb_votes rating year_of_release imdb_id
426 Drishyam 8.2 58340.0 8.106037 2015 tt4430212
436 Talvar 8.2 26612.0 8.003625 2015 tt4934950
673 Kahaani 8.1 53181.0 8.001818 2012 tt1821480
568 Ugly 8.1 17483.0 7.826408 2013 tt2882328
15 Badla 7.9 15499.0 7.624009 2019 tt8130968
1095 Johnny Gaddaar 7.8 10612.0 7.440185 2007 tt1077248
393 Detective Byomkesh Bakshy! 7.6 14674.0 7.354394 2015 tt3447364
27 The Tashkent Files 8.0 5524.0 7.349695 2019 tt8108268
983 Ghajini 7.3 53086.0 7.237606 2008 tt1166100
1103 Manorama Six Feet Under 7.7 6013.0 7.175113 2007 tt0920464

You can get the next 10 films of the above list by simply changing the offset value!

In [16]:
top_movies(df, genre="Mystery", offset=10)[["original_title", "imdb_rating", "imdb_votes", "rating", "year_of_release", "imdb_id"]]
Out[16]:
original_title imdb_rating imdb_votes rating year_of_release imdb_id
731 Talaash 7.2 36801.0 7.118107 2012 tt1787988
1350 Ek Hasina Thi 7.6 6000.0 7.103707 2004 tt0352314
1056 Bhool Bhulaiyaa 7.2 19469.0 7.053494 2007 tt0995031
1729 Kaun? 7.8 3071.0 6.952977 1999 tt0195002
296 Te3n 7.2 10451.0 6.951466 2016 tt4814290
376 Rahasya 7.6 3950.0 6.945926 2015 tt3337550
691 Shanghai 7.2 9143.0 6.923540 2012 tt2072227
1114 No Smoking 7.2 5817.0 6.812959 2007 tt0995740
835 Karthik Calling Karthik 7.0 9451.0 6.772480 2010 tt1373156
1417 Bhoot 6.5 3017.0 6.233540 2003 tt0341266

Let us now find the top writers and directors of all time. We will be needing 3 new datasets!

In [17]:
crew_movie_path = "./1950-2019/bollywood_crew.csv"
director_path = "./1950-2019/bollywood_crew_data.csv"
writer_path = "./1950-2019/bollywood_writers_data.csv"

crew_df = pd.read_csv(open(crew_movie_path))
director_df = pd.read_csv(open(director_path))
writer_df = pd.read_csv(open(writer_path))

crew_df.head()
Out[17]:
imdb_id directors writers
0 tt0042184 nm0025608 nm0025608|nm0324690
1 tt0042207 nm0490178 nm0161032|nm1879927
2 tt0042225 nm0707533 \N
3 tt0042233 nm0788880 nm0592578|nm0788880
4 tt0042380 nm0439074 nm1278450|nm0438022|nm1301772
In [18]:
director_df.head()
Out[18]:
crew_id name born_year death_year profession known_for
0 nm0001408 Shekhar Kapur 1945 \N actor|director|producer tt0240510|tt0414055|tt0109206|tt0127536
1 nm0002172 Mukul Anand 1951 1997 director|writer|producer tt0104607|tt0102201|tt0098999|tt0092026
2 nm0002411 Mani Kaul 1944 2011 director|writer|actor tt0207626|tt0066514|tt0070009|tt0102515
3 nm0003939 Vikramaditya Motwane 1976 \N producer|writer|director tt0238936|tt3322420|tt1639426|tt1327035
4 nm0004072 Kaizad Gustad 1968 \N director|writer|miscellaneous tt0330082|tt3309662|tt0168529|tt0819646
In [19]:
writer_df.head()
Out[19]:
crew_id name born_year death_year profession known_for
0 nm0000636 William Shakespeare 1564 1616 writer|soundtrack|miscellaneous tt3894536|tt5377528|tt5932378|tt8632012
1 nm0002005 Agatha Christie 1890 1976 writer|camera_department tt3402236|tt0029171|tt0051201|tt1349600
2 nm0002042 Charles Dickens 1812 1870 writer|soundtrack|miscellaneous tt0096061|tt0063385|tt0095776|tt0119223
3 nm0002172 Mukul Anand 1951 1997 director|writer|producer tt0104607|tt0102201|tt0098999|tt0092026
4 nm0002411 Mani Kaul 1944 2011 director|writer|actor tt0207626|tt0066514|tt0070009|tt0102515

As in the main dataset, attributes such as directors, writers, profession and known_for are separated by |. As this is getting repetitive, we will create a method to deal with such situations.

In [20]:
def split_attribute(df, attr, split_by="|"):
    return df[attr].apply(lambda row: row.split(split_by) if isinstance(row, str) else [])

to_clean_attr = [(crew_df, "directors"), (crew_df, "writers"), (director_df, "profession"), (director_df, "known_for"), (writer_df, "profession"), (writer_df, "known_for")]

for attr in to_clean_attr:
    attr[0][attr[1]] = split_attribute(attr[0], attr[1])
In [21]:
def get_crew_freq(crew_type="directors"):
    # As we want list of top directors and writers
    # We are only considering the top 250 movies generated by method top_movies()  
    top_imdb_ids = top_movies(df, limit=250)["imdb_id"].tolist()
    freq = dict()

    for imdb_id in top_imdb_ids:
        info = crew_df[crew_df["imdb_id"] == imdb_id][crew_type].all()
        for crew_id in info:
            if crew_type == "directors":
                name = director_df[director_df["crew_id"] == crew_id]["name"].all()
            else:
                name = writer_df[writer_df["crew_id"] == crew_id]["name"].all()
            try:
                freq[name] += 1
            except:
                freq[name] = 1
    
    return freq

LIMIT = 20
director_freq = get_crew_freq(crew_type="directors")
director_freq = {k: v for k,v in sorted(director_freq.items(), key=lambda x: x[1], reverse=True)[:LIMIT]}
In [22]:
plt.bar(director_freq.keys(), director_freq.values())
plt.xticks(rotation='vertical')
plt.show()

Looks like Anurag Kashyap is critically the most popular director (the views strongly correlate with that of IMDB audience). A similar histogram can be plotted for Writers and Actors. Here is a list of top writers:

In [23]:
writer_freq = {k: v for k,v in sorted(get_crew_freq(crew_type="writers").items(), key=lambda x: x[1], reverse=True)[:LIMIT]}
plt.bar(writer_freq.keys(), writer_freq.values())
plt.xticks(rotation='vertical')
plt.show()

It is suprising that Anurag Kashyap has topped this list as well! Let's experiment something else, we will try showing a correlation between top directors and top writers, the perfect recipe for success!

i.e we are finding which top directors and writers have collaborated with each other.

In [24]:
top_directors = list(director_freq.keys())
top_writers = list(writer_freq.keys())

crew_corr = np.zeros(shape=(len(top_directors), len(top_writers)))

def fill_crew_corr(crew_corr, top_directors, top_writers):
    top_imdb_ids = top_movies(df, limit=250)["imdb_id"].tolist()

    for imdb_id in top_imdb_ids:
        d_ids = crew_df[crew_df["imdb_id"] == imdb_id]["directors"].all()
        w_ids = crew_df[crew_df["imdb_id"] == imdb_id]["writers"].all()        

        directors = [director_df[director_df["crew_id"] == d_id]["name"].all() for d_id in d_ids]
        writers = [writer_df[writer_df["crew_id"] == w_id]["name"].all() for w_id in w_ids]
 
        for director in directors:
            for writer in writers:
                if writer in top_writers and director in top_directors:
                    crew_corr[top_writers.index(writer)][top_directors.index(director)] += 1
    
    return crew_corr

crew_corr = fill_crew_corr(crew_corr, top_directors, top_writers)
In [25]:
cmap = sns.light_palette("Blue", as_cmap=True)

sns.heatmap(crew_corr, xticklabels=top_directors, yticklabels=top_writers, linewidth=0.5, cmap=cmap)
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0e8bb38240>

The heatmap is mostly empty, but as we can observe some popular directors and writers frequently collaborate with each other! Let's further this analysis by creating a 4D model to compare relationship between top actors, writers and directors. To create a 4D model, we will need frequency of a (director, actor, writer) relationship (for each movie).

In [26]:
relation_dwa = dict() # This dictionary will hold the frequency count of {(director, writer, actor): freq, ..}

def find_dwa_reln(relation_dwa, top_directors, top_writers):
    top_imdb_ids = top_movies(df, limit=250)["imdb_id"].tolist()

    for imdb_id in top_imdb_ids:
        d_ids = crew_df[crew_df["imdb_id"] == imdb_id]["directors"].all()
        w_ids = crew_df[crew_df["imdb_id"] == imdb_id]["writers"].all()

        all_actors = df[df["imdb_id"] == imdb_id]["actors"].all()
        
        directors = [director_df[director_df["crew_id"] == d_id]["name"].all() for d_id in d_ids]
        writers = [writer_df[writer_df["crew_id"] == w_id]["name"].all() for w_id in w_ids]

        for director in directors:
            for writer in writers:
                for actor in [actor for actor in all_actors if actor != ""]:
                        try:
                            relation_dwa[(director, writer, actor)] += 1
                        except:
                            relation_dwa[(director, writer, actor)] = 1
         
    return relation_dwa

relation_dwa = {k: v for k,v in sorted(find_dwa_reln(relation_dwa, top_directors, top_writers).items(), key=lambda x: x[1], reverse=True)[:40]}
In [27]:
Xuniques, X = np.unique([rel[0] for rel in relation_dwa], return_inverse=True)
Yuniques, Y = np.unique([rel[1] for rel in relation_dwa], return_inverse=True)
Zuniques, Z = np.unique([rel[2] for rel in relation_dwa], return_inverse=True)

fig = plt.figure(figsize=(20, 15))
ax = fig.add_subplot(111, projection='3d')

img = ax.scatter(X, Y, Z, c=[relation_dwa[rel] for rel in relation_dwa], cmap=cmap)
ax.set(xticks=range(len(Xuniques)), xticklabels=Xuniques,
       yticks=range(len(Yuniques)), yticklabels=Yuniques,
       zticks=range(len(Zuniques)), zticklabels=Zuniques)

ax.set_xlabel("directors"); ax.set_ylabel("writers"); ax.set_zlabel("actors");
ax.zaxis.labelpad = ax.xaxis.labelpad = ax.yaxis.labelpad = 50
plt.rcParams["axes.labelweight"] = "bold"

fig.colorbar(img)
plt.show()

Just like the heatmap, this model appears sparse! In the model above, the color scheme (c) acts as the fourth dimension (showing frequency).

Content based filtering

We will now perform content based filtering using actors, director, writer, title, genres, story, summary and tagline (if any). For content based filtering the strategy we will follow is a trivial one. We will perform cosine similarity on the attributes mentioned above (after removing the stopwords and other minor cleaning). Getting director and writer info is being repeated frequently, so we will write a method for it.

In [28]:
def get_director_writer(df, person="director"):
    if person == "director":
        d_ids = crew_df[crew_df["imdb_id"] == df["imdb_id"]]["directors"].all()
        directors = [director_df[director_df["crew_id"] == d_id]["name"].all() for d_id in d_ids]
        return directors
    elif person == "writer":
        w_ids = crew_df[crew_df["imdb_id"] == df["imdb_id"]]["writers"].all()
        writers = [writer_df[writer_df["crew_id"] == w_id]["name"].all() for w_id in w_ids]
        return writers
In [29]:
def content_based(title, df, limit, offset):
    df["desc"] = df["original_title"].fillna("").apply(lambda x: x.lower()) + df["story"].fillna("").apply(lambda x: x.lower()) + df["summary"].fillna("").apply(lambda x: x.lower()) + df["tagline"].fillna("").apply(lambda x: x.lower()) + df["genres"].apply(lambda x: (" ".join(x)).lower())

    df["crew"] = df["actors"].fillna("").apply(lambda actors: (" ".join([actor.replace(" ", "") for actor in actors[:5]])).lower())

    #Let's try guessing the title index!
    try:
        index = df.index[df["original_title"] == title][0]
    except:
        try:
            title = get_close_matches(title, [movie for movie in df["original_title"].tolist()])[0]
            index = df.index[df["original_title"] == title][0]
        except:
            return None

    df["directors"] = df.apply(get_director_writer, args=("director",), axis=1)
    df["writers"] = df.apply(get_director_writer, args=("writer",), axis=1)

    df["crew"] += " "+df["directors"].fillna("").apply(lambda direcs: (" ".join([direc.replace(" ", "") for direc in DIRECTOR_WT*direcs if isinstance(direc, bool) == False])).lower())
    df["crew"] += " "+df["writers"].fillna("").apply(lambda writers: (" ".join([writer.replace(" ", "") for writer in WRITER_WT*writers if isinstance(writer, bool) == False])).lower())
    tfidf = CountVectorizer(analyzer="word", stop_words=stopwords.words("english"), ngram_range=(1, 2))
    tfidf_mat = tfidf.fit_transform(df["desc"] + df["crew"])
    cosine_sim = cosine_similarity(tfidf_mat, tfidf_mat)
        
    rec_movie = cosine_sim[index]
    ids = rec_movie.argsort()[::-1][1 : limit]
    return top_movies(df.iloc[ids], percentile=0.2)[["original_title", "genres", "runtime", "imdb_rating", "imdb_votes", "year_of_release", "wins_nominations", "release_date"]]

# You can include director and writer's significance by increasing their weight
DIRECTOR_WT = 1
WRITER_WT = 1
content_df = content_based("MS Dhoni An untold story", df, 10, 0)
content_df["genres"] = content_df["genres"].apply(lambda x: ", ".join(x))
content_df
Out[29]:
original_title genres runtime imdb_rating imdb_votes year_of_release wins_nominations release_date
1070 Chak De! India Drama, Family, Sport 153.0 8.2 68421.0 2007 28 wins & 13 nominations 10 August 2007 (India)
1259 Iqbal Drama, Sport 132.0 8.1 14864.0 2005 9 wins & 13 nominations 20 January 2006 (USA)
692 Ferrari Ki Sawaari Comedy, Drama, Family 140.0 6.4 4693.0 2012 4 nominations 15 June 2012 (India)
1126 Say Salaam India: 'Let's Bring the Cup Home' Drama, Sport NaN 6.4 84.0 2007 NaN NaN
749 Patiala House Drama, Sport 140.0 5.6 8301.0 2011 NaN 11 February 2011 (India)
1044 Meerabai Not Out Comedy, Drama, Romance NaN 4.0 86.0 2008 NaN NaN
960 Dil Bole Hadippa! Comedy, Drama, Sport 148.0 4.5 3528.0 2009 NaN 18 September 2009 (India)

As you can see, our approach is able to predict cricket based films for MS Dhoni: An Untold Story. We are passing our results to the top_movies method we wrote earlier with a 0.2 percentile to remove the worst reviewed films. Also I have added difflib's get_close_matches method to predict the movie which is closest to the film entered by the user. This is helpful in situations where one doesn't know the exact movie title!

Again, this is a very basic approach and many sophisticated approaches exist! You can also use tags.csv, genome_tags.csv and genome_scores.csv. From TIMDB's documentation:

The genome-scores were available for very few movies (64 in total) from "Full" MovieLens dataset.

Based on my observation, tags.csv has tags for 504 movies! Therefore, if the movie is not very popular, your best bet would be to generate tags from summary, story and tagline.

Collaborative Filtering

Let us conclude this exploration by performing basic collaborative filtering using ./collaborative/ratings.csv

In [30]:
def collaborative(m_id):
    df_coll = pd.read_csv(open("./collaborative/ratings.csv"))
    df_titles = pd.read_csv(open("./collaborative/titles.csv"))

    reader = Reader(rating_scale=(1, 5))
    sim_options = {"name": "pearson_baseline", "user_based": False}
    data = Dataset.load_from_df(df_coll[["user_id", "movie_id", "rating"]], reader)

    trainset = data.build_full_trainset()
    model = KNNBaseline(sim_options=sim_options)
    model.fit(trainset)

    inn_id = model.trainset.to_inner_iid(m_id)
    inn_id_neigh = model.get_neighbors(inn_id, k=10)

    titles = list()
    for movielens_id in inn_id_neigh:
        titles += [df_titles[df_titles["movie_id"] == model.trainset.to_raw_iid(movielens_id)]["title"].all()]
    
    return titles

# Here 157392 is the movie_id for the movie: Dangal (2016)
collaborative(167392)
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Out[30]:
['3 Idiots (2009)',
 'Chak De India! (2007)',
 'Like Stars on Earth (Taare Zameen Par) (2007)',
 'Paint It Yellow (Rang De Basanti) (2006)',
 'Children of Heaven  The (Bacheha-Ye Aseman) (1997)',
 'Munna bhai M.B.B.S. (2003)',
 'Queen (2014)',
 'Andaz Apna Apna (1994)',
 'Swades: We  the People (Our Country) (2004)',
 'Sholay (1975)']

From above you can see, we are being recommended movies with Aamir Khan in them. Also sports based movies and movies with the theme - struggle are popular. You can use ./collaborative/links.csv to convert a movie_id to it's equivalent imdb_id and get the statistics for the same from ./1950-2019/bollywood_full.csv. From the documentation:

The leading zeros are removed for imdb_id, which are not removed for the rest of the database(i.e for "1950-1989", "1990-2009", "2010-2019" and "1950-2019").
Example: in links.csv if imdb_id is 123456, it can be tt0123456 in imdb_id col in the datasets in "1950-1989", "1990-2009" and "2010-2019".

So this is it! Hopefully I have given you a good overview of how this dataset looks like and what basic operations you can perform with it. Try caching the models and dataframes to perform quick recommendations. You can try and create a hybrid approach which combines results from content-based and collaborative filtering approaches. If you have any questions, you can contact me.

Keep exploring! 😄