The complete database can be found on Github.
This notebook is created by the author of this dataset.
This database contains a list of Indian movies and it's meta-data, with release years ranging between 1950 and 2019.
As the database is big, we will start by analyzing the primary dataset: ./1950-2019/bollywood_full.csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
from difflib import get_close_matches
from surprise import Reader, Dataset, KNNBaseline
file = "./1950-2019/bollywood_full.csv"
df = pd.read_csv(open(file, "r"))
df.head()
df.describe(include="all")
df.shape
Looking at the above information, few elementary conclusions can be drawn:
0.0
for is_adult
, the value is same for all movies. Hence we can drop that attributeactors
and genres
are separated by |tagline
and wins_nominations
(Less information pertaining to wins_nominations
is obvious as not every film can win awards or get nominated 😄)runtime
is of dtype object
as null values are entered in character form ('\N'). We need to convert it to float64
typeA note from documentation (for ./1950-2019/bollywood_full.csv
)
(This dataset is merged ON "imdb_id", hence if you find a niche-movie missing, see the respective year's directory).
i.e if a movie released in 1994 is missing from this dataset, check the directory ./1990-2009/
df = df.drop(['is_adult'], axis=1)
df["actors"] = df["actors"].apply(lambda actors: actors.strip().split("|") if isinstance(actors, str) else [])
df["genres"] = df["genres"].apply(lambda genres: genres.strip().split("|") if isinstance(genres, str) else [])
df["runtime"] = pd.to_numeric(df["runtime"].replace("\\N", np.nan))
df["runtime"].dtype
df[["actors", "genres"]].head()
Before we continue, I have got to know which film scored a 9.4 on IMDB 😁
df[df["imdb_rating"] == 9.4][["original_title", "year_of_release", "imdb_id"]]
How ? What .... Apparently this film does have a rating of 9.4! (as of December 27, 2019)
Let's start playing with the dataset. First we will find statistics about runtime
attribute.
(df["runtime"].describe(), "Total hours of content created by Indian cinema: " + str(sum(df["runtime"].dropna().tolist())/60))
As we can see, the average runtime of a Bollywood movie is around 140
minutes.
Also 7973
hours (approx.) of content has been produced by Indian Cinema. That's impressive!
year_mapping = dict()
for year in range(1950, 2020, 10):
year_mapping[year] = 0
for key in df["year_of_release"].dropna().value_counts().keys():
if key != "\\N" and int(key) - year < 10:
year_mapping[year] += df["year_of_release"].dropna().value_counts()[key]
plt.pie([val for val in year_mapping.values()], labels=[key for key in year_mapping.keys()], autopct="%.2f")
plt.plot()
plt.show()
The above pie chart shows the % of movies released (categorized by decade). Almost 75% of all movies were released in the last 3 decades. The film industry saw an approx. of 5% growth every decade!
We will now look at the blockbusters produced by Indian cinema. For this, I will be using IMDB's rating algorithm:
weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
Where:
R = average for the movie (mean) = (rating)
v = number of votes for the movie = (votes)
m = minimum votes required to be listed in the Top Rated list (currently 25,000)
C = the mean vote across the whole report
All the genres in the dataset are:
all_genres = set([genre for genre in [genres for genres in df["genres"].tolist()] for genre in genre])
all_genres.remove("\\N")
all_genres
def imdb_rating(film, C, m):
vote = film["imdb_votes"]
rating = film["imdb_rating"]
return ((vote / (vote + m)) * rating) + ((m / (vote + m)) * C)
def top_movies(df, percentile=0.85, limit=10, offset=0, genre=None):
if genre != None:
top_df = df[df["genres"].apply(lambda genres: genre in genres)]
else:
top_df = df
rating = df[df["imdb_rating"].notnull()]["imdb_rating"]
votes = df[df["imdb_votes"].notnull()]["imdb_votes"]
imdb_C = rating.mean()
# For the dataset, percentile = 0.85 roughly corresponds to 2500
# To match IMDB's parameter, percentile should be around 0.98026
imdb_m = votes.quantile(percentile)
top_df = top_df[top_df["imdb_votes"] >= imdb_m]
pd.options.mode.chained_assignment = None
top_df["rating"] = top_df.apply(imdb_rating, args=(imdb_C, imdb_m), axis=1)
top_df = top_df.sort_values("rating", ascending=False)
return top_df.iloc[offset: offset+limit]
top_movies(df)[["original_title", "imdb_rating", "imdb_votes", "rating", "year_of_release", "imdb_id"]]
To find top movies (by genre), simply fill the genre
attribute
top_movies(df, genre="Mystery")[["original_title", "imdb_rating", "imdb_votes", "rating", "year_of_release", "imdb_id"]]
You can get the next 10 films of the above list by simply changing the offset
value!
top_movies(df, genre="Mystery", offset=10)[["original_title", "imdb_rating", "imdb_votes", "rating", "year_of_release", "imdb_id"]]
Let us now find the top writers and directors of all time. We will be needing 3 new datasets!
crew_movie_path = "./1950-2019/bollywood_crew.csv"
director_path = "./1950-2019/bollywood_crew_data.csv"
writer_path = "./1950-2019/bollywood_writers_data.csv"
crew_df = pd.read_csv(open(crew_movie_path))
director_df = pd.read_csv(open(director_path))
writer_df = pd.read_csv(open(writer_path))
crew_df.head()
director_df.head()
writer_df.head()
As in the main dataset, attributes such as directors
, writers
, profession
and known_for
are separated by |. As this is getting repetitive, we will create a method to deal with such situations.
def split_attribute(df, attr, split_by="|"):
return df[attr].apply(lambda row: row.split(split_by) if isinstance(row, str) else [])
to_clean_attr = [(crew_df, "directors"), (crew_df, "writers"), (director_df, "profession"), (director_df, "known_for"), (writer_df, "profession"), (writer_df, "known_for")]
for attr in to_clean_attr:
attr[0][attr[1]] = split_attribute(attr[0], attr[1])
def get_crew_freq(crew_type="directors"):
# As we want list of top directors and writers
# We are only considering the top 250 movies generated by method top_movies()
top_imdb_ids = top_movies(df, limit=250)["imdb_id"].tolist()
freq = dict()
for imdb_id in top_imdb_ids:
info = crew_df[crew_df["imdb_id"] == imdb_id][crew_type].all()
for crew_id in info:
if crew_type == "directors":
name = director_df[director_df["crew_id"] == crew_id]["name"].all()
else:
name = writer_df[writer_df["crew_id"] == crew_id]["name"].all()
try:
freq[name] += 1
except:
freq[name] = 1
return freq
LIMIT = 20
director_freq = get_crew_freq(crew_type="directors")
director_freq = {k: v for k,v in sorted(director_freq.items(), key=lambda x: x[1], reverse=True)[:LIMIT]}
plt.bar(director_freq.keys(), director_freq.values())
plt.xticks(rotation='vertical')
plt.show()
Looks like Anurag Kashyap is critically the most popular director (the views strongly correlate with that of IMDB audience). A similar histogram can be plotted for Writers and Actors. Here is a list of top writers:
writer_freq = {k: v for k,v in sorted(get_crew_freq(crew_type="writers").items(), key=lambda x: x[1], reverse=True)[:LIMIT]}
plt.bar(writer_freq.keys(), writer_freq.values())
plt.xticks(rotation='vertical')
plt.show()
It is suprising that Anurag Kashyap has topped this list as well! Let's experiment something else, we will try showing a correlation between top directors and top writers, the perfect recipe for success!
i.e we are finding which top directors and writers have collaborated with each other.
top_directors = list(director_freq.keys())
top_writers = list(writer_freq.keys())
crew_corr = np.zeros(shape=(len(top_directors), len(top_writers)))
def fill_crew_corr(crew_corr, top_directors, top_writers):
top_imdb_ids = top_movies(df, limit=250)["imdb_id"].tolist()
for imdb_id in top_imdb_ids:
d_ids = crew_df[crew_df["imdb_id"] == imdb_id]["directors"].all()
w_ids = crew_df[crew_df["imdb_id"] == imdb_id]["writers"].all()
directors = [director_df[director_df["crew_id"] == d_id]["name"].all() for d_id in d_ids]
writers = [writer_df[writer_df["crew_id"] == w_id]["name"].all() for w_id in w_ids]
for director in directors:
for writer in writers:
if writer in top_writers and director in top_directors:
crew_corr[top_writers.index(writer)][top_directors.index(director)] += 1
return crew_corr
crew_corr = fill_crew_corr(crew_corr, top_directors, top_writers)
cmap = sns.light_palette("Blue", as_cmap=True)
sns.heatmap(crew_corr, xticklabels=top_directors, yticklabels=top_writers, linewidth=0.5, cmap=cmap)
The heatmap is mostly empty, but as we can observe some popular directors and writers frequently collaborate with each other! Let's further this analysis by creating a 4D model to compare relationship between top actors, writers and directors. To create a 4D model, we will need frequency of a (director, actor, writer) relationship (for each movie).
relation_dwa = dict() # This dictionary will hold the frequency count of {(director, writer, actor): freq, ..}
def find_dwa_reln(relation_dwa, top_directors, top_writers):
top_imdb_ids = top_movies(df, limit=250)["imdb_id"].tolist()
for imdb_id in top_imdb_ids:
d_ids = crew_df[crew_df["imdb_id"] == imdb_id]["directors"].all()
w_ids = crew_df[crew_df["imdb_id"] == imdb_id]["writers"].all()
all_actors = df[df["imdb_id"] == imdb_id]["actors"].all()
directors = [director_df[director_df["crew_id"] == d_id]["name"].all() for d_id in d_ids]
writers = [writer_df[writer_df["crew_id"] == w_id]["name"].all() for w_id in w_ids]
for director in directors:
for writer in writers:
for actor in [actor for actor in all_actors if actor != ""]:
try:
relation_dwa[(director, writer, actor)] += 1
except:
relation_dwa[(director, writer, actor)] = 1
return relation_dwa
relation_dwa = {k: v for k,v in sorted(find_dwa_reln(relation_dwa, top_directors, top_writers).items(), key=lambda x: x[1], reverse=True)[:40]}
Xuniques, X = np.unique([rel[0] for rel in relation_dwa], return_inverse=True)
Yuniques, Y = np.unique([rel[1] for rel in relation_dwa], return_inverse=True)
Zuniques, Z = np.unique([rel[2] for rel in relation_dwa], return_inverse=True)
fig = plt.figure(figsize=(20, 15))
ax = fig.add_subplot(111, projection='3d')
img = ax.scatter(X, Y, Z, c=[relation_dwa[rel] for rel in relation_dwa], cmap=cmap)
ax.set(xticks=range(len(Xuniques)), xticklabels=Xuniques,
yticks=range(len(Yuniques)), yticklabels=Yuniques,
zticks=range(len(Zuniques)), zticklabels=Zuniques)
ax.set_xlabel("directors"); ax.set_ylabel("writers"); ax.set_zlabel("actors");
ax.zaxis.labelpad = ax.xaxis.labelpad = ax.yaxis.labelpad = 50
plt.rcParams["axes.labelweight"] = "bold"
fig.colorbar(img)
plt.show()
Just like the heatmap, this model appears sparse!
In the model above, the color scheme (c
) acts as the fourth dimension (showing frequency).
We will now perform content based filtering using actors, director, writer, title, genres, story, summary and tagline (if any). For content based filtering the strategy we will follow is a trivial one. We will perform cosine similarity on the attributes mentioned above (after removing the stopwords and other minor cleaning). Getting director and writer info is being repeated frequently, so we will write a method for it.
def get_director_writer(df, person="director"):
if person == "director":
d_ids = crew_df[crew_df["imdb_id"] == df["imdb_id"]]["directors"].all()
directors = [director_df[director_df["crew_id"] == d_id]["name"].all() for d_id in d_ids]
return directors
elif person == "writer":
w_ids = crew_df[crew_df["imdb_id"] == df["imdb_id"]]["writers"].all()
writers = [writer_df[writer_df["crew_id"] == w_id]["name"].all() for w_id in w_ids]
return writers
def content_based(title, df, limit, offset):
df["desc"] = df["original_title"].fillna("").apply(lambda x: x.lower()) + df["story"].fillna("").apply(lambda x: x.lower()) + df["summary"].fillna("").apply(lambda x: x.lower()) + df["tagline"].fillna("").apply(lambda x: x.lower()) + df["genres"].apply(lambda x: (" ".join(x)).lower())
df["crew"] = df["actors"].fillna("").apply(lambda actors: (" ".join([actor.replace(" ", "") for actor in actors[:5]])).lower())
#Let's try guessing the title index!
try:
index = df.index[df["original_title"] == title][0]
except:
try:
title = get_close_matches(title, [movie for movie in df["original_title"].tolist()])[0]
index = df.index[df["original_title"] == title][0]
except:
return None
df["directors"] = df.apply(get_director_writer, args=("director",), axis=1)
df["writers"] = df.apply(get_director_writer, args=("writer",), axis=1)
df["crew"] += " "+df["directors"].fillna("").apply(lambda direcs: (" ".join([direc.replace(" ", "") for direc in DIRECTOR_WT*direcs if isinstance(direc, bool) == False])).lower())
df["crew"] += " "+df["writers"].fillna("").apply(lambda writers: (" ".join([writer.replace(" ", "") for writer in WRITER_WT*writers if isinstance(writer, bool) == False])).lower())
tfidf = CountVectorizer(analyzer="word", stop_words=stopwords.words("english"), ngram_range=(1, 2))
tfidf_mat = tfidf.fit_transform(df["desc"] + df["crew"])
cosine_sim = cosine_similarity(tfidf_mat, tfidf_mat)
rec_movie = cosine_sim[index]
ids = rec_movie.argsort()[::-1][1 : limit]
return top_movies(df.iloc[ids], percentile=0.2)[["original_title", "genres", "runtime", "imdb_rating", "imdb_votes", "year_of_release", "wins_nominations", "release_date"]]
# You can include director and writer's significance by increasing their weight
DIRECTOR_WT = 1
WRITER_WT = 1
content_df = content_based("MS Dhoni An untold story", df, 10, 0)
content_df["genres"] = content_df["genres"].apply(lambda x: ", ".join(x))
content_df
As you can see, our approach is able to predict cricket based films for MS Dhoni: An Untold Story. We are passing our results to the top_movies
method we wrote earlier with a 0.2 percentile to remove the worst reviewed films. Also I have added difflib
's get_close_matches
method to predict the movie which is closest to the film entered by the user. This is helpful in situations where one doesn't know the exact movie title!
Again, this is a very basic approach and many sophisticated approaches exist! You can also use tags.csv
, genome_tags.csv
and genome_scores.csv
. From TIMDB's documentation:
The genome-scores were available for very few movies (64 in total) from "Full" MovieLens dataset.
Based on my observation, tags.csv
has tags for 504 movies! Therefore, if the movie is not very popular, your best bet would be to generate tags from summary, story and tagline.
Let us conclude this exploration by performing basic collaborative filtering using ./collaborative/ratings.csv
def collaborative(m_id):
df_coll = pd.read_csv(open("./collaborative/ratings.csv"))
df_titles = pd.read_csv(open("./collaborative/titles.csv"))
reader = Reader(rating_scale=(1, 5))
sim_options = {"name": "pearson_baseline", "user_based": False}
data = Dataset.load_from_df(df_coll[["user_id", "movie_id", "rating"]], reader)
trainset = data.build_full_trainset()
model = KNNBaseline(sim_options=sim_options)
model.fit(trainset)
inn_id = model.trainset.to_inner_iid(m_id)
inn_id_neigh = model.get_neighbors(inn_id, k=10)
titles = list()
for movielens_id in inn_id_neigh:
titles += [df_titles[df_titles["movie_id"] == model.trainset.to_raw_iid(movielens_id)]["title"].all()]
return titles
# Here 157392 is the movie_id for the movie: Dangal (2016)
collaborative(167392)
From above you can see, we are being recommended movies with Aamir Khan in them. Also sports based movies and movies with the theme - struggle are popular. You can use ./collaborative/links.csv
to convert a movie_id
to it's equivalent imdb_id
and get the statistics for the same from ./1950-2019/bollywood_full.csv
. From the documentation:
The leading zeros are removed for imdb_id, which are not removed for the rest of the database(i.e for "1950-1989", "1990-2009", "2010-2019" and "1950-2019").
Example: in links.csv if imdb_id is 123456, it can be tt0123456 in imdb_id col in the datasets in "1950-1989", "1990-2009" and "2010-2019".
So this is it! Hopefully I have given you a good overview of how this dataset looks like and what basic operations you can perform with it. Try caching the models and dataframes to perform quick recommendations. You can try and create a hybrid approach which combines results from content-based and collaborative filtering approaches. If you have any questions, you can contact me.
Keep exploring! 😄