Analyzing TIMDB (The Indian Movie Database)¶

The complete database can be found on Github.

This notebook is created by the author of this dataset.

This database contains a list of Indian movies and it's meta-data, with release years ranging between 1950 and 2019. As the database is big, we will start by analyzing the primary dataset: ./1950-2019/bollywood_full.csv

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
from difflib import get_close_matches
from surprise import Reader, Dataset, KNNBaseline

file = "./1950-2019/bollywood_full.csv"
df = pd.read_csv(open(file, "r"))

Initial Analysis¶

df.head()

df.describe(include="all")

df.shape

(4330, 18)

Looking at the above information, few elementary conclusions can be drawn:

There are 4330 movies in this dataset
As mean is 0.0 for is_adult, the value is same for all movies. Hence we can drop that attribute
actors and genres are separated by |
Many movies don't have information about tagline and wins_nominations (Less information pertaining to wins_nominations is obvious as not every film can win awards or get nominated 😄)
runtime is of dtype object as null values are entered in character form ('\N'). We need to convert it to float64 type
The mean IMDB rating is 5.91 and the maximum rating a film has been given is 9.4

A note from documentation (for ./1950-2019/bollywood_full.csv)

(This dataset is merged ON "imdb_id", hence if you find a niche-movie missing, see the respective year's directory).

i.e if a movie released in 1994 is missing from this dataset, check the directory ./1990-2009/

df = df.drop(['is_adult'], axis=1)

df["actors"] = df["actors"].apply(lambda actors: actors.strip().split("|") if isinstance(actors, str) else [])
df["genres"] = df["genres"].apply(lambda genres: genres.strip().split("|") if isinstance(genres, str) else [])

df["runtime"] = pd.to_numeric(df["runtime"].replace("\\N", np.nan))

df["runtime"].dtype

dtype('float64')

df[["actors", "genres"]].head()

Before we continue, I have got to know which film scored a 9.4 on IMDB 😁

df[df["imdb_rating"] == 9.4][["original_title", "year_of_release", "imdb_id"]]

How ? What .... Apparently this film does have a rating of 9.4! (as of December 27, 2019)

Let's start playing with the dataset. First we will find statistics about runtime attribute.

(df["runtime"].describe(), "Total hours of content created by Indian cinema: " + str(sum(df["runtime"].dropna().tolist())/60))

(count    3411.000000
 mean      140.250660
 std        23.763995
 min         7.000000
 25%       126.000000
 50%       140.000000
 75%       155.000000
 max       321.000000
 Name: runtime, dtype: float64,
 'Total hours of content created by Indian cinema: 7973.25')

As we can see, the average runtime of a Bollywood movie is around 140 minutes.

Also 7973 hours (approx.) of content has been produced by Indian Cinema. That's impressive!

year_mapping = dict()
for year in range(1950, 2020, 10):
    year_mapping[year] = 0
    for key in df["year_of_release"].dropna().value_counts().keys():
        if key != "\\N" and int(key) - year < 10:
            year_mapping[year] += df["year_of_release"].dropna().value_counts()[key]
                
plt.pie([val for val in year_mapping.values()], labels=[key for key in year_mapping.keys()], autopct="%.2f")
plt.plot()
plt.show()

The above pie chart shows the % of movies released (categorized by decade). Almost 75% of all movies were released in the last 3 decades. The film industry saw an approx. of 5% growth every decade!

We will now look at the blockbusters produced by Indian cinema. For this, I will be using IMDB's rating algorithm:

weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
Where:
R = average for the movie (mean) = (rating)
v = number of votes for the movie = (votes)
m = minimum votes required to be listed in the Top Rated list (currently 25,000)
C = the mean vote across the whole report

All the genres in the dataset are:

all_genres = set([genre for genre in [genres for genres in df["genres"].tolist()] for genre in genre])
all_genres.remove("\\N")
all_genres

{'Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Thriller',
 'War',
 'Western'}

def imdb_rating(film, C, m):
    vote = film["imdb_votes"]
    rating = film["imdb_rating"]

    return ((vote / (vote + m)) * rating) + ((m / (vote + m)) * C)

def top_movies(df, percentile=0.85, limit=10, offset=0, genre=None):
    if genre != None:
        top_df = df[df["genres"].apply(lambda genres: genre in genres)]
    else:
        top_df = df

    rating = df[df["imdb_rating"].notnull()]["imdb_rating"]
    votes = df[df["imdb_votes"].notnull()]["imdb_votes"]
    imdb_C = rating.mean()
    # For the dataset, percentile = 0.85 roughly corresponds to 2500
    # To match IMDB's parameter, percentile should be around 0.98026
    imdb_m = votes.quantile(percentile)

    top_df = top_df[top_df["imdb_votes"] >= imdb_m]

    pd.options.mode.chained_assignment = None
    top_df["rating"] = top_df.apply(imdb_rating, args=(imdb_C, imdb_m), axis=1)
    top_df = top_df.sort_values("rating", ascending=False)
    
    return top_df.iloc[offset: offset+limit]

top_movies(df)[["original_title", "imdb_rating", "imdb_votes", "rating", "year_of_release", "imdb_id"]]

To find top movies (by genre), simply fill the genre attribute

top_movies(df, genre="Mystery")[["original_title", "imdb_rating", "imdb_votes", "rating", "year_of_release", "imdb_id"]]

You can get the next 10 films of the above list by simply changing the offset value!

top_movies(df, genre="Mystery", offset=10)[["original_title", "imdb_rating", "imdb_votes", "rating", "year_of_release", "imdb_id"]]

Let us now find the top writers and directors of all time. We will be needing 3 new datasets!

crew_movie_path = "./1950-2019/bollywood_crew.csv"
director_path = "./1950-2019/bollywood_crew_data.csv"
writer_path = "./1950-2019/bollywood_writers_data.csv"

crew_df = pd.read_csv(open(crew_movie_path))
director_df = pd.read_csv(open(director_path))
writer_df = pd.read_csv(open(writer_path))

crew_df.head()

director_df.head()

writer_df.head()

As in the main dataset, attributes such as directors, writers, profession and known_for are separated by |. As this is getting repetitive, we will create a method to deal with such situations.

def split_attribute(df, attr, split_by="|"):
    return df[attr].apply(lambda row: row.split(split_by) if isinstance(row, str) else [])

to_clean_attr = [(crew_df, "directors"), (crew_df, "writers"), (director_df, "profession"), (director_df, "known_for"), (writer_df, "profession"), (writer_df, "known_for")]

for attr in to_clean_attr:
    attr[0][attr[1]] = split_attribute(attr[0], attr[1])

def get_crew_freq(crew_type="directors"):
    # As we want list of top directors and writers
    # We are only considering the top 250 movies generated by method top_movies()  
    top_imdb_ids = top_movies(df, limit=250)["imdb_id"].tolist()
    freq = dict()

    for imdb_id in top_imdb_ids:
        info = crew_df[crew_df["imdb_id"] == imdb_id][crew_type].all()
        for crew_id in info:
            if crew_type == "directors":
                name = director_df[director_df["crew_id"] == crew_id]["name"].all()
            else:
                name = writer_df[writer_df["crew_id"] == crew_id]["name"].all()
            try:
                freq[name] += 1
            except:
                freq[name] = 1
    
    return freq

LIMIT = 20
director_freq = get_crew_freq(crew_type="directors")
director_freq = {k: v for k,v in sorted(director_freq.items(), key=lambda x: x[1], reverse=True)[:LIMIT]}

plt.bar(director_freq.keys(), director_freq.values())
plt.xticks(rotation='vertical')
plt.show()

Looks like Anurag Kashyap is critically the most popular director (the views strongly correlate with that of IMDB audience). A similar histogram can be plotted for Writers and Actors. Here is a list of top writers:

writer_freq = {k: v for k,v in sorted(get_crew_freq(crew_type="writers").items(), key=lambda x: x[1], reverse=True)[:LIMIT]}
plt.bar(writer_freq.keys(), writer_freq.values())
plt.xticks(rotation='vertical')
plt.show()

It is suprising that Anurag Kashyap has topped this list as well! Let's experiment something else, we will try showing a correlation between top directors and top writers, the perfect recipe for success!

i.e we are finding which top directors and writers have collaborated with each other.

top_directors = list(director_freq.keys())
top_writers = list(writer_freq.keys())

crew_corr = np.zeros(shape=(len(top_directors), len(top_writers)))

def fill_crew_corr(crew_corr, top_directors, top_writers):
    top_imdb_ids = top_movies(df, limit=250)["imdb_id"].tolist()

    for imdb_id in top_imdb_ids:
        d_ids = crew_df[crew_df["imdb_id"] == imdb_id]["directors"].all()
        w_ids = crew_df[crew_df["imdb_id"] == imdb_id]["writers"].all()        

        directors = [director_df[director_df["crew_id"] == d_id]["name"].all() for d_id in d_ids]
        writers = [writer_df[writer_df["crew_id"] == w_id]["name"].all() for w_id in w_ids]
 
        for director in directors:
            for writer in writers:
                if writer in top_writers and director in top_directors:
                    crew_corr[top_writers.index(writer)][top_directors.index(director)] += 1
    
    return crew_corr

crew_corr = fill_crew_corr(crew_corr, top_directors, top_writers)

cmap = sns.light_palette("Blue", as_cmap=True)

sns.heatmap(crew_corr, xticklabels=top_directors, yticklabels=top_writers, linewidth=0.5, cmap=cmap)

<matplotlib.axes._subplots.AxesSubplot at 0x7f0e8bb38240>

The heatmap is mostly empty, but as we can observe some popular directors and writers frequently collaborate with each other! Let's further this analysis by creating a 4D model to compare relationship between top actors, writers and directors. To create a 4D model, we will need frequency of a (director, actor, writer) relationship (for each movie).

relation_dwa = dict() # This dictionary will hold the frequency count of {(director, writer, actor): freq, ..}

def find_dwa_reln(relation_dwa, top_directors, top_writers):
    top_imdb_ids = top_movies(df, limit=250)["imdb_id"].tolist()

    for imdb_id in top_imdb_ids:
        d_ids = crew_df[crew_df["imdb_id"] == imdb_id]["directors"].all()
        w_ids = crew_df[crew_df["imdb_id"] == imdb_id]["writers"].all()

        all_actors = df[df["imdb_id"] == imdb_id]["actors"].all()
        
        directors = [director_df[director_df["crew_id"] == d_id]["name"].all() for d_id in d_ids]
        writers = [writer_df[writer_df["crew_id"] == w_id]["name"].all() for w_id in w_ids]

        for director in directors:
            for writer in writers:
                for actor in [actor for actor in all_actors if actor != ""]:
                        try:
                            relation_dwa[(director, writer, actor)] += 1
                        except:
                            relation_dwa[(director, writer, actor)] = 1
         
    return relation_dwa

relation_dwa = {k: v for k,v in sorted(find_dwa_reln(relation_dwa, top_directors, top_writers).items(), key=lambda x: x[1], reverse=True)[:40]}

Xuniques, X = np.unique([rel[0] for rel in relation_dwa], return_inverse=True)
Yuniques, Y = np.unique([rel[1] for rel in relation_dwa], return_inverse=True)
Zuniques, Z = np.unique([rel[2] for rel in relation_dwa], return_inverse=True)

fig = plt.figure(figsize=(20, 15))
ax = fig.add_subplot(111, projection='3d')

img = ax.scatter(X, Y, Z, c=[relation_dwa[rel] for rel in relation_dwa], cmap=cmap)
ax.set(xticks=range(len(Xuniques)), xticklabels=Xuniques,
       yticks=range(len(Yuniques)), yticklabels=Yuniques,
       zticks=range(len(Zuniques)), zticklabels=Zuniques)

ax.set_xlabel("directors"); ax.set_ylabel("writers"); ax.set_zlabel("actors");
ax.zaxis.labelpad = ax.xaxis.labelpad = ax.yaxis.labelpad = 50
plt.rcParams["axes.labelweight"] = "bold"

fig.colorbar(img)
plt.show()

Just like the heatmap, this model appears sparse! In the model above, the color scheme (c) acts as the fourth dimension (showing frequency).

Content based filtering¶

We will now perform content based filtering using actors, director, writer, title, genres, story, summary and tagline (if any). For content based filtering the strategy we will follow is a trivial one. We will perform cosine similarity on the attributes mentioned above (after removing the stopwords and other minor cleaning). Getting director and writer info is being repeated frequently, so we will write a method for it.

def get_director_writer(df, person="director"):
    if person == "director":
        d_ids = crew_df[crew_df["imdb_id"] == df["imdb_id"]]["directors"].all()
        directors = [director_df[director_df["crew_id"] == d_id]["name"].all() for d_id in d_ids]
        return directors
    elif person == "writer":
        w_ids = crew_df[crew_df["imdb_id"] == df["imdb_id"]]["writers"].all()
        writers = [writer_df[writer_df["crew_id"] == w_id]["name"].all() for w_id in w_ids]
        return writers

def content_based(title, df, limit, offset):
    df["desc"] = df["original_title"].fillna("").apply(lambda x: x.lower()) + df["story"].fillna("").apply(lambda x: x.lower()) + df["summary"].fillna("").apply(lambda x: x.lower()) + df["tagline"].fillna("").apply(lambda x: x.lower()) + df["genres"].apply(lambda x: (" ".join(x)).lower())

    df["crew"] = df["actors"].fillna("").apply(lambda actors: (" ".join([actor.replace(" ", "") for actor in actors[:5]])).lower())

    #Let's try guessing the title index!
    try:
        index = df.index[df["original_title"] == title][0]
    except:
        try:
            title = get_close_matches(title, [movie for movie in df["original_title"].tolist()])[0]
            index = df.index[df["original_title"] == title][0]
        except:
            return None

    df["directors"] = df.apply(get_director_writer, args=("director",), axis=1)
    df["writers"] = df.apply(get_director_writer, args=("writer",), axis=1)

    df["crew"] += " "+df["directors"].fillna("").apply(lambda direcs: (" ".join([direc.replace(" ", "") for direc in DIRECTOR_WT*direcs if isinstance(direc, bool) == False])).lower())
    df["crew"] += " "+df["writers"].fillna("").apply(lambda writers: (" ".join([writer.replace(" ", "") for writer in WRITER_WT*writers if isinstance(writer, bool) == False])).lower())
    tfidf = CountVectorizer(analyzer="word", stop_words=stopwords.words("english"), ngram_range=(1, 2))
    tfidf_mat = tfidf.fit_transform(df["desc"] + df["crew"])
    cosine_sim = cosine_similarity(tfidf_mat, tfidf_mat)
        
    rec_movie = cosine_sim[index]
    ids = rec_movie.argsort()[::-1][1 : limit]
    return top_movies(df.iloc[ids], percentile=0.2)[["original_title", "genres", "runtime", "imdb_rating", "imdb_votes", "year_of_release", "wins_nominations", "release_date"]]

# You can include director and writer's significance by increasing their weight
DIRECTOR_WT = 1
WRITER_WT = 1
content_df = content_based("MS Dhoni An untold story", df, 10, 0)
content_df["genres"] = content_df["genres"].apply(lambda x: ", ".join(x))
content_df

As you can see, our approach is able to predict cricket based films for MS Dhoni: An Untold Story. We are passing our results to the top_movies method we wrote earlier with a 0.2 percentile to remove the worst reviewed films. Also I have added difflib's get_close_matches method to predict the movie which is closest to the film entered by the user. This is helpful in situations where one doesn't know the exact movie title!

Again, this is a very basic approach and many sophisticated approaches exist! You can also use tags.csv, genome_tags.csv and genome_scores.csv. From TIMDB's documentation:

The genome-scores were available for very few movies (64 in total) from "Full" MovieLens dataset.

Based on my observation, tags.csv has tags for 504 movies! Therefore, if the movie is not very popular, your best bet would be to generate tags from summary, story and tagline.

Collaborative Filtering¶

Let us conclude this exploration by performing basic collaborative filtering using ./collaborative/ratings.csv

def collaborative(m_id):
    df_coll = pd.read_csv(open("./collaborative/ratings.csv"))
    df_titles = pd.read_csv(open("./collaborative/titles.csv"))

    reader = Reader(rating_scale=(1, 5))
    sim_options = {"name": "pearson_baseline", "user_based": False}
    data = Dataset.load_from_df(df_coll[["user_id", "movie_id", "rating"]], reader)

    trainset = data.build_full_trainset()
    model = KNNBaseline(sim_options=sim_options)
    model.fit(trainset)

    inn_id = model.trainset.to_inner_iid(m_id)
    inn_id_neigh = model.get_neighbors(inn_id, k=10)

    titles = list()
    for movielens_id in inn_id_neigh:
        titles += [df_titles[df_titles["movie_id"] == model.trainset.to_raw_iid(movielens_id)]["title"].all()]
    
    return titles

# Here 157392 is the movie_id for the movie: Dangal (2016)
collaborative(167392)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.

['3 Idiots (2009)',
 'Chak De India! (2007)',
 'Like Stars on Earth (Taare Zameen Par) (2007)',
 'Paint It Yellow (Rang De Basanti) (2006)',
 'Children of Heaven  The (Bacheha-Ye Aseman) (1997)',
 'Munna bhai M.B.B.S. (2003)',
 'Queen (2014)',
 'Andaz Apna Apna (1994)',
 'Swades: We  the People (Our Country) (2004)',
 'Sholay (1975)']

From above you can see, we are being recommended movies with Aamir Khan in them. Also sports based movies and movies with the theme - struggle are popular. You can use ./collaborative/links.csv to convert a movie_id to it's equivalent imdb_id and get the statistics for the same from ./1950-2019/bollywood_full.csv. From the documentation:

The leading zeros are removed for imdb_id, which are not removed for the rest of the database(i.e for "1950-1989", "1990-2009", "2010-2019" and "1950-2019").
Example: in links.csv if imdb_id is 123456, it can be tt0123456 in imdb_id col in the datasets in "1950-1989", "1990-2009" and "2010-2019".

So this is it! Hopefully I have given you a good overview of how this dataset looks like and what basic operations you can perform with it. Try caching the models and dataframes to perform quick recommendations. You can try and create a hybrid approach which combines results from content-based and collaborative filtering approaches. If you have any questions, you can contact me.

Keep exploring! 😄

	title_x	imdb_id	poster_path	wiki_link	title_y	original_title	year_of_release	runtime	genres	imdb_rating	imdb_votes	story	summary	tagline	actors	wins_nominations	release_date
0	Uri: The Surgical Strike	tt8291224	https://upload.wikimedia.org/wikipedia/en/thum...	https://en.wikipedia.org/wiki/Uri:_The_Surgica...	Uri: The Surgical Strike	Uri: The Surgical Strike	2019	138	Action\|Drama\|War	8.4	35112.0	Divided over five chapters the film chronicle...	Indian army special forces execute a covert op...	NaN	Vicky Kaushal\|Paresh Rawal\|Mohit Raina\|Yami Ga...	4 wins	11 January 2019 (USA)
1	Battalion 609	tt9472208	NaN	https://en.wikipedia.org/wiki/Battalion_609	Battalion 609	Battalion 609	2019	131	War	4.1	73.0	The story revolves around a cricket match betw...	The story of Battalion 609 revolves around a c...	NaN	Vicky Ahuja\|Shoaib Ibrahim\|Shrikant Kamat\|Elen...	NaN	11 January 2019 (India)
2	The Accidental Prime Minister (film)	tt6986710	https://upload.wikimedia.org/wikipedia/en/thum...	https://en.wikipedia.org/wiki/The_Accidental_P...	The Accidental Prime Minister	The Accidental Prime Minister	2019	112	Biography\|Drama	6.1	5549.0	Based on the memoir by Indian policy analyst S...	Explores Manmohan Singh's tenure as the Prime ...	NaN	Anupam Kher\|Akshaye Khanna\|Aahana Kumra\|Atul S...	NaN	11 January 2019 (USA)
3	Why Cheat India	tt8108208	https://upload.wikimedia.org/wikipedia/en/thum...	https://en.wikipedia.org/wiki/Why_Cheat_India	Why Cheat India	Why Cheat India	2019	121	Crime\|Drama	6.0	1891.0	The movie focuses on existing malpractices in ...	The movie focuses on existing malpractices in ...	NaN	Emraan Hashmi\|Shreya Dhanwanthary\|Snighdadeep ...	NaN	18 January 2019 (USA)
4	Evening Shadows	tt6028796	NaN	https://en.wikipedia.org/wiki/Evening_Shadows	Evening Shadows	Evening Shadows	2018	102	Drama	7.3	280.0	While gay rights and marriage equality has bee...	Under the 'Evening Shadows' truth often plays...	NaN	Mona Ambegaonkar\|Ananth Narayan Mahadevan\|Deva...	17 wins & 1 nomination	11 January 2019 (India)

	title_x	imdb_id	poster_path	wiki_link	title_y	original_title	is_adult	year_of_release	runtime	genres	imdb_rating	imdb_votes	story	summary	tagline	actors	wins_nominations	release_date
count	4330	4330	3580	4330	4330	4330	4330.0	4330	4330	4330	4317.000000	4317.000000	4065	4329	685	4320	1344	3049
unique	4288	4284	3529	4330	4052	4047	NaN	71	149	264	NaN	NaN	4020	4021	678	4271	250	2278
top	Aan	tt0347416	https://upload.wikimedia.org/wikipedia/en/thum...	https://en.wikipedia.org/wiki/Aakraman	Anari	Dushman	NaN	2014	\N	Drama	NaN	NaN	A Royal Indian family consists of the Emperor ...	Add a Plot »	A million-dollar fake	Mala Sinha\|Dharmendra\|Kumkum\|Sujit Kumar\|Mehmo...	1 nomination	1977 (India)
freq	2	2	7	1	4	4	NaN	110	919	513	NaN	NaN	2	264	2	2	229	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	5.911744	2387.874913	NaN	NaN	NaN	NaN	NaN	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	1.330077	9404.126400	NaN	NaN	NaN	NaN	NaN	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	0.000000	0.000000	NaN	NaN	NaN	NaN	NaN	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	5.100000	32.000000	NaN	NaN	NaN	NaN	NaN	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	6.100000	131.000000	NaN	NaN	NaN	NaN	NaN	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	6.900000	966.000000	NaN	NaN	NaN	NaN	NaN	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	9.400000	310481.000000	NaN	NaN	NaN	NaN	NaN	NaN

	actors	genres
0	[Vicky Kaushal, Paresh Rawal, Mohit Raina, Yam...	[Action, Drama, War]
1	[Vicky Ahuja, Shoaib Ibrahim, Shrikant Kamat, ...	[War]
2	[Anupam Kher, Akshaye Khanna, Aahana Kumra, At...	[Biography, Drama]
3	[Emraan Hashmi, Shreya Dhanwanthary, Snighdade...	[Crime, Drama]
4	[Mona Ambegaonkar, Ananth Narayan Mahadevan, D...	[Drama]

	original_title	imdb_rating	imdb_votes	rating	year_of_release	imdb_id
3615	Anand	8.8	23953.0	8.527217	1971	tt0066763
916	3 Idiots	8.4	310481.0	8.380139	2009	tt1187043
1131	Taare Zameen Par	8.4	148498.0	8.358832	2007	tt0986264
354	Dangal	8.4	131338.0	8.353554	2016	tt5074352
146	Andhadhun	8.4	51615.0	8.285127	2018	tt8108198
3119	Gol Maal	8.6	16086.0	8.238628	1979	tt0079221
0	Uri: The Surgical Strike	8.4	35112.0	8.234721	2019	tt8291224
1209	Bacheha-Ye aseman	8.3	56293.0	8.198516	1997	tt0118849
1062	Black Friday	8.5	16761.0	8.164265	2004	tt0400234
1201	Rang De Basanti	8.2	103071.0	8.145850	2006	tt0405508

	original_title	imdb_rating	imdb_votes	rating	year_of_release	imdb_id
426	Drishyam	8.2	58340.0	8.106037	2015	tt4430212
436	Talvar	8.2	26612.0	8.003625	2015	tt4934950
673	Kahaani	8.1	53181.0	8.001818	2012	tt1821480
568	Ugly	8.1	17483.0	7.826408	2013	tt2882328
15	Badla	7.9	15499.0	7.624009	2019	tt8130968
1095	Johnny Gaddaar	7.8	10612.0	7.440185	2007	tt1077248
393	Detective Byomkesh Bakshy!	7.6	14674.0	7.354394	2015	tt3447364
27	The Tashkent Files	8.0	5524.0	7.349695	2019	tt8108268
983	Ghajini	7.3	53086.0	7.237606	2008	tt1166100
1103	Manorama Six Feet Under	7.7	6013.0	7.175113	2007	tt0920464

	original_title	imdb_rating	imdb_votes	rating	year_of_release	imdb_id
731	Talaash	7.2	36801.0	7.118107	2012	tt1787988
1350	Ek Hasina Thi	7.6	6000.0	7.103707	2004	tt0352314
1056	Bhool Bhulaiyaa	7.2	19469.0	7.053494	2007	tt0995031
1729	Kaun?	7.8	3071.0	6.952977	1999	tt0195002
296	Te3n	7.2	10451.0	6.951466	2016	tt4814290
376	Rahasya	7.6	3950.0	6.945926	2015	tt3337550
691	Shanghai	7.2	9143.0	6.923540	2012	tt2072227
1114	No Smoking	7.2	5817.0	6.812959	2007	tt0995740
835	Karthik Calling Karthik	7.0	9451.0	6.772480	2010	tt1373156
1417	Bhoot	6.5	3017.0	6.233540	2003	tt0341266

	imdb_id	directors	writers
0	tt0042184	nm0025608	nm0025608\|nm0324690
1	tt0042207	nm0490178	nm0161032\|nm1879927
2	tt0042225	nm0707533	\N
3	tt0042233	nm0788880	nm0592578\|nm0788880
4	tt0042380	nm0439074	nm1278450\|nm0438022\|nm1301772

	crew_id	name	born_year	death_year	profession	known_for
0	nm0001408	Shekhar Kapur	1945	\N	actor\|director\|producer	tt0240510\|tt0414055\|tt0109206\|tt0127536
1	nm0002172	Mukul Anand	1951	1997	director\|writer\|producer	tt0104607\|tt0102201\|tt0098999\|tt0092026
2	nm0002411	Mani Kaul	1944	2011	director\|writer\|actor	tt0207626\|tt0066514\|tt0070009\|tt0102515
3	nm0003939	Vikramaditya Motwane	1976	\N	producer\|writer\|director	tt0238936\|tt3322420\|tt1639426\|tt1327035
4	nm0004072	Kaizad Gustad	1968	\N	director\|writer\|miscellaneous	tt0330082\|tt3309662\|tt0168529\|tt0819646

	crew_id	name	born_year	death_year	profession	known_for
0	nm0000636	William Shakespeare	1564	1616	writer\|soundtrack\|miscellaneous	tt3894536\|tt5377528\|tt5932378\|tt8632012
1	nm0002005	Agatha Christie	1890	1976	writer\|camera_department	tt3402236\|tt0029171\|tt0051201\|tt1349600
2	nm0002042	Charles Dickens	1812	1870	writer\|soundtrack\|miscellaneous	tt0096061\|tt0063385\|tt0095776\|tt0119223
3	nm0002172	Mukul Anand	1951	1997	director\|writer\|producer	tt0104607\|tt0102201\|tt0098999\|tt0092026
4	nm0002411	Mani Kaul	1944	2011	director\|writer\|actor	tt0207626\|tt0066514\|tt0070009\|tt0102515

	original_title	genres	runtime	imdb_rating	imdb_votes	year_of_release	wins_nominations	release_date
1070	Chak De! India	Drama, Family, Sport	153.0	8.2	68421.0	2007	28 wins & 13 nominations	10 August 2007 (India)
1259	Iqbal	Drama, Sport	132.0	8.1	14864.0	2005	9 wins & 13 nominations	20 January 2006 (USA)
692	Ferrari Ki Sawaari	Comedy, Drama, Family	140.0	6.4	4693.0	2012	4 nominations	15 June 2012 (India)
1126	Say Salaam India: 'Let's Bring the Cup Home'	Drama, Sport	NaN	6.4	84.0	2007	NaN	NaN
749	Patiala House	Drama, Sport	140.0	5.6	8301.0	2011	NaN	11 February 2011 (India)
1044	Meerabai Not Out	Comedy, Drama, Romance	NaN	4.0	86.0	2008	NaN	NaN
960	Dil Bole Hadippa!	Comedy, Drama, Sport	148.0	4.5	3528.0	2009	NaN	18 September 2009 (India)