Data Stats

IMDB provides a subset of their data to everyone who might be interesting in analyzing it. The data provided by IMDB are split into multiple files which can be found here: IMDB Data Set

Only some of the files provided by IMDB were used during this project:

Data set: Usage:
title.principals.tsv.gz
This file contains the links between the actors and the movies. The file is therefore
required in order to establish which actors were in which movies and by doing
this, creating the network of actors.
This file is the main datafile used for the project.
name.basics.tsv.gz
This file is used in order to get the actors names based on their ID (nconst In file).
title.akas.tsv.gz
This file is used in order to get the movie names based on their titleId.
title.basics.tsv.gz The title.basics file is used in order to remove all entries in our dataset, which
is not movies such as games, tv shows, etc.
title.ratings.tsv.gz This file contains the ratings of the movies and were used in order to get
the ratings of each movie used in the datasets.

Initially the title.principals.tsv.gz file is around 1.3 Gb with 30.674.812 rows and 6 columns. Each row is a link between a person and a production, with the person role in the production.

The first 5 rows in the raw data file:

tconst ordering nconst category job characters
1 tt0000001 1 nm1588970 self \N ["Herself"]
2 tt0000001 2 nm0005690 director \N \N
3 tt0000001 3 nm0374658 cinematographer director of photography \N
4 tt0000002 1 nm0721526 director \N \N
5 tt0000002 2 nm1335271 composer \N \N
Where tconst is the movie ID and nconst is the actor ID.

For this project, it is only relevant to look at actors in movies. Therefore multiple steps were conducted in order to remove all data, which did not contain an actor or a movie. In the title.principals.tsv.gz file, the column "category" tells something about the specific persons role in that production. Every row that did not contain an actor, actress or self were removed.

In order to remove productions that are not movies, the title.basics file was used. The second column is called titleType, which describe what kind of production the movie is with the given movie ID. These types are things such as documentary, short, videoGame, movie, tvMovie etc.
Everything that does not have either movie or tvMovie as their titleType in the title.basics file wwas removed from the data, which reduced the size further.

User reviews

The data set of reviews used for this project is one found on the website Kaggle.com The data contains 100.000 reviews on 14.127 movies.
The data can be found here: Review data set
The reviews from kaggle is within two folders; one called test and one called train. Sentiment analysis has already been conducted on most of the rewies and sorted into folders. However for this project, the already performed sentiment analysis was ignored, and sentiment analysis was performed anew on all 100.000 reviews stripped of their original labelling. Each review in the downloaded data has its own file. Besides these review files, the data also contained relational files with the URLs for each review. These URLs were used in order to link each review with the corresponding movie.

Since this project is about predicting good movies based on their reviews, any movie in our dataset that does not have at least one review were removed.
In order to finish cleaning up the data, all links between actors and movies, where either the actor or movie were not in the dataset, were removed.

The final dataset consists om 44.000 rows and has a size of 2.1 Mb.