For this project it was decided to work with IMDB data. IMDB provides a subset of their data to everyone who might be interesting in analyzing it. The data provided by IMDB are split into multiple files, not all of these files are used for this project. The files used for this project is: name.basics.tsv.gz title.akas.tsv.gz title.basics.tsv.gz title.crew.tsv.gz title.episode.tsv.gz title.principals.tsv.gz title.ratings.tsv.gz The file "title.principals" is the main datafile, which orignally consists of 1.3 Gb of data with 30.674.812 rows and 6 columns. These datafiles contains information about movies, actors in those movies, when they movies were made, what the different peoples roles were in the movies, ID of movies and actors, type of movie such as tv show or movie and ratings of these movies.
Besides these data files, some files containing reviews of movies were also used. These reviews were downloaded from the website Kaggle.com and consists of 100.000 reviews on 14.127 movies. The reviews from Kaggle are divided into two folders, each containing 50.000 reviews. Sentiment analysis has already been conducted on some of the reviews, however this was ignored for this project. Besides the moview reviews the data also contains URLS describing which movie the different reviews come from, this is the part linking the movies with the reviews.
The datasets chosen for this project provides with a way to link actors to each other through their movies. Besides this it also provides reviews for the movies, such that sentiment analysis can be done on these in order to link the sentiment score of the movies actors has been in, to the actors.
Some of the data files are also used when cleaning the data and making it more suitable for the project, this was especially important since the data set was very large initially.
All of these data files therefore provides everything needed in order to make this project, which is why they were chosen.
The goal of this project is to find communities of actors/actresses which are enjoyable together. The project is therefore not about find good movies, but instead finding out which actors/actresses make good movies when working together.
It is therefore possible for an actor to have bad reviews in general, but still being enjoyable to watch when paired up with certain actors/actresses.
We have two type of data:
The databases contains alot of irrelevant information such as games and movies with no reviews in the review data set. Therefore we first have to clean our databases in order to keep only the relevant information.
In this we also comment the different methods and tools used for data cleaning.
startFromCleanData = True #Start with the raw data imported or the cleaned files
fastExecution = False #Use the stored graph, position and DF of rebuild them
savingFigures = True #Whether to save or not the figures produced
# Import Libraries
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import fa2
import math
import community
import matplotlib.cm as cm
from __future__ import division
import matplotlib.image as mpimg
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import io
from collections import Counter
from wordcloud import WordCloud
from scipy.special import zeta
import pickle
# Rendering Parameters
title_font = {'family': 'sans-serif',
'color': '#000000',
'weight': 'normal',
'size': 16,
}
#COLORS
mBlue = "#55638A" # For actor
fRed = "#9E1030" # For actress
#PICKLE
def save_obj(obj, name ):
with open('obj/'+ name + '.pkl', 'wb') as f:
pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
def load_obj(name ):
with open('obj/' + name + '.pkl', 'rb') as f:
return pickle.load(f)
We decided to use Pickle as our data frame, because it is native to pandas and also because the Pickle data frame structure is more compressed than txt files and allows for much faster reading of files.
###################################
# Initialise a movie dictionnary
###################################
# Function. to convert movie or actor id to sting key
def idToString(iD, base): # base = "tt" for movies or "nm" for actors
if iD<10:
return base+"000000"+str(iD)
if iD<100:
return base+"00000"+str(iD)
if iD<1000:
return base+"0000"+str(iD)
if iD<10000:
return base+"000"+str(iD)
if iD<100000:
return base+"00"+str(iD)
if iD<1000000:
return base+"0"+str(iD)
else:
return base+str(iD)
# Create movie dictionnary
movieDict = {}
lastMovie = 9999999 #last movie ID
if not fastExecution:
for i in range(lastMovie):
movieDict[idToString(i+1,"tt")] = False
print "Movie Dictionnary initialised"
else:
print "Fast execution mode, movie dictionnary will be initialised later"
We decided to use dictionaries, when checking which movies and actors we want to save in our data. This is because dictionaries provide an easy way to save each movie ID and actor ID as a key in the dictionary and having the value of that key as either True or false, depending if it should be kept or not. Another reason why we decided on using dictionaries are that dictionaries is faster to use when running code.
###################################
# Get the movies to keep
###################################
# List of the reviews documents
listReviewsDocuments = ["train/urls_neg.txt","test/urls_neg.txt","train/urls_pos.txt","test/urls_pos.txt","train/urls_unsup.txt"]
# Fill in the dictionnary
for document in listReviewsDocuments:
files = io.open("aclImdb/"+document, mode="r", encoding="utf-8")
for row in files:
w = re.findall(r'http://www.imdb.com/title/(\w*)/usercomments',row)
movieDict[w[0]] = True
Throughtout this project all data is encoded and decoded with unicode. The reason for this is that the data used for this project is already encoded in unicode. It is therefore the obvious choice to keep the same formate when handling the text, throughout the project.
###################################
# Create an Actor Dict
###################################
actorDict = {}
lastActor = 29999999 #last movie ID
for i in range(lastActor):
actorDict[idToString(i+1,"nm")] = False
print "Actor Dictionnary initialised"
After this setup the data is ready to be cleaned. The way the data was cleaned was to only save data which is relevant for this project. First it was relevant to only save movies which has reviews and which are actually movies and not games, tv shows etc.
###################################
# key to movie name file
###################################
if not startFromCleanData:
path = "DATA/title.basics.txt"
cleanPath = "DATA/title.basics.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
cleanfile = io.open(cleanPath, mode="w", encoding="utf-8")
b=False # skip the first line
count =0
for row in files:
if b:
split=row.split("\t")
key = split[0]
if movieDict[key]:
if (split[1] in ['movie', 'tvMovie']):
cleanfile.write(row)
count +=1
else:
movieDict[key]=False
else:
b=True
files.close()
cleanfile.close()
print "There are "+str(count)+" movies considered"
print "DATA/title.basics.txt cleaned"
After this step, only actors and actresses in the remaining movies should be saved, everyone not in the movies or with another role than actor/actress where therefore removed.
##########################################################
# film actors links file : Clean + get actor dictionnary
##########################################################
if not startFromCleanData:
path = "DATA/title.principals.txt"
cleanPath = "DATA/title.principals.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
cleanfile = io.open(cleanPath, mode="w", encoding="utf-8")
roleCheckList = ["actor", "actress", "self"] #check if it is an actor
nLinks = 0
i=False # skip first line
for row in files:
if i:
split = row.split("\t")
key = split[0]
if movieDict[key]:
if (split[3] in roleCheckList or split[4] in roleCheckList or split[5] in roleCheckList):
cleanfile.write(row)
actorDict[split[2]]=True
nLinks +=1
else:
i=True
files.close()
cleanfile.close()
##REMOVE ERRORS
actorDict["nm0547707"]=False
actorDict['nm0547707']=False
actorDict['nm0809728']=False
actorDict['nm2442859']=False
actorDict['nm1996613']=False
actorDict['nm0600636']=False
actorDict['nm1824417']=False
actorDict['nm2440192']=False
actorDict['nm1754167']=False
print "There are "+str(nLinks-9)+" actors considered"
print "DATA/title.principals.txt cleaned"
###################################
# key to actor name file
###################################
if not startFromCleanData:
path = "DATA/name.basics.txt"
cleanPath = "DATA/name.basics.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
cleanfile = io.open(cleanPath, mode="w", encoding="utf-8")
count = 0
i=False
for row in files:
if i:
split = row.split("\t")
key = split[0]
if actorDict[key]:
cleanfile.write(row)
else:
i=True
files.close()
cleanfile.close()
print "DATA/name.basics.txt cleaned"
Once everything not relevant for the project has been removed and only relevant movies and actors/acresses remain, it is then necessary to initialize all of this data, in order to gather relevant information about the data such as movie years etc.
############################################
# Preprocess Movie Dict and get movie years
############################################
movieAgeDict = {}
path = "DATA/title.basics.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
count =0
for row in files:
split=row.split("\t")
key = split[0]
if movieDict[key]:
if (split[1] in ['movie', 'tvMovie']) and not (split[5] == "\\N"):
movieAgeDict[key] = int(split[5])
count +=1
files.close()
#Clean Movie dict
for i in range(lastMovie):
movieDict[idToString(i+1,"tt")] = False
for key in movieAgeDict.keys():
movieDict[key]=True
print "There are "+str(count)+" movies considered"
print "Movie Dictionnary Preprocessed and Movie Age Dictionnary Built"
##########################################################
# film actors links file : Clean + get actor dictionnary
##########################################################
path = "DATA/title.principals.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
roleCheckList = ["actor", "actress", "self"] #check if it is an actor
nLinks = 0
for row in files:
split = row.split("\t")
key = split[0]
if movieDict[key]:
if (split[3] in roleCheckList or split[4] in roleCheckList or split[5] in roleCheckList):
actorDict[split[2]]=True
nLinks +=1
files.close()
###REMOVE ERRORS
actorDict["nm0547707"]=False
actorDict['nm0547707']=False
actorDict['nm0809728']=False
actorDict['nm2442859']=False
actorDict['nm1996613']=False
actorDict['nm0600636']=False
actorDict['nm1824417']=False
actorDict['nm2440192']=False
actorDict['nm1754167']=False
print "There are "+str(nLinks-9)+" actors considered"
print "Actor Dictionnary Preprocessed"
###################################
# Create a ratings dict
###################################
ratingDict = {}
path = "DATA/ratings.txt"
files = io.open(path, mode="r", encoding="utf-8")
count = 0
i=False # skip first line
for row in files:
if i:
key = row[:9]
if movieDict[key]:
split = row.split("\t")
ratingDict[key] = float(split[1])
else:
i=True
files.close()
###################################
# Create a movie name dict
###################################
movieNameDict = {}
moviesList = []
path = "DATA/title.akas.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
count = 0
for row in files:
split = row.split("\t")
if movieDict[split[0]] and not (split[0] in movieNameDict) and (split[0] in ratingDict) and "original" in row :
movieNameDict[split[0]] = split[2]
moviesList.append(split[0])
files.close()
###################################
# Create an actor name dict
###################################
actorNameDict = {}
actorGenderDict = {}
actorsList = []
path = "DATA/name.basics.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
count = 0
for row in files:
split = row.split("\t")
if actorDict[split[0]] and not (split[0] in actorNameDict):
actorNameDict[split[0]] = split[1]
if "actor" in split[4]:
actorGenderDict[split[0]] = "M"
else:
actorGenderDict[split[0]] = "F"
actorsList.append(split[0])
files.close()
###################################
# Build a movie data frame
###################################
if not fastExecution:
moviesData = {"iD" : movieNameDict.keys(), "Title": pd.Series(np.zeros(len(moviesList))), "Rating":pd.Series(np.zeros(len(moviesList))), "Year":pd.Series(np.zeros(len(moviesList)))}
moviesDF = pd.DataFrame(moviesData)
for i in moviesDF.index:
iD =moviesDF.loc[i].at["iD"]
moviesDF.loc[i, "Title"]= movieNameDict[iD]
moviesDF.loc[i, "Rating"] = ratingDict[iD]
moviesDF.loc[i, "Year"]= movieAgeDict[iD]
if savingFigures:
moviesDF.to_pickle("obj/moviesDF.pkl")
else:
moviesDF = pd.read_pickle("obj/moviesDF.pkl")
moviesDF.sort_values("Rating", ascending=False).head(10)
When the data has been cleaned, the remaining data for the movies are their rating, movie title, which year it was made and the movie ID.
This data is everything needed in order to link it to the actors and the different reviews as well as categorizing them after year and analyzing ratings.
###################################
# Build an actor data frame
###################################
if not fastExecution:
actorsData = {"iD": actorNameDict.keys(), "Name": pd.Series(np.zeros(len(actorsList))),"Gender": pd.Series(np.zeros(len(actorsList)))}
actorsDF = pd.DataFrame(actorsData)
for i in actorsDF.index:
iD = actorsDF.loc[i].at["iD"]
actorsDF.loc[i, "Name"]= actorNameDict[iD]
actorsDF.loc[i, "Gender"] = actorGenderDict[iD]
if savingFigures:
actorsDF.to_pickle("obj/actorsDF.pkl")
else:
actorsDF = pd.read_pickle("obj/actorsDF.pkl")
actorsDF.head(10)
For actors and actresses the only relevant information was their gender, name and IMDB ID which is used when linking them to the movies.
###################################
# Create a links list
###################################
path = "DATA/title.principals.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
links = np.empty((nLinks,2),dtype=object)
count = 0
for row in files:
split = row.split("\t")
if actorDict[split[2]]:
links[count,0]= split[0]
links[count,1]= split[2]
count+=1
files.close()
###################################
# Create an actor links list
###################################
actorsLinks = []
files = io.open("obj/actorsLinksList.txt", mode="w", encoding="utf-8")
for i in range(count-1):
j = i+1
while (j<count) and (links[i,0]==links[j,0]):
actorsLinks.append([links[i,1],links[j,1],links[i,0]]) #[actor1, actor2, movie]
files.write(str(links[i,1])+"\t"+str(links[j,1])+"\t"+links[i,0]+"\r\n")
j+=1
files.close()
def cleanLoadData():
#build the Dataframes
mDF = pd.read_pickle("obj/moviesDF.pkl")
aDF = pd.read_pickle("obj/actorsDF.pkl")
aLL = []
files = io.open("obj/actorsLinksList.txt", mode="r", encoding="utf-8")
for row in files:
split = row.split("\t")
aLL.append(split)
files.close()
#rebuild the Dictionnary
movieAgeDict = {}
ratingDict = {}
actorName = {}
movieName = {}
#movies
for i in mDF.index:
iD = mDF.loc[i].at["iD"]
rating = mDF.loc[i].at["Rating"]
title = mDF.loc[i].at["Title"]
year = mDF.loc[i].at["Year"]
movieAgeDict[iD] = year
ratingDict[iD] = rating
movieName[iD] = title
#actors
for i in aDF.index:
iD = aDF.loc[i].at["iD"]
name = aDF.loc[i].at["Name"]
actorName[iD]= name
return movieAgeDict,ratingDict,actorName,movieName,mDF,aDF,aLL
Once the data has been cleaned and saved into files, all there is left to do is load the data and use it in the rest of the project.
As mentioned in the "What is our data set" chapter the original data consists of over 30 million rows and 1.3 Gb of data. The cleaned data ends up being around 44.000 rows with a size of 2.1Mb. which is approximately 0,15% of the original data.