Motivation

What is our data set?

For this project it was decided to work with IMDB data. IMDB provides a subset of their data to everyone who might be interesting in analyzing it. The data provided by IMDB are split into multiple files, not all of these files are used for this project. The files used for this project is: name.basics.tsv.gz title.akas.tsv.gz title.basics.tsv.gz title.crew.tsv.gz title.episode.tsv.gz title.principals.tsv.gz title.ratings.tsv.gz The file "title.principals" is the main datafile, which orignally consists of 1.3 Gb of data with 30.674.812 rows and 6 columns. These datafiles contains information about movies, actors in those movies, when they movies were made, what the different peoples roles were in the movies, ID of movies and actors, type of movie such as tv show or movie and ratings of these movies.

Besides these data files, some files containing reviews of movies were also used. These reviews were downloaded from the website Kaggle.com and consists of 100.000 reviews on 14.127 movies. The reviews from Kaggle are divided into two folders, each containing 50.000 reviews. Sentiment analysis has already been conducted on some of the reviews, however this was ignored for this project. Besides the moview reviews the data also contains URLS describing which movie the different reviews come from, this is the part linking the movies with the reviews.

Why did you chose this/ these particular datasets

The datasets chosen for this project provides with a way to link actors to each other through their movies. Besides this it also provides reviews for the movies, such that sentiment analysis can be done on these in order to link the sentiment score of the movies actors has been in, to the actors.

Some of the data files are also used when cleaning the data and making it more suitable for the project, this was especially important since the data set was very large initially.

All of these data files therefore provides everything needed in order to make this project, which is why they were chosen.

What was your goal for the end user's experience?

The goal of this project is to find communities of actors/actresses which are enjoyable together. The project is therefore not about find good movies, but instead finding out which actors/actresses make good movies when working together.

It is therefore possible for an actor to have bad reviews in general, but still being enjoyable to watch when paired up with certain actors/actresses.

Data preparation

We have two type of data:

  • reviews
  • IMBD databases containing actors, movies and rating

The databases contains alot of irrelevant information such as games and movies with no reviews in the review data set. Therefore we first have to clean our databases in order to keep only the relevant information.

In this we also comment the different methods and tools used for data cleaning.

Preparation for data cleaning

Execution style

In [1]:
startFromCleanData = True #Start with the raw data imported or the cleaned files
fastExecution = False     #Use the stored graph, position and DF of rebuild them
savingFigures = True      #Whether to save or not the figures produced

Libraries

In [2]:
# Import Libraries
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import fa2
import math
import community
import matplotlib.cm as cm
from __future__ import division
import matplotlib.image as mpimg
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import io
from collections import Counter
from wordcloud import WordCloud
from scipy.special import zeta
import pickle
# Rendering Parameters
title_font = {'family': 'sans-serif',
        'color':  '#000000',
        'weight': 'normal',
        'size': 16,
        }
#COLORS
mBlue = "#55638A"     # For actor
fRed = "#9E1030"    # For actress

Object Storage

In [3]:
#PICKLE
def save_obj(obj, name ):
    with open('obj/'+ name + '.pkl', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def load_obj(name ):
    with open('obj/' + name + '.pkl', 'rb') as f:
        return pickle.load(f)
        

We decided to use Pickle as our data frame, because it is native to pandas and also because the Pickle data frame structure is more compressed than txt files and allows for much faster reading of files.

Initialise Actors and Movie Dictionnaries

In [4]:
###################################
# Initialise a movie dictionnary
###################################

# Function. to convert movie or actor id to sting key
def idToString(iD, base): # base = "tt" for movies or "nm" for actors
    if iD<10:
        return base+"000000"+str(iD)
    if iD<100:
        return base+"00000"+str(iD)
    if iD<1000:
        return base+"0000"+str(iD)
    if iD<10000:
        return base+"000"+str(iD)
    if iD<100000:
        return base+"00"+str(iD)
    if iD<1000000:
        return base+"0"+str(iD)
    else:
        return base+str(iD)
    
# Create movie dictionnary
movieDict = {}
lastMovie = 9999999 #last movie ID
if not fastExecution:
    for i in range(lastMovie):
        movieDict[idToString(i+1,"tt")] = False
    print "Movie Dictionnary initialised"
else:
    print "Fast execution mode, movie dictionnary will be initialised later"
    
Movie Dictionnary initialised

We decided to use dictionaries, when checking which movies and actors we want to save in our data. This is because dictionaries provide an easy way to save each movie ID and actor ID as a key in the dictionary and having the value of that key as either True or false, depending if it should be kept or not. Another reason why we decided on using dictionaries are that dictionaries is faster to use when running code.

In [5]:
###################################
# Get the movies to keep
###################################

# List of the reviews documents
listReviewsDocuments = ["train/urls_neg.txt","test/urls_neg.txt","train/urls_pos.txt","test/urls_pos.txt","train/urls_unsup.txt"]

# Fill in the dictionnary
for document in listReviewsDocuments:
    files = io.open("aclImdb/"+document, mode="r", encoding="utf-8")
    for row in files:
        w = re.findall(r'http://www.imdb.com/title/(\w*)/usercomments',row)
        movieDict[w[0]] = True

Throughtout this project all data is encoded and decoded with unicode. The reason for this is that the data used for this project is already encoded in unicode. It is therefore the obvious choice to keep the same formate when handling the text, throughout the project.

In [6]:
###################################
# Create an Actor Dict
###################################
actorDict = {}
lastActor = 29999999 #last movie ID
for i in range(lastActor):
    actorDict[idToString(i+1,"nm")] = False
print "Actor Dictionnary initialised"
Actor Dictionnary initialised

Data Cleaning

After this setup the data is ready to be cleaned. The way the data was cleaned was to only save data which is relevant for this project. First it was relevant to only save movies which has reviews and which are actually movies and not games, tv shows etc.

In [7]:
###################################
# key to movie name file
###################################

if not startFromCleanData:
    path = "DATA/title.basics.txt"
    cleanPath = "DATA/title.basics.clean.txt"
    files = io.open(path, mode="r", encoding="utf-8")
    cleanfile = io.open(cleanPath, mode="w", encoding="utf-8")
    b=False # skip the first line
    count =0
    for row in files:
        if b:
            split=row.split("\t")
            key = split[0]
            if movieDict[key]:
                if (split[1] in ['movie', 'tvMovie']):
                    cleanfile.write(row)
                    count +=1
                else:
                    movieDict[key]=False
        else:
            b=True
    files.close()
    cleanfile.close()


    print "There are "+str(count)+" movies considered"
    print "DATA/title.basics.txt cleaned"

After this step, only actors and actresses in the remaining movies should be saved, everyone not in the movies or with another role than actor/actress where therefore removed.

In [8]:
##########################################################
# film actors links file : Clean + get actor dictionnary
##########################################################

if not startFromCleanData:
    path = "DATA/title.principals.txt"
    cleanPath = "DATA/title.principals.clean.txt"
    files = io.open(path, mode="r", encoding="utf-8")
    cleanfile = io.open(cleanPath, mode="w", encoding="utf-8")
    roleCheckList = ["actor", "actress", "self"] #check if it is an actor
    nLinks = 0
    i=False # skip first line
    for row in files:
        if i:
            split = row.split("\t") 
            key = split[0]
            if movieDict[key]:
                if (split[3] in roleCheckList or split[4] in roleCheckList or split[5] in roleCheckList):
                    cleanfile.write(row)
                    actorDict[split[2]]=True
                    nLinks  +=1

        else:
            i=True

    files.close()
    cleanfile.close()

    ##REMOVE ERRORS
    actorDict["nm0547707"]=False
    actorDict['nm0547707']=False
    actorDict['nm0809728']=False
    actorDict['nm2442859']=False
    actorDict['nm1996613']=False
    actorDict['nm0600636']=False
    actorDict['nm1824417']=False
    actorDict['nm2440192']=False
    actorDict['nm1754167']=False

    print "There are "+str(nLinks-9)+" actors considered"
    print "DATA/title.principals.txt cleaned"
In [9]:
###################################
# key to actor name file
###################################

if not startFromCleanData:
    path = "DATA/name.basics.txt"
    cleanPath = "DATA/name.basics.clean.txt"
    files = io.open(path, mode="r", encoding="utf-8")
    cleanfile = io.open(cleanPath, mode="w", encoding="utf-8")
    count = 0
    i=False
    for row in files:
        if i:
            split = row.split("\t")
            key = split[0]
            if actorDict[key]:
                cleanfile.write(row)
        else:
            i=True

    files.close()
    cleanfile.close()
    print "DATA/name.basics.txt cleaned"

Clean Data Pre-Processing

Once everything not relevant for the project has been removed and only relevant movies and actors/acresses remain, it is then necessary to initialize all of this data, in order to gather relevant information about the data such as movie years etc.

In [10]:
############################################
# Preprocess Movie Dict and get movie years
############################################

movieAgeDict = {}

path = "DATA/title.basics.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
count =0
for row in files:
    split=row.split("\t")
    key = split[0]
    if movieDict[key]:
        if (split[1] in ['movie', 'tvMovie']) and not (split[5] == "\\N"):
            movieAgeDict[key] = int(split[5])
            count +=1
files.close()

#Clean Movie dict
for i in range(lastMovie):
    movieDict[idToString(i+1,"tt")] = False

for key in movieAgeDict.keys():
    movieDict[key]=True


print "There are "+str(count)+" movies considered"
print "Movie Dictionnary Preprocessed and Movie Age Dictionnary Built"
    
There are 10735 movies considered
Movie Dictionnary Preprocessed and Movie Age Dictionnary Built
In [11]:
##########################################################
# film actors links file : Clean + get actor dictionnary
##########################################################

path = "DATA/title.principals.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
roleCheckList = ["actor", "actress", "self"] #check if it is an actor
nLinks = 0
for row in files:
    split = row.split("\t") 
    key = split[0]
    if movieDict[key]:
        if (split[3] in roleCheckList or split[4] in roleCheckList or split[5] in roleCheckList):
            actorDict[split[2]]=True
            nLinks  +=1

files.close()

###REMOVE ERRORS
actorDict["nm0547707"]=False
actorDict['nm0547707']=False
actorDict['nm0809728']=False
actorDict['nm2442859']=False
actorDict['nm1996613']=False
actorDict['nm0600636']=False
actorDict['nm1824417']=False
actorDict['nm2440192']=False
actorDict['nm1754167']=False

print "There are "+str(nLinks-9)+" actors considered"

print "Actor Dictionnary Preprocessed"
    
There are 43553 actors considered
Actor Dictionnary Preprocessed
In [12]:
###################################
# Create a ratings dict
###################################
ratingDict = {}
path = "DATA/ratings.txt"
files = io.open(path, mode="r", encoding="utf-8")
count = 0
i=False # skip first line
for row in files:
    if i:
        key = row[:9]
        if movieDict[key]:
            split = row.split("\t") 
            ratingDict[key] = float(split[1])
    else:
        i=True

files.close()
In [13]:
###################################
# Create a movie name dict
###################################
movieNameDict = {}
moviesList = []
path = "DATA/title.akas.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
count = 0
for row in files:
    split = row.split("\t") 
    if movieDict[split[0]] and not (split[0] in movieNameDict) and (split[0] in ratingDict) and "original" in row   :
        movieNameDict[split[0]] = split[2]
        moviesList.append(split[0])

files.close()
In [14]:
###################################
# Create an actor name dict
###################################
actorNameDict = {}
actorGenderDict = {}
actorsList = []
path = "DATA/name.basics.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
count = 0
for row in files:
    split = row.split("\t") 
    if actorDict[split[0]] and not (split[0] in actorNameDict):
        actorNameDict[split[0]] = split[1]
        if "actor" in split[4]:
            actorGenderDict[split[0]] = "M"
        else:
            actorGenderDict[split[0]] = "F"
        actorsList.append(split[0])
files.close()
In [20]:
###################################
# Build a movie data frame
###################################
if not fastExecution:
    moviesData = {"iD" : movieNameDict.keys(), "Title": pd.Series(np.zeros(len(moviesList))), "Rating":pd.Series(np.zeros(len(moviesList))), "Year":pd.Series(np.zeros(len(moviesList)))}
    moviesDF = pd.DataFrame(moviesData)
    for i in moviesDF.index:
        iD =moviesDF.loc[i].at["iD"]
        moviesDF.loc[i, "Title"]= movieNameDict[iD]
        moviesDF.loc[i, "Rating"] = ratingDict[iD]
        moviesDF.loc[i, "Year"]= movieAgeDict[iD]
    if savingFigures:
        moviesDF.to_pickle("obj/moviesDF.pkl")
else:
    moviesDF = pd.read_pickle("obj/moviesDF.pkl")
moviesDF.sort_values("Rating", ascending=False).head(10)
Out[20]:
Rating Title Year iD
8686 9.1 The Regard of Flight 1983.0 tt0134050
7737 9.0 Notre-Dame de Paris 1999.0 tt0285800
8377 8.9 Ko to tamo peva 1980.0 tt0076276
4887 8.9 12 Angry Men 1957.0 tt0050083
9860 8.9 Schindler's List 1993.0 tt0108052
1157 8.9 The Lord of the Rings: The Return of the King 2003.0 tt0167260
8305 8.8 Saban Oglu Saban 1977.0 tt0253614
1389 8.8 Sobache serdtse 1988.0 tt0096126
2151 8.8 The Art of Amália 2000.0 tt0204839
9079 8.8 The Lord of the Rings: The Fellowship of the Ring 2001.0 tt0120737

When the data has been cleaned, the remaining data for the movies are their rating, movie title, which year it was made and the movie ID.

This data is everything needed in order to link it to the actors and the different reviews as well as categorizing them after year and analyzing ratings.

In [22]:
###################################
# Build an actor data frame
###################################
if not fastExecution:
    actorsData = {"iD": actorNameDict.keys(), "Name": pd.Series(np.zeros(len(actorsList))),"Gender": pd.Series(np.zeros(len(actorsList)))}
    actorsDF = pd.DataFrame(actorsData)
    for i in actorsDF.index:
        iD = actorsDF.loc[i].at["iD"]
        actorsDF.loc[i, "Name"]= actorNameDict[iD]
        actorsDF.loc[i, "Gender"] = actorGenderDict[iD]
    if savingFigures:
        actorsDF.to_pickle("obj/actorsDF.pkl")
else:
    actorsDF = pd.read_pickle("obj/actorsDF.pkl")
actorsDF.head(10)
Out[22]:
Gender Name iD
0 F Bobbie Bresee nm0107679
1 F Malgorzata Rozniatowska nm0747647
2 M Ahmet Ugurlu nm0880128
3 F Laura Nativo nm1137466
4 F Jordy Benattar nm0070237
5 M Özkan Ugur nm0880126
6 M John Foss nm1458561
7 M Panayiotis Hartomatzidis nm0367186
8 M Simon Abkarian nm0008787
9 F Victoria Snow nm0795281

For actors and actresses the only relevant information was their gender, name and IMDB ID which is used when linking them to the movies.

In [17]:
###################################
# Create a links list
###################################
path = "DATA/title.principals.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
links = np.empty((nLinks,2),dtype=object)
count = 0
for row in files:
    split = row.split("\t")
    if actorDict[split[2]]:
        links[count,0]= split[0]
        links[count,1]= split[2]
        count+=1

files.close()
In [23]:
###################################
# Create an actor links list
###################################
actorsLinks = []
files = io.open("obj/actorsLinksList.txt", mode="w", encoding="utf-8")
for i in range(count-1):
    j = i+1
    while (j<count) and (links[i,0]==links[j,0]):
        actorsLinks.append([links[i,1],links[j,1],links[i,0]]) #[actor1, actor2, movie]
        files.write(str(links[i,1])+"\t"+str(links[j,1])+"\t"+links[i,0]+"\r\n")
        j+=1
files.close()

LOAD & CLEAN DATA FUNCTION

In [25]:
def cleanLoadData():
    
    #build the Dataframes
    mDF = pd.read_pickle("obj/moviesDF.pkl")
    aDF = pd.read_pickle("obj/actorsDF.pkl")
    aLL = []
    files = io.open("obj/actorsLinksList.txt", mode="r", encoding="utf-8")
    for row in files:
        split = row.split("\t")
        aLL.append(split)
    files.close()
    
    #rebuild the Dictionnary
    movieAgeDict = {}
    ratingDict = {}
    actorName = {}
    movieName = {}
    #movies
    for i in mDF.index:
        iD = mDF.loc[i].at["iD"]
        rating = mDF.loc[i].at["Rating"]
        title = mDF.loc[i].at["Title"]
        year = mDF.loc[i].at["Year"]
        movieAgeDict[iD] = year
        ratingDict[iD] = rating
        movieName[iD] = title
    #actors
    for i in aDF.index:
        iD = aDF.loc[i].at["iD"]
        name = aDF.loc[i].at["Name"]
        actorName[iD]= name
    return movieAgeDict,ratingDict,actorName,movieName,mDF,aDF,aLL
    

Once the data has been cleaned and saved into files, all there is left to do is load the data and use it in the rest of the project.

Cleaned data stats

As mentioned in the "What is our data set" chapter the original data consists of over 30 million rows and 1.3 Gb of data. The cleaned data ends up being around 44.000 rows with a size of 2.1Mb. which is approximately 0,15% of the original data.