For this project it was decided to work with IMDB data. IMDB provides a subset of their data to everyone who might be interesting in analyzing it. The data provided by IMDB are split into multiple files, not all of these files are used for this project. The files used for this project is: name.basics.tsv.gz title.akas.tsv.gz title.basics.tsv.gz title.crew.tsv.gz title.episode.tsv.gz title.principals.tsv.gz title.ratings.tsv.gz The file "title.principals" is the main datafile, which orignally consists of 1.3 Gb of data with 30.674.812 rows and 6 columns. These datafiles contains information about movies, actors in those movies, when they movies were made, what the different peoples roles were in the movies, ID of movies and actors, type of movie such as tv show or movie and ratings of these movies.
Besides these data files, some files containing reviews of movies were also used. These reviews were downloaded from the website Kaggle.com and consists of 100.000 reviews on 14.127 movies. The reviews from Kaggle are divided into two folders, each containing 50.000 reviews. Sentiment analysis has already been conducted on some of the reviews, however this was ignored for this project. Besides the moview reviews the data also contains URLS describing which movie the different reviews come from, this is the part linking the movies with the reviews.
The datasets chosen for this project provides with a way to link actors to each other through their movies. Besides this it also provides reviews for the movies, such that sentiment analysis can be done on these in order to link the sentiment score of the movies actors has been in, to the actors.
Some of the data files are also used when cleaning the data and making it more suitable for the project, this was especially important since the data set was very large initially.
All of these data files therefore provides everything needed in order to make this project, which is why they were chosen.
The goal of this project is to find communities of actors/actresses which are enjoyable together. The project is therefore not about find good movies, but instead finding out which actors/actresses make good movies when working together.
It is therefore possible for an actor to have bad reviews in general, but still being enjoyable to watch when paired up with certain actors/actresses.
We have two type of data:
The databases contains alot of irrelevant information such as games and movies with no reviews in the review data set. Therefore we first have to clean our databases in order to keep only the relevant information.
In this we also comment the different methods and tools used for data cleaning.
startFromCleanData = True #Start with the raw data imported or the cleaned files
fastExecution = False #Use the stored graph, position and DF of rebuild them
savingFigures = True #Whether to save or not the figures produced
# Import Libraries
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import fa2
import math
import community
import matplotlib.cm as cm
from __future__ import division
import matplotlib.image as mpimg
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import io
from collections import Counter
from wordcloud import WordCloud
from scipy.special import zeta
import pickle
# Rendering Parameters
title_font = {'family': 'sans-serif',
'color': '#000000',
'weight': 'normal',
'size': 16,
}
#COLORS
mBlue = "#55638A" # For actor
fRed = "#9E1030" # For actress
#PICKLE
def save_obj(obj, name ):
with open('obj/'+ name + '.pkl', 'wb') as f:
pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
def load_obj(name ):
with open('obj/' + name + '.pkl', 'rb') as f:
return pickle.load(f)
We decided to use Pickle as our data frame, because it is native to pandas and also because the Pickle data frame structure is more compressed than txt files and allows for much faster reading of files.
###################################
# Initialise a movie dictionnary
###################################
# Function. to convert movie or actor id to sting key
def idToString(iD, base): # base = "tt" for movies or "nm" for actors
if iD<10:
return base+"000000"+str(iD)
if iD<100:
return base+"00000"+str(iD)
if iD<1000:
return base+"0000"+str(iD)
if iD<10000:
return base+"000"+str(iD)
if iD<100000:
return base+"00"+str(iD)
if iD<1000000:
return base+"0"+str(iD)
else:
return base+str(iD)
# Create movie dictionnary
movieDict = {}
lastMovie = 9999999 #last movie ID
if not fastExecution:
for i in range(lastMovie):
movieDict[idToString(i+1,"tt")] = False
print "Movie Dictionnary initialised"
else:
print "Fast execution mode, movie dictionnary will be initialised later"
We decided to use dictionaries, when checking which movies and actors we want to save in our data. This is because dictionaries provide an easy way to save each movie ID and actor ID as a key in the dictionary and having the value of that key as either True or false, depending if it should be kept or not. Another reason why we decided on using dictionaries are that dictionaries is faster to use when running code.
###################################
# Get the movies to keep
###################################
# List of the reviews documents
listReviewsDocuments = ["train/urls_neg.txt","test/urls_neg.txt","train/urls_pos.txt","test/urls_pos.txt","train/urls_unsup.txt"]
# Fill in the dictionnary
for document in listReviewsDocuments:
files = io.open("aclImdb/"+document, mode="r", encoding="utf-8")
for row in files:
w = re.findall(r'http://www.imdb.com/title/(\w*)/usercomments',row)
movieDict[w[0]] = True
Throughtout this project all data is encoded and decoded with unicode. The reason for this is that the data used for this project is already encoded in unicode. It is therefore the obvious choice to keep the same formate when handling the text, throughout the project.
###################################
# Create an Actor Dict
###################################
actorDict = {}
lastActor = 29999999 #last movie ID
for i in range(lastActor):
actorDict[idToString(i+1,"nm")] = False
print "Actor Dictionnary initialised"
After this setup the data is ready to be cleaned. The way the data was cleaned was to only save data which is relevant for this project. First it was relevant to only save movies which has reviews and which are actually movies and not games, tv shows etc.
###################################
# key to movie name file
###################################
if not startFromCleanData:
path = "DATA/title.basics.txt"
cleanPath = "DATA/title.basics.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
cleanfile = io.open(cleanPath, mode="w", encoding="utf-8")
b=False # skip the first line
count =0
for row in files:
if b:
split=row.split("\t")
key = split[0]
if movieDict[key]:
if (split[1] in ['movie', 'tvMovie']):
cleanfile.write(row)
count +=1
else:
movieDict[key]=False
else:
b=True
files.close()
cleanfile.close()
print "There are "+str(count)+" movies considered"
print "DATA/title.basics.txt cleaned"
After this step, only actors and actresses in the remaining movies should be saved, everyone not in the movies or with another role than actor/actress where therefore removed.
##########################################################
# film actors links file : Clean + get actor dictionnary
##########################################################
if not startFromCleanData:
path = "DATA/title.principals.txt"
cleanPath = "DATA/title.principals.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
cleanfile = io.open(cleanPath, mode="w", encoding="utf-8")
roleCheckList = ["actor", "actress", "self"] #check if it is an actor
nLinks = 0
i=False # skip first line
for row in files:
if i:
split = row.split("\t")
key = split[0]
if movieDict[key]:
if (split[3] in roleCheckList or split[4] in roleCheckList or split[5] in roleCheckList):
cleanfile.write(row)
actorDict[split[2]]=True
nLinks +=1
else:
i=True
files.close()
cleanfile.close()
##REMOVE ERRORS
actorDict["nm0547707"]=False
actorDict['nm0547707']=False
actorDict['nm0809728']=False
actorDict['nm2442859']=False
actorDict['nm1996613']=False
actorDict['nm0600636']=False
actorDict['nm1824417']=False
actorDict['nm2440192']=False
actorDict['nm1754167']=False
print "There are "+str(nLinks-9)+" actors considered"
print "DATA/title.principals.txt cleaned"
###################################
# key to actor name file
###################################
if not startFromCleanData:
path = "DATA/name.basics.txt"
cleanPath = "DATA/name.basics.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
cleanfile = io.open(cleanPath, mode="w", encoding="utf-8")
count = 0
i=False
for row in files:
if i:
split = row.split("\t")
key = split[0]
if actorDict[key]:
cleanfile.write(row)
else:
i=True
files.close()
cleanfile.close()
print "DATA/name.basics.txt cleaned"
Once everything not relevant for the project has been removed and only relevant movies and actors/acresses remain, it is then necessary to initialize all of this data, in order to gather relevant information about the data such as movie years etc.
############################################
# Preprocess Movie Dict and get movie years
############################################
movieAgeDict = {}
path = "DATA/title.basics.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
count =0
for row in files:
split=row.split("\t")
key = split[0]
if movieDict[key]:
if (split[1] in ['movie', 'tvMovie']) and not (split[5] == "\\N"):
movieAgeDict[key] = int(split[5])
count +=1
files.close()
#Clean Movie dict
for i in range(lastMovie):
movieDict[idToString(i+1,"tt")] = False
for key in movieAgeDict.keys():
movieDict[key]=True
print "There are "+str(count)+" movies considered"
print "Movie Dictionnary Preprocessed and Movie Age Dictionnary Built"
##########################################################
# film actors links file : Clean + get actor dictionnary
##########################################################
path = "DATA/title.principals.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
roleCheckList = ["actor", "actress", "self"] #check if it is an actor
nLinks = 0
for row in files:
split = row.split("\t")
key = split[0]
if movieDict[key]:
if (split[3] in roleCheckList or split[4] in roleCheckList or split[5] in roleCheckList):
actorDict[split[2]]=True
nLinks +=1
files.close()
###REMOVE ERRORS
actorDict["nm0547707"]=False
actorDict['nm0547707']=False
actorDict['nm0809728']=False
actorDict['nm2442859']=False
actorDict['nm1996613']=False
actorDict['nm0600636']=False
actorDict['nm1824417']=False
actorDict['nm2440192']=False
actorDict['nm1754167']=False
print "There are "+str(nLinks-9)+" actors considered"
print "Actor Dictionnary Preprocessed"
###################################
# Create a ratings dict
###################################
ratingDict = {}
path = "DATA/ratings.txt"
files = io.open(path, mode="r", encoding="utf-8")
count = 0
i=False # skip first line
for row in files:
if i:
key = row[:9]
if movieDict[key]:
split = row.split("\t")
ratingDict[key] = float(split[1])
else:
i=True
files.close()
###################################
# Create a movie name dict
###################################
movieNameDict = {}
moviesList = []
path = "DATA/title.akas.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
count = 0
for row in files:
split = row.split("\t")
if movieDict[split[0]] and not (split[0] in movieNameDict) and (split[0] in ratingDict) and "original" in row :
movieNameDict[split[0]] = split[2]
moviesList.append(split[0])
files.close()
###################################
# Create an actor name dict
###################################
actorNameDict = {}
actorGenderDict = {}
actorsList = []
path = "DATA/name.basics.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
count = 0
for row in files:
split = row.split("\t")
if actorDict[split[0]] and not (split[0] in actorNameDict):
actorNameDict[split[0]] = split[1]
if "actor" in split[4]:
actorGenderDict[split[0]] = "M"
else:
actorGenderDict[split[0]] = "F"
actorsList.append(split[0])
files.close()
###################################
# Build a movie data frame
###################################
if not fastExecution:
moviesData = {"iD" : movieNameDict.keys(), "Title": pd.Series(np.zeros(len(moviesList))), "Rating":pd.Series(np.zeros(len(moviesList))), "Year":pd.Series(np.zeros(len(moviesList)))}
moviesDF = pd.DataFrame(moviesData)
for i in moviesDF.index:
iD =moviesDF.loc[i].at["iD"]
moviesDF.loc[i, "Title"]= movieNameDict[iD]
moviesDF.loc[i, "Rating"] = ratingDict[iD]
moviesDF.loc[i, "Year"]= movieAgeDict[iD]
if savingFigures:
moviesDF.to_pickle("obj/moviesDF.pkl")
else:
moviesDF = pd.read_pickle("obj/moviesDF.pkl")
moviesDF.sort_values("Rating", ascending=False).head(10)
When the data has been cleaned, the remaining data for the movies are their rating, movie title, which year it was made and the movie ID.
This data is everything needed in order to link it to the actors and the different reviews as well as categorizing them after year and analyzing ratings.
###################################
# Build an actor data frame
###################################
if not fastExecution:
actorsData = {"iD": actorNameDict.keys(), "Name": pd.Series(np.zeros(len(actorsList))),"Gender": pd.Series(np.zeros(len(actorsList)))}
actorsDF = pd.DataFrame(actorsData)
for i in actorsDF.index:
iD = actorsDF.loc[i].at["iD"]
actorsDF.loc[i, "Name"]= actorNameDict[iD]
actorsDF.loc[i, "Gender"] = actorGenderDict[iD]
if savingFigures:
actorsDF.to_pickle("obj/actorsDF.pkl")
else:
actorsDF = pd.read_pickle("obj/actorsDF.pkl")
actorsDF.head(10)
For actors and actresses the only relevant information was their gender, name and IMDB ID which is used when linking them to the movies.
###################################
# Create a links list
###################################
path = "DATA/title.principals.clean.txt"
files = io.open(path, mode="r", encoding="utf-8")
links = np.empty((nLinks,2),dtype=object)
count = 0
for row in files:
split = row.split("\t")
if actorDict[split[2]]:
links[count,0]= split[0]
links[count,1]= split[2]
count+=1
files.close()
###################################
# Create an actor links list
###################################
actorsLinks = []
files = io.open("obj/actorsLinksList.txt", mode="w", encoding="utf-8")
for i in range(count-1):
j = i+1
while (j<count) and (links[i,0]==links[j,0]):
actorsLinks.append([links[i,1],links[j,1],links[i,0]]) #[actor1, actor2, movie]
files.write(str(links[i,1])+"\t"+str(links[j,1])+"\t"+links[i,0]+"\r\n")
j+=1
files.close()
def cleanLoadData():
#build the Dataframes
mDF = pd.read_pickle("obj/moviesDF.pkl")
aDF = pd.read_pickle("obj/actorsDF.pkl")
aLL = []
files = io.open("obj/actorsLinksList.txt", mode="r", encoding="utf-8")
for row in files:
split = row.split("\t")
aLL.append(split)
files.close()
#rebuild the Dictionnary
movieAgeDict = {}
ratingDict = {}
actorName = {}
movieName = {}
#movies
for i in mDF.index:
iD = mDF.loc[i].at["iD"]
rating = mDF.loc[i].at["Rating"]
title = mDF.loc[i].at["Title"]
year = mDF.loc[i].at["Year"]
movieAgeDict[iD] = year
ratingDict[iD] = rating
movieName[iD] = title
#actors
for i in aDF.index:
iD = aDF.loc[i].at["iD"]
name = aDF.loc[i].at["Name"]
actorName[iD]= name
return movieAgeDict,ratingDict,actorName,movieName,mDF,aDF,aLL
Once the data has been cleaned and saved into files, all there is left to do is load the data and use it in the rest of the project.
As mentioned in the "What is our data set" chapter the original data consists of over 30 million rows and 1.3 Gb of data. The cleaned data ends up being around 44.000 rows with a size of 2.1Mb. which is approximately 0,15% of the original data.
fastExecution = True # Whether to use or not pre build short-cut files to skip long execution bloc of codes
savingFigures = False # Whether to save or not the figures produced
savingData = False # Whether to build or not the short cut files for future fastExecution
# Import Libraries
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import json
import re
import fa2
import math
import community
import matplotlib.cm as cm
import pickle
from __future__ import division
import matplotlib.image as mpimg
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import io
from collections import Counter
from wordcloud import WordCloud
from scipy.special import zeta
# Rendering Parameters
title_font = {'family': 'sans-serif',
'color': '#000000',
'weight': 'bold',
'size': 28,
}
#COLORS
mBlue = "#55638A" # For actor
fRed = "#9E1030" # For actress
The cleaning and preprocessing of data can be found in Explainer notebook part 1 and 2.
Here we get the movies and actors data frames (moviesDF,actorsDF), the movie age dictionnary (movieAgeDict), the rating one (ratingDict), the movies and actors names dictionnaries (actorNameDict,movieNameDict) and one list containing all the collaboration between actors (actorsLinks)
from loadData import cleanLoadData
movieAgeDict,ratingDict,actorNameDict,movieNameDict,moviesDF,actorsDF,actorsLinks = cleanLoadData()
actorsDF.head(10)
moviesDF.sort_values("Rating", ascending=False).head(10)
We choose to build a graph where the nodes are the actors, and the links represent the collaboration between two actors: the movies they both participate in, the ratings they got and the average year of their collaborations
##########################
# Create the actors Graph
##########################
G = nx.Graph()
#add nodes
for i in actorsDF.index:
G.add_node(actorsDF.loc[i].at["iD"], Name= actorsDF.loc[i, "Name"], Gender = actorsDF.loc[i, "Gender"])
#add links
for link in actorsLinks:
if link[0] != link[1]:
if G.has_edge(link[0],link[1]): #Update existing edges
G[link[0]][link[1]]["weight"] +=1
G[link[0]][link[1]]["movies"].append(link[2])
#Average Rating
avRating = (G[link[0]][link[1]]["avRating"])*(1-1.0/G[link[0]][link[1]]["weight"]) #Former ratings
avRating += ratingDict[link[2]]/G[link[0]][link[1]]["weight"] # Added Movie
G[link[0]][link[1]]["avRating"] = avRating
#Average Year
avYear = (G[link[0]][link[1]]["avYear"])*(1-1.0/G[link[0]][link[1]]["weight"]) #Former ratings
avYear += movieAgeDict[link[2]]/G[link[0]][link[1]]["weight"] # Added Movie
G[link[0]][link[1]]["avYear"] = avYear
else: #Create new edge
G.add_edge(link[0], link[1],
weight = 1,
movies = [link[2]],
avRating = ratingDict[link[2]],
avYear = movieAgeDict[link[2]])
We took the giant component to have a connected graph to study.
##################
# Clean the Graph
##################
G = max(nx.connected_component_subgraphs(G), key=len)
if savingData:
nx.write_gpickle(G, 'obj/full.gpickle')
print "The graph has "+str(G.number_of_nodes())+" nodes (actors) and "+str(G.number_of_edges())+" edges (collaborations)"
#################################################
# Set the node colors actording to actors gender
#################################################
def getColors(graph):
mBlue = "#55638A" # For actor
fRed = "#9E1030" # For actress
colors = {} # Build the color
k=0
for n in graph.nodes:
if graph.nodes[n]['Gender'] == "F":
colors[n]= fRed
else:
colors[n]= mBlue
return colors
We use separate colors for men and women to see if they have different influence in the graph.
###############################################
# Set the edge colors actording to movie years
###############################################
def getEdgeColors(graph):
c1930 = "#c12424"
c1955 = "#ff6612"
c1980 = "#ffce00"
c1995 = "#e3f018"
cNow = "#bdff00"
edgesColors = {}
for e in G.edges:
edgesColors[e] = c1930 #RED
if graph.get_edge_data(*e)["avYear"]>1930:
edgesColors[e] = c1955 #ORANGE
if graph.get_edge_data(*e)["avYear"]>1955:
edgesColors[e] = c1980 #YELLOW
if graph.get_edge_data(*e)["avYear"]>1980:
edgesColors[e] = c1995 #LIGHT GREEN
if graph.get_edge_data(*e)["avYear"]>1995:
edgesColors[e] = cNow #GREEN
return edgesColors
The edges will have colors according to the average year of the two actors collaborations, from red for old movies to green for the most recent ones. With this can show how the age is important in this network.
###########################################################
# Set the size of the nodes according to outgoing strength
###########################################################
sizes = {}
actorsDF["Collab"] = pd.Series(np.zeros(len(actorsDF.index))) #to store the outgoing strength
for i in actorsDF.index: # go through actors
iD =actorsDF.loc[i].at["iD"]
if iD in G.nodes(): # if actor in the grpah
edges = list(G.edges(iD))
ogStrength = 0
for e in edges: # go though his edges
ogStrength += G.get_edge_data(*e)["weight"] #update outgoing strength
actorsDF.loc[i, "Collab"]= ogStrength
sizes[iD] = ogStrength
else :
actorsDF.loc[i, "Collab"]= 0
actorsDF=actorsDF.sort_values("Collab", ascending=False) #Sort actors DF
In this network, the outgoing strength correponds to the number of times an actor has collaborated with other. It tends to grow with the number of movies done and the number of actors in those movies. Then it represent the importance of the actor in the industry. </br> We add the names of the 10 biggest actors on the plot.
####################################################################
# Display the name of the actors with the biggest outgoing strength
####################################################################
SortedNames = np.asarray(actorsDF["Name"])
SortedNames = SortedNames[:10]
labels = {}
for n in G.nodes():
name = actorNameDict[n]
if name in SortedNames:
labels[n]="\n\n"+name
else:
labels[n]=""
We used the Force Atlas algorithm to have a better rendering of the network.
########################
# Force Atlas Algorithm
########################
forceatlas2 = fa2.ForceAtlas2(
# Behavior alternatives
outboundAttractionDistribution=False, # Dissuade hubs
linLogMode=False, # NOT IMPLEMENTED
adjustSizes=False, # Prevent overlap (NOT IMPLEMENTED)
edgeWeightInfluence=1.0,
# Performance
jitterTolerance=1.0, # Tolerance
barnesHutOptimize=True,
barnesHutTheta=1.2,
multiThreaded=False, # NOT IMPLEMENTED
# Tuning
scalingRatio=0.005,
strongGravityMode=False,
gravity=20,
# Log
verbose=True)
The position can be obtained by the force atlas algorithm but as the graph is pretty big so it is pretty long (about 15 mn). To make it faster we save the position dictionnary in a txt file, and we can directly read the positions from this file in the "fastExecution" mode.
####################
# Get the positions
####################
pos={} # Position of the nodes in the force2 algorithm
if fastExecution:
path = "DATA/forceAtlasPositions.txt"
files = io.open(path, mode="r", encoding="utf-8")
for row in files:
split = row.split("\t")
pos[split[0]] = (float(split[1]),float(split[2]))
files.close()
else:
pos = forceatlas2.forceatlas2_networkx_layout(G, pos=None, iterations=3000) #~15 mn
if savingData:
#Store the positions in a text file
path = "DATA/forceAtlasPositions.txt"
files = io.open(path, mode="w", encoding="utf-8")
for key in pos.keys():
row = key +"\t" + str(pos[key][0]) +"\t"+str(pos[key][1])+"\r\n"
files.write(row.replace("u'","").replace("'",""))
files.close()
The size are compute regarding the outgoing strenght. To amplify the results and then make the plot more readable we use the square of the outgoing strenght. Nodes with a big outgoing strength will the appear bigger and small ones smaller. </br>
$nodeSize = \left(0.2\cdot outgoingStrength\right)^2$
#######
# Draw
#######
#Get colors
colors = getColors(G)
edgesColors = getEdgeColors(G)
#Plot
fig = plt.figure(figsize=(30, 30))
nx.draw_networkx(G, pos,
node_size = [(0.2*sizes[n])**(2) for n in G.nodes()],
node_color = [colors[n] for n in G.nodes()],
with_labels=True,
width = 0.1,
edge_color=[edgesColors[e] for e in G.edges()],
labels = labels,
font_size = 17,
font_weight= "bold")
plt.axis('off')
plt.title("IMDb Actors Graph", fontdict = title_font )
base = 'Figures/actorGraph'
if savingFigures:
plt.savefig(base+'.jpeg', bbox_inches='tight')
plt.savefig(base+'.svg', bbox_inches='tight')
plt.show()
We can distinguish 3 main groups:
</br>
We can see that, the impact of the collaborations'age. Old American actors are regroup in the same area of the graph. Other old links are harder to visualize. This could be explained by the fact that movie industry was not very big at the time and not very internationnal. Then the links between old non American actors and the bulk of the actors are rare.
Let's build a function to get a distribution histogram from a list or an array of data.
######################
# HISTOGRAM FUNCTION
######################
def histogram(degrees, dens): # degrees (list or array of data), dens (whether it is a density histogram or not)
# Computing Bins
min_bin = np.amin(degrees)
max_bin = np.amax(degrees)
nb_bins = int(math.ceil(max_bin)-math.floor(min_bin))
v = np.empty(nb_bins+1)
v[nb_bins] = int(math.ceil(max_bin))
bins = np.empty(nb_bins)
for i in range(nb_bins):
v[i] = int(min_bin + i)
bins[i] = int(min_bin + i)
#Hist
hist, bin_edges = np.histogram(degrees,bins = v,density=dens)
return hist, bin_edges
Let's look how the movies and collaborations a spread across the years
##################################
# HISTOGRAM OF THE MOVIES BY YEAR
##################################
moviesYear = np.asarray(moviesDF["Year"])
linksYear = []
for link in actorsLinks:
if link[0] != link[1]:
year = movieAgeDict[link[2]]
linksYear.append(year)
# Get the histograms
histM, binsM = histogram(moviesYear,False)
histL, binsL = histogram(linksYear,False)
plt.figure(figsize = (15,6))
plt.bar(binsM[:-1], histM, 0.35, color=mBlue, label = "Movies")
plt.bar([b+0.4 for b in binsL[:-1]], histL, 0.35, color=fRed, label = "Links")
plt.xlabel('Year')
plt.ylabel('Occurences')
plt.title('Movies and links distribution', fontdict = title_font)
plt.legend()
base = 'Figures/moviesDist'
if savingFigures:
plt.savefig(base+'.jpeg', bbox_inches='tight')
plt.savefig(base+'.png', bbox_inches='tight')
plt.savefig(base+'.svg', bbox_inches='tight')
plt.show()
The gap in the 1960s corresponds to the worst year for the american industry. Very few movies where produced at this time. Since the 1980s we can see a huge raise in the number of collaborations, but a smaller one for the movies. This could indicate that more and more movies are produced every year since this period, and more and more actors plays in those movies.</br>
So we divide the data into movies from before 1970, then between 1970 and 1980, 1980 and 1990, 1990 and 2000 and finally after 2000
############################
# REPARTITION AMONG PERIODS
############################
#Define periods and set counts
periods = [1900,1970,1980,1990,2000]
moviesByPeriods = np.zeros(5,dtype=int)
linksByPeriods = np.zeros(5,dtype=int)
nMovies = 0
nLinks = 0
#Go through movies
for i in moviesDF.index:
age = moviesDF.loc[i, "Year"]
nMovies +=1
if age < 1970:
moviesByPeriods[0]+=1
elif age < 1980:
moviesByPeriods[1]+=1
elif age < 1990:
moviesByPeriods[2]+=1
elif age < 2000:
moviesByPeriods[3]+=1
else:
moviesByPeriods[4]+=1
#Go through links
for e in G.edges():
age = G.get_edge_data(*e)["avYear"]
nLinks +=1
if age < 1970:
linksByPeriods[0]+=1
elif age < 1980:
linksByPeriods[1]+=1
elif age < 1990:
linksByPeriods[2]+=1
elif age < 2000:
linksByPeriods[3]+=1
else:
linksByPeriods[4]+=1
print "Period 1900-1970: "+str(moviesByPeriods[0])+"("+str(round(100*moviesByPeriods[0]/nMovies,2))+"%)"+" movies"+" - "+str(linksByPeriods[0])+"("+str(round(100*linksByPeriods[0]/nLinks,2))+"%)"+" links."
print "Period 1970-1980: "+str(moviesByPeriods[1])+"("+str(round(100*moviesByPeriods[1]/nMovies,2))+"%)"+" movies"+" - "+str(linksByPeriods[1])+"("+str(round(100*linksByPeriods[1]/nLinks,2))+"%)"+" links"
print "Period 1980-1990: "+str(moviesByPeriods[2])+"("+str(round(100*moviesByPeriods[2]/nMovies,2))+"%)"+" movies"+" - "+str(linksByPeriods[2])+"("+str(round(100*linksByPeriods[2]/nLinks,2))+"%)"+" links"
print "Period 1990-2000: "+str(moviesByPeriods[3])+"("+str(round(100*moviesByPeriods[3]/nMovies,2))+"%)"+" movies"+" - "+str(linksByPeriods[3])+"("+str(round(100*linksByPeriods[3]/nLinks,2))+"%)"+" links"
print "Period 2000+: "+str(moviesByPeriods[4])+"("+str(round(100*moviesByPeriods[4]/nMovies,2))+"%)"+" movies"+" - "+str(linksByPeriods[4])+"("+str(round(100*linksByPeriods[4]/nLinks,2))+"%)"+" links"
Build a function to get the graph corresponding to a specific period
###################################
# Graph per period function
###################################
def graphPeriod(start,end):
G_per = nx.Graph()
#add nodes
for i in actorsDF.index:
G_per.add_node(actorsDF.loc[i].at["iD"], Name= actorsDF.loc[i, "Name"], Gender = actorsDF.loc[i, "Gender"])
#add links
for link in actorsLinks:
if (start < movieAgeDict[link[2]]) and (movieAgeDict[link[2]]<= end):
if link[0] != link[1]:
if G_per.has_edge(link[0],link[1]):
G_per[link[0]][link[1]]["weight"] +=1
G_per[link[0]][link[1]]["movies"].append(link[2])
G_per[link[0]][link[1]]["avRating"] = (G_per[link[0]][link[1]]["avRating"])*(1-1.0/G_per[link[0]][link[1]]["weight"])+ratingDict[link[2]]/G_per[link[0]][link[1]]["weight"]
G_per[link[0]][link[1]]["avYear"] = (G_per[link[0]][link[1]]["avYear"])*(1-1.0/G_per[link[0]][link[1]]["weight"])+movieAgeDict[link[2]]/G_per[link[0]][link[1]]["weight"]
else:
G_per.add_edge(link[0], link[1], weight = 1, movies = [link[2]], avRating = ratingDict[link[2]], avYear = movieAgeDict[link[2]])
#take the giant component
G_per=max(nx.connected_component_subgraphs(G_per), key=len)
print "There are "+str(G_per.number_of_nodes()) +" nodes(actors) and "+ str(G_per.number_of_edges())+ " links(movie collaboration) in "+str(start)+'-'+str(end)+" period."
return G_per
###################################
# Subdivide the network by period
###################################
graphByPeriod = {}
for i in range(len(periods)):
if i <4:
gph = graphPeriod(periods[i],periods[i+1])
if savingData:
nx.write_gpickle(G, 'obj/graph_'+str(periods[i+1])+'.gpickle')
graphByPeriod[str(periods[i+1])]=gph
else:
gph = graphPeriod(periods[i],2020)
graphByPeriod["now"]=gph
if savingData:
nx.write_gpickle(G, 'obj/graph_now.gpickle')
# Graph Titles
titles = {}
titles["1970"] = "1900-1970"
titles["1980"] = "1970-1980"
titles["1990"] = "1980-1990"
titles["2000"] = "1990-2000"
titles["now"] = "2000+"
We store the graph and the corresponding data in dictionaries which makes it easy to access after.
Same as before for the full graph, there two options either directly with the force atlas algorithm or with the short-cut files.
################################
# Graph per period positionning
################################
positionsPeriod = {}
if not fastExecution:
for key in graphByPeriod:
p = forceatlas2.forceatlas2_networkx_layout(graphByPeriod[key], pos=None, iterations=3000)
positionsPeriod[key] = p
if savingData:
#Build a shortcut to speed up and not re-run the algorithm
#Store the positions in a text file
path = "DATA/forceAtlasPositions_"+key+".txt"
files = io.open(path, mode="w", encoding="utf-8")
for key in p.keys():
row = key +"\t" + str(p[key][0]) +"\t"+str(p[key][1])+"\r\n"
files.write(row.replace("u'","").replace("'",""))
files.close()
else:
#Get the Dictionnary
for key in graphByPeriod:
posit={}
path = "DATA/forceAtlasPositions_"+key+".txt"
files = io.open(path, mode="r", encoding="utf-8")
for row in files:
split = row.split("\t")
posit[split[0]] = (float(split[1]),float(split[2]))
files.close()
positionsPeriod[key] = posit
We build a draw function to automatize the drawing of the period graphs. In input it receive the considered graph, the title of the representation build and the positions of the nodes given by the force atlas algorithm. </br> Sizes are computed the same way as for the full graph before.
################
# DRAW FUNCTION
################
#Auxiliary function
def getSecond(a):
return a[1]
def draw(graph,ttl,posi):
colors = getColors(graph) # Build the color and size arrays
#SIZE
# Get the actor/actress with the biggest number of collaborations
sizes = {}
os = []
sizeMax =0
for iD in graph.nodes():
edges = list(graph.edges(iD))
ogStrength = 0
for e in edges:
ogStrength += graph.get_edge_data(*e)["weight"]
sizes[iD] = ogStrength
os.append((graph.nodes[iD]["Name"],ogStrength))
if ogStrength > sizeMax:
sizeMax = ogStrength
#LABEL
# Build a label dictionnary with the name of the members to highlight
SortedNames = np.asarray(sorted(os, key=getSecond,reverse = True))[:,0]
SortedNames = SortedNames[:10]
labels = {}
for n in graph.nodes():
name = actorNameDict[n]
if name in SortedNames:
labels[n]="\n\n"+name
else:
labels[n]=""
#POSITIONNING
positions = posi
alpha =25/sizeMax
fig = plt.figure(figsize=(30, 30))
nx.draw_networkx(graph, positions,
node_size = [(alpha*sizes[n])**(2) for n in graph.nodes()], node_color = [colors[n] for n in graph.nodes()],
with_labels=True,
width = 0.1,edge_color='#999999',labels = labels,font_size = 17, font_weight= "bold")
plt.axis('off')
plt.title("Actors Graph Period "+ttl, fontdict = title_font )
base = 'Figures/actorGraph_'+ttl
if savingFigures:
plt.savefig(base+'.jpeg', bbox_inches='tight')
plt.savefig(base+'.png', bbox_inches='tight')
plt.savefig(base+'.svg', bbox_inches='tight')
plt.show()
####################
# Draw the networks
####################
for key in graphByPeriod.keys():
draw(graphByPeriod[key],titles[key], positionsPeriod[key])