Alright! I'm gonna try to get some info about my Spotify playlists.
If you don't want to see all the beautiful plots I made you can go down straight to the Conclusions section.
%matplotlib inline
import os
import my_spotify_credentials as credentials
import numpy as np
import pandas as pd
import ujson
import spotipy
import spotipy.util
import seaborn as sns
from bokeh.charts import Histogram, Scatter, Donut, show
from bokeh.io import output_notebook
from bokeh.models import HoverTool, ColumnDataSource
from bokeh import palettes
Please note that I had to configure my Spotify Dev account credentials (https://spotipy.readthedocs.io/en/latest/#authorization-code-flow) in order to fetch some of the following requests.
Setting up the scope (https://developer.spotify.com/web-api/using-scopes/), the username and then request the songs in my library filtering some fields (I'm only gonna work with the following info: song names, artists, song duration, the date I added that song to my library and its popularity).
os.environ["SPOTIPY_CLIENT_ID"] = credentials.SPOTIPY_CLIENT_ID
os.environ["SPOTIPY_CLIENT_SECRET"] = credentials.SPOTIPY_CLIENT_SECRET
os.environ["SPOTIPY_REDIRECT_URI"] = credentials.SPOTIPY_REDIRECT_URI
scope = 'user-library-read'
username = 'jose.vicente'
token = spotipy.util.prompt_for_user_token(username, scope)
if token:
spotipy_obj = spotipy.Spotify(auth=token)
saved_tracks_resp = spotipy_obj.current_user_saved_tracks(limit=50)
else:
print('Couldn\'t get token for that username')
number_of_tracks = saved_tracks_resp['total']
print('%d tracks' % number_of_tracks)
def save_only_some_fields(track_response):
return {
'id': str(track_response['track']['id']),
'name': str(track_response['track']['name']),
'artists': [artist['name'] for artist in track_response['track']['artists']],
'duration_ms': track_response['track']['duration_ms'],
'popularity': track_response['track']['popularity'],
'added_at': track_response['added_at']
}
tracks = [save_only_some_fields(track) for track in saved_tracks_resp['items']]
while saved_tracks_resp['next']:
saved_tracks_resp = spotipy_obj.next(saved_tracks_resp)
tracks.extend([save_only_some_fields(track) for track in saved_tracks_resp['items']])
Let's modify the data collected to work more easily with it.
tracks_df = pd.DataFrame(tracks)
pd.set_option('display.max_rows', len(tracks))
In case there are more than one artists, I only care for the first one. I'm gonna transform the length to seconds.
#pd.reset_option('display.max_rows')
tracks_df['artists'] = tracks_df['artists'].apply(lambda artists: artists[0])
tracks_df['duration_ms'] = tracks_df['duration_ms'].apply(lambda duration: duration/1000)
tracks_df = tracks_df.rename(columns = {'duration_ms':'duration_s'})
Let's make some plots, but first, let's explain (copy - paste) some concepts.
Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
Valence is a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
Tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
audio_features = {}
for idd in tracks_df['id'].tolist():
audio_features[idd] = spotipy_obj.audio_features(idd)[0]
tracks_df['acousticness'] = tracks_df['id'].apply(lambda idd: audio_features[idd]['acousticness'])
tracks_df['speechiness'] = tracks_df['id'].apply(lambda idd: audio_features[idd]['speechiness'])
tracks_df['key'] = tracks_df['id'].apply(lambda idd: str(audio_features[idd]['key']))
tracks_df['liveness'] = tracks_df['id'].apply(lambda idd: audio_features[idd]['liveness'])
tracks_df['instrumentalness'] = tracks_df['id'].apply(lambda idd: audio_features[idd]['instrumentalness'])
tracks_df['energy'] = tracks_df['id'].apply(lambda idd: audio_features[idd]['energy'])
tracks_df['tempo'] = tracks_df['id'].apply(lambda idd: audio_features[idd]['tempo'])
tracks_df['time_signature'] = tracks_df['id'].apply(lambda idd: audio_features[idd]['time_signature'])
tracks_df['loudness'] = tracks_df['id'].apply(lambda idd: audio_features[idd]['loudness'])
tracks_df['danceability'] = tracks_df['id'].apply(lambda idd: audio_features[idd]['danceability'])
tracks_df['valence'] = tracks_df['id'].apply(lambda idd: audio_features[idd]['valence'])
output_notebook()
show(Histogram(tracks_df['popularity'], title='Tracks popularity', bins=25, density=False, plot_width=800))
show(Histogram(tracks_df[tracks_df['duration_s'] < 700]['duration_s'],
title='Tracks length (Tubular Bells removed)', density=False, plot_width=800))
show(Scatter(tracks_df, x='valence', y='danceability',
title='danceability vs valence', color='navy', plot_width=800))
show(Scatter(tracks_df, x='energy', y='loudness',
title='loudness vs energy', color='navy', plot_width=800))
show(Scatter(tracks_df, x='energy', y='valence',
title='valence vs energy', color='navy', plot_width=800))
# sns.plt.figure(figsize=(15, 10))
# sns.pairplot(tracks_df)
# sns.plt.show()
sns.plt.figure(figsize=(15, 10))
corr = tracks_df.corr()
sns.heatmap(corr, annot=True).set_title('Pearson correlation matrix')
sns.plt.show()
artists_songs_df = tracks_df['artists'].value_counts()[:15]
p = Donut(artists_songs_df, plot_width=850, plot_height=800,
color=palettes.RdBu9, title='Number of tracks by artist')
show(p)
Number of tracks by artist
tracks_df['artists'].value_counts()[:40]
Some stats about my songs
first_describe = tracks_df.describe()
first_describe.loc[['mean','std','50%','min','max'],:]
print('''
The median value of my songs popularity is %.2f and the median value of my songs length is %f minutes.
The longest track lasts %f minutes and the shortest one lasts %f minutes.
''' % (first_describe['popularity']['50%'], first_describe['duration_s']['50%']/60,
first_describe['duration_s']['max']/60, first_describe['duration_s']['min']/60))
The following cells show the longest song and the shortest song:
tracks_df.iloc[ tracks_df['duration_s'].idxmax() ][['artists','name']]
tracks_df.iloc[ tracks_df['duration_s'].idxmin() ][['artists','name']]
Popularity ranking (songs' popularity vary over the time so this ranking may be different everytime this notebook is executed).
tracks_df[['added_at','name', 'artists', 'popularity']].sort_values('popularity', ascending=False)[:40]
Some boxplots to have a better knowledge of my songs popularity and duration
sns.set_context('notebook', font_scale=1.5)
sns.set_style('whitegrid')
sns.plt.figure(figsize=(15, 10))
sns.boxplot(x=tracks_df['popularity']).set_title('Popularity boxplot')
sns.plt.show()
sns.plt.figure(figsize=(15, 10))
sns.boxplot(x=tracks_df['duration_s']).set_title('Duration boxplot')
sns.plt.show()
def plot_time_series(col_name, title, rolling_window_days):
daily_series = pd.Series(data=np.array(tracks_df[col_name]),
name=col_name,
index=tracks_df['added_at']).sort_index()
(daily_series.rolling(window = rolling_window_days)
.mean()
.plot(figsize=(30, 10))
.set(xlabel='date (by day)', ylabel=col_name, title=title))
sns.plt.show()
plot_time_series('popularity', 'Popularity over time (window = 30 days)', 30)
plot_time_series('duration_s', 'Duration (s) over time (window = 30 days)', 30)
plot_time_series('danceability', 'Danceability over time (window = 30 days)', 30)
plot_time_series('valence', 'Valence over time (window = 30 days)', 30)
plot_time_series('energy', 'Energy over time (window = 30 days)', 30)
plot_time_series('tempo', 'Tempo over time (window = 30 days)', 30)
aux = tracks_df.copy()
aux['added_at'] = pd.to_datetime(aux['added_at'], format='%Y-%m-%d %H:%M:%S')
songs_added_in_window = (aux.groupby([pd.Grouper(freq='30D', key='added_at'), 'id'])
.size().reset_index(name='count').groupby('added_at').count()['count'])
(songs_added_in_window
.plot(figsize=(20, 10))
.set(xlabel='date (by day)', ylabel='count', title='New songs added over time (window = 30 days)'))
sns.plt.show()
def get_genres(artist, spotipy_obj):
response = spotipy_obj.search(q='artist:' + artist, type='artist')['artists']['items']
return response[0]['genres'] if response else []
tracks_df['genres'] = tracks_df['artists'].apply(lambda artist: get_genres(artist, spotipy_obj))
genres = []
for gnrs in tracks_df['genres'].tolist():
genres.extend(gnrs)
genres_df = pd.DataFrame(genres, columns=['genre'])
Genres ranking
genres_df['genre'].value_counts()[:40]
Some histograms about the info retrieved previously
show(Histogram(tracks_df['valence'], title='Tracks valence (1 = happy, 0 = sad)', bins=50, density=False, plot_width=800))
show(Histogram(tracks_df['danceability'], title='Danceability', bins=50, density=False, plot_width=800))
show(Histogram(tracks_df['loudness'], title='Loudness', bins=50, density=False, plot_width=800))
show(Histogram(tracks_df['tempo'], title='Tempo', bins=50, density=False, plot_width=800))
show(Histogram(tracks_df['energy'], title='Energy', bins=50, density=False, plot_width=800))
pitch_classes = ['C/Do', 'C#/Do sost.', 'D/Re', 'D/Re sost.', 'E/Mi', 'F/Fa', 'F#/Fa sost.', 'G/Sol', 'G#/Sol sost.', 'A/La', 'A#/La sost.', 'B/Si']
tracks_df['key'].replace([str(i) for i in list(range(0, 12))], pitch_classes, inplace=True)
sns.set_context('notebook', font_scale=1.5)
sns.set_style('whitegrid')
print('https://en.wikipedia.org/wiki/Pitch_class#Other_ways_to_label_pitch_classes')
sns.plt.figure(figsize=(20, 10))
(sns.countplot(x=tracks_df['key'], order=pitch_classes)
.set_title('Keys'))
sns.plt.show()
sns.plt.figure(figsize=(15, 10))
sns.boxplot(x=tracks_df['valence']).set_title('Tracks valence (1 = happy, 0 = sad)')
sns.plt.show()
sns.plt.figure(figsize=(15, 10))
sns.boxplot(x=tracks_df['danceability']).set_title('Danceability')
sns.plt.show()
sns.plt.figure(figsize=(15, 10))
sns.boxplot(x=tracks_df['loudness']).set_title('Loudness')
sns.plt.show()
sns.plt.figure(figsize=(15, 10))
sns.boxplot(x=tracks_df['tempo']).set_title('Tempo')
sns.plt.show()
sns.plt.figure(figsize=(15, 10))
sns.boxplot(x=tracks_df['energy']).set_title('Energy')
sns.plt.show()
second_describe = tracks_df[['energy', 'tempo', 'loudness', 'danceability', 'valence']].describe()
second_describe.loc[['mean','std','50%','min','max'],:]
Songs with most and least energy:
print(tracks_df.iloc[ tracks_df['energy'].idxmax() ][['artists','name', 'energy']])
print()
print(tracks_df.iloc[ tracks_df['energy'].idxmin() ][['artists','name', 'energy']])
Songs with most and least valence:
print(tracks_df.iloc[ tracks_df['valence'].idxmax() ][['artists','name', 'valence']])
print()
print(tracks_df.iloc[ tracks_df['valence'].idxmin() ][['artists','name', 'valence']])
The songs with most and least tempo:
print(tracks_df.iloc[ tracks_df['tempo'].idxmax() ][['artists','name', 'tempo']])
print()
print(tracks_df.iloc[ tracks_df['tempo'].idxmin() ][['artists','name', 'tempo']])
The most and the least danceable songs:
print(tracks_df.iloc[ tracks_df['danceability'].idxmax() ][['artists','name', 'danceability']])
print()
print(tracks_df.iloc[ tracks_df['danceability'].idxmin() ][['artists','name', 'danceability']])
Here are the loudest and the lest loud songs:
print(tracks_df.iloc[ tracks_df['loudness'].idxmax() ][['artists','name', 'loudness']])
print()
print(tracks_df.iloc[ tracks_df['loudness'].idxmin() ][['artists','name', 'loudness']])
Songs that match or are close to the energy median value:
def sorted_diffed_values(constant, df, field, n_first):
return df.iloc[(df[field]-constant).abs().argsort()][['artists','name', field]][:n_first]
sorted_diffed_values(float(format(second_describe['energy']['50%'], '.3f')), tracks_df, 'energy', 5)
Songs that match or are close to the tempo median value:
sorted_diffed_values(second_describe['tempo']['50%'], tracks_df, 'tempo', 5)
Songs that match or are close to the danceability median value:
sorted_diffed_values(float(format(second_describe['danceability']['50%'], '.2f')), tracks_df, 'danceability', 5)
Songs that match or are close to the valence median value:
sorted_diffed_values(float(format(second_describe['valence']['50%'], '.2f')), tracks_df, 'valence', 5)
Songs that match or are close to the loudness median value:
sorted_diffed_values(float(format(second_describe['loudness']['50%'], '.2f')), tracks_df, 'loudness', 5)
According to the median values, I like energetic and loud music (pay attention to the correlation between this two measures at the beginning of this notebook) with tempo above 120 BPM.
The music I like is not happy either sad, and it's not too danceable.
The median popularity is 56 out of 100, maybe because I listen to old songs (low popularity) and trending songs (high popularity).
Apparently, my favourite pitch class is C (Do).
Oh, clearly my favourite music genre is rock and I love Oasis ;)
Next steps?