Song2Vec: Music Recommender


The behavior of musicophiles has changed along with the evolvement of the music industry in the past decades. Previously we conservatively bought music on a compact disc, but now music streaming services are more preferable; such as Amazon Music,
Apple Music, Google Play Music, Pandora, Spotify, Youtube Music, to name a few. This is because of the convenience offered by these platforms so that users are able to search their favorite songs right away without having to bother going to the music store physically.

Users may not have enough time to scan through all available songs and manually create a playlist. Instead, a recommender system is constructed which eases them to find relevant songs quickly. One example you might seen before is the “Made For You” feature from Spotify:

These personalized playlists are being recommended by grouping similar songs that go well together. How? In the real case, this process is done by combining several recommender algorithms, simply based on users’ activities such as likes, playlist history, or even listening history. In this article, we will demonstrate how to extract song embeddings using a neural network approach specifically Word2Vec model, use it to generate songs recommendation, and evaluate the performance.

Okay, so you might wonder what is Word2Vec actually? Developed by Tomas Mikolov 1 in 2013 at Google, it is one of the most common technique to do word embeddings in several Natural Language Processing (NLP) cases using shallow neural network. Word embeddings is just a fancy way of saying a numerical representation of words. A good analogy would be how colors are represented with a RGB values. These set of values is then called as a vector. For example, “black” can be associated with (0,0,0) and “white” with (255,255,255) as their pixel intensity values.

In fact, word embeddings method can be generalized into other item embeddings, which associate any product on an e-commerce website, any videos on Youtube, movies on Netflix with a vector. Of course, in this case, songs can also be a vector.

Can you guess what is the property of a sentence that Word2Vec exploits to learn the vector representation of a word? It is the sequential nature of the text. Take a look of the following scrambled sentence:

gives Spotify millions you access music service to digital a that is of songs.

It is difficult for us to understand the text because there is no sequence in the sentence. That’s why the sequence of words is crucial in any natural language. This property can be implemented to other data that has sequential nature as well. One such data that has the property is playlist of songs in music streaming services. The following image is an example of playlists in Spotify, where each playlist contain a sequence of songs:

Since the data cleansing and modelling process will be quite complicated, here I present the visualization for you to understand the overall workflow for this article:

Import Libraries

Before going any further, let’s import necessary libraries such as:

  • pandas for data analysis
  • numpy for scientific computing
  • matplotlib and seaborn for data visualization
  • gensim for topic modelling, in this case Word2vec
  • sklearn and spherecluster for other unsupervised learning algorithm
# Data Analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns'seaborn')

# Modelling
from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
from spherecluster import SphericalKMeans
from sklearn.model_selection import train_test_split
from sklearn.manifold import TSNE
from scipy import stats

# Additional
import math
import random
import itertools
import multiprocessing
from tqdm import tqdm
from time import time
import logging
import pickle

Data Cleansing

Human-made music playlists collected by Shuo Chen 2 from Cornell University are used to learn the song embeddings. The dataset contains US radio playlists from and songs tag from since December 2010 to May 2011. Each playlist will be treated as a sentence and each song in the playlist will be treated as one word.

The raw data consists of five separate txt files as follow:

  1. song_hash.txt: mapping from integer song_id to song’s title and artist name
  2. tags.txt: social tags, using integer song_id to represent a song
  3. tag_hash.txt: mapping from integer id to tag’s name
  4. train.txt and test.txt: playlists using integer song_id to represent a song


Each song has its own song_id which maps to exactly one title and artist name. There are 75252 unique songs present in song_hash.txt.

songs = pd.read_csv(FOLDER_PATH+"song_hash.txt", sep = '\t', header = None,
                    names = ['song_id', 'title', 'artist'], index_col = 0)
songs['artist - title'] = songs['artist'] + " - " + songs['title']
#>                                                      title  ...                                     artist - title
#> song_id                                                     ...                                                   
#> 0                             Gucci Time (w\/ Swizz Beatz)  ...          Gucci Mane - Gucci Time (w\/ Swizz Beatz)
#> 1        Aston Martin Music (w\/ Drake & Chrisette Mich...  ...  Rick Ross - Aston Martin Music (w\/ Drake & Ch...
#> 2                            Get Back Up (w\/ Chris Brown)  ...               T.I. - Get Back Up (w\/ Chris Brown)
#> 3                       Hot Toddy (w\/ Jay-Z & Ester Dean)  ...         Usher - Hot Toddy (w\/ Jay-Z & Ester Dean)
#> 4                                             Whip My Hair  ...                              Willow - Whip My Hair
#> ...                                                    ...  ...                                                ...
#> 75257                               Dearest (I'm So Sorry)  ...         Picture Me Broken - Dearest (I'm So Sorry)
#> 75258                                            USA Today  ...                           Alan Jackson - USA Today
#> 75259                                            Superstar  ...                              Raul Malo - Superstar
#> 75260                                  Romancin' The Blues  ...                Giacomo Gates - Romancin' The Blues
#> 75261                                         Inner Change  ...                     The Jazzmasters - Inner Change
#> [75262 rows x 3 columns]


Each song has several tags that exist in tags.txt and the mapping is provided in tag_hash.txt.

def readTXT(filename, start_line=0, sep=None):
    with open(FOLDER_PATH+filename) as file:
        return [line.rstrip().split(sep) for line in file.readlines()[start_line:]]
tags = readTXT("tags.txt")
#> [['49', '65', '72', '141', '197'], ['11', '35', '154'], ['#'], ['#'], ['#']]

If a song does not have any tag, it is indicated with just a ‘#’ as seen above. Replace it with the string “unknown” instead.

mapping_tags = dict(readTXT("tag_hash.txt", sep = ', '))
mapping_tags['#'] = "unknown"

The song_tags dataframe is combined and merged with previous songs.

song_tags = pd.DataFrame({'tag_names': [list(map(lambda x: mapping_tags.get(x), t)) for t in tags]}) = 'song_id'
songs = pd.merge(left = songs, right = song_tags, how = 'left',
                 left_index = True, right_index = True)
songs.index = songs.index.astype('str')
#>                                                      title  ...                                          tag_names
#> song_id                                                     ...                                                   
#> 0                             Gucci Time (w\/ Swizz Beatz)  ...                                          [wjlb-fm]
#> 1        Aston Martin Music (w\/ Drake & Chrisette Mich...  ...  [chill, rnb, loved, hip hop, rap, soft, wjlb-f...
#> 2                            Get Back Up (w\/ Chris Brown)  ...                                          [wjlb-fm]
#> 3                       Hot Toddy (w\/ Jay-Z & Ester Dean)  ...                                     [pop, hip-hop]
#> 4                                             Whip My Hair  ...  [pop, american, dance, rnb, hip-hop, hip hop, ...
#> [5 rows x 4 columns]

Unknown Songs

Unknown songs are defined as a song that doesn’t have either artist or title, indicated by dash (-) character. Remove these unknown songs from songs.

unknown_songs = songs[(songs['artist'] == '-') | (songs['title'] == '-')]
songs.drop(unknown_songs.index, inplace = True)


playlist is a list of lists of songs (represented with its song_id) from train.txt and test.txt. There are 15910 playlists that exist in the data.

playlist = readTXT("train.txt", start_line = 2) + readTXT("test.txt", start_line = 2)
print(f"Playlist Count: {len(playlist)}")
#> Playlist Count: 15910

Take a look at how the playlist is represented in a list. Recall that these playlists are treated as sentences and the song_id as a token of words.

for i in range(0, 2):
    print(f"Playlist Idx. {i}: {len(playlist[i])} Songs")
#> -------------------------
#> Playlist Idx. 0: 97 Songs
#> -------------------------
#> ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43']
#> -------------------------
#> Playlist Idx. 1: 205 Songs
#> -------------------------
#> ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117', '118', '119', '120', '121', '122', '123', '50', '70', '71', '124', '17', '85', '14', '82', '48', '125', '47', '46', '72', '53', '25', '73', '4', '126', '59', '74', '20', '43', '127', '128', '129', '13', '82', '48', '130', '131', '132', '133', '134', '135', '136', '137', '59', '46', '138', '43', '20', '139', '140', '73', '57', '70', '141', '3', '1', '74', '142', '143', '144', '145', '48', '13', '25', '146', '50', '147', '126', '59', '20', '148', '149', '150', '151', '152', '56', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '60', '176', '51', '177', '178', '179', '180', '181', '182', '183', '184', '185', '57', '186', '187', '188', '189', '190', '191', '46', '192', '193', '194', '195', '196', '197', '198', '25', '199', '200', '49', '201', '100', '202', '203', '204', '205', '206', '207', '32', '208', '209', '210']

Remove unknown songs from the playlist.

playlist_wo_unknown = [[song_id for song_id in p if song_id not in unknown_songs.index]
                       for p in playlist]

Remove playlist with zero or one song, since the model wouldn’t capture any sequence in that list.

clean_playlist = [p for p in playlist_wo_unknown if len(p) > 1]
print(f"Playlist Count After Cleansing: {len(clean_playlist)}")
#> Playlist Count After Cleansing: 15842

Remove song that doesn’t exist in any playlist.

unique_songs = set(itertools.chain.from_iterable(clean_playlist))
song_id_not_exist = set(songs.index) - unique_songs
songs.drop(song_id_not_exist, inplace = True)
print(f"Unique Songs After Cleansing: {songs.shape[0]}")
#> Unique Songs After Cleansing: 73448

Before there are 75262 unique songs and 15910 playlists. Now we are ready with 73448 unique songs and 15842 playlists.


The playlist is splitted into playlist_train and playlist_test with test size of 1000 playlist for further evaluation.

playlist_train, playlist_test = train_test_split(clean_playlist, test_size = 1000,
                                                 shuffle = True, random_state = 123)


As mentioned before, Word2Vec is one of the most popular techniques to learn word embeddings using shallow neural network. A neural network, like other supervised learning algorithms, requires labeled data to be trained. How can we train a neural network if the data is in a form of sequences of words (i.e. words) or sequences of songs (i.e. playlist) without any target or data label? The network will be trained by creating a so-called “fake” task. We won’t be interested in the inputs and outputs of the network, rather just train the weights between input and hidden layer that are extracted as the vectors. To put it in simple terms, the goal of embeddings can be classified as unsupervised learning, but the process of getting the embeddings in Word2Vec is achieved by supervised learning through a neural network.

Here is the illustration of general Word2Vec architecture by Xin Rong 3:


  • The input layer is a one-hot-encoded vector of size \(V\) (vocabulary size).
  • \(W_{V \times N}\) is the weight matrix that projects the input \(x\) to the hidden layer. These values are the embedded vectors.
  • The hidden layer contains \(N\) neurons (hyperparameter), it just copies the weighted sum of inputs to the next layer without any activation function.
  • \(W’_{N \times V}\) is the weight matrix that maps the hidden layer outputs to the final output layer.
  • The output layer is again a \(V\) length vector, with a softmax activation function.

There are two approaches of Word2Vec in which both are using the same architecture:

  • Skip-gram – the fake task would be: given a target word, the model is trying to predict the context words.
  • Continuous Bag-Of-Words (CBOW) – the fake task would be: predict the target word by the context words.

In this article, CBOW is used instead of Skip-gram, because according to Google Code Archive 4, it trains faster and able to capture the frequent songs more.

Target song that are played between context songs is assumed to be similar to each other. If the playlist are designed by users or the services for certain genre, the song embeddings will logically incorporate more information about the genre.

One epoch of CBOW may be breakdown into these steps 5:

  1. Convert the generated training samples into one-hot vectors \(x_1, x_2, …, x_C\) (contexts) for the input layer. So, the size is \(C \times V\)
  2. Multiply all vector \(x\) with \(W_{V \times N}\) and then take the sum or mean of embedded vectors.
  3. The hidden layer is then multiplied with \(W’_{N \times V}\) to get the weighted sum of size \(V\).
  4. Apply softmax function to turn the weighted sum into probabilities, usually denoted by \(\hat{y}\).
  5. Error between output and each context word is calculated as follows: \({(\hat{y} – y)}\)
  6. Backpropagate to re-adjust the weights, by using Gradient Descent optimizer.
    1. All weights in output matrix will be updated.
    2. Only corresponding word vector in the input matrix that will be updated.

Up until this point, you should have understood the general overview of how the Word2Vec works. But, there is an issue with the softmax function — it is computationally very expensive, as it requires scanning through the entire output embeddings matrix to compute the probability distribution of all V words, where V can be millions or more. The softmax function is defined as:

\(softmax(y_i) = \dfrac{e^{y_i}}{\sum \limits_{y=1}^V e^{y_j}}\)

The normalization factor in the denominator also requires \(V\) iterations. When implemented in codes, the normalization factor is computed only once and cached as a Python variable, making the algorithm complexity \(O(V)\).

Due to this computational inefficiency, softmax function is preferably not used in most implementations of Word2Vec. Instead let’s use an alternative called negative sampling with sigmoid function, which rephrases the problem into a set of independent binary logistic classification task of algorithm complexity = \(O(K+1)\), where \(K\) is the number of negative samples and \(1\) is the positive sample. Mikolov suggests using \(K\) in the range \([5, 20]\) for small vocabulary and \([2, 5]\) for a larger vocabulary.

Don’t worry about the code below. We are setting up the logging settings to monitor the training process.

logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO)

class Callback(CallbackAny2Vec):
    def __init__(self):
        self.epoch = 1
        self.training_loss = []

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        if self.epoch == 1:
            current_loss = loss
            current_loss = loss - self.loss_previous_step
        print(f"Loss after epoch {self.epoch}: {current_loss}")
        self.epoch += 1
        self.loss_previous_step = loss


By using gensim, the training process can be separated into three distinctive steps 6:

First, the instance of Word2Vec() is created to set up the parameters of the model and leave the model uninitialized.

  • size: dimensionality of the song vectors
  • window: maximum distance between context and target
  • min_count: frequency cut-off for a song to be considered in the model
  • sg = 0: using CBOW architecture
  • negative: negative sampling data
  • workers: number of CPU used to train the model
model = Word2Vec(
    size = 256,
    window = 10,
    min_count = 1,
    sg = 0,
    negative = 20,
    workers = multiprocessing.cpu_count()-1)
#> Word2Vec(vocab=0, size=256, alpha=0.025)

Secondly, the method .build_vocab() is called to build the vocabulary from a sequence of playlists and thus initialized the model.

logging.disable(logging.NOTSET) # enable logging
t = time()

#> 2020-06-29 14:35:46,001 : INFO : collecting all words and their counts
#> 2020-06-29 14:35:46,001 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
#> 2020-06-29 14:35:46,211 : INFO : PROGRESS: at sentence #10000, processed 1805894 words, keeping 63337 word types
#> 2020-06-29 14:35:46,338 : INFO : collected 72047 word types from a corpus of 2670082 raw words and 14842 sentences
#> 2020-06-29 14:35:46,338 : INFO : Loading a fresh vocabulary
#> 2020-06-29 14:35:46,562 : INFO : effective_min_count=1 retains 72047 unique words (100% of original 72047, drops 0)
#> 2020-06-29 14:35:46,562 : INFO : effective_min_count=1 leaves 2670082 word corpus (100% of original 2670082, drops 0)
#> 2020-06-29 14:35:46,726 : INFO : deleting the raw counts dictionary of 72047 items
#> 2020-06-29 14:35:46,728 : INFO : sample=0.001 downsamples 3 most-common words
#> 2020-06-29 14:35:46,728 : INFO : downsampling leaves estimated 2667923 word corpus (99.9% of prior 2670082)
#> 2020-06-29 14:35:46,872 : INFO : estimated required memory for 72047 words and 256 dimensions: 183575756 bytes
#> 2020-06-29 14:35:46,873 : INFO : resetting layer weights

Finally, .train() trains the model. The loggings here are mainly useful for monitoring the loss after each epoch.

  • total_examples: count of unique vocabulary (songs)
  • epochs: number of iterations over the dataset (whole playlist)
  • compute_loss: track model loss
logging.disable(logging.INFO) # disable logging
callback = Callback() # instead, print out loss for each epoch
t = time()

            total_examples = model.corpus_count,
            epochs = 100,
            compute_loss = True,
            callbacks = [callback])"song2vec.model")
#> Word2Vec(vocab=72047, size=256, alpha=0.025)

Loss Evaluation

Plot the training loss, making sure it decreases after each epoch. The closer the loss to a zero value, the better the model is in predicting a target song given surrounding context songs. Thus, the produced song vectors are more meaningful.

Vectors Visualization

The song vectors can be visualized using a gradient of colors. The model is trained using 256 dimensions, therefore there will be 256 color bars for each song, representing element values in the vector. The similarity between songs is calculated using cosine similarity:

\(similarity(A,B) = cos(\theta) = \frac{A.B}{\|A\| \|B\|}\)

Mathematically it measures the cosine of the angle between two vectors \(A\) and \(B\) which projected in a multi-dimensional space. Song vectors with similar context occupy close spatial positions; the cosine between such vectors should be close to 1, i.e. angle is closer to 0. The smaller the angle, the cosine similarity will be higher.

The plot above shows five most similar songs to song_id = '4162' (Maroon 5 – She Will Be Loved). Up until now, the model can be used for recommending new songs using cosine similarity, but only based on one main song.


What can we do with the song vectors? One thing is to group them into several clusters using K-Means clustering, but keep in mind that the similarity between vectors is calculated using cosine distance instead of regular (Euclidean) distance. Therefore K-Means with cosine distance should be considered, which often called Spherical K-Means Clustering. The idea is to identify the centroid such that it uniforms and minimizes the angle between each vector in a cluster. The intuition is just like looking at a cluster of stars where each point should have consistent spacing between each other. This spacing is referred to as the cosine similarity.

Spherical K-Means

Take a look at the picture below for the illustration, and here are the steps:

  1. Generate random 2D vectors ranging from (0,0) to (1,1).
  2. Project each vector onto a unit circle, so that the vectors are normalized (length is equal to one).
  3. From the projected vectors, perform basic k-means clustering into k clusters such that the vector within the same cluster are as similar as possible while the vector from different clusters is as dissimilar as possible.
  4. Assign cluster number for each vector.

That being said, let’s perform Spherical K-Means on the song vectors by iterating the number of clusters from 10 to 500 so that the optimal number of clusters k_opt can be chosen by the elbow method.

embedding_matrix = model.wv[model.wv.vocab.keys()]
#> (72047, 256)

How to locate the optimal number of clusters objectively? Here is the idea 7:

  1. Connect the first and last point of the curve with a straight line
  2. Calculate the perpendicular distance from each point to that line
  3. Consider the longest distance as the elbow
k_opt = locateOptimalElbow(skm_df.index, skm_df['WCSS'].values)
skm_opt = skm_df.loc[k_opt, "skm_object"]
#> SphericalKMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=110,
#>         n_init=5, n_jobs=-1, normalize=True, random_state=123, tol=0.0001,
#>         verbose=0)
songs_cluster = songs.copy()
songs_cluster.loc[model.wv.vocab.keys(), 'cluster'] = skm_opt.labels_
songs_cluster['cluster'] = songs_cluster['cluster'].fillna(-1).astype('int').astype('category')
#>                                                      title  ... cluster
#> song_id                                                     ...        
#> 0                             Gucci Time (w\/ Swizz Beatz)  ...      94
#> 1        Aston Martin Music (w\/ Drake & Chrisette Mich...  ...      94
#> 2                            Get Back Up (w\/ Chris Brown)  ...      33
#> 3                       Hot Toddy (w\/ Jay-Z & Ester Dean)  ...      94
#> 4                                             Whip My Hair  ...      94
#> [5 rows x 5 columns]

In the end, the optimal number of clusters is set to be 110. There is a possibility that some songs don’t have the embedded vectors since the playlist is split to train and test. For this case, assign the cluster as -1 instead.

Visualize Clusters

It is always quite helpful to visualize the embeddings that have been created. Over here, we have song vectors with 256 dimensions. These high-dimensional vectors can’t be visualized in our 3D world, so using dimensionality reduction algorithms such as t-Distributed Stochastic Neighbor Embedding (t-SNE) helps us map the vectors to a lower dimension. The mathematical detail of t-SNE will not be presented here, but in practice, it tends to produce a visualization with distinctly isolated clusters.

embedding_tsne = TSNE(n_components = 2, metric = 'cosine',
                      random_state = 123).fit_transform(embedding_matrix)
save2Pickle(embedding_tsne, "tsne_viz")

The cluster might look cluttered since all 110 clusters are being plotted at once. Instead, let’s just perform t-SNE on randomly selected 10 clusters and visualize the result.

Songs that have similar context (by cosine similarity) tend to be plotted next to each other. Thus, creating distinct song clusters. Note that the clusters might look overlap to each other due to the dimensionality reduction, but in the actual dimension, they do not.

Start Recommending

Congratulations! We are finally ready with the embeddings for every song that exists in playlist_train. How these song vectors are then used to suggest similar songs based on a certain playlist? One way is to calculate a playlist vector for each playlist by averaging together all the song vectors in that playlist. These vectors then become the query to find similar songs based on cosine similarity. Here is an illustration using a users’ music streaming playlist 8:

For each playlist in playlist_test, calculate the average vectors using meanVectors() function. If the song hasn’t been embedded before, neglect the song instead.

def meanVectors(playlist):
    vec = []
    for song_id in playlist:
        except KeyError:
    return np.mean(vec, axis=0)
playlist_vec = list(map(meanVectors, playlist_test))

For each playlist vector, we can recommend top \(n\) similar songs based on the cosine similarity. Let’s test the song embeddings to recommend top 10 songs for playlist_test in index 305.

def similarSongsByVector(vec, n = 10, by_name = True):
    # extract most similar songs for the input vector
    similar_songs = model.wv.similar_by_vector(vec, topn = n)
    # extract name and similarity score of the similar products
    if by_name:
        similar_songs = [(songs.loc[song_id, "artist - title"], sim)
                              for song_id, sim in similar_songs]
    return similar_songs
print_recommended_songs(idx = 305, n = 10)
#> ============================
#> ============================
#> Selena - Como La Flor
#> The Texas Tornados - Who Were You Thinkin' Of
#> Selena - Sentimientos
#> ============================
#> ============================
#> [Similarity: 0.835] Selena - Como La Flor
#> [Similarity: 0.779] Selena - Sentimientos
#> [Similarity: 0.763] Little Joe Y La Familia - Borrachera
#> [Similarity: 0.751] Lorenzo Antonio - Con La Misma Espina
#> [Similarity: 0.745] Tierra Tejana - Eres Casado
#> [Similarity: 0.742] Jennifer Y Los Jetz - Me Piden
#> [Similarity: 0.730] The Texas Tornados - (Hey Baby) Que Paso
#> [Similarity: 0.712] The Texas Tornados - Who Were You Thinkin' Of
#> [Similarity: 0.709] Ruben Vela - La Papaya
#> [Similarity: 0.704] Sparx - Lo Dice Mi Corazon
#> ============================

Interestingly, the model is able to capture and recommend new songs based on the “Spanish” genre from playlist_test indexed at 305 without being explicitly stated. Great! The final step is to evaluate how this recommender performs.


One way to evaluate the performance of a recommender system is by computing hit rate as follows:

  1. For each song in a playlist, intentionally Leave-One-Out (LOO) this song.
  2. By using several systems below, try to guess the LOO song.
  3. Ask the recommender for top \(n\) recommended songs.
  4. If the LOO song appears in the top \(n\) recommendation, consider it as a HIT. Otherwise not.
  5. Repeat the LOO process until the end of the playlist. Then, the hit rate of a playlist is calculated by dividing the number of HIT with the length of a playlist.
  6. Repeat step 1-5 for all playlist in playlist_test and calculate the Average Hit Rate at \(n\) ().
top_n_songs = 25

Random Recommender

As a baseline, let’s try to guess the LOO song randomly without any system.

def hitRateRandom(playlist, n_songs):
    hit = 0
    for i, target in enumerate(playlist):
        recommended_songs = random.sample(list(songs.index), n_songs)
        hit += int(target in recommended_songs)
    return hit/len(playlist)
eval_random = pd.Series([hitRateRandom(p, n_songs = top_n_songs)
                         for p in tqdm(playlist_test, position=0, leave=True)])

Output: 0.00030413731380910425

Song Tags Recommender

It is possible to recommend based on song tags provided on the data as follows:

  1. Create a list of song tag_names that surrounds the LOO song. The maximum distance between the LOO and context songs is defined by window.
  2. List all possible songs from the list.
  3. Take \(n\) songs randomly from the possible songs list.
mapping_tag2song = songs.explode('tag_names').reset_index().groupby('tag_names')['song_id'].apply(list)

def hitRateContextSongTag(playlist, window, n_songs):
    hit = 0
    context_target_list = [( for w in range(idx-window, idx+window+1)
                             if not(w < 0 or w == idx or w >= len(playlist))], target)
                           for idx, target in enumerate(playlist)]
    for i, (context, target) in enumerate(context_target_list):
        context_song_tags = set(songs.loc[context, 'tag_names'].explode().values)
        possible_songs_id = set(mapping_tag2song[context_song_tags].explode().values)
        recommended_songs = random.sample(possible_songs_id, n_songs)
        hit += int(target in recommended_songs)
    return hit/len(playlist)
eval_song_tag = pd.Series([hitRateContextSongTag(p, model.window, n_songs = top_n_songs)
                           for p in tqdm(playlist_test, position=0, leave=True)])

Output: 0.0005425495180688559

Cluster-based Recommender

To improve further, let’s utilize the result of clustering in the modelling section:

  1. Identify which cluster number is the most frequent (by majority voting) in surrounding songs. The maximum distance between the LOO and context songs is defined by window.
  2. List out possible songs from that majority cluster.
  3. Take \(n\) songs randomly from the possible songs list.
def hitRateClustering(playlist, window, n_songs):
    hit = 0
    context_target_list = [( for w in range(idx-window, idx+window+1)
                             if not(w < 0 or w == idx or w >= len(playlist))], target)
                           for idx, target in enumerate(playlist)]
    for context, target in context_target_list:
        cluster_numbers = skm_opt.predict([model.wv[c] for c in context if c in model.wv.vocab.keys()])
        majority_voting = stats.mode(cluster_numbers).mode[0]
        possible_songs_id = list(songs_cluster[songs_cluster['cluster'] == majority_voting].index)
        recommended_songs = random.sample(possible_songs_id, n_songs)
        songs_id = list(zip(*recommended_songs))[0]
        hit += int(target in songs_id)
    return hit/len(playlist)
eval_clust = pd.Series([hitRateClustering(p, model.window, n_songs = top_n_songs)
                           for p in tqdm(playlist_test, position=0, leave=True)])

Output: 0.005054657281168753

Song2Vec Recommender

Lastly, evaluate the CBOW Song2Vec model as follows:

  1. Take the average vectors of surrounding context songs using previously defined meanVectors() function. The maximum distance is defined by window.
  2. Find top \(n\) similar songs based on cosine similarity using similarSongsByVector() function.
def hitRateSong2Vec(playlist, window, n_songs):
    hit = 0
    context_target_list = [( for w in range(idx-window, idx+window+1)
                             if not(w < 0 or w == idx or w >= len(playlist))], target)
                           for idx, target in enumerate(playlist)]
    for context, target in context_target_list:
        context_vector = meanVectors(context)
        recommended_songs = similarSongsByVector(context_vector, n = n_songs, by_name = False)
        songs_id = list(zip(*recommended_songs))[0]
        hit += int(target in songs_id)
    return hit/len(playlist)
eval_song2vec = pd.Series([hitRateSong2Vec(p, model.window, n_songs = top_n_songs)
                           for p in tqdm(playlist_test, position=0, leave=True)])

Output: 0.11958469298590102


Finally, we compare the calculated Average Hit Rate at 25 () of the four recommender systems. The higher the AHR, the better is the system. From the bar plot below, Song2Vec outperforms other methods in terms of hit rate, which means that it can recommend a song well based on surrounding context songs. In a real-life scenario, this system may likely to be low quality since the AHR is only around 12%, but still, it is much better than no recommender system at all.


Song2Vec is an implementation of Word2Vec which able to capture the context of a song based on surrounding songs in a playlist. In this article, we successfully exploit the sequential property of a playlist and represent each song with a 256-dimensional vector. This vector representation is then used as a recommender system based on cosine similarity score. The objective of a music recommender is to create accurate personalized recommendations from historical playlist or listening queue. Therefore, metric such as is used to evaluate how many times (on average) a song is listed on the top-\(n\) recommended songs based on surrounding context songs.

Things to be taken carefully when applying Song2Vec on its own is the cold start problem, a condition where it is impossible to recommend any songs to a new user or even recommend a new song to any users. This can be efficiently handled by combining the recommender using a content-based technique, which utilizes explicit features or characteristics of the songs as demonstrated in the “Song Tags Recommender” section.

Maybe you’re wondering what are other implementations of Word2Vec? Here is the list for you:

  • Product recommendations: Using purchase receipts in a transaction to capture an item embeddings to learn the user’s purchase activity.
  • Listing recommendations: The user activity is in the form of click data, which can be represented as a sequence of listings that a user viewed.
  • Matching advertisement to search query: Data consist of sequential search sessions, including entered query, clicked advertisement, and search results.

The full Jupyter Notebook is available on my Github


Scroll to Top