Have you ever imagine how Netflix give you recommendation for movies you have never watch before?

If you’re familiar with machine learning, you can find the answer. Yappps.. that’s right. The answer is “Recommendation System”.

Recommendation system or recommender system is subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item. Recommendation system aims to telling us which movies to watch (Netflix), which product to buy (Amazone), or which songs to listen (Spotify) based on our historical data.

# Skincare Recommendation System using Collaborative Filtering (Matrix Factorization)

## Introduction

**Collaborative filtering** is one of basic models for recommendation system which are based on assumption that people like things similar to other things they like, or things that are liked by other people which have similar taste with them.

The ilustration given below :

From the ilustration above, information given that Kiki (girl with black cat) likes to buy apple, banana, and watermelon. While Satsuki (girl with yellow shirt) likes to buy apple and banana. They have similar taste in apple and banana, so we can recommend Satsuki to buy watermelon.

In **collaborative filtering**method there are two approaches which can be implemented :

**1. Memory-based approach: **create recommendation system by calculated closest users or items using cosine similarity or pearson correlation coefficients.

**2. Model-based approach: **create recommendation system by given predict user’s rating value of unrated items.

In this notebook, I will create simply recommender system to recommend skincare product to the customers which have never buy before. I’ll predict the unrated items using Singular Value Decomposition (SVD) of **Matrix Factorization** algorithm. The data used comes from scraping result in Femaledaily Website. Data contains information about review product given by customers. There are several attribut inside, for more details, let’s check it out!

## Data Preparation

#### Import library

```
import pandas as pd
from scipy.sparse.linalg import svds
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
```

#### Read the data

`data = pd.read_csv("data_input/Female Daily Skincare Review Final.csv")`

#### Check and drop missing values

`data.isna().sum()`

```
#> Reviewer 2
#> Recommend 0
#> Stars 0
#> Date 0
#> Product 0
#> Category 0
#> Date Scrape 0
#> Url 0
#> dtype: int64
```

`data = data.dropna()`

#### Check and drop duplicated values

`data.duplicated().sum()`

`#> 8105`

`data.drop_duplicates(keep = "first", inplace = True)`

#### Filter the data

In this section I will do filtering on the data where customer only rated the product once. Since those data having less information in recommendation system.

`id_count = pd.crosstab(index=data.Reviewer,columns='count').sort_values(by='count',ascending=True)`

```
name_r = id_count[id_count['count']>1]
name_u = name_r.index.to_list()
data = data[data.Reviewer.isin(name_u)]
data.to_csv('femdaily.csv',index=False,header=True)
```

#### Drop unused columns

```
data = pd.read_csv("femdaily.csv")
data.drop_duplicates(keep = "first", inplace = True)
data.drop(['Recommend','Date','Date Scrape','Url','Category'], axis=1, inplace=True)
data.rename(columns={'Reviewer':'reviewer','Product':'product','Stars':'rating'}, inplace=True)
data = (data[~(data['reviewer'] == ' ')])
```

`data`

```
#> reviewer rating product
#> 0 Ayuika 3 Perfect 3D Gel
#> 1 yustinalupit 4 Perfect 3D Gel
#> 2 evikdanny 3 Perfect 3D Gel
#> 3 daniskhoirun 3 Perfect 3D Gel
#> 4 hulahup19 5 Perfect 3D Gel
#> ... ... ... ...
#> 137295 steph91 4 Ultra Rapid Action Pad
#> 137296 farishaalamsyah 1 Ultra Rapid Action Pad
#> 137297 imeldanababan 4 Ultra Rapid Action Pad
#> 137298 princessvie 3 Ultra Rapid Action Pad
#> 137299 nucky 2 Ultra Rapid Action Pad
#>
#> [137292 rows x 3 columns]
```

## Data Exploration

Since in the next step (modelling) we will define and create matrix based on Product X User, so we need to understanding about size of both unique product and user.

#### Number of uniq product

```
uniq_product = data['product'].nunique()
print("Number of uniq product :",uniq_product)
```

`#> Number of uniq product : 3297`

Here above, product have 3297 unique number, this number will become number of columns matrix in modelling step.

#### Number of uniq users

```
uniq_reviewer = data['reviewer'].nunique()
print("Number of uniq reviewer :",uniq_reviewer)
```

`#> Number of uniq reviewer : 22359`

Here above, user have 22359 unique number, this number will become number of rows matrix in modelling step.

#### Distribution rating given by users

`plt.subplots(figsize = (7,6))`

`#> (<Figure size 700x600 with 1 Axes>, <AxesSubplot:>)`

`plt.hist(data['rating'],color="orange")`

```
#> (array([2.0000e+00, 0.0000e+00, 5.1450e+03, 0.0000e+00, 1.1718e+04,
#> 0.0000e+00, 2.6710e+04, 0.0000e+00, 4.2248e+04, 5.1469e+04]), array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ]), <BarContainer object of 10 artists>)
```

From the visualization above, bar plot shown that users frequently give rating in 5 or 4, which mean that they are satisfied with the product.

## Build Recommendation System

### Matrix Factorization

If you see the pivot matrix below, you will find that pivot matrix have so many zero value (missing value). Why did it happen? It can be happen because not every user give a rating in a every product, this condition called **sparse matrix.**Sparse matrix is limitation in collaborative filtering models, because sparse matrix gives bias information in our recommender system. There will be popularity bias in the recommendation given by the system to user, system will give recommends the product with the most interactions without any personalization.

**Matrix Factorization** is one way to handle those issue. Matrix factorization will breaking down of one matrix into a product of multiple matrices and give predictive rating in sparse matrix. Basic idea from matrix factorization is that attitudes or preferences of a user can be determined by a small number of hidden factors.

Illustration given below:

Intuitively, we can understand hidden factors for items and users from illustration above. Say that U is low dimensional matrix for Users features and V is low dimensional matrix for Product features. Every matrix values represent different characteristics about the users and the product. For e.g I have 3 features in Product matrix *(i) what kind categories of the product? (ii) Does the product contains dangerous addictive substance? (iii)How the product give impact in our skin?* Likewise, in Users matrix might represent *(i)how sensitive the user’s skin into the product substances? (ii)Does the user like “X” category product, an so on*. We can get the predictive ratings by calculate the dot product between matrix U and matrix V.

### Singular Value Decomposition (SVD)

**Singular Value Decomposition** is one of type Matrix Factorization. SVD algorithm will decomposes a matrix R into the best lower rank approximation of the original matrix R. Matematically SVD produce by the formula below :

We can arrange eigenvectors in different orders to produce U and V.

### Implementation Recommender System in Python Code

#### a. Create matrix pivot

Create matrix pivot where the vertical value is users name, horizontal value is product name, and the value inside matrix is rating given by users.

```
matrix_pivot = pd.pivot_table(data,values='rating',index='reviewer',columns='product').fillna(0)
matrix_pivot.head()
```

```
#> product 0,2 mm Therapy Air Mask Sheet ... “B” oil
#> reviewer ...
#> 01lely 0.0 ... 0.0
#> 01putrisalma 0.0 ... 0.0
#> 01sary 0.0 ... 0.0
#> 123hayoapa 0.0 ... 0.0
#> 15ayusafitri 0.0 ... 0.0
#>
#> [5 rows x 3297 columns]
```

#### b. Normalize rating values

Why we do need to normalize the rating?

Because it starts with the fact that people rate often on very different scales. Say that Kiki and Satsuki use a product B, and Kiki gives rating value 5 on that product, because Satsuki has a high standart she only gives 3 on that product. Here is, the 5 from Kiki is 3 from Satsuki. To make the model better is, we can increase the efficiency of this algorithm if we normalize user’s rating by substract rating value given by user in each product with mean rating in each product.

```
matrix_pivot_ = matrix_pivot.values
user_ratings_mean = np.mean(matrix_pivot_, axis = 1)
user_rating = matrix_pivot_ - user_ratings_mean.reshape(-1,1)
```

#### c. Singular Value Decomposition (SVD)

Create matrix U and Vt using library scipy.

```
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(user_rating, k = 50)
```

`sigma = np.diag(sigma)`

#### d. Create predictive rating

After we get the value from decomposition matrix above, we can create product ratings predictions for every user.

`all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)`

And, here below matrix is result of predictive rating from each user in each product.

```
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = matrix_pivot.columns, index=matrix_pivot.index)
preds_df
```

```
#> product 0,2 mm Therapy Air Mask Sheet ... “B” oil
#> reviewer ...
#> 01lely -0.033156 ... -4.931434e-03
#> 01putrisalma 0.010625 ... 7.925640e-04
#> 01sary 0.001551 ... -1.701105e-03
#> 123hayoapa 0.015527 ... 5.204980e-07
#> 15ayusafitri 0.003480 ... 1.148843e-03
#> ... ... ... ...
#> zvnazole -0.005587 ... 3.074178e-03
#> zyshalu -0.013294 ... 2.718208e-04
#> zzarahs -0.003167 ... 4.559316e-03
#> zzfatimah 0.002338 ... 1.161053e-02
#> zzulia -0.010500 ... -1.232599e-03
#>
#> [22359 rows x 3297 columns]
```

#### e. Create recommendation

In this final step we will create recommendation product. I’ll return the product with the 5 highest predicted rating that the user hasn’t already rated.

```
def recommend_product(predictions_df, user, data_, num_recommendations):
user_row_number = user
sorted_user_predictions = preds_df.loc[user_row_number].sort_values(ascending=False)
user_data = data_[data_.reviewer == (user)]
user_full = user_data
print('User {0} has already rated {1} product'.format(user, user_full.shape[0]))
a = data.drop_duplicates(subset='product', keep='last')
recommendations = (a[~a['product'].isin(user_full['product'])].
merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
left_on = 'product',
right_on = 'product').
rename(columns = {user_row_number: 'Predictions'}).
sort_values('Predictions', ascending = False).
iloc[:num_recommendations, :-1]
)
return user_full, recommendations
```

Input the user id to whom you want recommend the product.

```
user = 'zzulia'
already_rated, predictions = recommend_product(preds_df, user, data,5)
```

`#> User zzulia has already rated 3 product`

Result below shown that “zzulia” already rate 3 product. Twice for Facial Mask with different rating in each product, and once for Pembersih Two In One Bengkoang Whitening.

`already_rated`

```
#> reviewer rating product
#> 70013 zzulia 3 Facial Mask
#> 88179 zzulia 2 Facial Mask
#> 115840 zzulia 5 Pembersih Two In One Bengkoang Whitening
```

And, here below 5 highest predicted rating from user id “zzulia”. The recommendation system suggest “zzulia” to buy Prominent Essence Facial Mask, Facial Mask Bedak Dingin, Oil Control Mask, White Aqua Serum Sheet Mask, and Essential Vitamin. Suggested products are dominated with “Mask” product, because from historical data above “zzulia” already rate 2 product with category “Mask.

`prod_pred = predictions['product']`

`prod_pred`

```
#> 2040 Prominent Essence Facial Mask
#> 2225 Facial Mask Bedak Dingin
#> 1988 Oil Control Mask
#> 2000 White Aqua Serum Sheet Mask
#> 1661 Essential Vitamin
#> Name: product, dtype: object
```

## Conclusion

From the result above, we can conclude that:

1. Based on recommendation system above, Femaledaily website can provide product recommendation in the main dashboard when targeted users access the website.

2. Low dimensional matrix in Matrix factorization tried to capture the underlying features or hidden factors from the users and items.

3. This model is the right choice if you have many sparcity data.