Handling Duplicate Data

Reading Data and Basic Preprocessing

Some data that we obtain from the internet are gained as a raw, means that there are no modifications done to the data except placing it in the right column or row. Even if that’s a good thing, sometimes you have to treat and change the template of the data to be as friendly to reach our objective as possible.

Making sure that there are no duplicated data is one of the aspect of understanding the data itself, because we can’t say that the model that are being made from the information full of duplicated data is relevant enough to be used in real-case scenario. This time, we will learn how to hande duplicated data, so then we are sure that the data we’re going to use to create a model, visual interpretation, etc is reliable enough.

One of the example is a case when we want to find out the amount of requests in an online transportation. When we have a data that the canceled requests or no-driver conditions are exist, there are probability that some of those rows are consisted of only one consumer, henceforth irrelevant.

Now, we want to solve another case of understanding our data more: to decide which one is considered as duplicate, and to remove it.

This case is about Online Auctions Dataset from Kaggle. This data is about an auction held by eBay. Now, our main objective is to see people bidding an item each day.

The package

Before we process our data, it would be wise to understand each column that exist there:

  • auctionid : unique identifier of an auction
  • bid : the proxy bid placed by a bidder
  • bidtime : the time in days that the bid was placed, from the start of the auction
  • bidder : eBay username of the bidder
  • bidderrate : eBay feedback rating of the bidder
  • openbid : the opening bid set by the seller
  • price : the closing price that the item sold for (equivalent to the second highest bid + an increment)
  • item : auction item
  • auction_type : type of an auction: 3-days auction, 5-days auction, or 7-days auction.

If we look at the columns provided, we know that some columns are not needed. But to make sure, we will now read the data and call libraries needed, then take a look of the structure to make sure that every column has the right type of data.

library(dplyr)
library(ggplot2)

data <- read.csv("data_input/auction.csv")
glimpse(data)
#> Observations: 10,681
#> Variables: 9
#> $ auctionid    <dbl> 1638893549, 1638893549, 1638893549, 1638893549, 1...
#> $ bid          <dbl> 175.00, 100.00, 120.00, 150.00, 177.50, 1.00, 1.2...
#> $ bidtime      <dbl> 2.230949, 2.600116, 2.600810, 2.601076, 2.909826,...
#> $ bidder       <fct> schadenfreud, chuik, kiwisstuff, kiwisstuff, eli....
#> $ bidderrate   <int> 0, 0, 2, 2, 4, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 15, ...
#> $ openbid      <dbl> 99, 99, 99, 99, 99, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
#> $ price        <dbl> 177.5, 177.5, 177.5, 177.5, 177.5, 355.0, 355.0, ...
#> $ item         <fct> Cartier wristwatch, Cartier wristwatch, Cartier w...
#> $ auction_type <fct> 3 day auction, 3 day auction, 3 day auction, 3 da...

The chunk above shows that the data we have is ordered by auctionid. But, auctionid is still being read as numeric data type instead of factor. Then, because of our objection is to see how many people bid each item each day, we will round the number of bidtime. Also, some columns is useless in out objective, like bid, bidderrate, openbid, and price.

data <- data %>% 
  mutate(auctionid = as.factor(auctionid), 
         bidtime = floor(bidtime)) %>% 
  dplyr::select(auctionid, bidtime, bidder, bid, item, auction_type)

glimpse(data)
#> Observations: 10,681
#> Variables: 6
#> $ auctionid    <fct> 1638893549, 1638893549, 1638893549, 1638893549, 1...
#> $ bidtime      <dbl> 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1...
#> $ bidder       <fct> schadenfreud, chuik, kiwisstuff, kiwisstuff, eli....
#> $ bid          <dbl> 175.00, 100.00, 120.00, 150.00, 177.50, 1.00, 1.2...
#> $ item         <fct> Cartier wristwatch, Cartier wristwatch, Cartier w...
#> $ auction_type <fct> 3 day auction, 3 day auction, 3 day auction, 3 da...

Before going further, we will also check if there’s any NA in our data.

colSums(is.na(data))
#>    auctionid      bidtime       bidder          bid         item 
#>            0            0           16            0            0 
#> auction_type 
#>            0

As a rule of thumb, because the amount of NA in bidder is less than 5% of our data, we will remove them.

data <- data[complete.cases(data),]

Then, we will order the data depend on three things and in order: auctionid, bidtime, then bidder. By doing that, we can see each bidder bids each time, each day.

data <- data %>% 
  arrange(auctionid, bidtime, bidder)

glimpse(data)
#> Observations: 10,665
#> Variables: 6
#> $ auctionid    <fct> 1638843936, 1638843936, 1638843936, 1638843936, 1...
#> $ bidtime      <dbl> 0, 0, 3, 5, 6, 6, 6, 1, 1, 1, 3, 4, 6, 6, 6, 6, 6...
#> $ bidder       <fct> doc213, kona-java, zmxu, carloss8055, carloss8055...
#> $ bid          <dbl> 800.00, 500.00, 600.00, 1500.00, 1550.00, 1625.00...
#> $ item         <fct> Cartier wristwatch, Cartier wristwatch, Cartier w...
#> $ auction_type <fct> 7 day auction, 7 day auction, 7 day auction, 7 da...

Lead, Lag, and Complete

Now, we will find out which rows considered as duplicate so we can remove them. The rows we will remove is the rows existed because a bidder bids more than one time a day for an item. We will choose only the biggest amount of bid for each person each day.

For that, two functions from dplyr library will be introduced: lag and lead. The lag is being used to see the next value of a vector, and the lead one is the exact opposite of it.

But before we start jumping into our main objective, it would be wise to learn about why arranging before using lag and lead so important. For example, we have 5 numbers from 0 to 1, and we want to see what number before and after in each number using those two functions.

set.seed(8)
x <- runif(5)
cbind(x, after = lead(x), before = lag(x))
#>              x     after    before
#> [1,] 0.4662952 0.2078233        NA
#> [2,] 0.2078233 0.7996580 0.4662952
#> [3,] 0.7996580 0.6518713 0.2078233
#> [4,] 0.6518713 0.3215092 0.7996580
#> [5,] 0.3215092        NA 0.6518713

Well that’s easy, considering we only have a column before and we don’t have to care about ordering our value by what. But sometimes there are conditions that an error is occured because we don’t specify in which order we want to know our lag and lead.

So let’s make another example. Say that we have a data frame consists of a year, its quartal, and a value for each quartal. But the condition is the rows are scrambled and some rows are missing. Should we fill the incomplete rows first? or should we arrange it first? or can we directly find our next and before value?

Let’s read the arranged but incomplete data first.

set.seed(8)
y <- data.frame(year = c(2000,2000,2001,2001,2001,2001,2002,2002,2002),
                quartal = c(1,3,1,2,3,4,2,3,4),
                value = runif(9))
head(y)
#>   year quartal     value
#> 1 2000       1 0.4662952
#> 2 2000       3 0.2078233
#> 3 2001       1 0.7996580
#> 4 2001       2 0.6518713
#> 5 2001       3 0.3215092
#> 6 2001       4 0.7189275

We will then complete our missing quartal. We’re using complete from tidyr library.

y %>% 
  tidyr::complete(year, quartal) %>% 
  head()
#> # A tibble: 6 x 3
#>    year quartal  value
#>   <dbl>   <dbl>  <dbl>
#> 1  2000       1  0.466
#> 2  2000       2 NA    
#> 3  2000       3  0.208
#> 4  2000       4 NA    
#> 5  2001       1  0.800
#> 6  2001       2  0.652

We can fill the NAs using ifelse in mutate function from dplyr, or we can also easily use fill as a parameter inside complete above.

y <- y %>% 
  tidyr::complete(year, quartal, fill = list(value = 0))
head(y)
#> # A tibble: 6 x 3
#>    year quartal value
#>   <dbl>   <dbl> <dbl>
#> 1  2000       1 0.466
#> 2  2000       2 0    
#> 3  2000       3 0.208
#> 4  2000       4 0    
#> 5  2001       1 0.800
#> 6  2001       2 0.652

Now we’ll try to scramble them.

set.seed(8)
scrambled <- y[sample(nrow(y)),]
head(scrambled)
#> # A tibble: 6 x 3
#>    year quartal value
#>   <dbl>   <dbl> <dbl>
#> 1  2001       2 0.652
#> 2  2000       3 0.208
#> 3  2001       4 0.719
#> 4  2002       4 0.769
#> 5  2002       3 0.932
#> 6  2002       1 0

In order to solve that, we can first arrange our data depend on year and quartal before using lag or lead function. This example we will use lag.

wrong <- scrambled %>% 
  mutate(prev = lag(value)) %>% 
  arrange(year, quartal)
head(wrong)
#> # A tibble: 6 x 4
#>    year quartal value   prev
#>   <dbl>   <dbl> <dbl>  <dbl>
#> 1  2000       1 0.466  0.322
#> 2  2000       2 0      0    
#> 3  2000       3 0.208  0.652
#> 4  2000       4 0      0.800
#> 5  2001       1 0.800  0    
#> 6  2001       2 0.652 NA
right <- scrambled %>% 
  arrange(year,quartal) %>% 
  mutate(prev = lag(value))
head(right)
#> # A tibble: 6 x 4
#>    year quartal value   prev
#>   <dbl>   <dbl> <dbl>  <dbl>
#> 1  2000       1 0.466 NA    
#> 2  2000       2 0      0.466
#> 3  2000       3 0.208  0    
#> 4  2000       4 0      0.208
#> 5  2001       1 0.800  0    
#> 6  2001       2 0.652  0.800

The 2 tables above show how important ordering is, especially when you want to know the value before and after a row: doing wrong once, and your column will be broken.

Now we understand, that when we’re faced with a data of scrambled, and full of missing rows and we want to find its lag or lead, we can:

  1. Rearrange them,
  2. Fill the incomeplete rows, and
  3. Finally find their lead and lag

Now let’s get back to our main quest. We will apply those functions to determine that a row is considered as duplicate or not. After separating of which one is duplicate and which one is not, we will filter them to show only the non-duplicate ones.

data_mod <- data %>% 
  mutate(
    Final = ifelse(bidder == lead(bidder,1) &
                     auctionid == lead(auctionid,1) &
                     bidtime == lead(bidtime,1), 0, 1))

data_mod <- data_mod %>% 
  filter(Final == 1 | is.na(Final)) # NA will be returned in the last row of data

Fortunately we can use only lead one to know which of them is a duplicate. But for knowledge purposes, we can use lag also. In this condition, we don’t need to have each bidder’s biggest amount of bid in a day, so we can take only the first time they bid. This can be used if the data we have is prone to be accidentally inputted (like filling forms or quizzes).

data_mod2 <- data %>%
  mutate(
    Final = ifelse(bidder == lag(bidder,1) &
                     auctionid == lag(auctionid,1) &
                     bidtime == lag(bidtime,1), 0, 1))

data_mod2 <- data_mod2 %>% 
  filter(Final == 1 | is.na(Final)) # NA will be returned in the first row of data

The difference of them can be seen below. If we focus on the 5th row, we can see that the amount of bid is different. Because it happens that carloss8055 was bidding more than one time a day. It’s highest amount is placed in 4th column, and it’s lowest one in 5th.

merge <- cbind(data_mod[,1:4],data_mod2[,4])
colnames(merge)[4] <- "bid_lead"
colnames(merge)[5] <- "bid_lag"

head(merge)
#>    auctionid bidtime      bidder bid_lead bid_lag
#> 1 1638843936       0      doc213      800     800
#> 2 1638843936       0   kona-java      500     500
#> 3 1638843936       3        zmxu      600     600
#> 4 1638843936       5 carloss8055     1500    1500
#> 5 1638843936       6 carloss8055     1625    1550
#> 6 1638843936       6     jdrinaz     1600    1600

Because we want to get the highest bid of each bidder in each day, we will use data_mod, that took only the last one and remove the row before that because considered as a dup.

The Difference of Raw and Edited Data

At last, we will see the differencess of the data when we don’t filter it and when we do. First we will see first 8 of both data, then making a plot to make it more clear.

head(data[,1:5], 8)
#>    auctionid bidtime            bidder  bid               item
#> 1 1638843936       0            doc213  800 Cartier wristwatch
#> 2 1638843936       0         kona-java  500 Cartier wristwatch
#> 3 1638843936       3              zmxu  600 Cartier wristwatch
#> 4 1638843936       5       carloss8055 1500 Cartier wristwatch
#> 5 1638843936       6       carloss8055 1550 Cartier wristwatch
#> 6 1638843936       6       carloss8055 1625 Cartier wristwatch
#> 7 1638843936       6           jdrinaz 1600 Cartier wristwatch
#> 8 1638844284       1 dre_313@yahoo.com  225 Cartier wristwatch
head(data_mod[,1:5], 8)
#>    auctionid bidtime            bidder  bid               item
#> 1 1638843936       0            doc213  800 Cartier wristwatch
#> 2 1638843936       0         kona-java  500 Cartier wristwatch
#> 3 1638843936       3              zmxu  600 Cartier wristwatch
#> 4 1638843936       5       carloss8055 1500 Cartier wristwatch
#> 5 1638843936       6       carloss8055 1625 Cartier wristwatch
#> 6 1638843936       6           jdrinaz 1600 Cartier wristwatch
#> 7 1638844284       1 dre_313@yahoo.com  225 Cartier wristwatch
#> 8 1638844284       1         njbirdmom  500 Cartier wristwatch

Some rows has been deleted, like now we don’t have a row consists of carloss8055 bidding an item (Cartier wristwatch with auctionid 1638843936) for 1550, because after that carloss8055 was bidding again at higher price in the same day. Now, our modified data is not as much as the original one (it’s around the half of the original data).

data_agg <- data %>% 
  group_by(auctionid, bidtime) %>% 
  summarise(tot_bidder = n()) %>% 
  mutate(Type = "Raw") %>% 
  as.data.frame()

data_mod_agg <- data_mod %>% 
  group_by(auctionid, bidtime) %>% 
  summarise(tot_bidder = n()) %>% 
  mutate(Type = "Edited") %>% 
  as.data.frame()

data_combined <- rbind(data_agg, data_mod_agg) %>% 
  mutate(Type = as.factor(Type))

ggplot(data_combined, aes(x = bidtime, y = tot_bidder, group = Type)) + 
  geom_bin2d(position = "dodge", aes(colour = Type)) +
  labs(x = "bid time", y = "total bidder", title = "Original and Edited Data Comparison")

Using only lag and lead, we can see that the impact they’re given to the data is massive. And the edited data is considered more related to the real life scenario than the raw one, and we can easily say that we reach our objective with this.

Scroll to Top