Reading Data and Basic Preprocessing
Some data that we obtain from the internet are gained as a raw, means that there are no modifications done to the data except placing it in the right column or row. Even if that’s a good thing, sometimes you have to treat and change the template of the data to be as friendly to reach our objective as possible.
Making sure that there are no duplicated data is one of the aspect of understanding the data itself, because we can’t say that the model that are being made from the information full of duplicated data is relevant enough to be used in real-case scenario. This time, we will learn how to hande duplicated data, so then we are sure that the data we’re going to use to create a model, visual interpretation, etc is reliable enough.
One of the example is a case when we want to find out the amount of requests in an online transportation. When we have a data that the canceled
requests or no-driver
conditions are exist, there are probability that some of those rows are consisted of only one consumer, henceforth irrelevant.
Now, we want to solve another case of understanding our data more: to decide which one is considered as duplicate, and to remove it.
This case is about Online Auctions Dataset from Kaggle. This data is about an auction held by eBay. Now, our main objective is to see people bidding an item each day.
The package
Before we process our data, it would be wise to understand each column that exist there:
- auctionid : unique identifier of an auction
- bid : the proxy bid placed by a bidder
- bidtime : the time in days that the bid was placed, from the start of the auction
- bidder : eBay username of the bidder
- bidderrate : eBay feedback rating of the bidder
- openbid : the opening bid set by the seller
- price : the closing price that the item sold for (equivalent to the second highest bid + an increment)
- item : auction item
- auction_type : type of an auction: 3-days auction, 5-days auction, or 7-days auction.
If we look at the columns provided, we know that some columns are not needed. But to make sure, we will now read the data and call libraries needed, then take a look of the structure to make sure that every column has the right type of data.
library(dplyr)
library(ggplot2)
data <- read.csv("data_input/auction.csv")
glimpse(data)
#> Observations: 10,681
#> Variables: 9
#> $ auctionid <dbl> 1638893549, 1638893549, 1638893549, 1638893549, 1...
#> $ bid <dbl> 175.00, 100.00, 120.00, 150.00, 177.50, 1.00, 1.2...
#> $ bidtime <dbl> 2.230949, 2.600116, 2.600810, 2.601076, 2.909826,...
#> $ bidder <fct> schadenfreud, chuik, kiwisstuff, kiwisstuff, eli....
#> $ bidderrate <int> 0, 0, 2, 2, 4, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 15, ...
#> $ openbid <dbl> 99, 99, 99, 99, 99, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
#> $ price <dbl> 177.5, 177.5, 177.5, 177.5, 177.5, 355.0, 355.0, ...
#> $ item <fct> Cartier wristwatch, Cartier wristwatch, Cartier w...
#> $ auction_type <fct> 3 day auction, 3 day auction, 3 day auction, 3 da...
The chunk above shows that the data we have is ordered by auctionid
. But, auctionid
is still being read as numeric data type instead of factor. Then, because of our objection is to see how many people bid each item each day, we will round the number of bidtime
. Also, some columns is useless in out objective, like bid
, bidderrate
, openbid
, and price
.
data <- data %>%
mutate(auctionid = as.factor(auctionid),
bidtime = floor(bidtime)) %>%
dplyr::select(auctionid, bidtime, bidder, bid, item, auction_type)
glimpse(data)
#> Observations: 10,681
#> Variables: 6
#> $ auctionid <fct> 1638893549, 1638893549, 1638893549, 1638893549, 1...
#> $ bidtime <dbl> 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1...
#> $ bidder <fct> schadenfreud, chuik, kiwisstuff, kiwisstuff, eli....
#> $ bid <dbl> 175.00, 100.00, 120.00, 150.00, 177.50, 1.00, 1.2...
#> $ item <fct> Cartier wristwatch, Cartier wristwatch, Cartier w...
#> $ auction_type <fct> 3 day auction, 3 day auction, 3 day auction, 3 da...
Before going further, we will also check if there’s any NA in our data.
colSums(is.na(data))
#> auctionid bidtime bidder bid item
#> 0 0 16 0 0
#> auction_type
#> 0
As a rule of thumb, because the amount of NA in bidder
is less than 5% of our data, we will remove them.
data <- data[complete.cases(data),]
Then, we will order the data depend on three things and in order: auctionid
, bidtime
, then bidder
. By doing that, we can see each bidder bids each time, each day.
data <- data %>%
arrange(auctionid, bidtime, bidder)
glimpse(data)
#> Observations: 10,665
#> Variables: 6
#> $ auctionid <fct> 1638843936, 1638843936, 1638843936, 1638843936, 1...
#> $ bidtime <dbl> 0, 0, 3, 5, 6, 6, 6, 1, 1, 1, 3, 4, 6, 6, 6, 6, 6...
#> $ bidder <fct> doc213, kona-java, zmxu, carloss8055, carloss8055...
#> $ bid <dbl> 800.00, 500.00, 600.00, 1500.00, 1550.00, 1625.00...
#> $ item <fct> Cartier wristwatch, Cartier wristwatch, Cartier w...
#> $ auction_type <fct> 7 day auction, 7 day auction, 7 day auction, 7 da...
Lead, Lag, and Complete
Now, we will find out which rows considered as duplicate so we can remove them. The rows we will remove is the rows existed because a bidder bids more than one time a day for an item. We will choose only the biggest amount of bid for each person each day.
For that, two functions from dplyr
library will be introduced: lag
and lead
. The lag
is being used to see the next value of a vector, and the lead
one is the exact opposite of it.
But before we start jumping into our main objective, it would be wise to learn about why arranging before using lag
and lead
so important. For example, we have 5 numbers from 0 to 1, and we want to see what number before and after in each number using those two functions.
set.seed(8)
x <- runif(5)
cbind(x, after = lead(x), before = lag(x))
#> x after before
#> [1,] 0.4662952 0.2078233 NA
#> [2,] 0.2078233 0.7996580 0.4662952
#> [3,] 0.7996580 0.6518713 0.2078233
#> [4,] 0.6518713 0.3215092 0.7996580
#> [5,] 0.3215092 NA 0.6518713
Well that’s easy, considering we only have a column before and we don’t have to care about ordering our value by what. But sometimes there are conditions that an error is occured because we don’t specify in which order we want to know our lag
and lead
.
So let’s make another example. Say that we have a data frame consists of a year, its quartal, and a value for each quartal. But the condition is the rows are scrambled and some rows are missing. Should we fill the incomplete rows first? or should we arrange it first? or can we directly find our next and before value?
Let’s read the arranged but incomplete data first.
set.seed(8)
y <- data.frame(year = c(2000,2000,2001,2001,2001,2001,2002,2002,2002),
quartal = c(1,3,1,2,3,4,2,3,4),
value = runif(9))
head(y)
#> year quartal value
#> 1 2000 1 0.4662952
#> 2 2000 3 0.2078233
#> 3 2001 1 0.7996580
#> 4 2001 2 0.6518713
#> 5 2001 3 0.3215092
#> 6 2001 4 0.7189275
We will then complete our missing quartal. We’re using complete
from tidyr
library.
y %>%
tidyr::complete(year, quartal) %>%
head()
#> # A tibble: 6 x 3
#> year quartal value
#> <dbl> <dbl> <dbl>
#> 1 2000 1 0.466
#> 2 2000 2 NA
#> 3 2000 3 0.208
#> 4 2000 4 NA
#> 5 2001 1 0.800
#> 6 2001 2 0.652
We can fill the NAs using ifelse
in mutate
function from dplyr
, or we can also easily use fill
as a parameter inside complete
above.
y <- y %>%
tidyr::complete(year, quartal, fill = list(value = 0))
head(y)
#> # A tibble: 6 x 3
#> year quartal value
#> <dbl> <dbl> <dbl>
#> 1 2000 1 0.466
#> 2 2000 2 0
#> 3 2000 3 0.208
#> 4 2000 4 0
#> 5 2001 1 0.800
#> 6 2001 2 0.652
Now we’ll try to scramble them.
set.seed(8)
scrambled <- y[sample(nrow(y)),]
head(scrambled)
#> # A tibble: 6 x 3
#> year quartal value
#> <dbl> <dbl> <dbl>
#> 1 2001 2 0.652
#> 2 2000 3 0.208
#> 3 2001 4 0.719
#> 4 2002 4 0.769
#> 5 2002 3 0.932
#> 6 2002 1 0
In order to solve that, we can first arrange our data depend on year and quartal before using lag
or lead
function. This example we will use lag
.
wrong <- scrambled %>%
mutate(prev = lag(value)) %>%
arrange(year, quartal)
head(wrong)
#> # A tibble: 6 x 4
#> year quartal value prev
#> <dbl> <dbl> <dbl> <dbl>
#> 1 2000 1 0.466 0.322
#> 2 2000 2 0 0
#> 3 2000 3 0.208 0.652
#> 4 2000 4 0 0.800
#> 5 2001 1 0.800 0
#> 6 2001 2 0.652 NA
right <- scrambled %>%
arrange(year,quartal) %>%
mutate(prev = lag(value))
head(right)
#> # A tibble: 6 x 4
#> year quartal value prev
#> <dbl> <dbl> <dbl> <dbl>
#> 1 2000 1 0.466 NA
#> 2 2000 2 0 0.466
#> 3 2000 3 0.208 0
#> 4 2000 4 0 0.208
#> 5 2001 1 0.800 0
#> 6 2001 2 0.652 0.800
The 2 tables above show how important ordering is, especially when you want to know the value before and after a row: doing wrong once, and your column will be broken.
Now we understand, that when we’re faced with a data of scrambled, and full of missing rows and we want to find its lag
or lead
, we can:
- Rearrange them,
- Fill the incomeplete rows, and
- Finally find their
lead
andlag
Now let’s get back to our main quest. We will apply those functions to determine that a row is considered as duplicate or not. After separating of which one is duplicate and which one is not, we will filter them to show only the non-duplicate ones.
data_mod <- data %>%
mutate(
Final = ifelse(bidder == lead(bidder,1) &
auctionid == lead(auctionid,1) &
bidtime == lead(bidtime,1), 0, 1))
data_mod <- data_mod %>%
filter(Final == 1 | is.na(Final)) # NA will be returned in the last row of data
Fortunately we can use only lead
one to know which of them is a duplicate. But for knowledge purposes, we can use lag
also. In this condition, we don’t need to have each bidder’s biggest amount of bid in a day, so we can take only the first time they bid. This can be used if the data we have is prone to be accidentally inputted (like filling forms or quizzes).
data_mod2 <- data %>%
mutate(
Final = ifelse(bidder == lag(bidder,1) &
auctionid == lag(auctionid,1) &
bidtime == lag(bidtime,1), 0, 1))
data_mod2 <- data_mod2 %>%
filter(Final == 1 | is.na(Final)) # NA will be returned in the first row of data
The difference of them can be seen below. If we focus on the 5th row, we can see that the amount of bid is different. Because it happens that carloss8055 was bidding more than one time a day. It’s highest amount is placed in 4th column, and it’s lowest one in 5th.
merge <- cbind(data_mod[,1:4],data_mod2[,4])
colnames(merge)[4] <- "bid_lead"
colnames(merge)[5] <- "bid_lag"
head(merge)
#> auctionid bidtime bidder bid_lead bid_lag
#> 1 1638843936 0 doc213 800 800
#> 2 1638843936 0 kona-java 500 500
#> 3 1638843936 3 zmxu 600 600
#> 4 1638843936 5 carloss8055 1500 1500
#> 5 1638843936 6 carloss8055 1625 1550
#> 6 1638843936 6 jdrinaz 1600 1600
Because we want to get the highest bid of each bidder in each day, we will use data_mod
, that took only the last one and remove the row before that because considered as a dup.
The Difference of Raw and Edited Data
At last, we will see the differencess of the data when we don’t filter it and when we do. First we will see first 8 of both data, then making a plot to make it more clear.
head(data[,1:5], 8)
#> auctionid bidtime bidder bid item
#> 1 1638843936 0 doc213 800 Cartier wristwatch
#> 2 1638843936 0 kona-java 500 Cartier wristwatch
#> 3 1638843936 3 zmxu 600 Cartier wristwatch
#> 4 1638843936 5 carloss8055 1500 Cartier wristwatch
#> 5 1638843936 6 carloss8055 1550 Cartier wristwatch
#> 6 1638843936 6 carloss8055 1625 Cartier wristwatch
#> 7 1638843936 6 jdrinaz 1600 Cartier wristwatch
#> 8 1638844284 1 dre_313@yahoo.com 225 Cartier wristwatch
head(data_mod[,1:5], 8)
#> auctionid bidtime bidder bid item
#> 1 1638843936 0 doc213 800 Cartier wristwatch
#> 2 1638843936 0 kona-java 500 Cartier wristwatch
#> 3 1638843936 3 zmxu 600 Cartier wristwatch
#> 4 1638843936 5 carloss8055 1500 Cartier wristwatch
#> 5 1638843936 6 carloss8055 1625 Cartier wristwatch
#> 6 1638843936 6 jdrinaz 1600 Cartier wristwatch
#> 7 1638844284 1 dre_313@yahoo.com 225 Cartier wristwatch
#> 8 1638844284 1 njbirdmom 500 Cartier wristwatch
Some rows has been deleted, like now we don’t have a row consists of carloss8055 bidding an item (Cartier wristwatch with auctionid 1638843936) for 1550, because after that carloss8055 was bidding again at higher price in the same day. Now, our modified data is not as much as the original one (it’s around the half of the original data).
data_agg <- data %>%
group_by(auctionid, bidtime) %>%
summarise(tot_bidder = n()) %>%
mutate(Type = "Raw") %>%
as.data.frame()
data_mod_agg <- data_mod %>%
group_by(auctionid, bidtime) %>%
summarise(tot_bidder = n()) %>%
mutate(Type = "Edited") %>%
as.data.frame()
data_combined <- rbind(data_agg, data_mod_agg) %>%
mutate(Type = as.factor(Type))
ggplot(data_combined, aes(x = bidtime, y = tot_bidder, group = Type)) +
geom_bin2d(position = "dodge", aes(colour = Type)) +
labs(x = "bid time", y = "total bidder", title = "Original and Edited Data Comparison")
Using only lag
and lead
, we can see that the impact they’re given to the data is massive. And the edited data is considered more related to the real life scenario than the raw one, and we can easily say that we reach our objective with this.