2018 Yearly Analysis

Preparing Data for Analysis:

In order to prepare our data for analysis, we are going to load all our data in the form of lists. We are going to do the same by providing the paths to all the data and binding them together using rbind() function. The data for the financial year is now stored in the the variable Trips.

library("readxl")
library("ggplot2")
library("dygraphs")
setwd("D:/Case Study")
paths2018 <- vector()
paths2018 <- c(paths2018, paste(getwd(),"/Yearly_Data/2018/Quarterly/Divvy_Trips_2018_Q1.xlsx", sep = ""))
paths2018 <- c(paths2018, paste(getwd(),"/Yearly_Data/2018/Quarterly/Divvy_Trips_2018_Q2.xlsx", sep = ""))
paths2018 <- c(paths2018, paste(getwd(),"/Yearly_Data/2018/Quarterly/Divvy_Trips_2018_Q3.xlsx", sep = ""))
paths2018 <- c(paths2018, paste(getwd(),"/Yearly_Data/2018/Quarterly/Divvy_Trips_2018_Q4.xlsx", sep = ""))
AllPaths <- c(paths2018)
Trips <- list()
for(i in AllPaths)
{
  for(Mypath in i)
  {
    Trips <- rbind(Trips,read_excel(Mypath, col_types = c("guess","guess","guess","guess","guess","guess","guess","guess","guess","guess","guess","numeric")))
  }
}

Summary of Data:

summary(Trips)
##     trip_id           start_time                     end_time                  
##  Min.   :17536702   Min.   :2018-01-01 00:12:00   Min.   :2018-01-01 00:17:23  
##  1st Qu.:18468202   1st Qu.:2018-05-17 10:41:59   1st Qu.:2018-05-17 11:02:33  
##  Median :19392132   Median :2018-07-08 15:51:59   Median :2018-07-08 16:27:38  
##  Mean   :19489448   Mean   :2018-07-08 10:20:20   Mean   :2018-07-08 10:43:39  
##  3rd Qu.:20289018   3rd Qu.:2018-08-21 17:12:03   3rd Qu.:2018-08-21 17:29:25  
##  Max.   :21742442   Max.   :2018-12-31 23:59:18   Max.   :2019-01-06 02:36:16  
##                                                                                
##      bikeid      tripduration      from_station_id from_station_name 
##  Min.   :   1   Min.   :      61   Min.   :  2.0   Length:3126981    
##  1st Qu.:1785   1st Qu.:     392   1st Qu.: 76.0   Class :character  
##  Median :3604   Median :     666   Median :165.0   Mode  :character  
##  Mean   :3489   Mean   :    1399   Mean   :189.2                     
##  3rd Qu.:5230   3rd Qu.:    1197   3rd Qu.:283.0                     
##  Max.   :6471   Max.   :14336400   Max.   :664.0                     
##                                                                      
##  to_station_id   to_station_name      usertype            gender         
##  Min.   :  2.0   Length:3126981     Length:3126981     Length:3126981    
##  1st Qu.: 76.0   Class :character   Class :character   Class :character  
##  Median :165.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :189.8                                                           
##  3rd Qu.:284.0                                                           
##  Max.   :664.0                                                           
##                                                                          
##    birthyear     
##  Min.   :1895    
##  1st Qu.:1977    
##  Median :1986    
##  Mean   :1983    
##  3rd Qu.:1991    
##  Max.   :2005    
##  NA's   :477663

Analysis Metrics:

  • Customer-Sub Ratio
  • Gender Demographic
  • Month by Month

Customer-Sub Ratio:

The Customer-Sub ratio seems to start plateauing around the 80 - 20 % mark. Which looking at the industry standard is a healthy Customer-Sub ratio.

UserTypeCol <- Trips$usertype
Subs <- 0
Customers <- 0

for(i in UserTypeCol)
  if(i == "Subscriber")
    Subs <- Subs + 1 else
      Customers <- Customers + 1
pie(c(Subs,Customers),label = c(paste("Subscribers = ", round(Subs*100/(Subs + Customers), 2), "%"), paste("Customers = ", round(Customers*100/(Subs + Customers), 2), "%")))

Gender Demographic

The Gender Demographic seems to be a recurring problem every year. It would really be in the bike rental service’s best interest to advertise extensively to increase the sex ratio and further increase profits.

ggplot(data = Trips) +
  geom_bar(mapping = aes(x = gender, fill = usertype), stat = "count")

Month by Month:

Before Looking at each month of the year, we need to change our date formats to enable us to use those metrics. Using the as.Date() method, we create a new column named month recording the month the trip started in.

Trips$month <- format(as.Date(Trips$start_time), "%m")

We can observe, the same pattern appears in this year’s analysis as well. People tend to make more trips using the bike rental service during the summer season which is consistent with the findings shown in the previous years discussing the possibility of advertising for summer season in order to drive profits.

xy2 <- as.data.frame(table(Trips$month))
xy2$Var1 <- factor(xy2$Var1, levels = xy2$Var1[order(xy2$Freq)])

ggplot(xy2,aes(x=Var1, y = Freq, fill = Freq)) +
    scale_fill_gradient(low = "yellow", high = "red") +
      geom_bar(width=0.7, stat = "identity") +
        coord_flip()