2016 Yearly Analysis

Preparing Data for Analysis:

In order to prepare our data, we are going to load all the data present in excel sheets in the form of lists. We are going to implement the same by using the rbind function.

Initially, we are going to add the paths to all of our data for the year 2016 load it into a variable named Trips

library("readxl")
library("ggplot2")
library("dygraphs")
setwd("D:/Case Study")
paths2016 <- vector()
paths2016 <- c(paths2016, paste(getwd(),"/Yearly_Data/2016/Quarterly/Divvy_Trips_2016_04.xlsx", sep = ""))
paths2016 <- c(paths2016, paste(getwd(),"/Yearly_Data/2016/Quarterly/Divvy_Trips_2016_05.xlsx", sep = ""))
paths2016 <- c(paths2016, paste(getwd(),"/Yearly_Data/2016/Quarterly/Divvy_Trips_2016_06.xlsx", sep = ""))
paths2016 <- c(paths2016, paste(getwd(),"/Yearly_Data/2016/Quarterly/Divvy_Trips_2016_Q1.xlsx", sep = ""))
paths2016 <- c(paths2016, paste(getwd(),"/Yearly_Data/2016/Quarterly/Divvy_Trips_2016_Q3.xlsx", sep = ""))
paths2016 <- c(paths2016, paste(getwd(),"/Yearly_Data/2016/Quarterly/Divvy_Trips_2016_Q4.xlsx", sep = ""))
AllPaths <- c(paths2016)
Trips <- list()
for(i in AllPaths)
{
  for(Mypath in i)
  { 
    Trips <- rbind(Trips,read_excel(Mypath, col_types = c("guess","guess","guess","guess","guess","guess","guess","guess","guess","guess","guess","numeric")))
  }
}

Summary of Data:

summary(Trips)

##     trip_id          start_time          end_time             bikeid    
##  Min.   : 8547211   Length:3202147     Length:3202147     Min.   :   1  
##  1st Qu.: 9604498   Class :character   Class :character   1st Qu.:1496  
##  Median :11058250   Mode  :character   Mode  :character   Median :2960  
##  Mean   :10820326                                         Mean   :2931  
##  3rd Qu.:12011692                                         3rd Qu.:4329  
##  Max.   :12979228                                         Max.   :5920  
##                                                                         
##   tripduration     from_station_id from_station_name  to_station_id  
##  Min.   :   60.0   Min.   :  2.0   Length:3202147     Min.   :  2.0  
##  1st Qu.:  400.0   1st Qu.: 74.0   Class :character   1st Qu.: 74.0  
##  Median :  688.0   Median :157.0   Mode  :character   Median :157.0  
##  Mean   :  972.6   Mean   :177.9                      Mean   :178.3  
##  3rd Qu.: 1151.0   3rd Qu.:268.0                      3rd Qu.:268.0  
##  Max.   :86365.0   Max.   :620.0                      Max.   :620.0  
##                                                                      
##  to_station_name      usertype            gender            birthyear     
##  Length:3202147     Length:3202147     Length:3202147     Min.   :1899    
##  Class :character   Class :character   Class :character   1st Qu.:1974    
##  Mode  :character   Mode  :character   Mode  :character   Median :1984    
##                                                           Mean   :1980    
##                                                           3rd Qu.:1988    
##                                                           Max.   :2000    
##                                                           NA's   :722944

Analysis Metrics:

Customer-Sub Ratio
Gender Demographic
Trip Duration

Customer-Sub Ratio:

Here we observe a significant improvement in the Customer-Subscriber Ratio.

2015: 59.14-40.86 %

2016: 77.41-22.59 %

This shows us that the average user is highly likely to subscribe after they try out the service. Therefore, it might in the company’s best interest to host events to bring in more customers such as handing out free trials, hosting public awareness events, etc.

UserTypeCol <- Trips$usertype
Subs <- 0
Customers <- 0

for(i in UserTypeCol)
  if(i == "Subscriber")
    Subs <- Subs + 1 else
      Customers <- Customers + 1
pie(c(Subs,Customers),label = c(paste("Subscribers = ", round(Subs*100/(Subs + Customers), 2), "%"), paste("Customers = ", round(Customers*100/(Subs + Customers), 2), "%")))

Gender Demographic:

The ratio of male to female users seems to continue to widen. This indicates a lack of awareness of the service for female customers. It might be in the company’s best interest to advertise aggressively to female audience in order to improve the sex ratio.

ggplot(data = Trips) +
  geom_bar(mapping = aes(x = gender, fill = usertype), stat = "count")

Trip Duration:

Looking at the trip durations corresponding to the trip ids we’re able to witness a significant peek in trip durations around 1 x 10 ^ 7 trip_id.

Upon further investigation, the same is observed every year during the 7th month. This is perhaps due to the summer season and people being relatively more active and willing to use a bike for their daily commute. Furthermore, Students generally are on summer break during this period further explaining the high amount of trip durations during this period.

It would be in the company’s best interest to advertise for the summer season and ensure sufficient supply of bikes at each station during these periods due to high demand.

ggplot(data = Trips, aes(trip_id, tripduration)) +
        geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'