2016 Yearly Analysis
Preparing Data for Analysis:
In order to prepare our data, we are going to load all the data present in excel sheets in the form of lists. We are going to implement the same by using the rbind
function.
Initially, we are going to add the paths to all of our data for the year 2016 load it into a variable named Trips
library("readxl")
library("ggplot2")
library("dygraphs")
setwd("D:/Case Study")
<- vector()
paths2016 <- c(paths2016, paste(getwd(),"/Yearly_Data/2016/Quarterly/Divvy_Trips_2016_04.xlsx", sep = ""))
paths2016 <- c(paths2016, paste(getwd(),"/Yearly_Data/2016/Quarterly/Divvy_Trips_2016_05.xlsx", sep = ""))
paths2016 <- c(paths2016, paste(getwd(),"/Yearly_Data/2016/Quarterly/Divvy_Trips_2016_06.xlsx", sep = ""))
paths2016 <- c(paths2016, paste(getwd(),"/Yearly_Data/2016/Quarterly/Divvy_Trips_2016_Q1.xlsx", sep = ""))
paths2016 <- c(paths2016, paste(getwd(),"/Yearly_Data/2016/Quarterly/Divvy_Trips_2016_Q3.xlsx", sep = ""))
paths2016 <- c(paths2016, paste(getwd(),"/Yearly_Data/2016/Quarterly/Divvy_Trips_2016_Q4.xlsx", sep = ""))
paths2016 <- c(paths2016)
AllPaths <- list()
Trips for(i in AllPaths)
{for(Mypath in i)
{ <- rbind(Trips,read_excel(Mypath, col_types = c("guess","guess","guess","guess","guess","guess","guess","guess","guess","guess","guess","numeric")))
Trips
} }
Summary of Data:
summary(Trips)
## trip_id start_time end_time bikeid
## Min. : 8547211 Length:3202147 Length:3202147 Min. : 1
## 1st Qu.: 9604498 Class :character Class :character 1st Qu.:1496
## Median :11058250 Mode :character Mode :character Median :2960
## Mean :10820326 Mean :2931
## 3rd Qu.:12011692 3rd Qu.:4329
## Max. :12979228 Max. :5920
##
## tripduration from_station_id from_station_name to_station_id
## Min. : 60.0 Min. : 2.0 Length:3202147 Min. : 2.0
## 1st Qu.: 400.0 1st Qu.: 74.0 Class :character 1st Qu.: 74.0
## Median : 688.0 Median :157.0 Mode :character Median :157.0
## Mean : 972.6 Mean :177.9 Mean :178.3
## 3rd Qu.: 1151.0 3rd Qu.:268.0 3rd Qu.:268.0
## Max. :86365.0 Max. :620.0 Max. :620.0
##
## to_station_name usertype gender birthyear
## Length:3202147 Length:3202147 Length:3202147 Min. :1899
## Class :character Class :character Class :character 1st Qu.:1974
## Mode :character Mode :character Mode :character Median :1984
## Mean :1980
## 3rd Qu.:1988
## Max. :2000
## NA's :722944
Analysis Metrics:
- Customer-Sub Ratio
- Gender Demographic
- Trip Duration
Customer-Sub Ratio:
Here we observe a significant improvement in the Customer-Subscriber Ratio.
2015: 59.14-40.86 %
2016: 77.41-22.59 %
This shows us that the average user is highly likely to subscribe after they try out the service. Therefore, it might in the company’s best interest to host events to bring in more customers such as handing out free trials, hosting public awareness events, etc.
<- Trips$usertype
UserTypeCol <- 0
Subs <- 0
Customers
for(i in UserTypeCol)
if(i == "Subscriber")
<- Subs + 1 else
Subs <- Customers + 1
Customers pie(c(Subs,Customers),label = c(paste("Subscribers = ", round(Subs*100/(Subs + Customers), 2), "%"), paste("Customers = ", round(Customers*100/(Subs + Customers), 2), "%")))
Gender Demographic:
The ratio of male to female users seems to continue to widen. This indicates a lack of awareness of the service for female customers. It might be in the company’s best interest to advertise aggressively to female audience in order to improve the sex ratio.
ggplot(data = Trips) +
geom_bar(mapping = aes(x = gender, fill = usertype), stat = "count")
Trip Duration:
Looking at the trip durations corresponding to the trip ids we’re able to witness a significant peek in trip durations around 1 x 10 ^ 7 trip_id.
Upon further investigation, the same is observed every year during the 7th month. This is perhaps due to the summer season and people being relatively more active and willing to use a bike for their daily commute. Furthermore, Students generally are on summer break during this period further explaining the high amount of trip durations during this period.
It would be in the company’s best interest to advertise for the summer season and ensure sufficient supply of bikes at each station during these periods due to high demand.
ggplot(data = Trips, aes(trip_id, tripduration)) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'