Tidytuesday Philly Parking Tickets
Tidytuesday Philly Parking Tickets
By: Patrick Ayers
12/3/2019
Basic Over Data
This week I will be looking at the Philadelphia parking violations data set. This data has all of the parking violations for Philadelphia for the year 2017. The whole data set was over 1GB in size, but it was cut down to less than 100Mb due to Git Hub restrictions. If you would like to see the full data set or who uploaded it to Git hub see https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-12-03.
Let's Take a look at what the structure of the data is. We can see that there are a total of seven variables and a bit over 1.2 million observations.
As always, we can look at the number of missing values. Looking at the table below, we can see that almost all of the data is coming from zip_code variables. Given that we have both latitude and longitudes values, we can use them to find the zip code they are connected too.
variables | missing_values |
---|---|
violation_desc | 0 |
issue_datetime | 0 |
fine | 0 |
issuing_agency | 1 |
lat | 0 |
lon | 0 |
zip_code | 173588 |
Let's take a more in-depth look at the variable of violations_desc and see what violations include, or if we need to do many more data cleaning. There are 96 unique values for these variables.
unique(tickets$violation_desc)
## [1] "BUS ONLY ZONE" "STOPPING PROHIBITED" "OVER TIME LIMIT"
## [4] "STOP PROHIBITED CC" "DOUBLE PARKED" "OVER TIME LIMIT CC"
## [7] "FIRE HYDRANT" "METER EXPIRED" "HP RESERVED SPACE"
## [10] "METER EXPIRED CC" "LOADING ZONE CC" "STREET CLEANING"
## [13] "PASSENGR LOADNG ZONE" "EXPIRED INSPECTION" "SIDEWALK"
## [16] "CORNER CLEARANCE" "PARKING PROHBITED CC" "PARKING PROHBITED"
## [19] "HP RAMP BLOCKED" "CROSSWALK" "IMPROPER ON 2WAY HWY"
## [22] "TAXI STAND" "BLOCKING DRIVEWAY" "SIDEWALK CC"
## [25] "PRIVATE PROPERTY" "SCHOOL ZONE" "BUS ONLY ZONE CC"
## [28] "UNREG/ABANDONED VEH" "EXPIRED TAG" "DOUBLE PARKED CC"
## [31] "LOADING ZONE" "CAR SHARE VEHS ONLY" "VALET ZONE VIOLATION"
## [34] "IMPROPER ON 1WAY HWY" "SNOW EMERGENCY ROUTE" "STOP MEDIAL STRIP"
## [37] "STOP/BLOCK HIGHWAY" "SNOW RELOCATION TOW" "CORNER CLEARANCE CC"
## [40] "TAXI NOT AT STAND" "BUS NOT IN BUS STAND" "STOP IN INTERSECTION"
## [43] "DISPLAY VEH FOR SALE" "STOP ON R/R TRACK 44" "WASH/REPAIR VEH"
## [46] "PARKING PROHIBITED" "BLOCKING DRIVEWAY CC" "BLOCKNG MASS TRANSIT"
## [49] "CROSSWALK CC" "COUNTERFEIT HP PERM" "COMMRCL VEH RES AREA"
## [52] "ON LIMITD ACCESS HW" "ILLEGAL PLACED TICKT" "PK NEAR FIRE STAT 51"
## [55] "UNMETERED SPACE" "SUSPENDED LICENSE" "VALET VIOLATION"
## [58] "PKG W/LIC SUSPD 75" "PARK MOBILE HOME 28" "EXCESSIVE IDLING"
## [61] "BUS-IMP PSNG DISCHRG" "DISOBEY SIGN-PRHB 21" "PARKED ON BRIDGE 14"
## [64] "STOP/BLOCK HIWY CC" "PARK METER IMPROPER" "EXCESSIVE NOISE"
## [67] "UNATTENDED VEH 29" "PARENTAL LIABILITY" "PARK PROHIBITED 04"
## [70] "IMPROPER-2WAY HWY 74" "PARKED ON GRASS" "STOPPED IN SAFE ZONE"
## [73] "OBSTRUCT TROLLEY 52" "UNLAWFUL ALARM" "FRAUD PARK HP SPACE"
## [76] "PARK PROHIB PLACE" "INVALID CONT PERMIT" "INVALID KIOSK RECPT"
## [79] "DAMAGE TO METER" "PK NEAR TRAFF SIGNAL" "INVALID RPP PERMIT"
## [82] "DISOBEY SN-STANDING" "BUS-IMPROPER LOADING" "ILLEGAL PLACD TKT 73"
## [85] "HIWAY CONSTRCTION 43" "PARK-TROLLEY STOP 23" "STOP PROHIBITED CC"
## [88] "OWNER ID ON COM VEH" "STOP/BLOCK HWY 33" "IMPROPER MOTORCYCLE"
## [91] "PK OVER TIME LIM 17" "+4HR IN LOADING ZONE" "PRK MTR IMPROPER 11"
## [94] "INTERSECTION CC" "STOPPED SAFE ZONE CC"
Adding a new Variable
One thing I am seeing is how some of the values have “CC” after them, such as SIDEWALK CC vs. SIDEWALK or LOADING ZONE CC vs. LOADING ZONE. If you go and look at the http://www.philapark.org/faq/ we can see that there are different ticket values for non Center City and Center City. We can look at some of the violations that have the letters CC and see if there is a difference in the fines.
tickets %>% filter(violation_desc %in% c("METER EXPIRED CC","METER EXPIRED","DOUBLE PARKED CC","DOUBLE PARKED","SIDEWALK CC","SIDEWALK","OVER TIME LIMIT CC","OVER TIME LIMIT" )) %>%
group_by(violation_desc) %>% summarise(mean_value = dollar(mean(fine))) %>% kable()
violation_desc | mean_value |
---|---|
DOUBLE PARKED | $51 |
DOUBLE PARKED CC | $76 |
METER EXPIRED | $26 |
METER EXPIRED CC | $36 |
OVER TIME LIMIT | $26 |
OVER TIME LIMIT CC | $36.00 |
SIDEWALK | $51 |
SIDEWALK CC | $76 |
I am seeing that the difference between the fines is about $25.00, which is matching up with what is said on the Philadelphia website. With this information, we can now and add a new variable called City_center to the data. I will also be removing “CC” form violation_desc because that information is now in the variable.
Tickets <- tickets %>%
mutate('City_Center' = if_else(str_detect(violation_desc,"CC"),"City_center","Not_city_center"))
#glimpse(Tickets)
Tickets$violation_desc <- trimws(str_remove(Tickets$violation_desc,"CC"))
kable(head(Tickets[,c(1,2,3,4,8)]))
violation_desc | issue_datetime | fine | issuing_agency | City_Center |
---|---|---|---|---|
BUS ONLY ZONE | 2017-12-06 12:29:00 | 51 | PPA | Not_city_center |
STOPPING PROHIBITED | 2017-10-16 18:03:00 | 51 | PPA | Not_city_center |
OVER TIME LIMIT | 2017-11-02 22:09:00 | 26 | PPA | Not_city_center |
OVER TIME LIMIT | 2017-11-05 20:19:00 | 26 | PPA | Not_city_center |
STOP PROHIBITED | 2017-10-17 06:58:00 | 76 | PPA | City_center |
DOUBLE PARKED | 2017-10-02 10:40:00 | 51 | POLICE | Not_city_center |
Diving into the Data
Now that we have cleaned the data let's take a look into what we can learn from this data set. First, let's see just the general information about the data set and see what the top 10 violation is.
Tickets %>% group_by(City_Center) %>% summarise(total_fine = dollar(sum(fine)),
avg_fine = dollar(mean(fine)),
total_fins_given = n()) %>% kable()
City_Center | total_fine | avg_fine | total_fins_given |
---|---|---|---|
City_center | $24,759,362 | $48.96 | 505697 |
Not_city_center | $32,498,768 | $43.03 | 755194 |
Tickets %>% group_by(violation_desc) %>% summarise("Violation_Count" = n()) %>%
arrange(desc(Violation_Count)) %>% top_n(10) %>% kable()
violation_desc | Violation_Count |
---|---|
METER EXPIRED | 462389 |
OVER TIME LIMIT | 181444 |
EXPIRED INSPECTION | 138575 |
STOP PROHIBITED | 115899 |
PARKING PROHBITED | 92314 |
STOPPING PROHIBITED | 47395 |
BUS ONLY ZONE | 30161 |
SIDEWALK | 24412 |
PASSENGR LOADNG ZONE | 24359 |
FIRE HYDRANT | 20945 |
Meter expired is the most common ticket that was given out in 2017, let's add to this and see how much money did the fines generate. We can see that Meter Expired makes almost 15 million dollars, and the next highest violation amount is Stop Prohibited at 8.8 million dollars.
Tickets %>% group_by(violation_desc) %>% summarise("Violation_Count" = n(),
"Total_fines" = dollar(sum(fine))) %>%
arrange(desc(Violation_Count)) %>% top_n(10,Violation_Count) %>% kable()
violation_desc | Violation_Count | Total_fines |
---|---|---|
METER EXPIRED | 462389 | $14,832,714 |
OVER TIME LIMIT | 181444 | $4,963,384 |
EXPIRED INSPECTION | 138575 | $5,681,835 |
STOP PROHIBITED | 115899 | $8,808,249 |
PARKING PROHBITED | 92314 | $4,235,694 |
STOPPING PROHIBITED | 47395 | $2,417,135 |
BUS ONLY ZONE | 30161 | $1,795,536 |
SIDEWALK | 24412 | $1,387,412 |
PASSENGR LOADNG ZONE | 24359 | $755,129 |
FIRE HYDRANT | 20945 | $1,591,820 |
This data was at the total level; let's take advantage of the new variable of City center and see how different the parking violations differences between city centers and non-city centers.
Tickets %>% filter(City_Center == "City_center") %>% group_by(violation_desc) %>%
summarise("Violation_Count" = n(),
"Total_fines" = dollar(sum(fine))) %>%
arrange(desc(Violation_Count)) %>% top_n(10,Violation_Count) %>% kable()
violation_desc | Violation_Count | Total_fines |
---|---|---|
METER EXPIRED | 281060 | $10,118,160 |
STOP PROHIBITED | 115899 | $8,808,249 |
PARKING PROHBITED | 45082 | $2,299,182 |
OVER TIME LIMIT | 24585 | $885,050 |
LOADING ZONE | 12826 | $654,126 |
BUS ONLY ZONE | 10294 | $782,319 |
DOUBLE PARKED | 7841 | $595,916 |
SIDEWALK | 5696 | $432,896 |
BLOCKING DRIVEWAY | 958 | $72,808 |
CORNER CLEARANCE | 812 | $61,712 |
Tickets %>% filter(City_Center == "Not_city_center") %>% group_by(violation_desc) %>%
summarise("Violation_Count" = n(),
"Total_fines" = dollar(sum(fine))) %>%
arrange(desc(Violation_Count)) %>% top_n(10,Violation_Count) %>% kable()
violation_desc | Violation_Count | Total_fines |
---|---|---|
METER EXPIRED | 181329 | $4,714,554 |
OVER TIME LIMIT | 156859 | $4,078,334 |
EXPIRED INSPECTION | 138575 | $5,681,835 |
STOPPING PROHIBITED | 47395 | $2,417,135 |
PARKING PROHBITED | 47232 | $1,936,512 |
PASSENGR LOADNG ZONE | 24359 | $755,129 |
FIRE HYDRANT | 20945 | $1,591,820 |
BUS ONLY ZONE | 19867 | $1,013,217 |
SIDEWALK | 18716 | $954,516 |
HP RESERVED SPACE | 14554 | $4,380,754 |
We can see that the top 3 violations are almost entirely different. The top 3 parking violations for city centers are Meter expired, stop prohibited, and parking prohibited. It is comparing that to non-city centers that have Meter expired, overtime limit, and expired inspection. I do find it interesting that that double-parked is in the top 10 for city centers but not for non-city centers.
The biggest take away that I am seeing from comparing the two tables is that the City centers violations generate almost 60% of all Meter expired violations.Lastly, we can create a bar graph of the show the differences between the city center and non-city centers. This plot is a great way to see the different distribution between the two different location types.
p1 <- Tickets %>%
select(violation_desc,City_Center) %>%
group_by(City_Center) %>%
count(violation_desc) %>%
arrange(desc(n)) %>%
top_n(5) %>%
filter(City_Center == 'City_center') %>%
ggplot(aes(violation_desc,n)) + geom_bar(stat = 'identity') + ggtitle("Top 5 Violations for City Centers") + ylim(0,300000) +
scale_y_continuous(breaks = c(50000,100000,150000,200000,250000,300000,350000)) + ylab("Number of violations") + xlab("Violation")
p2 <- Tickets %>%
select(violation_desc,City_Center) %>%
group_by(City_Center) %>%
count(violation_desc) %>%
arrange(desc(n)) %>%
top_n(5) %>%
filter(City_Center != 'City_center') %>%
ggplot(aes(violation_desc,n)) + geom_bar(stat = 'identity') + ggtitle("Top 5 Violations for non-City Centers") +
scale_y_continuous(breaks = c(50000,100000,150000,200000,250000)) + ylab("Number of violations") + xlab("Violation")
ggarrange(p1,p2,nrow = 2)
Final Thoughts
This was a great data set to dig into — a lot of great information and many different ways to analyze the data. I did not even go into the variable of issuing_agency or latitude and longitudes. There is still more analysis that can be done on this, such as a heat map to see the area of Philadelphia has the highest density of violations. There can even be an analysis of the date-times that are included with the data set. A lot of fun working on this data, and lastly, if you are in Philadelphia, be shear to read the parking signs to avoid a ticket.