Visualizing 18.7 million GPS points
This is the latest in a series of posts on Austin Bcycle’s bike rental data. I’ve been working with Austin BCycle on previous articles on bike stations, bike rentals, and using machine learning to predict hourly rentals.
It’s been a while since I worked on this project, but Nick from BCycle provided me with a new dataset, and I couldn’t resist taking a look. The new data is a set of GPS points recorded from bikes in the BCycle system. I hadn’t realized the bikes had GPS trackers in them, but it makes sense to equip them with trackers, for monitoring and theft prevention purposes.
The GPS data ranges from the launch of BCycle on December 21st, 2013 to July 30th 2016. The data is stored in a set of CSV files totalling 7.3GB in size. For each trip the data contains:
- Bike ID of the bike used for the trip.
- User ID of the person renting the bike.
- Rental and return kiosk: Where the trip began and ended.
- Rental and return data and time: the time when the bike was rented, and dropped off at the destination station.
- Membership type: Austin BCycle offers a range of memberships, from 24 hour one-time passes to yearly subscriptions.
- GPS points recorded during each trip. These are spaced at approximately 5 seconds, giving a fine resolution.
I also have a dataset from an earlier article which includes every trip from BCycle’s launch to the end of 2016. This dataset doesn’t include the User ID or GPS points, but it’s useful to compare which bikes have GPS data and which don’t.
As in previous articles, all the code and Jupyter notebooks can be found on my Github. For this article I’m not going to delve into the coding, and instead focus on the visualizations and analysis.
GPS enabled bikes
The first thing I was curious about is how many of the BCycle bikes had GPS trackers. I made the assumption that any bike in the GPS trip dataset has a GPS sensor built in, and that bikes don’t have GPS added or removed over time.
There are 408 bikes in total in the data, of which 316 have a least one GPS-enabled trip. This works out to 77% of BCycle bikes being equipped with GPS.
I then checked the proportion of trips that were made by bikes with GPS tracking data. In 548158 total trips, 435125 have GPS tracker data. This works out as 79% of the trips.
BCycle users don’t know which bikes have GPS and which don’t, so the proportion of GPS trips should be approximately the same as the bikes with GPS. That is, assuming the GPS bikes are spread around Austin randomly. This is the case in the data (77% of GPs enabled bikes made 79% of the trips).
In the non-GPS trip data I had the start and end stations for each trip, but no information on the route the user took in between. I could have used an API like Google’s Distance one to calculate the shortest path between the two stations. For commuting trips this is a reasonable assumption, but for recreational trips this may not be true. Austin BCycle usage on weekends is higher than during the week (see this article).
The “shortest distance” assumption also can’t be applied to “out-and-back” trips where the the rental and return station is the same. These trips account for 15% of overall trips in the non-GPS dataset. Maybe the users parked their car near the Town Lake cycling path and went for a spin around the trails there. We can’t calculate the distance for these trips without knowing the actual course the users took.
Fortunately for us, the GPS trips data includes position reported at approximately 5 second intervals. We can trace the route a user took by linking these points together with straight lines, then add up the lengths of those lines. The plot below shows the kernel density plot of trips, by membership type.
The plots show that recurring members take shorter distance trips, with their kernel density plot more heavily right-skewed than the one-time members. One-time members take longer trips, with a heavier tail extending to the right. This may be because one-time members are using the bikes for recreational trips whose distances are longer.
We now have accurate trip distances for all the GPS trip dataset. We can now use the trip durations to calculate the average speed of each trip. Then we can split the trips by one-time and recurring memberships to see how they vary.
The plots show that the one-time members on average have slower-speed trips. The median one-time member speed is 5.2mph, compared with 7.4mph for recurring members. One time members also have more slower-speed trips.
While it’s interesting to look at the statistics and visualizations of the trips, it’s also fun to see them animated. I used Carto to load up the GPS tracks from the first week of May 2016. The bar graph on the bottom shows how many trips there were over each of the days, we can see a peak in trips on Friday, Saturday, and Sunday. Each GPS position is linked to the time, and shown as a blue dot on the map.
There are a couple of ways to interact with the map:
- To start the animation, click on the ‘play’ button on the bottom left. The bike trails will animate, showing the bike trips as they cycle around Austin.
- To view the aggregated results over a period of time, drag a horizontal line over the bar graph below the map. This shows all the GPS points that were tracked in that period of time.
Visualizing all 18.7 million GPS points
One challenge with visualizing millions of points on a map is how to do this effectively. This could be in terms of the compute time to produce the map, or how effective the visualization is in showing details of the dataset.
One approach is to start with a map that has a low brightness. Each GPS position can be plotted as a white color with a low alpha value. As the GPS position density increases, the area becomes progressively brighter as the points overlay each other. This is known as overplotting.
The plot below shows how this gives an X-ray appearance to the plot. The downside is that there’s no way to adjust the brightness with the amount of overplotting. The alpha values add up linearly, but in some cases a log transform might help preserve the low-density details without washing out the high-density areas.
After searching online I found an alternative library called Datashader. This uses a different approach to the overplotting+alpha shown above, and is much faster. There’s also a great tutorial on plotting the NYC Taxi dataset which helped in making the plot below. The histogram equalization feature Datashader supports brings out the low-density areas on the map, while avoiding the wash-out of the roads in the downtown area.
I hope you’ve enjoyed this look at the GPS routes Austin BCycle users have taken over the past few years. As always, the code to generate the plots and maps can be found on my Github. If you liked the article, please recommend it below (the heart icon), and follow me for updates on new articles on data analysis and visualization.
#URLinkedUp AustinStartup https://austinstartups.com/mapping-austin-bcycle-trips-5fa752a908a3?source=rss—-9504c035b990—4
#Austin Check out URLinkedUp > http://www.urlinkedup.com