Introduction
Many cities around the world have introduced bike-share schemes in the past decade. City governments see these schemes as a means to promote several policy goals, including improved connectivity, better public health, reduced pollution, and enjoyment of residents and tourists \cite{Fishman_2013}. New York joined this trend in May 2003 when it launched Citibike with an initial 6,000 cycles at more than 100 stations in Manhattan.
The scheme subsequently expanded to 10,000 cycles. The distance travelled on Citibikes has now surpassed 3,300,000 miles, equating to just over 13 times the distance between the earth and the moon (www.citibikeblog.tumblr.com).
A key consideration for city government is to ensure equitable provision of services such as Citibike to all citizens. The degree to which Citibike by different demographic groups can be influenced by behavioral, physical and economic factors, but these havereceived little independent study \cite{Faghih_Imani_2016}. We therefore examined usage of Citibike between the genders to determine whether take-up of this service differs systematically on a gender basis. This question is of policy relevance to New York City government and civic groups, and of operational relevance to the Citibike franchise operator, given the shared interest of these actors in ensuring Citibike is a service for all New Yorkers and visitors (not just a subset of them).
\(\)
Data
The dataset comprises trip duration, start and stop times and locations, and information on customers making the journeys - including gender, year of birth and subscription type.
A key data issue for our analysis is the nature of the gender field, which records values of '1' for male, '2' for female, and '0' for unknown. The data was provided in CSV form.
Several data cleaning tasks were carried out. Fields other than gender and trip duration were dropped. The dataset initially comprised 3,293,678 rows. A descriptive summary of the statistics was generated (Figure 1) and found to suggest the presence of significant outliers, indicated by extremely high maximum values which exceeded 3,000,000 seconds (or 800 hours). On further manual inspection of the data, it was determined that these high outlying values likely derive from errors such as docking stations not working properly, or from lost cycles.