Data Science
A Python data science project
Objective - Finding the best place to stay in North Carolina: The Battle of Neighborhoods - Wake vs Mecklenburg County
Purpose of the project - I integrated and practiced the Pyhon Programming I learned through The IBM Cognitive Class for Data Science
My role - Since this is a solo project, I performed all project roles, from research to finding and the final solution
Tools and Technology - Applying Data Science methodology - working with Jupyter notebooks - to create Python apps - access relational databases using SQL & Python - use Python libraries ( Matplotlib and Seaborn ) to generate data visualizations - perform data analysis using Pandas - construct & evaluate Machine Learning (ML) models using Scikit-learn & SciPy and apply data science & ML techniques to real location data sets.
1. INTRODUCTION
1.1 Scenario and Background
I am a BI Developer and currently live within a walking distance to Downtown Union Station in Denver, Colorado and have an easy commute to work with access to good public transportation. Likewise, I enjoy many amenities in the neighborhood that includes sports bars, restaurants, basketball courts, food and drink shops and places of entertainment.
I have been thinking of moving to North Carolina but, I am a bit stressed towards the process to secure a comparable place to live in Raleigh or Charlotte. Therefore, I decided to apply my learned skills from the IBM Cognitive Class for Data Science to explore ways that can help me decide if my decision to move is not wild. Of course there are alternatives to achieve the answer using available Google and Social media tools, but digging more into the raw data of the state would be more rewarding as it would give a clear picture of what one might face with such a decision.
On finding the best spot to live, a great deal of things are viewed when settling on the choice between urban areas, towns or neighborhoods. Some of propositions incorporate, but are not limited to:
Overall Comparison:
The correlation of similar components for every city, bringing about having a general diagram of the two urban areas. A portion of the prevalent factors incorporates population, cost of living, average rent, crime rate, tax rates and air quality.
Crime Rates:
Here the correlation is made to know the crimes of two urban communities, then measures them both against the national statistics
Cost of Living and Salary:
Comparison: This considers looking at pay rates and typical cost of basic items in urban areas for a choice to be made. It mostly takes into consideration test scores and teacher and student ratios, including the teacher's experience of the list of schools in the city of choice.
Neighborhood Comparison:
This looks at neighborhood comparison and helps one choose the best place to live within any given city. These sites allow you to see some pretty interesting facts about the various communities.
1.2 Problem and Purpose of this Project
The dataset incorporates the coordinates of the cities and neighborhoods in the USA but do not include the venues within these locations. With venue information, it would be easy to find out more information about the neighborhoods. For example, how many sports bars and restaurants there is, and any basketball courts or playgrounds? We would also need to find out about any banks, food and drink shops? It would better comprehend or help us settle on a good choice about where to move or migrate to if this data was accessible.
Subsequently, the reason for this project is to, algorithmically, find a way to use the location coordinates and tag each data point into a neighborhood in two Counties in North Carolina - Wake County and Mecklenburg County. The algorithm to be used is k-means clustering. The main idea is to determine neighborhood with venues clustered around each other so that one can make a decision on the right neighborhood to choose based on the proximity of amenities and venues.
A. Clustering the Neighborhoods
The k-means clustering algorithm is an unsupervised clustering technique that searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like.
The "cluster center" is the arithmetic mean of all the points belonging to the cluster.
Each point is closer to its own cluster center as compared to other cluster centers in the dataset.
The two assumptions above are presumably the basis of the k-means model.
1.3 Interested Audience
Information discovered in this project is useful to any person or entity considering moving to a major city around the United States or anywhere in the world, since the approach and methodologies used are applicable in all cases. The use of FourSquare data and mapping techniques combined with data analysis will help resolve the key questions.
NB: While all of these analyses are useful for comparing the neighborhoods, there is nothing like visiting the actual city, seeing the neighborhoods and talking to the residents. If it's possible, an in-person visit is highly recommended before making a big move or relocating decision.
2. DATA
A description of the data and its sources that will be used to solve the problem.
The dataset for this project consists of information regarding the cities in the USA obtained from https://simplemaps.com/data/us-cities. Specifically, the data contains: City Name, County Code, County Name, Density, Id, Latitude, Longitude, Source, State Id, State Name, and Timezone. Business intelligence tools were used for geocoding the data to obtain the correct coordinates. The data was then exported and converted into a .json file, read into a pandas dataframe and sliced into Wake and Mecklenburg data for use in the project.
In addition to this data, the Foursquare API will be used to collect venues near the neighborhoods for cluster analysis to be performed on the data.
2.1. US_cities data
Since the dataframe contains information of the whole of the United States, North Carolina (NC) the State of interest was segmented from the whole and some of the column names will be renamed.
Figure.0 US_cities data
3. METHODOLOGY
3.1. Exploratory Analysis
Exploratory analysis was performed by examining tables and plots of the downloaded data. This was used to:
Segment the data into Cities in Wake and Mecklenburg Counties in North Carolina,
Identify missing values, verify the quality of the data,
Determine likely approaches to modelling which might best lead to good clustering.
3.2. Segmenting, Slicing and Visualizing the data
An important part of cluster modelling is the careful selection of the variables from the available data. A prerequisite of the study was to use foursquare API to collect the 'venues' information. Hence it is very important that the dataset for this work includes the coordinates of the cities to be studied. The subjects included in the data for analysis include: Neighborhood Name, County Name, Density, Latitude, Longitude and State Name.
figure 1
figure 2
figure 3
Folium was used to view the sliced data for both counties.
What is Folium❓
Folium is a powerful python library that builds on the data wrangling strengths of the python ecosystem and the mapping strengths of the Leaflet.js library. In a more general explanation, data is being manipulated in Python then visualized on a Leaflet map via Folium.
Figure.4 Raleigh, North Carolina
Latitude and longitude values of Raleigh are 35.8324, -78.6438.
Figure.5 Charlotte, North Carolina
The geograpical coordinate of Charlotte, NC are 35.2356385, -80.8139485.
3.3. Neighborhood Exploration and Clustering
For neighborhood exploration, the Foursquare API was used. The get request was deployed on the Foursquare URL to get the category types of venues, limiting the number of venues to 100 within a 500 radius.Because the aim of the project is to determine the cluster of venues in the neighborhoods, one-hot encoding was performed on the venue categories to get dummies for each venue. That is to say, the venues were coded into 0s and 1s. The result was then grouped by neighborhood, taking the mean of the frequency of occurrence of each category.
Using the Foursquare API, the venues within the neighborhoods in both Wake and Mecklenburg counties resulted in a vast number of outcomes. The radius defined for the venue returned venues with 81 rows and 7 columns for Wake County and 58 rows and 7 columns for Mecklenburg County. The one-hot encoding produced a total number of 106, 63 and 71, 53 rows and columns for Wake and Mecklenburg counties respectively.
There are (81, 7) rows and columns of venues and neighborhoods in Wake County.
There are (58, 7) rows and columns of venues and neighborhoods in Mecklenburg County
3.4. Clustering of Neighborhoods in Wake County.
For clustering of venue categories in the neighborhoods, the k-means cluster was employed to cluster the neighborhoods into four clusters. The k-means clustering machine learning algorithm is an unsupervised clustering technique that searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like.
The "cluster center" is the arithmetic mean of all the points belonging to the cluster.
Each point is closer to its own cluster center than to other cluster centers in the dataset.
The two assumptions above are presumably the basis of the k-means model. To be able to produce the clusters and visualize it on a map, the sliced Wake and Mecklenburg county data were merged with the grouped venue data. This was done so that the coordinates form the sliced data can aid in visualizing the clusters on a map.
4. RESULT
Figure.7 Wake County data (wake_data)
Figure.8 Mecklenburg County data (mecklenburg_data)
Since the aim of the project is to cluster the neighborhoods, the k-means algorithm is applied to the onehot encoded venue dataset, assuming there are four different clusters. The tables below show the neighborhood and the cluster labels assigned to it after the k-means algorithm was applied. Cluster label ‘0’ represents the 1st cluster and ‘3’ the 4th cluster. This series of plots shows the data for each pair of variables with different clusters shown for different cluster-plotting symbols on the maps as in Figure 11.
Figure.9 Neighborhood Clusters for Wake County
Figure.11
Figure.10 Neighborhood Clusters for Mecklenburg Cluster
Examining each cluster of the various neighborhoods in the analyzed counties, it was determined that some discriminating venue categories were distinguished by each cluster. Based on the defining categories, names were assigned to each cluster. Since 10 common venues were defined in this work, the assigned names were based only on the 1 st common venues for ease of name assignment.
Through exploratory analysis, it seems a lot of the neighborhoods are in the cluster for both Wake and Mecklenburg County. When we look at the clusters for Wake County, it becomes clear that the first two most common venues in the neighborhoods contain a lot of mixed amenities: Scenic Lookouts, Baseball Fields, Mexican Restaurants, Parks, Beer Stores, Basketball Courts, Italian Restaurants, Mobile Phone Shops, Electronic Stores, Yoga Studios and Smoothie Shops. Again, looking at that of Mecklenburg County, we have Chinese Restaurants, Beer Gardens, Gym, Ice Cream Shops, Pools, Bakery's, Wine Shops, Women's Stores and Breweries.
Figure.12 Wake County most common venues
Figure.13 Mecklenburg County most common venues
Figure.14 Mecklenburg County Venue count Bar Graph
Figure.15 Mecklenburg County Venue count clusters
Figure.16 Venues in Mecklenburg County
Figure.17 Wake County Venue Count
Figure.18 Wake County Venues
Other factors which may determine the best place to stay may include the population density in an area in relation to the available venues. One may need to compare the ratio to choose their own preferred place to live.
Figure.19 Wake County density bar graph and dataframe
5. DISCUSSION
So the question is :- Where should someone considering relocating move to in North Carolina, given the neighborhoods choices: Wake and Mecklenburg County ? Well, by looking at the two neighborhood maps, it shows that a foodie would choose to live in the Wake county since there are a lot of restaurants and food outlets in the neighborhoods. Also, if you enjoy time outdoors like scenic viewing and outdoor sports, choosing Wake county as your place of relocation would be a good idea. A couple breweries found in the Mecklenburg county would also invite a bibulous person.
However, decision is left to the individual looking at relocating to make. But in general, though all these analyses are useful, there is nothing like visiting the actual city, seeing the neighborhoods, and speaking with residents. If it's possible, an in-person visit is highly recommended before making a big move.
6. CONCLUSION
The aim of this work is to provide the necessary amenities to help people decide on the best to live or relocate to should they come to make a decision on that. Using public datasets obtained from the web, it made it possible to address a few factors by analyzing the neighborhoods within two major counties in North Carolina: Wake and Mecklenburg based on the spatial distribution of venues in the chosen neighborhoods. The analysis has shown that using folium- a python library that assists in building a quick interactive data visualization and Foursquare API for neighborhood data collection, it is feasible to cluster neighborhood cities data based on known and accepted machine learning techniques.
These results must be considered bounded in scope to the dataset used, since there is no information available as to its provenance. Such results will be of interest to people or citizens whose aim to compare different neighborhoods when thinking about relocation or vacationing in a different environment, considering the ease of accessing numerous venues within a clustered setting. There certainly is a lot of room for improvement, for example, obtaining more than the current neighborhood locations to analyze and cluster a wide expanse of geographical setting. We may also use and analyze crime data – which is publicly available for these two counties - to help provide enough room for decision making with regards to choosing a location to relocate to. This information may be extremely useful because certainly no one would want to live in a crime infested neighborhood. Though the approach used here may not be vigorous enough, it nevertheless showcases the usefulness of Data Analysis for decision making.