Tuesday, June 18, 2019

Coursera Capstone Project









Capstone Project
 The Battle of Neighborhood
Segmenting and Clustering Neighborhoods of San Francisco and Los Angeles
By,
 Charles C Gomes






Contents


Introduction……………………………………………………………………………..3
Objective………………………………………………………………………………..4
Data……………………………………………………………………………………..5
Results…………………………………………………………………………………..6
Discussion………………………………………………………………………………7
Conclusion……………………………………………………………………………...8

 

 

 

 

 

 

 

 

 

 

 

 

Introduction

San Francisco and Los Angeles are two major cities in California.
Brief information about both cities:
  • San Francisco: officially the City and County of San Francisco, is a city in, and the cultural, commercial, and financial center of, Northern California. San Francisco is the 13th-most populous city in the United States, and the fourth-most populous in California, with 883,305 residents as of 2018.

  • Los Angeles: officially the City of Los Angeles and often known by its initials L.A., is the most populous city in California, the second most populous city in the United States, after New York City, and the third most populous city in North America. With an estimated population of nearly four million,[11] Los Angeles is the cultural, financial, and commercial center of Southern California.




 

 

Objective

In this project, we will study in details the area classification using Foursquare data and machine learning segmentation and clustering. The aim of this project is to segment areas or neighborhood of San Francisco and Los Angeles based on the most common places captured from Foursquare.
Using segmentation and clustering, we hope we can determine:
  1. The similarity or dissimilarity of both cities
  2. Classification of area located inside the city whether it is residential, tourism places, or others






Data

The data for neighborhoods is acquired from following sources
Additionally Foursquare data api was used for getting different kind of venues for segmentation and clustering.










Results

Cluster 1: San Francisco: Tourism
Cluster 2: San Francisco: Residental and Tourism
Cluster 3: San Francisco: Tourism
Cluster 1: Los Angeles: Tourism
Cluster 2: Los Angeles: Residential based on the Park, Convenience Store and Yoga Studio.
Cluster 3: Los Angeles: Mixed.















Discussion
Based on cluster for each cities above, we believe that classification for each cluster can be done 
better with calculation of venues categories (most common) in each cities.  
Referring to each clsuter, we can not deterimine clearly what represent in each cluster by using
 Foursquare - Most Common Venue data. What is lacking at this point is a systematic, quantitative 
way to identify and distinguish different district and to describe the correlation most common venues
 as recorded in Foursquare. 
The reality is however more complex: similar cities might have or might not have similar common venues. 
A further step in this classification would be to find a method to extract these common venues and 
integrate the spatial correlations between different of areas or district.  We believe that the classification we propose is an encouraging step towards a quantitative and systematic comparison of the different cities. 
Further studies are indeed needed in order to relate the data acquired, then observe it to more meaningful and objective results.



 

 



Conclusion

With the help of Foursquare API, we were able to capture the venue information and using venue information, 
we can figure out the similarities or dissimilarities of San Francisco and Los Angeles. 
We did classification of Neighbourhoods as Residential, tourism or Mixed.  
In conclusion, both cities San Francisco and Los Angeles have tourism as similarity as well as there are some residential areas.
 It is somewhat clear that in San Francisco, the residential and tourism neighborhoods are mixed compare to Los Angeles. 



Tuesday, August 9, 2016

Data Workflow Management in Big Data Analytics

Workflow Management in Big Data Analytics

So now, you have this big powerful analytics cluster of 500+ nodes and now suddenly you have lots of team around your organization ready to attack your cluster with heavy jobs.

You need a way to schedule and manage this jobs in the data pipeline and that is where data wokflow management tool like Airflow, Nodered come into picture.

Airflow

Nodered


Friday, August 5, 2016

Messaging Queue Systems - Kafka, Mesos, RabbitMQ, ZeroMQ, Apache ActiveMQ, OpenMP


  1. Kafka
    • Getting started - http://blog.antlypls.com/blog/2015/10/05/getting-started-with-spark-streaming-using-docker/
  2. Mesos
  3. RabbitMQ
  4. ZeroMQ
  5. Apache ActiveMQ
  6. OpenMP

Thursday, July 28, 2016

Connect to Cloudant database from SparkR

How to Connect to Cloudant Database from SparkR kernel



Below i will show how to do it from Bluemix but it will apply  Jupyter Notebook running on any environment.

Connecting to Cloudant from IBM Bluemix - Juypter Notebooks on Spark


  1. Create an account in bluemix(ibm offers 30 days free trial) - https://console.ng.bluemix.net/registration/

  2. Create a spark service (https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html) 
  3. Now create notebook with sparkR as language.  
    • spark context needs to know which driver to use to connect to Cloudant database. In bluemix spark service enivornment the driver is loaded by default. 
    •  https://github.com/cloudant-labs/spark-cloudant
    •  Also if you are in different environment, you use binary
    • https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.3/cloudant-spark-v1.6.3-125.jar
    • For ex. use %Addjar -f https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.3/cloudant-spark-v1.6.3-125.jar to add it to your spark.
  4. Once you have spark-cloudant connector in your spark.
  5.  You are going to need to have 3 configuration parameters set for you spark context
    • cloudant.host","ACCOUNT.cloudant.com"
      "cloudant.username", "USERNAME"
      "cloudant.password","PASSWORD"
       
      So in sparkR, you would need to use one of the sparkEnv variable
      to pass your environment variables to all the executors.
      sc <- sparkR.init(sparkEnv = list("cloudant.host"="c8dca934-d2a4-4dcc-9123-2189ce9f5812-bluemix.cloudant.com","cloudant.username"="c8dca934-d2a4-4dcc-9123-2189ce9f5812-bluemix","cloudant.password"="XXXXXXXXXXXXXXXXXXXX")) 
       
       
      Once you execute above. Your sparkcontext is ready to use cloudant-connector.
      All you need to do is specify that you are reading using com.cloudant.spark
       
      people <- read.df(sqlContext, database, header='true', source = "com.cloudant.spark",inferSchema='true') 
      
      
      I have complete Notebook published on this github repo. Feel Free to use it.

Friday, June 17, 2016

Difference Between Spark and Hadoop Map-reduce

Difference Between Spark and Hadoop


Difference Spark Hadoop Map-reduce
1. Perfomance Itertaive computations are performed in-memory, the mapper functions just transform one RDD to another RDD, resulting in saving disk io,network io and improving performance Map and Reduce phases cause every mapper/reducer to write data to disk after mapping and then successive mapper/reducer to read from it, thus resulting in disk io,network io, causing latency
2. Programming Languages Scala,Java,Python,R Java
3. Basic Unit of Data RDD - Resilient Distributed Dataset Tuples
4. Lines of Code for WordCount as less as 6 in python code. refer here as less as 73 in Java code. refer here

Monday, June 13, 2016

Spark 2.0 is out


Spark Summit East Keynote: Apache Spark 2.0


How do you get your hands on Spark 2.0 :-
1. Databricks Community Edition
2. Download and set it up


Major features:-
  1.  Tungsten Phase 2 speedups of 5-10x
  2. Structured Streaming real-time engine on SQL/DataFrames
  3. Unifying Datasets and DataFrames