Big Data Authority

Thursday, March 17, 2016

Connecting to MongoDB from IBM Bluemix - Juypter Notebooks on Spark

Create an account in bluemix(ibm offers 30 days free trial) - https://console.ng.bluemix.net/registration/

Create a spark service (https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html)

Now create notebook with scala as language.
1. Add unityJDBC jar which has mongodb driver.
  %Addjar https://github.com/charles2588/SparkNotebooksJars/raw/master/unityjdbc.jar
2. Add Mongo Java Driver jar which unityJDBC need
  %Addjar https://github.com/charles2588/SparkNotebooksJars/raw/master/mongo-java-driver-2.13.3.jar
3. Test below import
```
import mongodb.jdbc.MongoDriver
```
```
 
```
4. Import the two classes SparkConf and SparkContext
  import org.apache.spark.sql.{DataFrame, SQLContext}
5. Simply replace url with your mongodb url.
  dbtable with name of the table for which you want to create dataframe.
  replace user and password for your db2 database server.
```
val url = "jdbc:mongo://ds045252.mlab.com:45252/samplemongodb"
val dbtable = "Photos"
val user = "charles2588"
val password = "*****"
val options = scala.collection.Map("url" -> url,"driver" -> "mongodb.jdbc.MongoDriver","dbtable" ->dbtable,"user"->user,"password"->password)
```
6. Now create new SQLContext from your new Spark Context which has db2 driver loaded
  val sqlContext = new SQLContext(sc)
7. Create a dataframereader from your SQLContext for your table
  val dataFrameReader = sqlContext.read.format("jdbc").options(options)
8. Call the load method to create DataFrame for your table.
  val tableDataFrame = dataFrameReader.load()
9. Call show() method to display the table contents in the Notebook
  tableDataFrame.show()

You have successfully created a dataframe from mongodb , now you can do further processing according to your need.

Tuesday, February 2, 2016

Use Python to upload files to swift storage.

Use Python to upload files to swift storage.

NYPD Motor Vehicle Accidents
From Bluemix example NYPD Motor Vehicle Accidents, use the code upto the point where we create spark dataframe.

# adding the PySpark modul to SparkContext
sc.addPyFile("https://raw.githubusercontent.com/seahboonsiew/pyspark-csv/master/pyspark_csv.py")
import pyspark_csv as pycsv

collisions = sc.textFile("swift://hivecontainer." + credentials['name'] + "/NYPD_Motor_Vehicle_Collisions.csv")

# create Spark DataFrame using pyspark-csv
collisions_df = pycsv.csvToDataFrame(sqlContext, collisions_body, sep=",", columns=collisions_header_list)

#save the Spark DataFrame to local storage
collisions_df.toPandas().to_csv('mycsv.csv’)
# Now there are two ways to upload this resultant csv file back to swift object storage

#1. You can use install swift CLI and run swift commands with !(magic) https://www.ng.bluemix.net/docs/services/ObjectStorage/index.html#using-swift-cli but that
# seems to erring out since python-dev module and other module required for “!pip —user install python-swiftclient” and pip install python-keystoneclient and
# pip install urllib3 certifi pyopenssl

#2. You can use !(magic) with curl and swift storage REST API (which is tested and working fine)

!curl -i -H "Content-Type: application/json" -d '{"auth": {"identity": {"methods": ["password"],"password": {"user": {"id": "cc8b1374d0de412fa1c7e201a4e90bce","password": "jo..F04#N(bDR1OZ"}}},"scope": {"project": {"id": "e4321c16ed084c06a9dc62ba810a61bf"}}}}' https://identity.open.softlayer.com/v3/auth/tokens

HTTP/1.1 201 Created
Date: Wed, 03 Feb 2016 01:43:18 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips mod_wsgi/3.4 Python/2.7.5
X-Subject-Token: gAAAAABWsVs3GT2stF………
Vary: X-Auth-Token
x-openstack-request-id: req-fb9adbc9-425d-4551-817b-9c19995f3107
Content-Length: 17448
Content-Type: application/json

Copy the X-Subject-Token value and save it to token variable for multiple uses.

token = “gAAAAABWsVs3GT2stFDdr…"

Now to get the size of the exported file run following
!ls -l mycsv.csv
-rw-r--r-- 1 s027-20bcfe6e4297e8-2c631c8ff999 users 145549225 Feb 1 15:08 mycsv.csv

Now replace the content-length value with the size of the exported file and replace other parameters to form object storage URL
as explain in the link specified here
https://www.ng.bluemix.net/docs/services/ObjectStorage/index.html#using-swift-restapi

!curl -X PUT -H "X-Auth-Token:$token" -H "Content-Length: 145549225" https://dal.objectstorage.open.softlayer.com/v3/AUTH_e4321c16ed084c06a9dc62ba810a61bf/hivecontainer/mycsv.csv -T mycsv.csv

You can now verify that mycsv.csv is uploaded to swift storage by going to swift service.

Thanks,
Charles.

Monday, February 1, 2016

Connecting to any DB2 or (SQLDB powerwd by db2 on Bluemix) from IBM Bluemix - Juypter Notebooks on Spark

Create an account in bluemix(ibm offers 30 days free trial) - https://console.ng.bluemix.net/registration/

Create a spark service (https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html)

Now create notebook with scala as language.
1. Import the two classes SparkConf and SparkContext
  import org.apache.spark.sql.{DataFrame, SQLContext}
2. Simply replace url with your db2 url.
  dbtable with name of the table for which you want to create dataframe.
  replace user and password for your db2 database server.
  val url = "jdbc:db2://75.126.155.153:50000/SQLDB" val dbtable = "USER13878.MYTABLE" val user = "USER13878" val password = "<enter your password>" val options = scala.collection.Map("url" -> url, "driver" -> "com.ibm.db2.jcc.DB2Driver", "dbtable" ->dbtable,"user"->user,"password"->password)
3. Now create new SQLContext from your new Spark Context which has db2 driver loaded
  val sqlContext = new SQLContext(sc)
4. Create a dataframereader from your SQLContext for your table
  val dataFrameReader = sqlContext.read.format("jdbc").options(options)
5. Call the load method to create DataFrame for your table.
  val tableDataFrame = dataFrameReader.load()
6. Call show() method to display the table contents in the Notebook
  tableDataFrame.show()

You have successfully created a dataframe for your db2 table , now you can do further processing according to your need.

Sunday, January 31, 2016

Connecting to Postgres from IBM Bluemix - Juypter Notebooks on Spark

Create an account in bluemix(ibm offers 30 days free trial) - https://console.ng.bluemix.net/registration/

Create a spark service (https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html)

Now create notebook with scala as language.
1. Download the postgres jar using %Addjar method to add a third party jar.
  %Addjar -f https://jdbc.postgresql.org/download/postgresql-9.4.1207.jre7.jar
2. Import the two classes SparkConf and SparkContext
  import org.apache.spark.{SparkConf, SparkContext}
3. First statement simply creates a SparkConf configuration object from Spark's initial context "sc"
  Then conf.setJars is magic statement that specify which all jars to be added to the new Sparkcontext we are going to create.(In this case as we have downloaded postgres driver jar, it will add this new jar to new spark context we created. (Simply copy paste the statement as it is so complex to modify:))
  val conf = sc.getConf conf.setJars(ClassLoader.getSystemClassLoader.asInstanceOf[java.net.URLClassLoader].getURLs.map(_.toString).toSet.toSeq ++ kernel.interpreter.classLoader.asInstanceOf[java.net.URLClassLoader].getURLs.map(_.toString).toSeq) conf.set("spark.driver.allowMultipleContexts", "true") conf.set("spark.master","local[*]") val scPostgres = new SparkContext(conf)
4. Import the SQLContext class for further dataframe and other use
  import org.apache.spark.sql.{SQLContext}
5. Simply replace url with your postgres url.
  dbtable with name of the table for which you want to create dataframe.
  replace user and password for your postgres database.
  Note in url:- You can opt to remove sslmode argument depending on the configuration of the Postgres Server.
  val url = "jdbc:postgresql://ec2-75-101-163-171.compute-1.amazonaws.com:5432/d7vad26hel3q5l?sslmode=require" val dbtable = "public.test" val user = "" val password = "" val options = scala.collection.Map("url" -> url, "driver" -> "org.postgresql.Driver", "dbtable" ->dbtable,"user"->user,"password"->password)
6. Now create new SQLContext from your new Spark Context which has postgres driver loaded
  val ncsqlContext = new SQLContext(scPostgres)
7. Create a dataframereader from your SQLContext for your table
  val dataFrameReader = ncsqlContext.read.format("jdbc").options(options)
8. Call the load method to create DataFrame for your table.
  val tableDataFrame = dataFrameReader.load()
9. Call show() method to display the table contents in the Notebook
  tableDataFrame.show()

You have successfully created a dataframe.