Big Data Authority: bluemix

Showing posts with label bluemix. Show all posts

Thursday, July 28, 2016

Connect to Cloudant database from SparkR

How to Connect to Cloudant Database from SparkR kernel

Below i will show how to do it from Bluemix but it will apply Jupyter Notebook running on any environment.

Connecting to Cloudant from IBM Bluemix - Juypter Notebooks on Spark

Create an account in bluemix(ibm offers 30 days free trial) - https://console.ng.bluemix.net/registration/

Create a spark service (https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html)
Now create notebook with sparkR as language.

spark context needs to know which driver to use to connect to Cloudant database. In bluemix spark service enivornment the driver is loaded by default.
https://github.com/cloudant-labs/spark-cloudant
Also if you are in different environment, you use binary
https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.3/cloudant-spark-v1.6.3-125.jar
For ex. use %Addjar -f https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.3/cloudant-spark-v1.6.3-125.jar to add it to your spark.

Once you have spark-cloudant connector in your spark.
You are going to need to have 3 configuration parameters set for you spark context

cloudant.host","ACCOUNT.cloudant.com"

"cloudant.username", "USERNAME"
"cloudant.password","PASSWORD"

So in sparkR, you would need to use one of the sparkEnv variable

to pass your environment variables to all the executors.

sc <- sparkR.init(sparkEnv = list("cloudant.host"="c8dca934-d2a4-4dcc-9123-2189ce9f5812-bluemix.cloudant.com","cloudant.username"="c8dca934-d2a4-4dcc-9123-2189ce9f5812-bluemix","cloudant.password"="XXXXXXXXXXXXXXXXXXXX"))

Once you execute above. Your sparkcontext is ready to use cloudant-connector.

All you need to do is specify that you are reading using com.cloudant.spark

people <- read.df(sqlContext, database, header='true', source = "com.cloudant.spark",inferSchema='true')

I have complete Notebook published on this github repo. Feel Free to use it.

Thursday, June 9, 2016

Running your first R notebook on IBM Bluemix Apache Spark Service

Running your first R notebook on IBM Bluemix Apache Spark Service

IBM Bluemix Apache Spark Service have introduced R -tech preview for allowing users to run R programs on spark cluster.
https://developer.ibm.com/clouddataservices/docs/spark/technical-previews/r-in-jupyter-notebooks/
So how do you get yourself started on R notebook on Spark.

You would need to create new instance of the service as tech preview was introduced in May 2016. Please check it out.
I have a simple example of PI Calculator here, if you just want to import and give the service a try:- https://github.com/charles2588/bluemixsparknotebooks/raw/master/R/Pi_Bluemix.ipynb

Tuesday, February 2, 2016

Use Python to upload files to swift storage.

Use Python to upload files to swift storage.

NYPD Motor Vehicle Accidents
From Bluemix example NYPD Motor Vehicle Accidents, use the code upto the point where we create spark dataframe.

# adding the PySpark modul to SparkContext
sc.addPyFile("https://raw.githubusercontent.com/seahboonsiew/pyspark-csv/master/pyspark_csv.py")
import pyspark_csv as pycsv

collisions = sc.textFile("swift://hivecontainer." + credentials['name'] + "/NYPD_Motor_Vehicle_Collisions.csv")

# create Spark DataFrame using pyspark-csv
collisions_df = pycsv.csvToDataFrame(sqlContext, collisions_body, sep=",", columns=collisions_header_list)

#save the Spark DataFrame to local storage
collisions_df.toPandas().to_csv('mycsv.csv’)
# Now there are two ways to upload this resultant csv file back to swift object storage

#1. You can use install swift CLI and run swift commands with !(magic) https://www.ng.bluemix.net/docs/services/ObjectStorage/index.html#using-swift-cli but that
# seems to erring out since python-dev module and other module required for “!pip —user install python-swiftclient” and pip install python-keystoneclient and
# pip install urllib3 certifi pyopenssl

#2. You can use !(magic) with curl and swift storage REST API (which is tested and working fine)

!curl -i -H "Content-Type: application/json" -d '{"auth": {"identity": {"methods": ["password"],"password": {"user": {"id": "cc8b1374d0de412fa1c7e201a4e90bce","password": "jo..F04#N(bDR1OZ"}}},"scope": {"project": {"id": "e4321c16ed084c06a9dc62ba810a61bf"}}}}' https://identity.open.softlayer.com/v3/auth/tokens

HTTP/1.1 201 Created
Date: Wed, 03 Feb 2016 01:43:18 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips mod_wsgi/3.4 Python/2.7.5
X-Subject-Token: gAAAAABWsVs3GT2stF………
Vary: X-Auth-Token
x-openstack-request-id: req-fb9adbc9-425d-4551-817b-9c19995f3107
Content-Length: 17448
Content-Type: application/json

Copy the X-Subject-Token value and save it to token variable for multiple uses.

token = “gAAAAABWsVs3GT2stFDdr…"

Now to get the size of the exported file run following
!ls -l mycsv.csv
-rw-r--r-- 1 s027-20bcfe6e4297e8-2c631c8ff999 users 145549225 Feb 1 15:08 mycsv.csv

Now replace the content-length value with the size of the exported file and replace other parameters to form object storage URL
as explain in the link specified here
https://www.ng.bluemix.net/docs/services/ObjectStorage/index.html#using-swift-restapi

!curl -X PUT -H "X-Auth-Token:$token" -H "Content-Length: 145549225" https://dal.objectstorage.open.softlayer.com/v3/AUTH_e4321c16ed084c06a9dc62ba810a61bf/hivecontainer/mycsv.csv -T mycsv.csv

You can now verify that mycsv.csv is uploaded to swift storage by going to swift service.

Thanks,
Charles.

Sunday, January 31, 2016

Connecting to Postgres from IBM Bluemix - Juypter Notebooks on Spark

Create an account in bluemix(ibm offers 30 days free trial) - https://console.ng.bluemix.net/registration/

Create a spark service (https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html)

Now create notebook with scala as language.
1. Download the postgres jar using %Addjar method to add a third party jar.
  %Addjar -f https://jdbc.postgresql.org/download/postgresql-9.4.1207.jre7.jar
2. Import the two classes SparkConf and SparkContext
  import org.apache.spark.{SparkConf, SparkContext}
3. First statement simply creates a SparkConf configuration object from Spark's initial context "sc"
  Then conf.setJars is magic statement that specify which all jars to be added to the new Sparkcontext we are going to create.(In this case as we have downloaded postgres driver jar, it will add this new jar to new spark context we created. (Simply copy paste the statement as it is so complex to modify:))
  val conf = sc.getConf conf.setJars(ClassLoader.getSystemClassLoader.asInstanceOf[java.net.URLClassLoader].getURLs.map(_.toString).toSet.toSeq ++ kernel.interpreter.classLoader.asInstanceOf[java.net.URLClassLoader].getURLs.map(_.toString).toSeq) conf.set("spark.driver.allowMultipleContexts", "true") conf.set("spark.master","local[*]") val scPostgres = new SparkContext(conf)
4. Import the SQLContext class for further dataframe and other use
  import org.apache.spark.sql.{SQLContext}
5. Simply replace url with your postgres url.
  dbtable with name of the table for which you want to create dataframe.
  replace user and password for your postgres database.
  Note in url:- You can opt to remove sslmode argument depending on the configuration of the Postgres Server.
  val url = "jdbc:postgresql://ec2-75-101-163-171.compute-1.amazonaws.com:5432/d7vad26hel3q5l?sslmode=require" val dbtable = "public.test" val user = "" val password = "" val options = scala.collection.Map("url" -> url, "driver" -> "org.postgresql.Driver", "dbtable" ->dbtable,"user"->user,"password"->password)
6. Now create new SQLContext from your new Spark Context which has postgres driver loaded
  val ncsqlContext = new SQLContext(scPostgres)
7. Create a dataframereader from your SQLContext for your table
  val dataFrameReader = ncsqlContext.read.format("jdbc").options(options)
8. Call the load method to create DataFrame for your table.
  val tableDataFrame = dataFrameReader.load()
9. Call show() method to display the table contents in the Notebook
  tableDataFrame.show()

You have successfully created a dataframe.

Big Data Authority

Pages