Big Data Authority: jupyter

Thursday, July 28, 2016

Connect to Cloudant database from SparkR

How to Connect to Cloudant Database from SparkR kernel

Below i will show how to do it from Bluemix but it will apply Jupyter Notebook running on any environment.

Connecting to Cloudant from IBM Bluemix - Juypter Notebooks on Spark

Create an account in bluemix(ibm offers 30 days free trial) - https://console.ng.bluemix.net/registration/

Create a spark service (https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html)
Now create notebook with sparkR as language.

spark context needs to know which driver to use to connect to Cloudant database. In bluemix spark service enivornment the driver is loaded by default.
https://github.com/cloudant-labs/spark-cloudant
Also if you are in different environment, you use binary
https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.3/cloudant-spark-v1.6.3-125.jar
For ex. use %Addjar -f https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.3/cloudant-spark-v1.6.3-125.jar to add it to your spark.

Once you have spark-cloudant connector in your spark.
You are going to need to have 3 configuration parameters set for you spark context

cloudant.host","ACCOUNT.cloudant.com"

"cloudant.username", "USERNAME"
"cloudant.password","PASSWORD"

So in sparkR, you would need to use one of the sparkEnv variable

to pass your environment variables to all the executors.

sc <- sparkR.init(sparkEnv = list("cloudant.host"="c8dca934-d2a4-4dcc-9123-2189ce9f5812-bluemix.cloudant.com","cloudant.username"="c8dca934-d2a4-4dcc-9123-2189ce9f5812-bluemix","cloudant.password"="XXXXXXXXXXXXXXXXXXXX"))

Once you execute above. Your sparkcontext is ready to use cloudant-connector.

All you need to do is specify that you are reading using com.cloudant.spark

people <- read.df(sqlContext, database, header='true', source = "com.cloudant.spark",inferSchema='true')

I have complete Notebook published on this github repo. Feel Free to use it.

Sunday, January 31, 2016

Connecting to Postgres from IBM Bluemix - Juypter Notebooks on Spark

Create an account in bluemix(ibm offers 30 days free trial) - https://console.ng.bluemix.net/registration/

Create a spark service (https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html)

Now create notebook with scala as language.
1. Download the postgres jar using %Addjar method to add a third party jar.
  %Addjar -f https://jdbc.postgresql.org/download/postgresql-9.4.1207.jre7.jar
2. Import the two classes SparkConf and SparkContext
  import org.apache.spark.{SparkConf, SparkContext}
3. First statement simply creates a SparkConf configuration object from Spark's initial context "sc"
  Then conf.setJars is magic statement that specify which all jars to be added to the new Sparkcontext we are going to create.(In this case as we have downloaded postgres driver jar, it will add this new jar to new spark context we created. (Simply copy paste the statement as it is so complex to modify:))
  val conf = sc.getConf conf.setJars(ClassLoader.getSystemClassLoader.asInstanceOf[java.net.URLClassLoader].getURLs.map(_.toString).toSet.toSeq ++ kernel.interpreter.classLoader.asInstanceOf[java.net.URLClassLoader].getURLs.map(_.toString).toSeq) conf.set("spark.driver.allowMultipleContexts", "true") conf.set("spark.master","local[*]") val scPostgres = new SparkContext(conf)
4. Import the SQLContext class for further dataframe and other use
  import org.apache.spark.sql.{SQLContext}
5. Simply replace url with your postgres url.
  dbtable with name of the table for which you want to create dataframe.
  replace user and password for your postgres database.
  Note in url:- You can opt to remove sslmode argument depending on the configuration of the Postgres Server.
  val url = "jdbc:postgresql://ec2-75-101-163-171.compute-1.amazonaws.com:5432/d7vad26hel3q5l?sslmode=require" val dbtable = "public.test" val user = "" val password = "" val options = scala.collection.Map("url" -> url, "driver" -> "org.postgresql.Driver", "dbtable" ->dbtable,"user"->user,"password"->password)
6. Now create new SQLContext from your new Spark Context which has postgres driver loaded
  val ncsqlContext = new SQLContext(scPostgres)
7. Create a dataframereader from your SQLContext for your table
  val dataFrameReader = ncsqlContext.read.format("jdbc").options(options)
8. Call the load method to create DataFrame for your table.
  val tableDataFrame = dataFrameReader.load()
9. Call show() method to display the table contents in the Notebook
  tableDataFrame.show()

You have successfully created a dataframe.

Big Data Authority

Pages

Thursday, July 28, 2016

Connect to Cloudant database from SparkR

Sunday, January 31, 2016

Connecting to Postgres from IBM Bluemix - Juypter Notebooks on Spark

Blog Archive