Big Data Analytics: 2016

Wednesday, October 5, 2016

Java JDBC Application to connect to IBM BigSQL or IBM DB2 in a Kerberos Cluster

This blog provides the complete code to connect to the IBM BigSQL Database in a Kerberos Cluster. The same code works for IBM DB2 also.

In the below code, we do not pass the database User/Passowd to connect to Database. Here, we use the Kerberos Authentication to connect to Database.

1) Java - JDBC Code

public class KerberosTest {

public static void main(String[] args) {

if (null != args && args.length != 2) {

throw new IllegalArgumentException(

"Invalid arguments. " +

"Specify headnode hostname/ip " +

"& database port");

}

String ServerName = args[0];

int PortNumber = Integer.parseInt(args[1]);

String DatabaseName = "bigsql";

java.util.Properties properties = new java.util.Properties();

// 11 is the integer value for kerberos

properties.put("securityMechanism", new String("11"));

// Provide the Kerberos Principal

properties.put("KerberosServerPrincipal",

"biadmin/btestserver.ibm.com@IBM.COM");

String url = String.format("jdbc:db2://%s:%d/%s", ServerName,

PortNumber, DatabaseName);

java.sql.Connection con = null;

try {

Class.forName("com.ibm.db2.jcc.DB2Driver").newInstance();

} catch (Exception e) {

System.out.println("Error: " +

"failed to load Db2 jcc driver.");

}

try {

System.out.println("url: " + url);

con = java.sql.DriverManager.

getConnection(url, properties);

java.sql.Statement s2 = con.createStatement();

try {

s2.executeUpdate("drop table t1");

s2.executeUpdate("drop table tbint");

System.out.println("Drop Hadoop & DB2 " +

"tables successfull!!!");

} catch (Exception e) {

System.out.println("drop is failing");

}

try {

// Create DB2 Table

s2.executeUpdate("create table t1 (c1 int)");

// Create BigSQL Table

s2.executeUpdate(

"create hadoop table " +

"if not exists "+

"tbint " +

"(col1 INT, col2 INT, col3 INT)");

System.out.println(

"Created Hadoop & DB2 tables " +

"successfully!!!");

} catch (Exception e) {

System.out.println("create is failing");

}

// Insert to DB2 Table

String str = "insert into t1 values (100)";

s2.executeUpdate(str);

// Query the DB2 Table

java.sql.PreparedStatement ps = con

.prepareStatement("select * from t1");

java.sql.ResultSet rs = ps.executeQuery();

while (rs.next()) {

System.out.println(rs.getString(1));

}

// Insert to BIGSQL Table

str = "insert into tbint values(1,2,3),(1,2,3),(1,2,3)";

s2.executeUpdate(str);

// Query the BIGSQL Table

ps = con.prepareStatement("select * from tbint");

rs = ps.executeQuery();

while (rs.next()) {

System.out.printf("%s,%s,%s", rs.getString(1),

rs.getString(2), rs.getString(3));

System.out.println();

}

con.close();

} catch (Exception e) {

e.printStackTrace();

}

To complie the above code, ensure the DB2 jars - db2jcc_license_cu.jar, db2jcc.jar & db2jcc4.jar in the classpath.

2) Testing the JDBC Application in Kerberos Cluster

Do the Kinit and get the valid ticket.

[biadmin@btestserver ~]$
[biadmin@btestserver ~]$ date
Wed Oct 5 16:50:18 PDT 2016
[biadmin@btestserver ~]$
[biadmin@btestserver ~]$
[biadmin@btestserver ~]$ /opt/ibm/biginsights/jdk/jre/bin/kinit -k -t /opt/ibm/biginsights/conf/security/keytabs/biadmin.keytab biadmin/btestserver.ibm.com
Done!
New ticket is stored in cache file /home/biadmin/krb5cc_biadmin
[biadmin@btestserver ~]$
[biadmin@btestserver ~]$
[biadmin@btestserver ~]$ klist -c krb5cc_biadmin
Ticket cache: FILE:krb5cc_biadmin
Default principal: biadmin/btestserver.ibm.com@IBM.COM

Valid starting Expires Service principal
10/05/16 16:50:22 10/06/16 16:50:22 krbtgt/IBM.COM@IBM.COM
[biadmin@btestserver ~]$
[biadmin@btestserver ~]$

Run the Java code.

[biadmin@btestserver ~]$
[biadmin@btestserver ~]$ java -jar /opt/nisan/BigSQL_jdbc_kerberos.jar "btestserver.ibm.com" 51000
url: jdbc:db2://btestserver.ibm.com:51000/bigsql
Drop Hadoop & DB2 tables successfull!!!
Created Hadoop & DB2 tables successfully!!!
100
1,2,3
1,2,3
1,2,3
[biadmin@btestserver ~]$

3) Verifying the records inserted to the table

[biadmin@btestserver ~]$
[biadmin@btestserver ~]$ su bigsql
Password:
[bigsql@btestserver biadmin]$
[bigsql@btestserver biadmin]$ db2 "connect to bigsql"

   Database Connection Information

Database server        = DB2/LINUXX8664 10.6.2
SQL authorization ID   = BIGSQL
Local database alias   = BIGSQL

[bigsql@btestserver biadmin]$
[bigsql@btestserver biadmin]$ db2 "select * from biadmin.t1"
C1
-----------
        100

1 record(s) selected.

[bigsql@btestserver biadmin]$
[bigsql@btestserver biadmin]$ db2 "select * from biadmin.tbint"
COL1        COL2        COL3
----------- ----------- -----------
          1           2           3
          1           2           3
          1           2           3

3 record(s) selected.

[bigsql@btestserver biadmin]$

I will be covering the JDBC Application to connect to Apache Hive in my next blog.

Wednesday, July 13, 2016

Identifying Emotions from tweets for a Social Event using R in IBM Data Scientist Workbench

This blog describes the steps on - How to identify the emotions from tweets for an event. As an example, I have taken the event "Britian Exit". I am using IBM Data Scientist Workbench, that is an online platform where you can directly execute R statements using RStudio. I would recommend you to use the Data Scientist Workbench because it has RStudio and Apache Spark in a single platform without installing it, in your machine.

The blog covers

1) Setting up the infrastructure IBM Data Scientist Workbench
2) Setting up Twitter App for OAuth.
3) Connecting to Twitter using R
4) Cleansing the tweets
5) Identifying the emotions
6) Plot the graph with emotions & tweets

1) Setting up the infrastructure - IBM Data Scientist Workbench

Create an online account in https://my.datascientistworkbench.com/login

Login to Data Scientist Workbench and click on RStudio IDE

2) Setting up Twitter App for OAuth

Login to to https://apps.twitter.com/app/new and create an Application.

Open the "Keys and Access Tokens" Tab and get the Consumer Key, Consumer Secret, Access Token, Access Token Secret.

3) Connecting to Twitter using R

We are using existing R Libraries to connect to Twitter. Install the required libraries

install.packages("twitteR")
install.packages("ROAuth")

Below statements will load the libraries & Connect to twitter. We are doing a search for some popular hashtags.

> library("twitteR")
>
> library("ROAuth")
>
> dir.create(file.path("/resources", "BRExit"), mode = "0777")
>
> setwd("/resources/BRExit")
>
> getwd()
[1] "/resources/BRExit"
>
> download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
trying URL 'http://curl.haxx.se/ca/cacert.pem'
Content type 'unknown' length 250607 bytes (244 KB)
==================================================
downloaded 244 KB

>## Provide the proper keys & token got from step 2.
> consumer_key <- 'xxxxxxxxxxx'
>
> consumer_secret <- 'xxxxxxxxxxxxxxxxxxxxxxx'
>
> access_token <- 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
>
> access_secret <- 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
>
> twiter_handle <- setup_twitter_oauth(consumer_key,consumer_secret,access_token,access_secret)
[1] "Using direct authentication"
Use a local file ('.httr-oauth'), to cache OAuth access credentials between R sessions?

1: Yes
2: No

Selection: 1
Adding .httr-oauth to .gitignore

Now, we query the Twitter to get the tweets related to hashtags.

> search_term <- "#brexit,#strongerin,#yes2eu"
> 
> search_term <- unlist(strsplit(search_term,","))
> 
> tweets = list()
>

> ## You may get warning because of Twitter rate limit.

> ## If there is many hashtags, then you may need to stop it after sometime since

> ## Twitter will impose the Rate Limit and you will be getting the exception to

> ## getting the tweets.

> for(i in 1:length(search_term)){
     result<-searchTwitter(search_term[i],n=1500,lang="en")
     tweets <- c(tweets,result)
     tweets <- unique(tweets)
 }
Warning message:
In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit,  :
  1500 tweets were requested but the API can only return 107
>

> ## Collected 3107 tweets

> length(tweets)
[1] 3107
>

> ## Display 5 tweets

> head(tweets)
[[1]]
[1] "AAPLRoom: AAPL Trading Room Newsletter is out! https://t.co/cXzBstbEI1 #trading #brexit"

[[2]]
[1] "DVATW: If #TheresaMay becomes PM then the hope and optimism of a fast clean #Brexit is over."

[[3]]
[1] "melvine: RT @RoubiniGlobal: #Brexit Circus: Which branch of the #EU line will the UK take? #MindTheGap https://t.co/mvFceRuJpy"

Save the tweets to a file for other types of analytics on the data.

> file<-NULL
> 
> if (file.exists("tweetsBRExit.csv")){file<- read.csv("tweetsBRExit.csv")}
> 
> df <- do.call("rbind", lapply(tweets, as.data.frame))
> 
> df<-rbind(df,file)
> 
> df <- df[!duplicated(df[c("id")]),]
> 
> write.csv(df,file="tweetsBRExit.csv",row.names=FALSE)
>

You could see the tweet is exported to /resources/BRExit/tweetsBRExit.csv. It has various informations that can be helpfull for building another insights like influencers, geo influence etc.

I am sharing the column names for your reference.

"text","favorited","favoriteCount","replyToSN","created","truncated","replyToSID","id",
"replyToUID","statusSource","screenName","retweetCount","isRetweet",
"retweeted","longitude","latitude"

4) Cleansing the tweets

We do some cleansing of tweets like removing the whitespace, numbers, punctuatoions etc.

> library(tm)
Loading required package: NLP
> 
> twitter_brexit_df = twListToDF(tweets)
> 
> r_text_corpus <- Corpus(VectorSource(twitter_brexit_df$text))
> 
> r_text_cleansing <- tm_map(r_text_corpus, stripWhitespace)
> 
> r_text_cleansing <- tm_map(r_text_cleansing, removeNumbers)
> 
> r_text_cleansing <- tm_map(r_text_cleansing, removePunctuation)
> 
> r_text_cleansing <- tm_map(r_text_cleansing, content_transformer(tolower))
>

5) Identifying the emotions

We are using the Library syuzhet to identify the emotions of the tweets. The get_nrc_sentiment implements Saif Mohammad’s NRC Emotion lexicon
Refer: http://www.purl.org/net/NRCemotionlexicon

> install.packages("syuzhet")

> library(syuzhet)
> 
> isNull <- function(data) {
     if(is.null(data))
         return(0)
     else
         return(data)
 }
> 
> text_vec = c()
> anger = c() ; anticipation=c() ; disgust=c() ; fear=c() ; joy=c() ;
> sadness=c() ; surprise=c() ; rust=c() ; nrc_negative=c() ; nrc_positive=c();
> 
> for(i in 1:length(r_text_cleansing)){
     text <- lapply(r_text_cleansing[i], as.character)
     text <- gsub("http\\w+", "", text)
     nrc_emotions <- get_nrc_sentiment(text)
     
     text_vec[i] <- text
     anger[i] <- isNull(nrc_emotions$anger)
     anticipation[i] <- isNull(nrc_emotions$anticipation)
     disgust[i] <- isNull(nrc_emotions$disgust)
     fear[i] <- isNull(nrc_emotions$fear)
     joy[i] <- isNull(nrc_emotions$joy)
     sadness[i] <- isNull(nrc_emotions$sadness)
     surprise[i] <- isNull(nrc_emotions$surprise)
     rust[i] <- isNull(nrc_emotions$rust)
     nrc_negative[i] <- isNull(nrc_emotions$negative)
     nrc_positive[i] <- isNull(nrc_emotions$positive)
 }
> 
> nrc_df <- data.frame(text_vec,anger,anticipation,disgust,fear,joy,sadness,surprise,
                      rust,nrc_negative,nrc_positive)
>

> nrc_df[1:2,2:11]
  anger anticipation disgust fear joy sadness surprise rust nrc_negative nrc_positive
1     0            0       0    0   0       0        0    0            0            0
2     0            2       0    0   3       0        2    0            0            3

Plot the graph with Emotions & Tweets

Plot the graph in R.

> par(mar=c(5.1,5,4.1,2.1))
> 
> barplot(
     sort(colSums(prop.table(nrc_df[, 2:9]))), 
     horiz = TRUE, 
     cex.names = 0.7,
     las = 1, 
     main = "Emotions for Britian Exit", 
     xlab="Percentage",
     col="lightblue"
 )

I was tracking the tweets for last couple of weeks and I could see the change in emotions as time progressed.

Sunday, June 26, 2016

Reinstalling the Services from Apache Ambari

This blog discusses on how to reinstall a component from Amabri. As an example, I have taken reinstallation of Knox.

IT IS RECOMMENDED TO GET PRODUCT TEAM INVOLVED BEFORE RUNNING THE BELOW PROCEDURE. PLEASE DO NOT TRY THIS PROCEDURE ON PRODUCTION CLUSTER DIRECTLY BEFORE TRYING IT IN YOUR DEVELOPMENT CLUSTER.

1) Stop the Knox

Syntax: curl -u admin:admin -H 'X-Requested-By: ambari' -i -X PUT -d '{"RequestInfo": {"context": "Stop KNOX REST"}, "Body": {"ServiceInfo": {"state": "INSTALLED"}}}' http://<AMBARI_IP>:8080/api/v1/clusters/<Cluster_Name>/services/KNOX

[root@bdavm621 nisan]# curl -u admin:admin -H 'X-Requested-By: ambari' -i -X PUT -d '{"RequestInfo": {"context": "Stop KNOX REST"}, "Body": {"ServiceInfo": {"state": "INSTALLED"}}}' http://bdavm621.com:8080/api/v1/clusters/BI_41/services/KNOX
HTTP/1.1 202 Accepted
User: admin
Set-Cookie: AMBARISESSIONID=1th1qyghm8v7vu39twhsrtpli;Path=/;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Type: text/plain
Vary: Accept-Encoding, User-Agent
Content-Length: 148
Server: Jetty(8.1.17.v20150415)

{
"href" : "http://bdavm621.com:8080/api/v1/clusters/BI_41/requests/179",
"Requests" : {
    "id" : 179,
    "status" : "Accepted"
}
}[root@bdavm621 nisan]#

2) Check the KNOX is stopped

Syntax: curl -u admin:admin -H "X-Requested-by:ambari" -i -k -X GET http://<AMBARI_IP>:8080/api/v1/clusters/<Cluster_Name>/services/KNOX

If the component is stopped the state will be INSTALLED ("state" : "INSTALLED") in the output.

[root@bdavm621 nisan]# curl -u admin:admin -H "X-Requested-by:ambari" -i -k -X GET http://bdavm621.com:8080/api/v1/clusters/BI_41/services/KNOX
HTTP/1.1 200 OK
User: admin
Set-Cookie: AMBARISESSIONID=1x0zxckld3asbozro4jmw6c55;Path=/;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Type: text/plain
Vary: Accept-Encoding, User-Agent
Content-Length: 1039
Server: Jetty(8.1.17.v20150415)

{
"href" : "http://bdavm621.com:8080/api/v1/clusters/BI_41/services/KNOX",
"ServiceInfo" : {
    "cluster_name" : "BI_41",
    "maintenance_state" : "OFF",
    "service_name" : "KNOX",
    "state" : "INSTALLED"
},
"alerts_summary" : {
    "CRITICAL" : 1,
    "MAINTENANCE" : 0,
    "OK" : 0,
    "UNKNOWN" : 0,
    "WARNING" : 0
},
"alerts" : [
    {
      "href" : "http://bdavm621.com:8080/api/v1/clusters/BI_41/services/KNOX/alerts/251",
      "Alert" : {
        "cluster_name" : "BI_41",
        "definition_id" : 101,
        "definition_name" : "knox_gateway_process",
        "host_name" : "bdavm621.com",
        "id" : 251,
        "service_name" : "KNOX"
      }
    }
],
"components" : [
    {
      "href" : "http://bdavm621.com:8080/api/v1/clusters/BI_41/services/KNOX/components/KNOX_GATEWAY",
      "ServiceComponentInfo" : {
        "cluster_name" : "BI_41",
        "component_name" : "KNOX_GATEWAY",
        "service_name" : "KNOX"
      }
    }
],
"artifacts" : [ ]
}[root@bdavm621 nisan]#

3) Remove KNOX

Take the backup of the knox config files under /etc/knox/conf.

Ensure all the repo files under /etc/yum.repos.d able to access the corresponding repository. Check the repository is pointing to KNOX setup. If you are using the local mirror repository then you need to start the HTTP Server ( by apachectl start ). This will ensure after you remove the KNOX, you able to install the KNOX.

If all the above prerequisite is met, remove the knoxusing below command.

Syntax: curl -u admin:admin -H "X-Requested-By: ambari" -X DELETE http://<AMBARI_IP>:8080/api/v1/clusters/<Cluster_Name>/services/KNOX

[root@bdavm621 nisan]# curl -u admin:admin -H "X-Requested-By: ambari" -X DELETE http://bdavm621.com:8080/api/v1/clusters/BI_41/services/KNOX
[root@bdavm621 nisan]#

Login to Amabri UI to confirm KNOX is removed from the UI.

3) Remove KNOX RPM's

Login to node where KNOX installed and remove the knox rpm's.

You can use any of the below commands to search the KNOX RPM's.

Syntax for searching rpm's:
yum list installed | grep knox
rpm -qa knox*
yum list installed | grep @ <repoName>

[root@bdavm621 nisan]# yum list installed | grep knox
knox_4_1_0_0.noarch                        0.6.0_TEST_23.4.1.0.0-3.el6 @TEST-4.1

Syntax to remove rpm's : yum remove <rpm_Name>

[root@bdavm621 nisan]# yum remove knox_4_1_0_0.noarch

Run the clean all command :

[root@bdavm621 nisan]# yum clean all

4) Install KNOX from Ambari UI

Re-Login to Amabri UI --> Actions --> Add Services Wizard, select KNOX and install the service again.

Thus, we are able to reinstall the KNOX. You can try similar steps to reinstall other services in Amabri.