Wednesday, November 18, 2015

Creating Ambari Hive Views in IBM Open Platform

This blogs talks on how to create the Amabri Hive Views in IBM Open Platform.

Step 1: Login to Ambari UI, then go to HDFS --> Config --> Advanced --> Custom core-site

Update or Add the property
hadoop.proxyuser.hive.groups = *
hadoop.proxyuser.root.groups=*
hadoop.proxyuser.root.hosts=*

  





            
              Save the Configurations and Restart the HDFS.


Step 2: Go to Hive --> Config --> Setting --> Security, set "Run as end user instead of Hive user" as false.


 
 This will internally add the property hive.server2.enable.doAs to false. Save the Configuration and Restart the Hive.

Step 3: Create the directory under /user/admin under HDFS and provide proper permissions.

[root@rvm ~]# su - hdfs
[hdfs@rvm ~]$ hadoop fs -mkdir /user/admin
[hdfs@rvm ~]$ hadoop fs -chown admin:hadoop /user/admin
 


Step 4:  Click on Admin --> Manage Ambari --> Views --> Hive --> Create Instance






 Create the instance with the below details and save.


Step 5: Go to the new Hive view Instance and run the queries.





Finally you are able to create Hive Views from Ambari.

Wednesday, November 11, 2015

Building BigSQL Ad Hoc Application for running Ad Hoc Queries using IBM BigInsights

This blog talks on - How to build a BigInsights Application to run the Adhoc BigSQL Queries. You will be able to schedule these application so that it will be running on scheduled intervals.

Step 1: Create a BigInsights project in Eclipse

Refer http://big-analytics.blogspot.com.au/2015/07/building-web-crawler-in-ibm-biginsights.html to install Text Analytics Plugin in Eclipse.






Step 2:  Update the Oozie workflow.

Here, I create a temp folder in Linux filesystem and create a sql script to store the query. After that, I will be be passing it to JSQSH Client to run the query.

Update the /BigSQLAdhocQuery/BIApp/workflow/workflow.xml

<workflow-app name="wfapp" xmlns="uri:oozie:workflow:0.4">
    <start to="create-folder"/>
        <action name="create-folder">
        <shell xmlns="uri:oozie:shell-action:0.2" >
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>bash</exec>
            <argument>-c</argument>
               <argument>/bin/mkdir -m 777 /tmp/bigSQLAdhocQuery_${wf:id()}</argument>
            <capture-output/>
        </shell>
         <ok to="create-query-file"/>
        <error to="failed-job-cleanup"/>
    </action>
    <action name="create-query-file">
        <shell xmlns="uri:oozie:shell-action:0.2" >
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>bash</exec>
            <argument>-c</argument>
               <argument>/bin/echo "${query}" >> /tmp/bigSQLAdhocQuery_${wf:id()}/test.sql</argument>
            <capture-output/>
        </shell>
        <ok to="run-jsqsh"/>
        <error to="failed-job-cleanup"/>
    </action>
    <action name="run-jsqsh">
        <shell xmlns="uri:oozie:shell-action:0.2" >
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>bash</exec>
            <argument>-c</argument>
               <argument>$JSQSH_HOME/bin/jsqsh -U ${user} -P ${password} -i /tmp/bigSQLAdhocQuery_${wf:id()}/test.sql -n -e ${connectionName}</argument>
            <capture-output/>
        </shell>
        <ok to="cleanup"/>
        <error to="failed-job-cleanup"/>
    </action>
   
    <action name="failed-job-cleanup">
        <shell xmlns="uri:oozie:shell-action:0.2" >
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>bash</exec>
            <argument>-c</argument>
               <argument>rm -rf /tmp/bigSQLAdhocQuery_${wf:id()}</argument>
            <capture-output/>
        </shell>
        <ok to="fail"/>
        <error to="fail"/>
    </action>
   
    <action name="cleanup">
        <shell xmlns="uri:oozie:shell-action:0.2" >
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>bash</exec>
            <argument>-c</argument>
               <argument>rm -rf /tmp/bigSQLAdhocQuery_${wf:id()}</argument>
            <capture-output/>
        </shell>
        <ok to="end"/>
        <error to="end"/>
    </action>
   
    <kill name="fail">
        <message>error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>


Step 3:  Update the /BigSQLAdhocQuery/BIApp/workflow/config-default.xml

<configuration>
<property>
        <name>connectionName</name>
        <value>null</value>
    </property>
    <property>
        <name>user</name>
        <value>null</value>
    </property>
    <property>
        <name>password</name>
        <value>null</value>
    </property>
    <property>
        <name>query</name>
        <value>null</value>
    </property>
</configuration>



Step 4:  Update the /BigSQLAdhocQuery/BIApp/application/application.xml

<application-template xmlns="http://biginsights.ibm.com/application" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <name>BigSQLAdhocQuery</name>
    <properties>
        <property defaultValue="bigsql" isInputPath="false" isOutputPath="false" isRequired="true" label="JSQSHConnectionName" name="connectionName" paramtype="STRING" uitype="textfield"/>
        <property isInputPath="false" isOutputPath="false" isRequired="true" label="User" name="user" paramtype="STRING" uitype="textfield"/>
        <property isInputPath="false" isOutputPath="false" isRequired="true" label="Password" name="password" paramtype="PASSWORD" uitype="textfield"/>
        <property isInputPath="false" isOutputPath="false" isRequired="true" label="Query" name="query" paramtype="TEXTAREA" uitype="textfield"/>
    </properties>
    <assets>
        <asset id="BigSQLAdhocQuery" type="WORKFLOW"/>
    </assets>
    <imagePath>defaultApp_L.png</imagePath>
</application-template>


Step 5: Publish the BigInsights Application.

 Refer http://big-analytics.blogspot.com.au/2015/07/building-web-crawler-in-ibm-biginsights.html for steps to publish the application.








Step 6: Deploy the App from Web Console & Run it from Console.

You can run the BigSQLv1 or BigSQL queries by creating a JSQSH Connection and passing it as input.

You can provide the Ad hoc BigSQL Queries and can schedule to run at fixed intervals.




Tuesday, October 27, 2015

Inspecting Apache Parquet file

We had a customer issue where the customer is trying to query the parquet file from Hive and it was failing for them. Later, we found that there was some issues with the parquet file that was causing the error. This blog talks on how to examine the parquet file.

We use Apache Parquet Tool to inspect the parquet file. You can download parquet-tools-1.6.0.jar from below link




1)  Getting the metadata information from the parquet file

Command: hadoop jar <parquet-tools-x.jar> meta <parquetFile>

[hdfs@rvm sqoop]$ hadoop jar /opt/nish/parquet-tools-1.6.0.jar meta /user/biadmin/Par/15ebcea5-50d3-441a-a79f-7314d691585f.parquet

WARNING: Use "yarn jar" to launch YARN applications.
file: hdfs://rvm.test.com:8020/user/biadmin/Par/15ebcea5-50d3-441a-a79f-7314d691585f.parquet
creator: parquet-mr (build 27f71a18579ebac6db2b0e9ac758d64288b6dbff)
extra: avro.schema = {"type":"record","name":"employeet","namespace":"bigsql","doc":"bigsql.employeet","fields":[{"name":"ID","type":["null","int"],"default":null,"columnName":"ID","sqlType":"4"},{"name":"NAME","type":["null","string"],"default":null,"columnName":"NAME","sqlType":"12"},{"name":"JOBROLE","type":["null","string"],"default":null,"columnName":"JOBROLE","sqlType":"12"}],"tableName":"bigsql.employeet"}

file schema: bigsql.employeet
--------------------------------------------------------------------------------
ID: OPTIONAL INT32 R:0 D:1
NAME: OPTIONAL BINARY O:UTF8 R:0 D:1
JOBROLE: OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1: RC:4 TS:163 OFFSET:4
--------------------------------------------------------------------------------
ID: INT32 SNAPPY DO:0 FPO:4 SZ:41/39/0.95 VC:4 ENC:PLAIN,BIT_PACKED,RLE
NAME: BINARY SNAPPY DO:0 FPO:45 SZ:56/70/1.25 VC:4 ENC:PLAIN,BIT_PACKED,RLE
JOBROLE: BINARY SNAPPY DO:0 FPO:101 SZ:48/54/1.13 VC:4 ENC:PLAIN,BIT_PACKED,RLE
[hdfs@rvm sqoop]$

Here
RC refers to Row Count and VC refers to Value Count.
SZ:{x}/{y}/{z} - x = Compressed total, y=uncompressedtotal, z = y:x ratio
Metadata information contains the compression used and Encoding used

You can understand the file format of parquet from https://parquet.apache.org/documentation/latest/ 



                                               Graphic sourced: http://tinyurl.com/o22gtck


                                                                           
2) Getting the schema from parquet file

Command: hadoop jar <parquet-tools-x.jar> schema <parquetFile>

[hdfs@rvm sqoop]$ hadoop jar /opt/nish/parquet-tools-1.6.0.jar schema /user/biadmin/Par/15ebcea5-50d3-441a-a79f-7314d691585f.parquet
WARNING: Use "yarn jar" to launch YARN applications.
message bigsql.employeet {
optional int32 ID;
optional binary NAME (UTF8);
optional binary JOBROLE (UTF8);
}

[hdfs@rvm sqoop]$



3) Display the content of parquet

Command: hadoop jar <parquet-tools-x.jar> cat <parquetFile>

[hdfs@rvm sqoop]$ hadoop jar /opt/nish/parquet-tools-1.6.0.jar cat /user/biadmin/Par/15ebcea5-50d3-441a-a79f-7314d691585f.parquet
WARNING: Use "yarn jar" to launch YARN applications.
15/10/27 18:49:43 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
ID = 3
NAME = nisanth2
JOBROLE = dev2

ID = 1
NAME = nisanth
JOBROLE = dev

ID = 4
NAME = nisanth3
JOBROLE = dev3

ID = 2
NAME = nisanth1
JOBROLE = dev1

[hdfs@rvm sqoop]$



4) Getting first few records

Command: hadoop jar <parquet-tools-x.jar> head -n <noOfRecords> <parquetFile>

[hdfs@rvm sqoop]$ hadoop jar /opt/nish/parquet-tools-1.6.0.jar head -n 2 /user/biadmin/Par/15ebcea5-50d3-441a-a79f-7314d691585f.parquet
WARNING: Use "yarn jar" to launch YARN applications.
15/10/27 18:50:54 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
ID = 3
NAME = nisanth2
JOBROLE = dev2

ID = 1
NAME = nisanth
JOBROLE = dev

[hdfs@rvm sqoop]$


These inspection helps to understand the no# of records, compression used, understanding the metadata of the columns etc for debugging the issues related to parquet.