Tuesday, May 12, 2015

Trouble Connecting Apache Spark with Hbase due to missing classes

Ideally when you try to connect HBase with Apache Spark, in most of the cases, it throws exception like ImmutableBytesWritableToStringConverter or Google Utils not found and various other errors while trying to run.

Almost all belongs to the same family of missing drivers.


To solve it straight forward,


Just go to spark-defaults.conf

update your spark.driver.extraClassPath with required libraries. keep on adding them.

like for missing ImmutableBytesWritableToStringConverter , add spark-examples-1.3.1-hadoop2.4.0.jar.


spark.driver.extraClassPath /Users/abhishekchoudhary/anaconda/anaconda/lib/python2.7/site-packages/graphlab/graphlab-create-spark-integration.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/hbase-server-0.98.6-cdh5.2.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/hbase-protocol-0.98.6-cdh5.2.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/hbase-hadoop2-compat-0.98.6-cdh5.2.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/hbase-client-0.98.6-cdh5.2.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/hbase-common-0.98.6-cdh5.2.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/htrace-core-2.04.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/lib/spark-examples-1.3.1-hadoop2.4.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/lib/spark-assembly-1.3.1-hadoop2.4.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/guava-12.0.1.jar




And one more thing , its actually ultra fast to access Hbase using Spark , so real-time updates


Sunday, May 10, 2015

HBase ignore comma while using bulk loading importTSV

HBase simply ignore Text while importing the same with CSV file, and the best part it didn't even inform you.
Entire job will be passed , but your HBase table won't have any data or partial data , like if any column has some values

"this text can be uploaded , but it has more" , then till uploaded it will be there in HBase Table cell , then rest of the contents are gone.
This is because I was importing TSV with seperator comma (,) and that lead to import engine to ignore comma among the csv cell.



It took 32 YARN jobs to figure out the actual issue.

Import CSV command -


create ‘bigdatatwitter’,’main’,’detail’,’other’


hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,other:contributors,main:truncated,main:text,main:in_reply_to_status_id,main:id,main:favorite_count,main:source,detail:retweeted,detail:coordinates,detail:entities,detail:in_reply_to_screen_name,detail:in_reply_to_user_id,detail:retweet_count,detail:id_str,detail:favorited,detail:retweeted_status,other:user,other:geo,other:in_reply_to_user_id_str,other:possibly_sensitive,other:lang,detail:created_at,other:in_reply_to_status_id_str,detail:place,detail:metadata" -Dimporttsv.separator="," bigdatatwitter file:///Users/abhishekchoudhary/PycharmProjects/DeepLearning/AllMix/bigdata3.csv