Thread safe access to HBase Tables using HTablePool and Spring HBase

Accessing HBase table by HTable is not threadsafe.

In order to access HTable instance in multithreaded mode HTablePool or HBaseTemplate should be used. The latter one is DAO pattern support by Spring. HTablePool is suggested by apache hbase client library.

Lets first discuss the HTablePool way.

HTablePool is not just a simple pool of HTable objects but handle thread-local objects as well. It allows the HTablePool to be initialized in resuable and threadLocal mode both. By supporting ThreadLocal mode it takes away pain of ThreadLocal objects usage in app code.

In order to initialize the HTablePool with threadLocal pool use this constructor :
HTablePool Constructor. When PoolType is set to ThreadLocal it actually binds the resource to the thread from which it is invoked.

This way has been suggested as a defacto to access HTable but here we have to write boiler plate code to access the pool i.e., get the HTable and close the resources and handling checked exception. In short we miss the support of spring DAO.

The same thing can be achieved using Spring Data Hadoop – HBase module.Spring data module provides HBaseTemplate class which is threadsafe in nature. It encapsulates all the boiler plate and provides famous spring exception conversion. The example usage can be found at  https://github.com/SpringSource/spring-data-book/tree/master/hadoop/hbase. This link has one issue related in application-context.xml. We need to set the zookeeper properties in HBaseConfiguration object which is missing in the example.

Only issue with this approach is that it keeps on creating and destroying HTable objects with every method call. This actually is negating usage of pools. In order to avoid recreating HTable objects one should use the Spring HbaseSynchronizationManager. It binds the HTable to the calling thread thus introducing the concept of Threadlocal objects. Each subsequent call made through HbaseTemplate is aware of the table bound and will use it instead of retrieving a new instance.  It can be set manually or through interceptors(AOP) using HbaseInterceptor. The manual setting example can be found at TestTemplate. Using interceptor may affect performance.

Spring way provides the same benefits as we get using spring DAO for RDBMS.

Advertisements

Matrix Multiplication on Hadoop MapReduce

Matrix multiplication is a problem which inherently doesn’t fit to mapReduce programming model as it can’t be divided and conquered.

Matrix multiplication is an important step in many m/c learning algorithms. Mahout library provides an implementation of matrix multiplication over hadoop. The problem with that implementation is that it starts only single mapper task as it uses CompositeInputFormat.

In order to calculate document similarity we had to perform matrix multiplication of order [6000,300] and [300,25000]. When this was done over mahout it took lot of time.

Thus we implemented over own logic for the same.

Here’s the steps to perform matrix multiplication :

Input :

1.  Path 1, of sequential file where key is of type IntWritable and and value is of type VectorWritable[please check mahout library for reference] representing first matrix.

2.  Path 2, of sequential file where key is of type IntWritable and and value is of type VectorWritable[please check mahout library for reference] representing second matrix.

Logic :

If we transpose the second matrix then it is essentially a cartesian product between two files. For example consider M1 = [{1,2,},{3,4},{,5,6}] and M2=[{A,B,C},{D,E,F}] then M1M2 = [{1A+2D,1B+2E,1C+2F},{3A+4D,3B+4E,3C+4F},{5A+6D,5B+6E,5C+6F}]

now M2′ = [{A,D},{B,E},{C,F}]

One can perform Cartesian Product between M1and M2′ to achieve at the same result.

Steps :

1. Perform transpose of second file. Reference implementation can be found at  :

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.4/org/apache/mahout/math/hadoop/TransposeJob.java

2. Use CartesianInputFormat and CartesianRecordReader to calculate the input splits in order to parallelize cartesian product. The reference can be found at https://github.com/adamjshook/mapreducepatterns/tree/master/MRDP/src/main/java/mrdp/ch5 [From : MapReduce Design Patterns]

It actually picks the inputSplits from two input files and create a list mapping each left side input split with right side. So if first file has 3 splits and second has 4 then we will have 3*4=12 splits. Thus we will have 12 mappers.

3. Write a mapper which takes the two vectors and multiply each list index item and add them up. Emit the key as left side file’s key and value as Pair of right side key and actual cell value.

4. Write a reducer which now converts the pair object into VectorWritable object.

Job Configuration will be like :

job.setMapperClass(CartesianMultiplicationMapper.class);

job.setInputFormat(CartesianInputFormat.class);

CartesianInputFormat.setLeftInputInfo(job, SequenceFileInputFormat.class,
                 “path1”);
        CartesianInputFormat.setRightInputInfo(job, SequenceFileInputFormat.class,
                 “path2”);

        SequenceFileOutputFormat.setOutputPath(job, new Path(“cartOutput));

        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(VectorWritable.class);

Will provide actual implementation on github.