Matrix Multiplication on Hadoop MapReduce

Matrix multiplication is a problem which inherently doesn’t fit to mapReduce programming model as it can’t be divided and conquered.

Matrix multiplication is an important step in many m/c learning algorithms. Mahout library provides an implementation of matrix multiplication over hadoop. The problem with that implementation is that it starts only single mapper task as it uses CompositeInputFormat.

In order to calculate document similarity we had to perform matrix multiplication of order [6000,300] and [300,25000]. When this was done over mahout it took lot of time.

Thus we implemented over own logic for the same.

Here’s the steps to perform matrix multiplication :

Input :

1.  Path 1, of sequential file where key is of type IntWritable and and value is of type VectorWritable[please check mahout library for reference] representing first matrix.

2.  Path 2, of sequential file where key is of type IntWritable and and value is of type VectorWritable[please check mahout library for reference] representing second matrix.

Logic :

If we transpose the second matrix then it is essentially a cartesian product between two files. For example consider M1 = [{1,2,},{3,4},{,5,6}] and M2=[{A,B,C},{D,E,F}] then M1M2 = [{1A+2D,1B+2E,1C+2F},{3A+4D,3B+4E,3C+4F},{5A+6D,5B+6E,5C+6F}]

now M2′ = [{A,D},{B,E},{C,F}]

One can perform Cartesian Product between M1and M2′ to achieve at the same result.

Steps :

1. Perform transpose of second file. Reference implementation can be found at  :

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.4/org/apache/mahout/math/hadoop/TransposeJob.java

2. Use CartesianInputFormat and CartesianRecordReader to calculate the input splits in order to parallelize cartesian product. The reference can be found at https://github.com/adamjshook/mapreducepatterns/tree/master/MRDP/src/main/java/mrdp/ch5 [From : MapReduce Design Patterns]

It actually picks the inputSplits from two input files and create a list mapping each left side input split with right side. So if first file has 3 splits and second has 4 then we will have 3*4=12 splits. Thus we will have 12 mappers.

3. Write a mapper which takes the two vectors and multiply each list index item and add them up. Emit the key as left side file’s key and value as Pair of right side key and actual cell value.

4. Write a reducer which now converts the pair object into VectorWritable object.

Job Configuration will be like :

job.setMapperClass(CartesianMultiplicationMapper.class);

job.setInputFormat(CartesianInputFormat.class);

CartesianInputFormat.setLeftInputInfo(job, SequenceFileInputFormat.class,
                 “path1”);
        CartesianInputFormat.setRightInputInfo(job, SequenceFileInputFormat.class,
                 “path2”);

        SequenceFileOutputFormat.setOutputPath(job, new Path(“cartOutput));

        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(VectorWritable.class);

Will provide actual implementation on github.

Advertisements