pyspark - Spark Matrix multiplication with python

Question

Welcome To Ask or Share your Answers For Others

pyspark - Spark Matrix multiplication with python

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

pyspark - Spark Matrix multiplication with python

I am trying to do matrix multiplication using Apache Spark and Python.

Here is my data

from pyspark.mllib.linalg.distributed import RowMatrix

My RDD of vectors

rows_1 = sc.parallelize([[1, 2], [4, 5], [7, 8]])
rows_2 = sc.parallelize([[1, 2], [4, 5]])

My maxtrix

mat1 = RowMatrix(rows_1)
mat2 = RowMatrix(rows_2)

I would like to do something like this:

mat = mat1 * mat2

I wrote a function to process the matrix multiplication but I'm afraid to have a long processing time. Here is my function:

def matrix_multiply(df1, df2):
    nb_row = df1.count()    
    mat=[]
    for i in range(0, nb_row):
        row=list(df1.filter(df1['index']==i).take(1)[0])
        row_out = []
        for r in range(0, len(row)):
            r_value = 0
            col = df2.select(df2[list_col[r]]).collect()
            col = [list(c)[0] for c in col]
            for c in range(0, len(col)): 
                r_value += row[c] * col[c]
            row_out.append(r_value)            
        mat.append(row_out)
    return mat

My function make a lot of spark actions (take, collect, etc.). Does the function will take a lot of processing time? If someone have another idea it will be helpful for me.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:53:50+0000

You cannot. Since RowMatrix has no meaningful row indices it cannot be used for multiplications. Even ignoring that the only distributed matrix which supports multiplication with another distributed structure is BlockMatrix.

from pyspark.mllib.linalg.distributed import *

def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024):
    return IndexedRowMatrix(
        rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))
    ).toBlockMatrix(rowsPerBlock, colsPerBlock)

as_block_matrix(rows_1).multiply(as_block_matrix(rows_2))

Categories

pyspark - Spark Matrix multiplication with python

pyspark - Spark Matrix multiplication with python

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags