apache spark - VectorAssembler output only to DenseVector?

Question

Welcome To Ask or Share your Answers For Others

apache spark - VectorAssembler output only to DenseVector?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache spark - VectorAssembler output only to DenseVector?

There is something very annoying with the function of VectorAssembler. I am currently transforming a set of columns into a single column of vectors and then use the StandardScaler function to apply the scaling to the included features. However, there seems that SPARK for memory reasons, decides whether it should use a DenseVector or a SparseVector to represent each row of features. But, when you need to use StandardScaler, the input of SparseVector(s) is invalid, only DenseVectors are allowed. Does anybody know a solution to that?

Edit: I decided to just use a UDF function instead, which turns the sparse vector into a dense vector. Kind of silly but works.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:30:37+0000

You're right that VectorAssembler chooses dense vs sparse output format based on whichever one uses less memory.

You don't need a UDF to convert from SparseVector to DenseVector; just use toArray() method:

from pyspark.ml.linalg import SparseVector, DenseVector 
a = SparseVector(4, [1, 3], [3.0, 4.0])
b = DenseVector(a.toArray())

Also, StandardScaler accepts SparseVector unless you set withMean=True at creation. If you do need to de-mean, you have to deduct a (presumably non-zero) number from all the components, so the sparse vector won't be sparse any more.

Categories

apache spark - VectorAssembler output only to DenseVector?

apache spark - VectorAssembler output only to DenseVector?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags