RDD vs Pandas Dataframe vs Direct Read to create Spark DataFrame

Question

asked Feb 5, 2021 in Technique[技术] by 深蓝 (71.8m points)

For creating Spark DataFrame, we can read directly from raw data, pass RDD OR pass pandas Dataframe.

I was doing experimentation with three of these methods,

Spark: Standalone Mode
using pyspark.sql module

Method1 : Reading text/csv file in Pandas and passing pandas DataFrame to create Spark DataFrame.

 df3=spark.createDataFrame(pandas_df)

Method2 :I have created RDD by passing text file to 'sc.textFile'. Then I used this RDD to create Spark DataFrame

df3=spark.createDataFrame(RDD_list, StringType())

Method3 :Reading directly from raw data to create Spark DataFrame

df3=spark.read.text("Data/bookpage.txt")

What I have observed:

   Method1:(pandas) - 8 ( I have 8 cores)
   Method2:(RDD)    - 2
   Method3:(Direct raw read)- 1

 Method1 : Raw Data => Pandas DF => Spark DataFrame
 Method2 : Raw Data => RDD => Spark DataFrame
 Method3 : Raw Data => Spark DataFrame

Questions:

Which method is more efficient?
As everything in spark implemented at RDD level, so creating RDD in Method2, can make it more efficient?
For same data, there are different default partitions. Why?

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories