Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
274 views
in Technique[技术] by (71.8m points)

RDD vs Pandas Dataframe vs Direct Read to create Spark DataFrame

For creating Spark DataFrame, we can read directly from raw data, pass RDD OR pass pandas Dataframe.

I was doing experimentation with three of these methods,

Spark: Standalone Mode
using pyspark.sql module

Method1 : Reading text/csv file in Pandas and passing pandas DataFrame to create Spark DataFrame.

 df3=spark.createDataFrame(pandas_df)

Method2 :I have created RDD by passing text file to 'sc.textFile'. Then I used this RDD to create Spark DataFrame

df3=spark.createDataFrame(RDD_list, StringType())

Method3 :Reading directly from raw data to create Spark DataFrame

df3=spark.read.text("Data/bookpage.txt")

What I have observed:

  1. Num of default partitions in three cases are different.
   Method1:(pandas) - 8 ( I have 8 cores)
   Method2:(RDD)    - 2
   Method3:(Direct raw read)- 1
 
  1. Conversion
 Method1 : Raw Data => Pandas DF => Spark DataFrame
 Method2 : Raw Data => RDD => Spark DataFrame
 Method3 : Raw Data => Spark DataFrame

Questions:

  1. Which method is more efficient?
  2. As everything in spark implemented at RDD level, so creating RDD in Method2, can make it more efficient?
  3. For same data, there are different default partitions. Why?

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...