Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

arrays - python spark alternative to explode for very large data

I have a dataframe like this:

df = spark.createDataFrame([(0, ["B","C","D","E"]),(1,["E","A","C"]),(2, ["F","A","E","B"]),(3,["E","G","A"]),(4,["A","C","E","B","D"])], ["id","items"])

which creates a data frame df like this:

+---+-----------------+
|  0|     [B, C, D, E]|
|  1|        [E, A, C]|
|  2|     [F, A, E, B]|
|  3|        [E, G, A]|
|  4|  [A, C, E, B, D]|
+---+-----------------+ 

I would like to get a result like this:

+---+-----+
|all|count|
+---+-----+
|  F|    1|
|  E|    5|
|  B|    3|
|  D|    2|
|  C|    3|
|  A|    4|
|  G|    1|
+---+-----+

Which essentially just finds all distinct elements in df["items"] and counts their frequency. If my data was of a more manageable size, I would just do this:

all_items = df.select(explode("items").alias("all")) 
result = all_items.groupby(all_items.all).count().distinct() 
result.show()

But because my data has millions of rows and thousands of elements in each list, this is not an option. I was thinking of doing this row by row, so that I only work with 2 lists at a time. Because most elements are frequently repeated in many rows (but the list in each row is a set), this approach should solve my problem. But the problem is, I don't really know how to do this in Spark, as I've only just started learning it. Could anyone help, please?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

What you need to do is reduce the size of your partitions going into the explode. There are 2 options to do this. First, if your input data is splittable you can decrease the size of spark.sql.files.maxPartitionBytes so Spark reads smaller splits. The other option would be to repartition before the explode.

The default value of maxPartitionBytes is 128MB, so Spark will attempt to read your data in 128MB chunks. If the data is not splittable then it'll read the full file into a single partition in which case you'll need to do a repartition instead.

In your case since you're doing an explode, say it's 100x increase with 128MB per partition going in, you're ending up with 12GB+ per partition coming out!

The other thing you may need to consider is your shuffle partitions since you're doing an aggregation. So again, you may need to increase the partitioning for the aggregation after the explode by setting spark.sql.shuffle.partitions to a higher value than the default 200. You can use the Spark UI to look at your shuffle stage and see how much data each task is reading in and adjust accordingly.

I discuss this and other tuning suggestions in the talk I just gave at Spark Summit Europe.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share

2.1m questions

2.1m answers

62 comments

56.7k users

...