Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
235 views
in Technique[技术] by (71.8m points)

apache spark - Pyspark Count Null Values Column Value Specific

My input spark dataframe is;

  Date        Client  Current 
    2020-10-26  1       NULL   
    2020-10-27  1       NULL   
    2020-10-28  1       NULL   
    2020-10-29  1       NULL   
    2020-10-30  1       NULL   
    2020-10-31  1       NULL   
    2020-11-01  1       NULL   
    2020-11-02  1       NULL    
    2020-11-03  1       NULL    
    2020-11-04  1       NULL    
    2020-11-05  1       NULL    
    2020-11-06  1       NULL    
    2020-11-07  1       NULL    
    2020-11-08  1       NULL    
    2020-11-09  1       NULL    
    2020-10-26  2       NULL    
    2020-10-27  2       NULL    
    2020-10-28  2       NULL    
    2020-10-29  2       10      
    2020-10-30  2       23      
    2020-10-31  2       NULL    
    2020-11-01  2       NULL    
    2020-11-02  2       1       
    2020-11-03  2       NULL    
    2020-11-04  2       NULL    
    2020-11-05  2       3       
    2020-10-27  3       NULL    
    2020-10-28  3       NULL    
    2020-10-29  3       10      
    2020-10-30  3       NULL    
    2020-10-31  3       NULL    
    2020-11-01  3       NULL    
    2020-11-02  3       NULL    
    2020-11-03  3       32      
    2020-11-04  3       NULL    
    2020-11-05  3       3       
    2020-11-03  4       NULL    
    2020-11-04  4       NULL    
    2020-11-05  4       NULL  

Dataframe is ordered by client_no and date. If the "Current column" is completely null for a client, the Full_NULL_Count column should write the null number in the first line of the client. I shared the desired output according to the data above;

   Date        Client  Current Full_NULL_Count
    2020-10-26  1       NULL    15   -> All "Current" values are null for client 1. So first row 
                                        value is  equal to total null count for Client 1 .    
    2020-10-27  1       NULL    NULL
    2020-10-28  1       NULL    NULL
    2020-10-29  1       NULL    NULL
    2020-10-30  1       NULL    NULL
    2020-10-31  1       NULL    NULL
    2020-11-01  1       NULL    NULL
    2020-11-02  1       NULL    NULL
    2020-11-03  1       NULL    NULL
    2020-11-04  1       NULL    NULL
    2020-11-05  1       NULL    NULL
    2020-11-06  1       NULL    NULL
    2020-11-07  1       NULL    NULL
    2020-11-08  1       NULL    NULL
    2020-11-09  1       NULL    NULL
    2020-10-26  2       NULL    NULL ->There are non null current values for Client 2. So it' s null.
    2020-10-27  2       NULL    NULL
    2020-10-28  2       NULL    NULL
    2020-10-29  2       10      NULL
    2020-10-30  2       23      NULL
    2020-10-31  2       NULL    NULL
    2020-11-01  2       NULL    NULL
    2020-11-02  2       1       NULL
    2020-11-03  2       NULL    NULL
    2020-11-04  2       NULL    NULL
    2020-11-05  2       3       NULL
    2020-10-27  3       NULL    NULL ->There are non null current values for Client 3. So it' s null.
    2020-10-28  3       NULL    NULL
    2020-10-29  3       10      NULL
    2020-10-30  3       NULL    NULL
    2020-10-31  3       NULL    NULL
    2020-11-01  3       NULL    NULL
    2020-11-02  3       NULL    NULL
    2020-11-03  3       32      NULL
    2020-11-04  3       NULL    NULL
    2020-11-05  3       3       NULL
    2020-11-03  4       NULL    3    -> All "Current" values are null for client 4. So first row 
                                        value is  equal to total null count for Client 4.   
    2020-11-04  4       NULL    NULL
    2020-11-05  4       NULL    NULL

Could you please help me about this?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can check the number of nulls for a client and compare it with the number of rows for that client.

from pyspark.sql import functions as F, Window
w = Window.partitionBy('Client')

result = df.withColumn(
    'Full_NULL_count',
    F.when(
        F.sum(F.col('Current').isNull().cast('int')).over(w) == F.count('*').over(w),
        F.count('*').over(w)
    )
).withColumn(
    'rn',
    F.row_number().over(w.orderBy('Date'))
).withColumn(
    'Full_NULL_count',
    F.when(
        F.col('rn') == 1,
        F.col('Full_NULL_count')
    )
).drop('rn').orderBy('Client', 'Date')

result.show(99)
+----------+------+-------+---------------+
|      Date|Client|Current|Full_NULL_count|
+----------+------+-------+---------------+
|2020-10-26|     1|   null|             15|
|2020-10-27|     1|   null|           null|
|2020-10-28|     1|   null|           null|
|2020-10-29|     1|   null|           null|
|2020-10-30|     1|   null|           null|
|2020-10-31|     1|   null|           null|
|2020-11-01|     1|   null|           null|
|2020-11-02|     1|   null|           null|
|2020-11-03|     1|   null|           null|
|2020-11-04|     1|   null|           null|
|2020-11-05|     1|   null|           null|
|2020-11-06|     1|   null|           null|
|2020-11-07|     1|   null|           null|
|2020-11-08|     1|   null|           null|
|2020-11-09|     1|   null|           null|
|2020-10-26|     2|   null|           null|
|2020-10-27|     2|   null|           null|
|2020-10-28|     2|   null|           null|
|2020-10-29|     2|     10|           null|
|2020-10-30|     2|     23|           null|
|2020-10-31|     2|   null|           null|
|2020-11-01|     2|   null|           null|
|2020-11-02|     2|      1|           null|
|2020-11-03|     2|   null|           null|
|2020-11-04|     2|   null|           null|
|2020-11-05|     2|      3|           null|
|2020-10-27|     3|   null|           null|
|2020-10-28|     3|   null|           null|
|2020-10-29|     3|     10|           null|
|2020-10-30|     3|   null|           null|
|2020-10-31|     3|   null|           null|
|2020-11-01|     3|   null|           null|
|2020-11-02|     3|   null|           null|
|2020-11-03|     3|     32|           null|
|2020-11-04|     3|   null|           null|
|2020-11-05|     3|      3|           null|
|2020-11-03|     4|   null|              3|
|2020-11-04|     4|   null|           null|
|2020-11-05|     4|   null|           null|
+----------+------+-------+---------------+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...