PySpark - Sum a column in dataframe and return results as int

I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.

df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"])

I do the following to sum the column.

df.groupBy().sum()

But I get a dataframe back.

+-----------+
|sum(Number)|
+-----------+
| 130|
+-----------+

I would 130 returned as an int stored in a variable to be used else where in the program.

result = 130

9 Answers

I think the simplest way:

df.groupBy().sum().collect()

will return a list. In your example:

In [9]: df.groupBy().sum().collect()[0][0]
Out[9]: 130

The simplest way really :

df.groupBy().sum().collect()

But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:

df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]

I tried on a bigger dataset and i measured the processing time:

RDD and ReduceByKey : 2.23 s

GroupByKey: 30.5 s

If you want a specific column :

import pyspark.sql.functions as F
df.agg(F.sum("my_column")).collect()[0][0]

This is another way you can do this. using agg and collect:

sum_number = df.agg({"Number":"sum"}).collect()[0]
result = sum_number["sum(Number)"]

Similar to other answers, but without the use of a groupby or agg. I just select the column in question, sum it, collect it, and then grab the first two indices to return an int. The only reason I chose this over the accepted answer is I am new to pyspark and was confused that the 'Number' column was not explicitly summed in the accepted answer. If I had to come back after sometime and try to understand what was happening, syntax such as below would be easier for me to follow.

import pyspark.sql.functions as f
df.select(f.sum('Number')).collect()[0][0]

You can also try using first() function. It returns the first row from the dataframe, and you can access values of respective columns using indices.

df.groupBy().sum().first()[0]

In your case, the result is a dataframe with single row and column, so above snippet works.

Select column as RDD, abuse keys() to get value in Row (or use .map(lambda x: x[0])), then use RDD sum:

df.select("Number").rdd.keys().sum()

SQL sum using selectExpr:

df.selectExpr("sum(Number)").first()[0]

The following should work:

df.groupBy().sum().rdd.map(lambda x: x[0]).collect()

sometimes read a csv file to pyspark Dataframe, maybe the numeric column change to string type '23',like this, you should use pyspark.sql.functions.sum to get the result as int , not sum()

import pyspark.sql.functions as F
df.groupBy().agg(F.sum('Number')).show()

Velvet Star Monitor

PySpark - Sum a column in dataframe and return results as int

9 Answers

Your Answer

Sign up or log in

Post as a guest

Similar Journal

How does 3 tick fishing work?

Tetra Master: What are the full rules?

Do lure modules and incense cause new pokemon to spawn near you, or existing pokemon to be drawn to you?

What is the strongest fixed location equipment you can obtain at Level 1?