Collapse rows and sum the values in the column

I have the following dataframe (df1):

ID someText PSM OtherValues
ABC c 2 qwe
CCC v 3 wer
DDD b 56 ert
EEE m 78 yu
FFF sw 1 io
GGG e 90 gv
CCC r 34 scf
CCC t 21 fvb
KOO y 45 hffd
EEE u 2 asd
LLL i 4 dlm
ZZZ i 8 zzas

I would like to collapse the first column and add the corresponding PSM values and I would like to get the following output:

ID Sum PSM
ABC 2
CCC 58
DDD 56
EEE 80
FFF 1
GGG 90
KOO 45
LLL 4
ZZZ 8

It seems doable with aggregate function but don't know the syntax.

5 Answers

In base:

aggregate(PSM ~ ID, data=x, FUN=sum)
## ID PSM
## 1 ABC 2
## 2 CCC 58
## 3 DDD 56
## 4 EEE 80
## 5 FFF 1
## 6 GGG 90
## 7 KOO 45
## 8 LLL 4
## 9 ZZZ 8

Example using dplyr, the next iteration of plyr:

df2 <- df1 %>% group_by(ID) %>% summarize(Sum_PSM = sum(PSM))

When you put the characters %>%, you are "piping." This means you're inputting what is on the left side of that pipe operator and performing the function on the right.

This is super easy using the plyr package:

library(plyr)
ddply(df1, .(ID), summarize, Sum=sum(PSM))

Using data.table

setDT(df1)[, lapply(.SD, sum) , by = ID, .SDcols = "PSM" ]

Using aggregate function seems to be better than dplyr if you want to just keep the original column names and operate inside one column at a time. Avoiding the use of summarize function,

Note from summarize function documentation

Be careful when using existing variable names; the corresponding columns will be immediately updated with the new data and this can affect subsequent operations referring to those variables.

For instance

## modified example from aggregate documentation with character variables and NAs
testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9), v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )
by <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)
aggregate(x = testDF, by = list(by1), FUN = "sum")
Group.1 v1 v2
1 1 15 165
2 12 9 99
3 2 NA NA
4 big 3 33
5 blue 3 33
6 red 5 55

You get what you want, but when you use summarise and ddply you need to specify names. So if you have many columns aggregate seems to be convenient.

testDF$ID=by1
ddply(testDF, .(ID), summarize, v1=sum(v1), v2=sum(v2) )
ID v1 v2
1 1 15 165
2 12 9 99
3 2 NA NA
4 big 3 33
5 blue 3 33
6 red 5 55
7 <NA> 15 165

To see the effect of the immediate update of the columns with summarize you can check the following examples,

ddply(testDF, .(ID), summarize, v1=max(v1,v2), v2=min(v1,v2) )
ID v1 v2
1 1 55 55
2 12 99 99
3 2 NA NA
4 big 33 33
5 blue 33 33
6 red 44 11
7 <NA> 88 77
ddply(testDF, .(ID), summarize, v1=min(v1,v2), v2=min(v1,v2) )
ID v1 v2
1 1 5 5
2 12 9 9
3 2 NA NA
4 big 3 3
5 blue 3 3
6 red 1 1
7 <NA> 7 7

Note that when V1 uses max, the col is already update when calculating v2, so for instance in the case of ID=1 we can't get the number 5 when using min in v2.

Velvet Star Monitor

Collapse rows and sum the values in the column

5 Answers

Your Answer

Sign up or log in

Post as a guest

Similar Journal

How do subsequent playthroughs work?

When should I use high gear?

How do Steam Trading cards and the trade system work?

Goblin Battle Standard doesn't summon the goblin army