Velvet Star Monitor

Standout celebrity highlights with iconic style.

updates

Collapse rows and sum the values in the column

Writer Andrew Mclaughlin

I have the following dataframe (df1):

ID someText PSM OtherValues
ABC c 2 qwe
CCC v 3 wer
DDD b 56 ert
EEE m 78 yu
FFF sw 1 io
GGG e 90 gv
CCC r 34 scf
CCC t 21 fvb
KOO y 45 hffd
EEE u 2 asd
LLL i 4 dlm
ZZZ i 8 zzas

I would like to collapse the first column and add the corresponding PSM values and I would like to get the following output:

ID Sum PSM
ABC 2
CCC 58
DDD 56
EEE 80
FFF 1
GGG 90
KOO 45
LLL 4
ZZZ 8

It seems doable with aggregate function but don't know the syntax.

5 Answers

In base:

aggregate(PSM ~ ID, data=x, FUN=sum)
## ID PSM
## 1 ABC 2
## 2 CCC 58
## 3 DDD 56
## 4 EEE 80
## 5 FFF 1
## 6 GGG 90
## 7 KOO 45
## 8 LLL 4
## 9 ZZZ 8

Example using dplyr, the next iteration of plyr:

df2 <- df1 %>% group_by(ID) %>% summarize(Sum_PSM = sum(PSM))

When you put the characters %>%, you are "piping." This means you're inputting what is on the left side of that pipe operator and performing the function on the right.

This is super easy using the plyr package:

library(plyr)
ddply(df1, .(ID), summarize, Sum=sum(PSM))

Using data.table

setDT(df1)[, lapply(.SD, sum) , by = ID, .SDcols = "PSM" ]

Using aggregate function seems to be better than dplyr if you want to just keep the original column names and operate inside one column at a time. Avoiding the use of summarize function,

Note from summarize function documentation

Be careful when using existing variable names; the corresponding columns will be immediately updated with the new data and this can affect subsequent operations referring to those variables.

For instance

## modified example from aggregate documentation with character variables and NAs
testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9), v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )
by <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)
aggregate(x = testDF, by = list(by1), FUN = "sum")
Group.1 v1 v2
1 1 15 165
2 12 9 99
3 2 NA NA
4 big 3 33
5 blue 3 33
6 red 5 55

You get what you want, but when you use summarise and ddply you need to specify names. So if you have many columns aggregate seems to be convenient.

testDF$ID=by1
ddply(testDF, .(ID), summarize, v1=sum(v1), v2=sum(v2) )
ID v1 v2
1 1 15 165
2 12 9 99
3 2 NA NA
4 big 3 33
5 blue 3 33
6 red 5 55
7 <NA> 15 165

To see the effect of the immediate update of the columns with summarize you can check the following examples,

ddply(testDF, .(ID), summarize, v1=max(v1,v2), v2=min(v1,v2) )
ID v1 v2
1 1 55 55
2 12 99 99
3 2 NA NA
4 big 33 33
5 blue 33 33
6 red 44 11
7 <NA> 88 77
ddply(testDF, .(ID), summarize, v1=min(v1,v2), v2=min(v1,v2) )
ID v1 v2
1 1 5 5
2 12 9 9
3 2 NA NA
4 big 3 3
5 blue 3 3
6 red 1 1
7 <NA> 7 7

Note that when V1 uses max, the col is already update when calculating v2, so for instance in the case of ID=1 we can't get the number 5 when using min in v2.

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.