Memory issue with raster processing in R
Matthew Barrera
I've written a script to automate the processing of my raster data, but it fails after half completing. The loop looks in the directory unzipped for subdirectories which contain .tif files and stacks those with some basemaps in /base_layers. Its been failing with:
Error in matrix(unlist(ini), ncol = 2, byrow = TRUE) : 'data' must be of a vector type, was 'NULL'
and
In writeBin(v, x@file@con, size = x@file@dsize) : problem writing to connection
and
Error in .local(.Object, ...) : GDAL Error 3: Free disk space available is 875151360 bytes, whereas 2473962400 are at least necessary. You can disable this check by defining the CHECK_DISK_FREE_SPACE configuration option to FALSE.
FIXED I think, see answersRan in a kubernetes pod with Ubuntu, 24 gigs of ram, and 50 gigs of storage, and 4 vCPUs and also in a VM instance running Ubuntu 18.04 with 2 vCPUs, 8 gigs of ram and 50 gigs of storage. The pods were failing part way through because of memory issues, and the VM didn't complete either. Code typically fails when it reaches the mask raster stage but only after running through a few iterations of the loop (so it all works at the start).
If anyone can point out if memory fragmentation is likely to occur in this script, or ways to clear the OS memory I would be forever grateful!
Script:
library(raster)
library(rgdal)
input = "/mnt/nfs/data/unzipped"
output = "/mnt/nfs/data/training" #path to where the data should go
#get paths to basemaps
DEM = raster("/mnt/nfs/data/base_layers/Clipped_filled_dem.tif")
flow_accum = raster("/mnt/nfs/data/base_layers/accumulation.tif")
slope = raster("/mnt/nfs/data/base_layers/Clipped_slope.tif")
aspect = raster("/mnt/nfs/data/base_layers/Clipped_filled_aspect.tif")
ruggedness = raster("/mnt/nfs/data/base_layers/Clipped_TRI.tif")
#get directories that have data that needs to be processed
directory = list.dirs(path = input, recursive = FALSE)
for(direct in directory) { subdirect = list.dirs(path = direct,recursive = FALSE) for(sub in subdirect){ files_for_raster <- list.files(path = sub, pattern = "*.tif$", full.names = TRUE) rasterstack = stack(files_for_raster) # name of datapull name = gsub(paste(direct,"/",sep=""),"",sub) print(c("working in",name)) # crop DEM to the extent of the satellite image DEMcrop = crop(DEM,rasterstack) #extent can be a raster flow_accumcrop = crop(flow_accum,rasterstack) slopecrop = crop(slope,rasterstack) aspectcrop = crop(aspect,rasterstack) ruggednesscrop = crop(ruggedness,rasterstack) print(c("cropped")) print(object.size(DEMcrop)) print(object.size(rasterstack)) # resample rasters, this will take a bit DEMcrop = resample(DEMcrop,rasterstack) flow_accumcrop = resample(flow_accumcrop,rasterstack) slopecrop = resample(slopecrop,rasterstack) aspectcrop = resample(aspectcrop,rasterstack) ruggednesscrop = resample(ruggednesscrop,rasterstack) print(c("resampled")) print(object.size(DEMcrop)) print(object.size(rasterstack)) # mask layers DEMcrop = mask(DEMcrop,raster::subset(rasterstack,1)) flow_accumcrop = mask(flow_accumcrop,raster::subset(rasterstack,1)) slopecrop = mask(slopecrop,raster::subset(rasterstack,1)) aspectcrop = mask(aspectcrop,raster::subset(rasterstack,1)) ruggednesscrop = mask(ruggednesscrop,raster::subset(rasterstack,1)) print(c("masked")) print(object.size(DEMcrop)) print(object.size(rasterstack)) # add baselayers to the raster stack finalstack = addLayer(rasterstack,DEMcrop,flow_accumcrop,slopecrop,aspectcrop,ruggednesscrop) print(names(finalstack)) print(nlayers(finalstack)) bands<-c("band1","band2","band3","band4","band5","band6","band7") type<-c("quality","sr_ndvi","DEM","flow_accum","slope","aspect","TRI") band_info<-data.frame(bands,type) print("finalstack") # create new output directory and save raster there output_subdirect = gsub(paste(input,"/",sep=""),"",sub) dir.create(file.path(output,output_subdirect), recursive = TRUE) Sys.chmod(file.path(output,output_subdirect), mode = "777", use_umask = FALSE) print("created directory") write = file.path(output,output_subdirect) writeRaster(finalstack, format="GTiff", filename=file.path(write,name,fsep="/"), options=c("INTERLEAVE=BAND","COMPRESS=NONE"), overwrite=TRUE) write.csv(band_info, file = paste(file.path(write,name,fsep="/"),".csv",sep="")) print("done processing") rm(rasterstack,DEMcrop,flow_accumcrop,slopecrop,aspectcrop,ruggednesscrop) gc() print(gc()) system("sudo sysctl -w vm.drop_caches=3") }
}
# useful functions
#mystack = stack("path/to/multilayer.tif") # multilayer.tif is an existing raster stack
#band1 = subset(mystack,subset=1) # subsets bands from raster stack
#removeTmpFiles(h=0) # removes temp files, can be used after writing raster stackes to delete all temp raster files
#rasterbrick<-brick(rasterstack) #can make a raster brick from a raster stackI've included the print(object.size()) to try to see if my objects are growing in size throughout the code's execution.
1 Answer
The file was writing large rasters to disk in the /tmp folder, over 20Gb each time it executed. Filled up the disk pretty quick. Still concerned why it was writing rasters to existing files rather than clearing them. Could clear the tmp directory during the script.
This was a pretty good explanation of memory issues with the raster package.