I have been attempting to use random forests to classify high resolution aerial imagery. Part one of this post series was my first attempt. The aerial imagery dataset that I am working on is made up of many ortho tiles that I need to classify into vegetation categories. The first attempt was to classify vegetation on one tile. This note documents classifying vegetation across tiles.

See the r script below.

Notes

Set Up

The training polygon shapefile consists of 8 classes (it should be 7 but I accidentally used two names for bare ground: “BG_Soil” and “BG_Rock”).

The Results

The out of bag (OOB) estimate of error rate was 1.85% which is pretty good. However, the training dataset has very imbalanced classes.

Class Percent Cover
BG_Rock 25.6
BG_Soil 24.7
Black_Sage 1.7
Grass 4.1
Other_Shrub 1.5
Other_Veg 15.2
PJ 24.2
Sage 2.9

Classes were very imbalanced in the training dataset. Both “BG_Soil” and “BG_Rock” should be combined making soil account for over 50% of the area.

You can see the class imbalance in the confusion matrix as well.

  BG_Rock BG_Soil Black_Sage Grass Other_Shrub Other_Veg PJ Sage class.error
BG_Rock 148953 145 2 24 12 3 1 7 0.0013007301521318348
BG_Soil 148 141975 206 472 768 40 51 289 0.01371319008815619
Black_Sage 8 244 8283 254 145 18 67 863 0.1618093503339405
Grass 25 174 155 23514 11 14 1 32 0.0172197609295327
Other_Shrub 3 531 61 5 7993 4 7 1 0.07112144102266127
Other_Veg 1 137 34 87 24 86021 2195 9 0.028099154878654997
PJ 0 78 203 8 24 688 139218 708 0.012126845813790088
Sage 0 247 1091 108 9 4 318 15315 0.10396676807863325

The most difficult divisions for the random forests was between “Black_Sage” and “Sage”. Which makes a lot of sense. About 6.3% of the time Sage was classified as Black_Sage, out of a total error of 10%, and about 8.7% of the time “Black_Sage” was classified as Sage, out of a total error of 16%.

Making Improvements (Iterate or Die)

Class Size: Obviously, I need to improve the class balance. I think the best way to do this is to both draw more training polygons in those classes with less (black_sage, sage and other_shrub) coverage and two add a balancing step in the script. Which brings me to sampsize().

Sample Size I also need to take advantage of randomForest’s ability to manage sample size with sampsize(). I have had a hard time automating the sample size to be used in the function so that I don’t have to find the smallest class and then manually put in that number as many times as there are classes. For instance if I had five classes and the minimum class size was 500 pixels I would need to set sampsize to be sampsize(500,500,500,500,500). Or I could sample the combined training set by randomly sampling pixels before I run random forests.

Segmentation - I still think segmentation has a role to play in this classification process, but my first attempt at segmentation was a complete failure. I need to find a tutorial on classification before I can make any improvement.

Adding Data - Using only five variables, red, blue, green, alpha, and ndvi, seems like very little data to classify imagery. There are a few datasets that I am considering adding:

  • Soils data
  • Elevation type data (slope, flow direction, terrain roughness, etc. )
  • Landfire data, which is essentially landsat data that has been classified.

But if this process is going to scale to the rest of the state, the datasources that I use need to be available to everyone.

Scripts


## Load libraries
require(dplyr)
require(stringr)
require(raster)
require(rgdal)
require(rgeos)
require(randomForest)



## Function to work subset training dataset of one tile
# - inputs (all are automatically created by build_model):
# -- tile_name: is the name of the tile from a single iteration in the list
# -- folder_of_tiles: is the directory that all of the tiles are stored.
# -- file_path_to_shape: the path to the shapefile that will classify.
# - steps:
# -- RASTER STEPS
# -- read raster brick
# -- change names to b1 to b4 for each band
# -- create an NDVI layer
# -- combine NDVI to four Bands
# -- renmae each band
# -- SHAPEFILE STEPS
# -- split file path name
# -- read shapefile with split filepath
# -- Don't know what the layer paste is all about, I think I can take it out.
# -- convert Class, the supervised classes for training, to numeric
# -- clip the shapefile to the extent of the loaded tile.
# -- rasterized the clipped shapefile with values being class.
# -- rename heading to class
# -- remove temporary variables
# -- MORE RASTER STEPS
# -- clip the raster to the shapefile extent
# -- add training raster (converted shapefile) to the ortho raster data.
# -- convert raster to values (table)
# -- convert values to data.frame, the above step may be redundant.
# -- filter na's, because when you clip a raster to a geometry, the pixels outside of the geometry are still there but they have values of NA.  This step is the most memory intensive step in the whole process.
# -- save the RDS to be modeled in random forests. There is definitely a better way to do this.....
# -- rm data_to_be_modeled otherwize each iteration will fail on the remove NA step.


subset_training_data_from_tile<- function( tile_name, folder_of_tiles, file_path_to_shape){


  ## read raster make ndvi layer and add it to the brick.
  raster<-brick(paste0(paste(folder_of_tiles, tile_name ,sep="/"), ".tif" ))
  names(raster)<-c("b1", "b2", "b3", "b4") ## Rename Bands for ease of use
  ndvi<-overlay(raster$b4, raster$b3, fun=function(x,y){(x-y)/(x+y)})
  cov<-addLayer(raster,ndvi) ## Merge NDVI and the tile imagery
  names(cov)<-c("b1", "b2", "b3", "b4", "ndvi")

  ## parse and read shapefile
  temp<-stringr::str_split(file_path_to_shape, "/")
  temp2<-readOGR(dsn=paste(head(temp[[1]], -1) , collapse = "/"), layer=paste(tail(temp[[1]], 1))) ## Don't remember what this does?
  temp2@data$code<-as.numeric(temp2@data$Class)
  shape<-crop(temp2, extent(cov))%>%
    rasterize(ndvi, field="code")  
  names(shape)<-"class"

  rm(ndvi, temp, temp2, raster)

  data_to_be_modeled<-crop(cov, extent(shape))%>%
    mask(shape)%>%
    addLayer(shape)%>%
    getValues()%>%
    as.data.frame()%>%
    filter(!is.na(class))

  saveRDS(data_to_be_modeled, paste0("training_datasets/", tile_name, "training_data.rds"))

  rm(data_to_be_modeled)
}


## Run random forest on combined training dataset
# - reads the folder with tiles
# -- takes all file names and stores them in a list to be transfered to the `subset_training_data_from_tile` function which
# - reads the shapefile
# -- adds the shapefile to `subset_training_data_from_tile`
# - tile names are looped over and sent to `subset_training_data_from_tile` individually with the shapefile.
# - all individually returned tile training datasets are combined together with rbind.fill
# - combined dataset, is run through random forests.
build_model<-function(folder_of_tiles, file_path_to_shape){
  t<-base::list.files(folder_of_tiles)%>%
    tools::file_path_sans_ext()%>%
    tools::file_path_sans_ext()%>%
    base::unique() ## May want to change this to just a vector and use this for just apply

  t2<-folder_of_tiles
  t3<-file_path_to_shape

  lapply(t, subset_training_data_from_tile, t2, t3)

  t4<-base::list.files("training_datasets")

  t5<-lapply(t4, function(x)readRDS(paste0("training_datasets/", x)))
  t6 <- do.call(plyr::rbind.fill, t5)

  head(t6)

  rm(t5)

  fit<-randomForest(as.factor(class)~.,
                    data=t6,
                    importance=TRUE,
                    ntree=500, norm.votes = FALSE
  )
  rm(t6)
  return(fit)
}


## Run the function `build_model`
set.seed(420)
start_time <- Sys.time()

test_model<-build_model("test_tiles", "test_shape/training_polygons12N_12102018")

end_time <- Sys.time()

end_time - start_time