An Introduction to Spatial Analytics With PostGIS, PL/R, and R
An Introduction to Spatial Analytics With PostGIS, PL/R, and R
If you're a PostgreSQL user who's unfamiliar with the capabilities of R and PL/R for writing user-defined SQL functions in the R language, you've come to the right post!
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
This is the first in a series of posts intended to introduce PostgreSQL users to PL/R, a loadable procedural language that enables a user to write user-defined SQL functions in the R programming language. When further combined with PostGIS, the geospatial extender for PostgreSQL, users can perform powerful spatial analytics within the PostgreSQL database.
This initial post introduces PL/R and R, provides setup instructions for following the spatial analytics example to be used in this series of posts, and provides introductory instruction on Geocoding with PostGIS, R, and PL/R.
While PostgreSQL is known for and widely popular as a transactional database due to its SQL compliance, reliability, data integrity, and ease of use, it is less commonly associated with analytic applications.
The combination of PostgreSQL and R provide users with the ability to leverage the power and efficiency of PostgreSQL and the rich analytic functionality of R. When further combined with PostGIS, the geospatial extender for PostgreSQL, users can perform powerful spatial analytics within the PostgreSQL database.
This series of blog posts will provide users with information about:
- Introduction to R and PL/R
- Geocoding With PostGIS, R and PL/R
- Example of Spatial Analytics: NDVI Processing
- Example of Spatial Analytics: Sample Analytics Functions
It is our hope that these posts will cause those building analytic applications to give PostgreSQL a second look.
Introduction to R and PL/R
The R Project for Statistical Computing provides the R programming language, which is a language and environment for statistical computing and graphics. R is an integrated suite of software facilities for data manipulation, calculation, and graphical display. R and its libraries implement a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. R is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of packages. In particular, R includes:
- An effective data handling and storage facility.
- A suite of operators for calculations on vectors, matrices, and arrays.
- A large, coherent, integrated collection of intermediate tools for data analysis.
- Graphical facilities for data analysis and display either on-screen or on hard copy.
- A well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form.
PL/R is a loadable procedural language that enables a user to write user-defined SQL functions in the R programming language. PL/R offers most (if not all) of the capabilities a function writer has in the R language. Commands are available to access the database via the PostgreSQL Server Programming Interface (SPI) and to raise messages via
Spatial Analytics With PostgreSQL, PostGIS, R, and PL/R
As a demonstration of the value of using PL/R to conduct spatial analytics, this series of blog posts will use the combination of PostgreSQL, PostGIS, R, and PL/R to conduct analysis on publicly available geospatial data, including:
- Ingest and utilize spatial shape (vector) data.
- Use online services for geocoding.
- ngest, utilize, and analyze spatial “raster” data.
As an example of the types of analysis made possible, this blog post will perform analytic operations against the normalized difference vegetation index (NDVI) data set in order to:
- Determine average NDVI for a particular location over time.
- Visualize average NDVI for a particular location over time with ggplot.
- Visualize average NDVI for a particular location by month.
- Visualize same month average NDVI for a particular location by year.
While this example is admittedly arbitrary, it is not difficult to apply the analytic techniques and methods demonstrated through this example to a variety of Spatial Analytic requirements.
The data used in this series of blog posts is NDVI data from the moderate-resolution imaging spectroradiometer (MODIS) data made available by the United States Geologic Survey (USGS). Specifically, this blog post makes use of MOD13A1 global, spatial raster data provided every few days at 500-meter spatial resolution (250 m and 1 km resolution is also available).
Vegetation indices are derived from typical spectral reflectance signatures of leaves. Reflected energy in the visible spectrum is very low while near-infrared radiation (NIR) is scattered with very little absorption. As a result, the contrast between visible and NIR is a sensitive measure of vegetation density. NDVI calculates the normalized transform of NIR to red reflectance ratio as follows:
NDVI = (rNIR - rRed) / (rNIR + rRed)
Analysis of vegetation conditions has applications that may include:
- Global biogeochemical and hydrologic modeling
- Agricultural monitoring and forecasting
- Land-use planning
- Land cover characterization
- Land cover change detection
This paper uses data for San Diego Country as described in more detail below. For those interested, the SQL functions have the state/county of interest passed in as an argument to make it reasonably easy to repeat for a different county somewhere else.
Let's look at the setup for PostgreSQL, ingest administrative boundaries data, create a USA counties layer table, and extract and plot San Diego County.
As an initial step, it is necessary to deploy PostgreSQL with PostGIS, R, and PL/R. This example uses PostgreSQL 9.6.1, PostGIS 2.3.2, R-3.3.2, and PL/R 188.8.131.52.
First, create the PostgreSQL database and install PostgreSQL extensions, which includes PostGIS:
createdb gis psql gis psql (9.6.1) Type "help" for help. gis=# create extension postgis; gis=# create extension plr;
Next, it is necessary to install a variety of required/interesting R packages as follows:
R CMD javareconf R R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) [...] Type 'q()' to quit R. x <- c("stringi", "raster", "sp", "spatial", "rgdal", "rgeos", "maptools", "dplyr", "tidyr", "tmap", "ggmap", "OpenStreetMap", "cairoDevice", "RGtk2", "wkb", "rasterVis") install.packages(x)
When installing the R packages, it is necessary to ensure that these packages are being installed as root or postgres.
It is worth noting that while there is no way to access internals of the database backend with PL/R, the user is able to gain operating system-level access under the permissions of the PostgreSQL user ID, as with a C function. Thus, unprivileged database users should not be permitted to use this language to create new functions. It must be installed as an untrusted procedural language so that only database superusers can create functions with it. The writer of a PL/R function must take care that the function cannot be used to do anything unwanted since it will be able to do anything that could be done by a user logged in as the database administrator. The resulting functions may then be safely executed by unprivileged users.
Ingesting Administrative Boundaries Data
Once our PostgreSQL, PostGIS, R, and PL/R environments are deployed, the next step is to ingest the Administrative Boundaries data associated with our example. For this example, we will be using the GADM database of global administrative boundaries found here.
Create USA Counties Layer Table
As a starting point for the example used for this blog series, it is necessary to create the USA Counties Layer Table. This step can be completed by first using the
getData() function from the
rgdal package in order to create the spatial shape layer for administrative areas of the country and level of detail requested. Once the spatial shape layer is created, it is possible to use the
writeOGR() function to create a new PostGIS table and store the gathered shape data in the newly created PostGIS table. Once the data is inserted into the newly created PostGIS table, it is possible to create a secondary index on the columns corresponding to the first and second level shape names.
The following commands implement steps described above through the creation of a PL/R function as follows:
CREATE OR REPLACE FUNCTION create_admin_layer(country text, level int, tablename text) RETURNS text AS $$ library(raster); library(rgdal) # set up database connections dsn="PG: " conn <- dbConnect(dbDriver("PostgreSQL"), user="postgres", dbname="gis", host="localhost") # download the requested administrative shapes and create the Postgres table shapes = getData('GADM', country = country, level = level) writeOGR(shapes, dsn, tablename, "PostgreSQL") sql.str <- sprintf("CREATE INDEX %s_idx1 ON %s(name_1,name_2)", t ablename, tablename) dbGetQuery(conn, sql.str) return("OK") $$ LANGUAGE plr; DROP TABLE IF EXISTS counties; SELECT create_admin_layer(country := 'USA', level := 2, tablename := 'counties');
Note that this function could easily be used to create spatial layer tables for other countries at the same, or different, levels of detail.
Extract and Plot San Diego County
Next, create a PL/R function to extract and plot a specific administrative shape of interest using the following commands:
-- Note: might need to run "Xvfb :1 -screen 0 1024x768x24" CREATE OR REPLACE FUNCTION plot_admin_layer(layername text, name1 text, name2 text) RETURNS bytea AS $$ library(rgeos); library(cairoDevice); library(RGtk2) pixmap <- gdkPixmapNew(, , ); asCairoDevice(pixmap) proj_str <- "+ + +no_defs + +,0,0" conn <- dbConnect(dbDriver("PostgreSQL"), , , ) sql.str <- sprintf("SELECT st_astext(wkb_geometry) AS geom FROM %s WHERE name_1 = '%s' AND name_2 = '%s'", layername, name1, name2) plot(readWKT(dbGetQuery(conn, sql.str), )) plot_pixbuf <- gdkPixbufGetFromDrawable(NULL, pixma p , pixmap$getColormap(), 0, 0, 0, 0, 500, 500) buffer <- gdkPixbufSaveToBufferv(plot_pixbuf, "png", character(0), character(0))$buffer return(buffer) $$ LANGUAGE 'plr' STRICT;
It is then possible to obtain a png image of the San Diego County boundaries using the following command:
SELECT plr_get_raw(plot_admin_layer('counties', 'California', 'San Diego'));
The resulting visualized output of the San Diego County plot is provided in Figure 1 below.
Figure 1: Visualization of San Diego County Administrative Boundary
Geocoding of Example Data
Geocoding is the computational process of transforming a postal address description to a location on the Earth’s surface (spatial representation in numerical coordinates). Geocoding relies on a computer representation of the street network. Geocoding is sometimes used for conversion from ZIP codes or postal codes to coordinates and occasionally for the conversion of parcel identifiers to centroid coordinates.
For purposes of our example, we will use the R package
ggmap to convert addresses to coordinates. The process for this conversion is as follows:
- Create the list of addresses for points of interest (or “PoI”).
- Use the R package
ggmapto convert addresses to coordinates. (Note:
ggmapuses either the Google Maps API or Data Science Tool Kit to perform the conversion.)
- Add PoI names.
- Set the layer Coordinate Reference System (CRS).
- Create the PostGIS layer table using OGR.
The PL/R function that implements the steps described above is as follows:
CREATE OR REPLACE FUNCTION create_poi_layer (addresses text, poinames text, layername text) RETURNS text AS $$ library(ggmap); library(rgdal) # geocode using the Google Maps API poilayer <- geocode(addresses) # add the airport names poilayer$name <- poinames # convert to SpatialDataFrame coordinates(poilayer) <- ~ lon + lat # specify the CRS proj4string(poilayer) <- CRS("+ +") # create the Postgres table for this layer dsn="PG: " writeOGR(poilayer, dsn, layername, "PostgreSQL") # add indexes conn <- dbConnect(dbDriver("PostgreSQL"), user="postgres", dbname="gis", host="localhost") sql.str <- sprintf("CREATE INDEX %s_name ON %s(name)", layername, layername) dbGetQuery(conn, sql.str) return("OK") $$ LANGUAGE plr;
The PL/R function is then used to create a spatial layer containing coordinates of three airports located within San Diego County:
DROP TABLE IF EXISTS airports; SELECT create_poi_layer ( addresses := ARRAY['1960 Joe Crosson Dr, El Cajon, CA 92020', '1424 Continental St, San Diego, CA 92154', '3225 N Harbor Dr, San Diego, CA 92101'], poinames := ARRAY['Gillespie Field', 'Brown Field', 'San Diego Intl'], layername := 'airports' );
This is the first in a series of posts intended to introduce PostgreSQL users to PL/R, a loadable procedural language that enables a user to write user-defined SQL functions in the R programming language. In particular, this post was designed to introduce R and PL/R to PostgreSQL users who may not be familiar with those capabilities.
Coming posts will build on the example introduced in this post in order to demonstrate how the combination of PostgreSQL, PostGIS, R and PL/R can be used to conduct spatial analytics, including NDVI Processing and the use of sample analytics functions.
Published at DZone with permission of Joe Conway . See the original article here.
Opinions expressed by DZone contributors are their own.