Wednesday, December 18, 2013

A Simple Function for Parallel Queries in Postgres

A few months back, I developed a prototype function for performing a simple, somewhat cludgey parallel query in Postgres. At the time, the function could only handle very simple queries (no WITH statements, etc.), so it wasn't really worth sharing.

I finally had some time to update a bit to handle more complex queries -- it's still a bit basic, and probably can't handle all use cases, but it's effective for a lot of the slower PostGIS operations I run. You can find the SQL code for the parsel function (think "parallel select" as well as "parcel out") here:

I have two goals in posting this:
1) To help other people perform some quick and dirty parallelization in Postgres
2) To share the code in the hopes that someone else might be able to take it a bit farther than I can right now.

If and when I update and refine it some more, I'll update the gist.

What you need to know:

 This function is in beta and it may behave unpredictably in some cases. Use it at your own risk. I take no responsibility for unexpected effects it may have. That being said, if you do discover bugs, please let me know here or on the gist and I will try to work on them.
  • The function is written in plpgsql and is dependent on the dblink extension. 
  • It can only be used in cases where the operation/query you are trying to run can be broken up into separate pieces based on the primary key of one of the input tables.
  • Steps to Use:
    • Create an output (empty) table to which results will be output
    • Run the function. Each parallel process with asynchronously insert data into the output table.
  • The usage is as follows:
text parsel(text database_name, text table_to_chunk, text primary_key, text query, text output_table, text table_to_chunk_alias, integer number_of_chunks)

database_name name of the database
table_to_chunk the name of the table that can be split up based on its primary key
primary_key the name of the primary key in the table_to_chunk
query the query to execute in parallel into which results will be inserted
table_to_chunk_alias = (optionalif you use an alias for the table_to_chunk in queryprovide it here
number_of_chunks approximate number of processes to split the operation into

A worked example:
Let's say you have a raster table of population data (my_schema.pop) and a polygon table representing watersheds (my_schema.watersheds). You want to find the total population within each watershed. In traditional (i.e., ArcGIS) speak, this is known as a "Zonal Statistics" operation; in PostGIS, you use ST_SummaryStats() to do it. To aggregate populations across multiple raster tiles intersecting the same polygon, your query is:

tile_stats as (
--run summary stats for each tile
SELECT a.gid, ST_SummaryStats(ST_Clip(b.rast, 1, a.geom_4326, true)) as stats
FROM my_schema.watersheds as a
INNER JOIN my_schema.population b
ON ST_Intersects(a.geom_4326,b.rast)
--aggregate the results from each tile for each polygon (identified by gid)
SELECT gid, sum((stats).sum) as pop_sum
FROM tile_stats
GROUP by gid;

For those who've tried this operation, it can be very slow (at least for PostGIS 2.0 -- haven't updated to 2.1 yet). However, because the population for each polygon can be calculated independently, it's easy to run this in separate chunks based on the primary key of the polygon table (gid). If there were 100 polygons in the polygon table, and you wanted to process the first 25 separately you could simply do:

WITH tile_stats as (
--run summary stats for each tile
SELECT a.gid, ST_SummaryStats(ST_Clip(b.rast, 1, a.geom_4326, true)) as stats
FROM my_schema.watersheds as a
INNER JOIN my_schema.population b
ON ST_Intersects(a.geom_4326,b.rast)
WHERE a.gid <= 25
--aggregate the results from each tile for each polygon (identified by gid)
SELECT gid, sum((stats).sum) as pop_sum
FROM tile_stats
GROUP by gid;

This gets a bit tedious if, for example, you want to split the data into 8 chunks, or 16, or more. That's where 
parsel  comes in handy:

First, create an empty output table to store all of the results:

CREATE TABLE my_schema.watershed_pop_sums (
gid integer,
pop_sum numeric);

Then, run your query in parsel:

SELECT parsel('my_db','my_schema.watershed','gid',
              'WITH tile_stats as (
                --run summary stats for each tile
                SELECT a.gid,ST_SummaryStats(ST_Clip(b.rast, 1, a.geom_4326, true)) as stats
                FROM my_schema.watersheds as a
                INNER JOIN my_schema.population b
                ON ST_Intersects(a.geom_4326,b.rast)
                --aggregate the results from each tile for each polygon (identified by gid)
                SELECT gid, sum((stats).sum) as pop_sum
                FROM tile_stats
                GROUP by gid;',

The function will issue a bunch of notices indicating the subqueries it has created. When it completes, it will return a text value of "Success". Review your output table to make sure the results make sense.

Friday, December 13, 2013

Dealing with arcpy memory leaks

Memory leaks appear to be fairly common in the ArcGIS arcpy module for Python. Search the user forums and you'll see plenty of references.

Aside from slowing down your script, memory leaks can eventually cause Python to crash, particularly with 32-bit Python. In my case, I was getting a Python.exe APPCRASH referencing the geocoding.dll, which made very little sense since I wasn't using anything related to geocoding. Fairly cryptic and strange behavior, but ultimately I tracked it to the arcpy.conversion.ExportMetadata() function, which seemed to be driving the a slow but steady rise in memory.

In situations where arcpy is causing memory leaks, I've found that the Python garbage collection module (gc) isn't very helpful. Instead, try using the multiprocessing module to spawn off a process specifically for the offending function:

from multiprocessing import Process
import arcpy

# set various filepaths
filepath = 'C:\\my_shapefile.shp'
METADATA_TRANSLATOR = 'C:\\Program Files (x86)\\ArcGIS\\Desktop10.0\\Metadata\\Translator\\ARCGIS2FGDC.xml'
out_xml_name = 'C:\\my_metadata.xml'

# ***************************

# run the ExportMetadata function in a separate process
proc = Process(target = arcpy.conversion.ExportMetadata, args = (filepath, METADATA_TRANSLATOR, out_xml_name))
# start the process
# wait for the process to finish, then kill it
# ***************************

The function will process within a separate instance of Python.exe and, once completed, all memory should be released.

Thursday, October 31, 2013

PostGIS ST_Intersection() Returns Multiple Geometry Types

An annoying little aspect of running ST_Intersection() in PostGIS is that it can return multiple geometry types, indiscriminate of the input geometry types (!msg/postgis-users/iCkVnTxzBBU/L62pz-PiN44J). For example, if you try to find the intersection of two Polygon geometries, the returned geometry could be of type (multi)point, (multi)line, (multi)polygon, or even collection, depending on the “dimension” of intersection between the two input geometries. In most applications, you would only want (and expect) the output geometry to be of type polygon.

The equivalent tool in good old reliable ArcGIS has an “Output Type” parameter that lets you control what type of geometries you want returned ( I figured we should have that option in PostGIS too. Below is the PL/PGSQL function I put together. 

Please test it out and let me know if you find bugs or unexpected behavior.

Usage of the new function is:
ST_Intersection(geom1, geom2, geom_type)
where geom_type must be one of ‘POLYGON’,  ‘LINESTRING’, or ‘POINT’.

Here example of how to use it. Note that you need to use the WITH statement because the function will return NULL geometries where the intersection is NOT of the specified type.

WITH intersected as (
      SELECT a.gid,,
            ST_Intersection(a.the_geom_4326, b.the_geom_4326,'POLYGON') AS the_geom_4326
            table1 a,
            table2 b
            ST_Intersects(a.the_geom_4326, b.the_geom_4326)
FROM intersected
where the_geom_4326 is not null;

CREATE OR REPLACE FUNCTION public.ST_Intersection(geom1 geometry, geom2 geometry, geom_type text)
  RETURNS geometry as
  intersected geometry;
  sql text;
  intersected_type text;
  geom_code integer;
  alternate_type text;

if geom_type = 'POINT' THEN
geom_code := 1;
alternate_type := 'MULTIPOINT';
elsif geom_type = 'LINESTRING' THEN
geom_code := 2;
alternate_type := 'MULTILINESTRING';
elsif geom_type = 'POLYGON' THEN
geom_code := 3;
alternate_type := 'MULTIPOLYGON';
raise 'geom_type must be one of POLYGON, LINESTRING, or POINT';
end if;

SELECT ST_Intersection(geom1,geom2) INTO intersected;

SELECT geometrytype(intersected) INTO intersected_type;

if intersected_type = 'GEOMETRYCOLLECTION' then
select ST_CollectionExtract(intersected,geom_code) INTO intersected;
SELECT geometrytype(intersected) INTO intersected_type;
end if;

IF intersected_type in (geom_type,alternate_type) then
return intersected;
return null;
end if;
  COST 100;

Friday, September 13, 2013

Postgres Function to Search for Substring in String

I haven't yet taken the time to learn regex, so sometimes I find the string functions in Postgres to be a little bit challenging or unintuitive. I find it particularly frustrating when I want to simply test whether one string is (any) part of another. Sure, I could use the position() function and check for any result not equal to zero, but that seems overly complicated. I'm sure I could also use the regexp_matches() function, if I had a better grasp on regex.

If you have these same frustrations, and want another option, here is a plpython function that will return a simple true or false. This is another case where the Python solution couldn't be simpler:

CREATE OR REPLACE FUNCTION public.substring_in_string(sub_string text, main_string text) RETURNS boolean AS $BODY$ if sub_string <> None and main_string <> None: return sub_string in main_string else: return False $BODY$ LANGUAGE plpythonu STABLE COST 100;
SELECT substring_in_string(<substring to search for>,<string to search in>);

Thursday, September 12, 2013

Sum Array in Postgres

Have you ever needed to sum an array of numbers in Postgres? The built-in sum() function is an aggregate function that doesn't work on arrays. If you try it on an array, you'll get an error message like:

ERROR:  function sum(integer[]) does not exist
ERROR:  function sum(numeric[]) does not exist

Instead, try this plpython function:

CREATE OR REPLACE FUNCTION public.array_sum(num_array numeric[])
  RETURNS numeric AS
return sum(num_array)
  LANGUAGE plpythonu stable

  COST 100;

Some things are just MUCH simpler when you have access to Python data models and methods. I REALLY wish I'd discovered plpython sooner...

Wednesday, August 28, 2013

Split Shapefile or Feature Class into Quads

A pervasive theme on this blog is my travails with big geographic data, in ArcGIS, in PostGIS, and in other software/programming settings as well. This post continues that theme, presenting a simple Python based tool I put together to work around some nondescript (i.e., 99999) errors I'd been receiving in ArcGIS 10.0 and 10.1 while attempting to use simple tools (e.g., Update, Erase, etc.) on some monster polygon data. 

The tool splits any given shapefile (points, lines, or polys) into four separate shapefiles by dividing the original spatial extent of the features into four equal sizes quads. It uses the Clip tool, which seems to be able to handle bigger data than some of the more sophisticated spatial overlay tools in Arc. Alternatively, you could separate big files based on attribute information, but the advantage here is keeping the data in spatially contiguous chunks to enable subsequent processing.

The tool can accept either shapefiles or feature classes as input, but outputs will be in shapefile format (this could easily be changed, but for my purposes, shapefiles were sufficient). 

I'll try to get the script and the tbx up on ArcScripts at some point, but for now, here is the code:

import arcpy
import os
import traceback

class bbox():
    '''Object representing a geographic bounding box.'''
    def __init__(self,xmin,ymin,xmax,ymax):        
        self.xMin = xmin
        self.yMin = ymin
        self.xMax = xmax
        self.yMax = ymax   
        self.Extent = arcpy.Extent(xmin,ymin,xmax,ymax)

    def quadSplit(self):
        '''Split the bbox into 4 quadrants, returned as a list of bbox objects'''
        yBounds = [self.yMin,self.yMax]
        xBounds = [self.xMin,self.xMax]
        xMid = round(sum(xBounds)/2,4)
        yMid = round(sum(yBounds)/2,4)
        xSubs = [list(sorted([x,xMid])) for x in xBounds]
        ySubs = [list(sorted([y,yMid])) for y in yBounds]
        quads = []
        for xSub in xSubs:
            for ySub in ySubs:
        return quads     
def main(inFC,outFolder,keepQuadBoundaries):
    '''inFC = input feature class or shapefile (full path)
       outFolder = output folder in which to save the quad-split shapefiles
       keepQuadBoundaries = if True, shapefiles storing the quad boundaries will be saved to outFolder as well
        Returns list of names of the output shapefiles
        basename = os.path.basename(inFC).split('.')[0]
        arcpy.env.workspace = outFolder
        arcpy.env.scratchWorkspace = outFolder
        d = arcpy.Describe(inFC)
        ext = d.extent
        b = bbox(ext.xmin,ext.ymin,ext.xmax,ext.ymax)
        for i,q in enumerate(b.quadSplit()):
            outFCs = []
            outquads = []
                arcpy.AddMessage('Clipping Quad %s' % i)
                outname = '%s_%s.shp' % (basename,i)    
                p = arcpy.Polygon(arcpy.Array([q.Extent.lowerLeft, q.Extent.upperLeft, q.Extent.upperRight, q.Extent.lowerRight, q.Extent.lowerLeft]))
                if keepQuadBoundaries == True:
                    quad_outname = '%s_quad_bndry_%s.shp' % (basename,i)
            except arcpy.ExecuteError, err:
                arcpy.AddWarning('Error occurred: ' % err)
        return outFCs
    except Exception, err:
        py_err = traceback.print_exc()
        if py_err:

if __name__ == "__main__":
    IN_FC = arcpy.GetParameterAsText(0)
    OUT_FOLDER= arcpy.GetParameterAsText(1)
    KEEP_QUAD_BOUNDARIES = arcpy.GetParameterAsText(2)
    if KEEP_QUAD_BOUNDARIES == 'true':
        KQB = True
        KQB = False

Median in Postgres

I was surprised to learn today that Postgres doesn't have a built in aggregate function for calculating the median of a set of numbers. Apparently it's not trival to calculate using SQL and there are lots of creative solutions out there (using sorting, row number, etc.). 

On the other hand, calculating a median IS trivial in both R and Python. There are some good solutions using PL/R, but as a Python guy, I wanted a solution in Python. Here it is, with thanks to these posts for getting me there:

CREATE OR REPLACE FUNCTION public._py_median(numarr numeric[]) 
import numpy
return numpy.median(numarr)
LANGUAGE 'plpythonu'
COST 100;

  sfunc = array_append,
  basetype = numeric,
  stype = numeric[],
  finalfunc = _py_median

SELECT median(val) FROM schema.table;

In order to use these functions, you will need to enable the plpythonu extension in Postgres: