Geographically weighted regression using Stata

Geographically weighted regression : A method for exploring spatial nonstationarity

Mark S. Pearce, Department of Child Health, University of Newcastle upon Tyne, UK.
Fax (00)-44-191-2023060
EMAIL m.s.pearce@ncl.ac.uk

The syntax for the gwr commands is

gwr [varlist] [if exp] [in range], east(varname) north(varname) options

gwrgrid [varlist] [if exp] [in range], east(varname) north(varname) options

where the allowed options are

saving(filename) dots reps(#) double eform family(familyname) link(linkname) [ln]offset(varname) test bandwidth(#) replace noconstant nolog scale(x2|dev|#) disp(#) iterate(#) init(varname) outfile(filename) comma wide mcsave(filename) sample(#)

with an additional option for gwrgrid

square(#)

Description

The gwr command applies geographically weighted regression to a dataset containing geographical reference points. These points must be defined in the gwr command using the east() and north() ‘options’.
The gwrgrid command is a slight variation which fits a grid across the area defined by east() and north(). This method is especially useful for large datasets where the gwr command can become quite time consuming.

The essence of geographically weighted regression is that it allows different relationships between the dependent and independent variables to exist at different points, (x,y), in space. For a full discussion of this method see Brunsdon et al. (1996).

The gwr command produces a set of parameter estimates for a regression at each point, i, at which there is an observation. The gwrgrid command produces a set of parameter estimates at each grid square centroid. Any grid square in which there are no observations is ignored. Using the outfile() and saving() options these estimates can be explored in more detail, for example using a geographical information system to map risk estimates across space. In each regression the observations are weighted by a function related to their distance from the point at which the regression is being carried out. The actual function used depends upon a bandwidth estimation which is carried out using a cross-validation approach. See, for example Cleveland (1979) and Bowman (1984).

The command is designed to test 2 hypotheses, both using a Monte Carlo simulation where the spatial points are randomly distributed amongst the data.

1. Does the geographically weighted regression model describe the data significantly better than a global regression model, the results of which are also shown in the output ?

2. Does the set of parameter estimates exhibit significant spatial variation ? This compares the standard deviations of the observed parameter estimates (Si) with those from the Monte Carlo simulation.

Options

test requests that the first hypothesis be tested, i.e. that the model produced by gwr describes the data significantly better than a global model.
Not testing this hypothesis reduces the need to calibrate the bandwidth for each run of the Monte Carlo simulation and so reduces the time the command will take to run. The second hypothesis, concerning spatial variation of the parameter estimates, is always carried out, using either the user-defined bandwidth or the bandwidth estimated from the observed data.
If test is specified and convergence of the bandwidth not achieved during the Monte Carlo simulation, a note is made in the results of how many times convergence was not achieved. The significance level is adjusted to ignore those runs where convergence was not achieved.

bandwidth(#) allows the bandwidth to be declared, eliminating the need to calibrate the bandwidth and so saving time.

sample(#) specifies the percentage of observations to be used in the bandwidth calibration process, the default being 100%. If this option is specified, #% of the observations will be randomly sampled and used in the calibration process. The same number of observations will be used when recalibrating the bandwidth for the simulation test of the geographically weighted regression model.

square(#) is only used with gwrgrid and allows the size of the grid squares to be defined. The default is to set the width of the grid squares to be half the bandwidth.

saving(filename) specifies the name of the file to contain the parameter estimates and grid reference for each point at which the regression is carried out.

outfile(filename) creates a text file filename.raw containing the parameter estimates from each point at which gwr is calculated. The comma and wide options for outfile() are also allowed. The file is set out as east north dep_vars constant

outfile() and saving() can both be specified simultaneously.

mcsave(filename) requests that the results of the Monte Carlo simulation be saved as filename.dta rather than using a temporary file. This file will contain the standard errors of the parameter estimates for each run, as well as the simulated bandwidths if the test option is specified. A simulated bandwidth of -99.99 indicates that the bandwidth calibration failed to converge.

replace indicates that the filenames specified by saving(), outfile() or mcsave() may be overwritten.

double specifies that the results stored in the file specified by saving(), outfile() or mcsave() are stored as doubles (8-byte reals). By default they are stored as floats (4-byte reals).

nolog suppresses the display of the bandwidth calibration process..

iterate(#) specifies the maximum number of iterations allowed in estimating the bandwidth. The default is 50.

reps(#) specifies the number of Monte Carlo simulations to be performed. The default is 1000.

dots requests a dot be placed on the screen at the beginning of each run of the Monte Carlo simulation, showing how far the simulation has gone.

gwr uses iweights. For this reason, while the default model (i.e. a linear regression model) actually uses regress (for speed), all other models use the glm command.
Where gwr uses the glm command, many of the options used with glm are also available for gwr, enabling the user to define the form of model:

eform family(familyname) link(linkname) [ln]offset(varname) noconstant scale(x2|dev|#) disp(#) init(varname)

Not specifying family(), link() or [ln]offset automatically results in the default linear regression model being used.

Example (a replication of Brunsdon et al)

This example uses the ward-level 1991 census data for the county of Tyne and Wear in the United Kingdom, and explores the relationship between the number of cars per household (cars) and 2 independent variables, the proportion of unemployed males (out of the total economically active male population) (unemp) and social class (the proportion of households with the head of the household in social class I) (sclass).
The spatial points are given by the variables east and north, in this case they are the eastings and northings of the ward centroids in kilometres.

Using this data, both hypotheses can be tested, in this example using a Monte Carlo simulation of 1000 repetitions.

. gwr cars sclass unemp, east(east) north(north) reps(1000) dots test saving(gwrout) replace dots nolog

Global Model

Source |       SS        df        MS                               Number of obs =     120
---------+----------------------------------------------           F( 2,   117) = 287.17
   Model | 45196.5848     2   22598.2924                       Prob > F      = 0.0000
Residual | 9207.00732   117   78.6923702                     R-squared     = 0.8308
---------+----------------------------------------------            Adj R-squared = 0.8279
   Total | 54403.5921   119   457.173043                        Root MSE      = 8.8709

--------------------------------------------------------------------------------------------
   cars1 |      Coef.   Std. Err.        t      P>|t|       [95% Conf. Interval]
---------+----------------------------------------------------------------------------------
   class |   1.880726   .3344889      5.623   0.000       1.218289    2.543164
   unemp | -1.827983   .1123766    -16.267 0.000      -2.050539   -1.605427
   _cons |   88.47704   2.885689     30.661   0.000       82.76208      94.192
---------------------------------------------------------------------------------------------

Convergence : Bandwidth = 4829.2874

Running Monte Carlo simulation

Geographically Weighted Regression

Significance Test for Bandwidth

----------------------------------------
Observed P-Value
----------------------------------------
4829.2874 0.010
----------------------------------------

Significance Tests for Non-Stationarity

-------------------------------------------------------------------------------
Variable            Si                   P-Value
-------------------------------------------------------------------------------
Constant            6.0421        0.568
class               1.6079     0.023
unemp               0.1573              0.930
-------------------------------------------------------------------------------

The global model shows that the number of cars per household is significantly related to social class and male unemployment in the study region.
The test of the bandwidth suggests that the geographically weighted regression model is a significantly better model for this data than the global linear regression model.
The significance tests for nonstationarity of the parameter estimates show that the relationship between the number of cars per household and social class varies significantly over the study area. Mapping the estimates in the file gwrout.dta (in ARC-VIEW) suggested that the relationship between car ownership and social class was stronger in areas not particularly well served by the region’s light railway mass transit system which has its focus on the city of Newcastle upon Tyne. Mapping the male unemployment parameter suggested that the relationship was less marked in certain areas, the cause of which wasn’t immediately obvious, thus suggesting an area of further investigation.

Methods and formulae

The basic idea of geographically weighted regression is that a regression model is fitted at each point, i, weighting all observations, j, by a function of distance from that point. Hence observations sampled near to the observation where the regression is centred have more influence on the resulting regression parameters at that point than observations further away. This then produces a set of parameter estimates for a regression at each point in space.

The weighting function used by gwr takes the form

Wj = exp(-dj / b2)

where dj is the distance from the point i at which the regression model is being fitted, and b is the bandwidth. This was the weighting function used by Brunsdon et al. If a different weighting function is desired, the ado-file can easily be altered.

The bandwidth is calibrated by a cross-validation technique which aims to minimise the score

[y i - yi(b)]2
where yi(b) is the fitted value of yi using the bandwidth b and the weighted regression model centred at the point i, with the observation for point i excluded from the calibration process.

Once the bandwidth calibration process is complete, gwr uses the optimal bandwidth to fit a weighted regression model at each point, the parameter estimates being output to either a temporary or permanent datafile depending on the options used. Outputting the parameter estimates to a permanent data file (in Stata and/or text format) allows further investigation, for example using a geographical information system.

The 2 hypotheses previously described are tested by a Monte Carlo simulation, although the significance of the gwr approach as compared to a global regression model is only tested if specifically requested by the test option.
Each run of the Monte Carlo simulation randomly distributes the spatial points across the observations, and the gwr process is repeated.
If test is specified, the simulated bandwidths are compared with that calibrated using the observed data.
The second hypothesis is tested by comparing the standard error of the parameter estimates from the observed data with those from each run of the Monte Carlo simulation.

Acknowledgements

I thank Brunsdon et al for the use of their census data and original Fortran program which can be obtained from their website
I also thank Dr Heather Dickinson, Dr Chris Brunsdon, Mr Martin Charlton and Mr Trevor Dummer, University of Newcastle for their useful comments in the formulation of these ado-files.
This work was also helped by comments made at the 4th Stata UK User Group Meeting, May 1998.

References

Bowman A.W. 1984. An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71: 353-60.
Brunsdon C., A.S. Fotheringham and M.E. Charlton. 1996. Geographically weighted regression: A method for exploring spatial nonstationarity. Geographical Analysis 28: 281-98.
Cleveland W.S. 1979. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 74:829-36.