Mark S. Pearce, Department of Child Health, University of Newcastle
upon Tyne, UK.
Fax (00)-44-191-2023060
EMAIL m.s.pearce@ncl.ac.uk
The syntax for the gwr commands is
gwr [varlist] [if exp] [in range], east(varname) north(varname) options
gwrgrid [varlist] [if exp] [in range], east(varname) north(varname) options
where the allowed options are
saving(filename) dots reps(#) double eform family(familyname)
link(linkname) [ln]offset(varname) test bandwidth(#)
replace noconstant nolog scale(x2|dev|#) disp(#)
iterate(#) init(varname) outfile(filename) comma wide
mcsave(filename) sample(#)
with an additional option for gwrgrid
square(#)
Description
The gwr command applies geographically weighted regression
to a dataset containing geographical reference points. These points must
be defined in the gwr command using the east() and north()
‘options’.
The gwrgrid command is a slight variation which fits a grid
across the area defined by east() and north(). This method
is especially useful for large datasets where the gwr command can become
quite time consuming.
The essence of geographically weighted regression is that it allows different relationships between the dependent and independent variables to exist at different points, (x,y), in space. For a full discussion of this method see Brunsdon et al. (1996).
The gwr command produces a set of parameter estimates for a regression at each point, i, at which there is an observation. The gwrgrid command produces a set of parameter estimates at each grid square centroid. Any grid square in which there are no observations is ignored. Using the outfile() and saving() options these estimates can be explored in more detail, for example using a geographical information system to map risk estimates across space. In each regression the observations are weighted by a function related to their distance from the point at which the regression is being carried out. The actual function used depends upon a bandwidth estimation which is carried out using a cross-validation approach. See, for example Cleveland (1979) and Bowman (1984).
The command is designed to test 2 hypotheses, both using a Monte Carlo simulation where the spatial points are randomly distributed amongst the data.
1. Does the geographically weighted regression model describe the data significantly better than a global regression model, the results of which are also shown in the output ?
2. Does the set of parameter estimates exhibit significant spatial variation
? This compares the standard deviations of the observed parameter estimates
(Si) with those from the Monte Carlo simulation.
Options
test requests that the first hypothesis be tested, i.e. that
the model produced by gwr describes the data significantly better than
a global model.
Not testing this hypothesis reduces the need to calibrate the bandwidth
for each run of the Monte Carlo simulation and so reduces the time the
command will take to run. The second hypothesis, concerning spatial variation
of the parameter estimates, is always carried out, using either the user-defined
bandwidth or the bandwidth estimated from the observed data.
If test is specified and convergence of the bandwidth not achieved
during the Monte Carlo simulation, a note is made in the results of how
many times convergence was not achieved. The significance level is adjusted
to ignore those runs where convergence was not achieved.
bandwidth(#) allows the bandwidth to be declared, eliminating the need to calibrate the bandwidth and so saving time.
sample(#) specifies the percentage of observations to be used in the bandwidth calibration process, the default being 100%. If this option is specified, #% of the observations will be randomly sampled and used in the calibration process. The same number of observations will be used when recalibrating the bandwidth for the simulation test of the geographically weighted regression model.
square(#) is only used with gwrgrid and allows the size of the grid squares to be defined. The default is to set the width of the grid squares to be half the bandwidth.
saving(filename) specifies the name of the file to contain the parameter estimates and grid reference for each point at which the regression is carried out.
outfile(filename) creates a text file filename.raw containing the parameter estimates from each point at which gwr is calculated. The comma and wide options for outfile() are also allowed. The file is set out as east north dep_vars constant
outfile() and saving() can both be specified simultaneously.
mcsave(filename) requests that the results of the Monte Carlo simulation be saved as filename.dta rather than using a temporary file. This file will contain the standard errors of the parameter estimates for each run, as well as the simulated bandwidths if the test option is specified. A simulated bandwidth of -99.99 indicates that the bandwidth calibration failed to converge.
replace indicates that the filenames specified by saving(), outfile() or mcsave() may be overwritten.
double specifies that the results stored in the file specified by saving(), outfile() or mcsave() are stored as doubles (8-byte reals). By default they are stored as floats (4-byte reals).
nolog suppresses the display of the bandwidth calibration process..
iterate(#) specifies the maximum number of iterations allowed in estimating the bandwidth. The default is 50.
reps(#) specifies the number of Monte Carlo simulations to be performed. The default is 1000.
dots requests a dot be placed on the screen at the beginning of each run of the Monte Carlo simulation, showing how far the simulation has gone.
gwr uses iweights. For this reason, while the default model (i.e.
a linear regression model) actually uses regress (for speed), all
other models use the glm command.
Where gwr uses the glm command, many of the options used
with glm are also available for gwr, enabling the user to
define the form of model:
eform family(familyname) link(linkname) [ln]offset(varname) noconstant scale(x2|dev|#) disp(#) init(varname)
Not specifying family(), link() or [ln]offset automatically results in the default linear regression model being used.
Example (a replication of Brunsdon et al)
This example uses the ward-level 1991 census data for the county of
Tyne and Wear in the United Kingdom, and explores the relationship between
the number of cars per household (cars) and 2 independent variables, the
proportion of unemployed males (out of the total economically active male
population) (unemp) and social class (the proportion of households with
the head of the household in social class I) (sclass).
The spatial points are given by the variables east and north,
in this case they are the eastings and northings of the ward centroids
in kilometres.
Using this data, both hypotheses can be tested, in this example using a Monte Carlo simulation of 1000 repetitions.
. gwr cars sclass unemp, east(east) north(north) reps(1000) dots test saving(gwrout) replace dots nolog
Global Model
Source | SS
df MS
Number of obs = 120
---------+----------------------------------------------
F( 2, 117) = 287.17
Model | 45196.5848 2
22598.2924
Prob > F = 0.0000
Residual | 9207.00732 117 78.6923702
R-squared = 0.8308
---------+----------------------------------------------
Adj R-squared = 0.8279
Total | 54403.5921 119
457.173043
Root MSE = 8.8709
--------------------------------------------------------------------------------------------
cars1 | Coef.
Std. Err. t
P>|t| [95% Conf. Interval]
---------+----------------------------------------------------------------------------------
class | 1.880726 .3344889
5.623 0.000 1.218289
2.543164
unemp | -1.827983 .1123766
-16.267 0.000 -2.050539
-1.605427
_cons | 88.47704 2.885689
30.661 0.000 82.76208
94.192
---------------------------------------------------------------------------------------------
Convergence : Bandwidth = 4829.2874
Running Monte Carlo simulation
<dots omitted>
Geographically Weighted Regression
Significance Test for Bandwidth
----------------------------------------
Observed
P-Value
----------------------------------------
4829.2874 0.010
----------------------------------------
Significance Tests for Non-Stationarity
-------------------------------------------------------------------------------
Variable
Si
P-Value
-------------------------------------------------------------------------------
Constant
6.0421 0.568
class
1.6079 0.023
unemp
0.1573
0.930
-------------------------------------------------------------------------------
The global model shows that the number of cars per household is significantly
related to social class and male unemployment in the study region.
The test of the bandwidth suggests that the geographically weighted
regression model is a significantly better model for this data than the
global linear regression model.
The significance tests for nonstationarity of the parameter estimates
show that the relationship between the number of cars per household and
social class varies significantly over the study area. Mapping the estimates
in the file gwrout.dta (in ARC-VIEW) suggested that the relationship
between car ownership and social class was stronger in areas not particularly
well served by the region’s light railway mass transit system which has
its focus on the city of Newcastle upon Tyne. Mapping the male unemployment
parameter suggested that the relationship was less marked in certain areas,
the cause of which wasn’t immediately obvious, thus suggesting an area
of further investigation.
Methods and formulae
The basic idea of geographically weighted regression is that a regression model is fitted at each point, i, weighting all observations, j, by a function of distance from that point. Hence observations sampled near to the observation where the regression is centred have more influence on the resulting regression parameters at that point than observations further away. This then produces a set of parameter estimates for a regression at each point in space.
The weighting function used by gwr takes the form
Wj = exp(-dj / b2)
where dj is the distance from the point i at which the regression model is being fitted, and b is the bandwidth. This was the weighting function used by Brunsdon et al. If a different weighting function is desired, the ado-file can easily be altered.
The bandwidth is calibrated by a cross-validation technique which aims to minimise the score
[y i - yi(b)]2
where yi(b) is the fitted value of yi using the bandwidth b and
the weighted regression model centred at the point i, with the observation
for point i excluded from the calibration process.
Once the bandwidth calibration process is complete, gwr uses the optimal bandwidth to fit a weighted regression model at each point, the parameter estimates being output to either a temporary or permanent datafile depending on the options used. Outputting the parameter estimates to a permanent data file (in Stata and/or text format) allows further investigation, for example using a geographical information system.
The 2 hypotheses previously described are tested by a Monte Carlo simulation,
although the significance of the gwr approach as compared to a global regression
model is only tested if specifically requested by the test option.
Each run of the Monte Carlo simulation randomly distributes the spatial
points across the observations, and the gwr process is repeated.
If test is specified, the simulated bandwidths are compared with that
calibrated using the observed data.
The second hypothesis is tested by comparing the standard error of
the parameter estimates from the observed data with those from each run
of the Monte Carlo simulation.
Acknowledgements
I thank Brunsdon et al for the use of their census data and original
Fortran program which can be obtained from their website
I also thank Dr Heather Dickinson, Dr Chris Brunsdon, Mr Martin Charlton
and Mr Trevor Dummer, University of Newcastle for their useful comments
in the formulation of these ado-files.
This work was also helped by comments made at the 4th Stata UK User
Group Meeting, May 1998.
References
Bowman A.W. 1984. An alternative method of cross-validation for the
smoothing of density estimates. Biometrika 71: 353-60.
Brunsdon C., A.S. Fotheringham and M.E. Charlton. 1996. Geographically
weighted regression: A method for exploring spatial nonstationarity. Geographical
Analysis 28: 281-98.
Cleveland W.S. 1979. Robust locally weighted regression and smoothing
scatterplots. Journal of the American Statistical Association 74:829-36.