Mark S. Pearce, Department of Child Health, University of Newcastle
upon Tyne, UK.

Fax (00)-44-191-2023060

EMAIL m.s.pearce@ncl.ac.uk

The syntax for the **gwr** commands is

**gwr **[varlist] [if exp] [in range], **east**(varname)
**north**(varname) options

**gwrgrid** [varlist] [if exp] [in range], **east**(varname)
**north**(varname) options

where the allowed options are

** saving(**filename**) dots reps(**#**) double eform family(**familyname**)
link(**linkname**) [ln]offset(**varname**) test bandwidth(**#**)
replace noconstant nolog scale(**x2|dev|#**) disp(**#**)
iterate(**#**) init(**varname**) outfile(**filename**) comma wide
mcsave(**filename**) sample(**#**)**

with an additional option for **gwrgrid**

** square(**#**)**

Description

The** gwr ** command applies geographically weighted regression
to a dataset containing geographical reference points. These points must
be defined in the gwr command using the **east**() and **north**()
‘options’.

The **gwrgrid** command is a slight variation which fits a grid
across the area defined by **east**() and **north**(). This method
is especially useful for large datasets where the gwr command can become
quite time consuming.

The essence of geographically weighted regression is that it allows different relationships between the dependent and independent variables to exist at different points, (x,y), in space. For a full discussion of this method see Brunsdon et al. (1996).

The **gwr** command produces a set of parameter estimates for a regression
at each point, i, at which there is an observation. The **gwrgrid**
command produces a set of parameter estimates at each grid square centroid.
Any grid square in which there are no observations is ignored. Using the
outfile() and saving() options these estimates can be explored in more
detail, for example using a geographical information system to map risk
estimates across space. In each regression the observations are weighted
by a function related to their distance from the point at which the regression
is being carried out. The actual function used depends upon a bandwidth
estimation which is carried out using a cross-validation approach. See,
for example Cleveland (1979) and Bowman (1984).

The command is designed to test 2 hypotheses, both using a Monte Carlo simulation where the spatial points are randomly distributed amongst the data.

1. Does the geographically weighted regression model describe the data significantly better than a global regression model, the results of which are also shown in the output ?

2. Does the set of parameter estimates exhibit significant spatial variation
? This compares the standard deviations of the observed parameter estimates
(Si) with those from the Monte Carlo simulation.

__Options__

**test** requests that the first hypothesis be tested, i.e. that
the model produced by gwr describes the data significantly better than
a global model.

Not testing this hypothesis reduces the need to calibrate the bandwidth
for each run of the Monte Carlo simulation and so reduces the time the
command will take to run. The second hypothesis, concerning spatial variation
of the parameter estimates, is always carried out, using either the user-defined
bandwidth or the bandwidth estimated from the observed data.

If test is specified and convergence of the bandwidth not achieved
during the Monte Carlo simulation, a note is made in the results of how
many times convergence was not achieved. The significance level is adjusted
to ignore those runs where convergence was not achieved.

**bandwidth**(#) allows the bandwidth to be declared, eliminating
the need to calibrate the bandwidth and so saving time.

**sample**(#) specifies the percentage of observations to be used
in the bandwidth calibration process, the default being 100%. If this option
is specified, #% of the observations will be randomly sampled and used
in the calibration process. The same number of observations will be used
when recalibrating the bandwidth for the simulation test of the geographically
weighted regression model.

**square**(#) is only used with gwrgrid and allows the size of the
grid squares to be defined. The default is to set the width of the grid
squares to be half the bandwidth.

**saving**(filename) specifies the name of the file to contain the
parameter estimates and grid reference for each point at which the regression
is carried out.

**outfile**(filename) creates a text file filename.raw containing
the parameter estimates from each point at which gwr is calculated. The
comma and wide options for outfile() are also allowed. The file is set
out as east north dep_vars constant

**outfile**() and **saving**() can both be specified simultaneously.

**mcsave**(filename) requests that the results of the Monte Carlo
simulation be saved as filename.dta rather than using a temporary file.
This file will contain the standard errors of the parameter estimates for
each run, as well as the simulated bandwidths if the test option is specified.
A simulated bandwidth of -99.99 indicates that the bandwidth calibration
failed to converge.

**replace** indicates that the filenames specified by saving(), outfile()
or mcsave() may be overwritten.

**double** specifies that the results stored in the file specified
by saving(), outfile() or mcsave() are stored as doubles (8-byte reals).
By default they are stored as floats (4-byte reals).

**nolog** suppresses the display of the bandwidth calibration process..

**iterate**(#) specifies the maximum number of iterations allowed
in estimating the bandwidth. The default is 50.

**reps**(#) specifies the number of Monte Carlo simulations to be
performed. The default is 1000.

**dots** requests a dot be placed on the screen at the beginning
of each run of the Monte Carlo simulation, showing how far the simulation
has gone.

**gwr** uses iweights. For this reason, while the default model (i.e.
a linear regression model) actually uses **regress** (for speed), all
other models use the **glm** command.

Where **gwr** uses the **glm** command, many of the options used
with **glm** are also available for **gwr**, enabling the user to
define the form of model:

**eform family(**familyname**) link(**linkname**) [ln]offset(**varname**)
noconstant scale(**x2|dev|#**) disp(**#**) init(**varname**)**

Not specifying **family(), link() or [ln]offset** automatically results
in the default linear regression model being used.

Example (a replication of Brunsdon et al)

This example uses the ward-level 1991 census data for the county of
Tyne and Wear in the United Kingdom, and explores the relationship between
the number of cars per household (cars) and 2 independent variables, the
proportion of unemployed males (out of the total economically active male
population) (unemp) and social class (the proportion of households with
the head of the household in social class I) (sclass).

The spatial points are given by the variables east and north,
in this case they are the eastings and northings of the ward centroids
in kilometres.

Using this data, both hypotheses can be tested, in this example using a Monte Carlo simulation of 1000 repetitions.

. **gwr cars sclass unemp, east(east) north(north) reps(1000) dots
test saving(gwrout) replace dots nolog**

Global Model

Source | SS
df MS
Number of obs = 120

---------+----------------------------------------------
F( 2, 117) = 287.17

Model | 45196.5848 2
22598.2924
Prob > F = 0.0000

Residual | 9207.00732 117 78.6923702
R-squared = 0.8308

---------+----------------------------------------------
Adj R-squared = 0.8279

Total | 54403.5921 119
457.173043
Root MSE = 8.8709

--------------------------------------------------------------------------------------------

cars1 | Coef.
Std. Err. t
P>|t| [95% Conf. Interval]

---------+----------------------------------------------------------------------------------

class | 1.880726 .3344889
5.623 0.000 1.218289
2.543164

unemp | -1.827983 .1123766
-16.267 0.000 -2.050539
-1.605427

_cons | 88.47704 2.885689
30.661 0.000 82.76208
94.192

---------------------------------------------------------------------------------------------

Convergence : Bandwidth = 4829.2874

Running Monte Carlo simulation

<dots omitted>

Geographically Weighted Regression

Significance Test for Bandwidth

----------------------------------------

Observed
P-Value

----------------------------------------

4829.2874 0.010

----------------------------------------

Significance Tests for Non-Stationarity

-------------------------------------------------------------------------------

Variable
Si
P-Value

-------------------------------------------------------------------------------

Constant
6.0421 0.568

class
1.6079 0.023

unemp
0.1573
0.930

-------------------------------------------------------------------------------

The global model shows that the number of cars per household is significantly
related to social class and male unemployment in the study region.

The test of the bandwidth suggests that the geographically weighted
regression model is a significantly better model for this data than the
global linear regression model.

The significance tests for nonstationarity of the parameter estimates
show that the relationship between the number of cars per household and
social class varies significantly over the study area. Mapping the estimates
in the file gwrout.dta (in ARC-VIEW) suggested that the relationship
between car ownership and social class was stronger in areas not particularly
well served by the region’s light railway mass transit system which has
its focus on the city of Newcastle upon Tyne. Mapping the male unemployment
parameter suggested that the relationship was less marked in certain areas,
the cause of which wasn’t immediately obvious, thus suggesting an area
of further investigation.

__Methods and formulae__

The basic idea of geographically weighted regression is that a regression model is fitted at each point, i, weighting all observations, j, by a function of distance from that point. Hence observations sampled near to the observation where the regression is centred have more influence on the resulting regression parameters at that point than observations further away. This then produces a set of parameter estimates for a regression at each point in space.

The weighting function used by gwr takes the form

Wj = exp(-dj / b2)

where dj is the distance from the point i at which the regression model is being fitted, and b is the bandwidth. This was the weighting function used by Brunsdon et al. If a different weighting function is desired, the ado-file can easily be altered.

The bandwidth is calibrated by a cross-validation technique which aims to minimise the score

[y i - yi(b)]2

where yi(b) is the fitted value of yi using the bandwidth b and
the weighted regression model centred at the point i, with the observation
for point i excluded from the calibration process.

Once the bandwidth calibration process is complete, gwr uses the optimal bandwidth to fit a weighted regression model at each point, the parameter estimates being output to either a temporary or permanent datafile depending on the options used. Outputting the parameter estimates to a permanent data file (in Stata and/or text format) allows further investigation, for example using a geographical information system.

The 2 hypotheses previously described are tested by a Monte Carlo simulation,
although the significance of the gwr approach as compared to a global regression
model is only tested if specifically requested by the test option.

Each run of the Monte Carlo simulation randomly distributes the spatial
points across the observations, and the gwr process is repeated.

If test is specified, the simulated bandwidths are compared with that
calibrated using the observed data.

The second hypothesis is tested by comparing the standard error of
the parameter estimates from the observed data with those from each run
of the Monte Carlo simulation.

__Acknowledgements__

I thank Brunsdon et al for the use of their census data and original
Fortran program which can be obtained from their website

I also thank Dr Heather Dickinson, Dr Chris Brunsdon, Mr Martin Charlton
and Mr Trevor Dummer, University of Newcastle for their useful comments
in the formulation of these ado-files.

This work was also helped by comments made at the 4th Stata UK User
Group Meeting, May 1998.

__References__

Bowman A.W. 1984. An alternative method of cross-validation for the
smoothing of density estimates. *Biometrika* **71**: 353-60.

Brunsdon C., A.S. Fotheringham and M.E. Charlton. 1996. Geographically
weighted regression: A method for exploring spatial nonstationarity. *Geographical
Analysis* **28**: 281-98.

Cleveland W.S. 1979. Robust locally weighted regression and smoothing
scatterplots. *Journal of the American Statistical Association* **74**:829-36.