Nick Stokes points out some fundamental problems with determining trends in surface temperatures. This is due to the changing distribution of stations within a grid cell with time. Consider a 5×5 degree grid cell which contains a 2 level plateau above a flat plain at sea level – as shown below. Temperature falls like -6.5C per 1000m in height so the real temperatures at different locations will be as shown. Therefore the correct average surface temperaure for that grid would be something like (3*20+2*14+7)/6 or about 16C. What you actually measure will depend on where your stations are located. Since the number of stations and their location is constantly changing with time there is little hope of measuring any underlying trend of average temperature in that cell. You might even argue that an average surface temperature, in this context, is a meaningless concept.

The mainstream answer to this problem is to use temperature anomalies instead. To do this we must define a monthly ‘normal’ temperature for each station over a 30 year period e.g. 1961-1990. Then in a second step we subtract these ‘normals’ from the measured temperatures to get DT or the ‘anomaly’ for that month. Then we average those values over the grid instead to get the average anomaly for that measurement month compared to 1961-1990. Next we can average over all months and all grid cells to derive the global annual temperature anomaly. The sampling bias has not really disappeared but has been partly subtracted. There is still an assumption that all stations react in synchrony to warming (or cooling) uniformly within a cell. This procedure introduces a new problem for those stations which have insufficient data defined within the selected 30 year period, and this can invalidate some of the most valuable older stations. Are there other ways to approach this problem?

For the GHCN V1 and GHCN V3(uncorrected) datasets I wanted to use all stations so took a naive approach. I simply used monthly normals defined per grid cell rather than per station over the entire period.

A novel approach to this problem was proposed first by Tamino, but then refined by RomanM and Nick Stokes. I will hopefully simplify their ideas without too much linear algebra. Corrections are welcome.

Each station is characterised by a fixed offset from the grid average. This remains constant in time because, for example, it is due to its altitude. We can estimate by first calculating all the monthly average temperatures for the particular grid cell in which it appears. Then by definition for any of the monthly averages

so now in a second step, by averaging over all the ‘offsets’ for a given station we can estimate .

So having found the set of all station ‘offsets’ in the database we can calculate temperature anomalies using all available stations in any month. I still think the anomalies have to be normalised to some standard year, but at least the bias due to a changing set of stations will be reduced, especially in the important early years.

P.S. I will try this out when time permits.