EPA-600/1-76-024
June 1976
Environmental Health Effects Research Series
U.S.
-------
RESEARCH REPORTING SERIES
Research reports of the Office of Research and Development, U.S. Environ-
mental Protection Agency, have been grouped into five series. These five broad
categories were established to facilitate further development and application
of environmental technology. Elimination of traditional grouping was con-
sciously planned to foster technology transfer and a maximum interface in
related fields. The five series are:
1. Environmental Health Effects Research
2. Environmental Protection Technology
3. Ecological Research
4. Environmental Monitoring
5. Socioeconomic Environmental Studies
This report has been assigned to the ENVIRONMENTAL HEALTH EFFECTS
RESEARCH series. This series describes projects and studies relating to the
tolerances of man for unhealthful substances or conditions. This work is gener-
ally assessed from a medical viewpoint, including physiological or psycho-
logical studies. In addition to toxicology and other medical specialities, study
areas include biomedical instrumentation and health research techniques uti-
lizing animalsbut always with intended application to human health measures.
This document is available to the public through the National Technical Informa-
tion Service, Springfield, Virginia 22161.
-------
EPA-600/1-76-024
June 1976
REGRESSION USING "HOCKEY STOCK" FUNCTIONS
By
Victor Hasselblad
John P. Creason
William C. Nelson
Statistics and Data Management Office
Health Effects Research Laboratory
U.S. Environmental Protection Agency
Research Triangle Park, N.C. 27711
U.S. ENVIRONMENTAL PROTECTION AGENCY
OFFICE OF RESEARCH AND DEVELOPMENT
HEALTH EFFECTS RESEARCH LABORATORY
RESEARCH TRIANGLE PARK, N.C. 27711
?ROTECTION
-------
DISCLAIMER
This report has been reviewed by the Health Effects Research
Laboratory, U.S. Environmental Protection Agency, and approved for
publication. Mention of trade names or commercial products does
not constitute endorsement or recommendation for use.
-------
ABSTRACT
The establishment of criteria for air pollutants requires that a
threshold level be established below which no adverse health effects
are observed. Since standard dose response curves, such as the loait
or probit, assume an effect at all levels, a segmented function was
developed. This function has zero slope up to a point, and then
increases monotonically from that point. Thus the name "hockey stick"
function. The increasina portion need not be linear; any function that
can be fitted by "least squares" techniques will work. A method for
computing confidence intervals is also given.
Since the curve can be used as a dose response curve, some comparisons
are made with the more conventional probit and logit curves. In general,
the fit of the "hockey stick" curve is as good as either the logit or
probit curve, even when the data originate from a logit or probit
distribution.
-------
1 . Introduction
In dose-response type studies, the logit or probit functions are usually
apolied to observed data to estimate the appropriate parameters of the response
curve. Only the probit and logit functions have been seriously considered, as
no theoretical basis has been suggested for other alternative functions. These
two functions are practically indistinguishable except for very small or very
large resoonse regions. The rate of increase in response per unit increase in
dose is frequently very small in these regions, and considerable difficulties
are encountered in the determination of the function endpoints.
A major problem associated with the establishment of criteria for air
pollutants is that of determining acceptable levels (thresholds) below which
no effects are discernable. There is considerable support for the hyoothesis
that these thresholds do exist for chemical toxins >"> , the demonstra-
tion of adaptive mechanisms being the most important. In this view,
effects due to pollutants will appear only when changes induced in a person
are beyond the capacity of that person's homeostatic mechanisms to overcome them.
There will therefore be a threshold dose level, x , below which there is no
response other than the existing background response, y . The dose-response
relationship may then be described by regression functions of the form
0, x <_xo
y = yn + f(s, x-x ), x > x , f(o, o) = o, (1)
(j (j u
where y , x , and e_ are to be estimated from the available data. As mentioned
above, considerable difficulties are encountered in trying to estimate x using
the logit or probit functions for f _[e , x-x ), so that alternative functions were
considered. The particular case of equation (1) examined in this paoer is what
might be called a "hockey stick" function:
-------
y = v X-X0
y = y0 + b(x-xo), x >_ XQ (2)
This function is a special case of two linear functions, with
different slopes, that has been considered by Quandt^ . His more general
problem, which allowed for different variances as well, was solved by
calculating the likelihood function for varying x . The resulting tests
for significance are based on the likelihood ratio criterion. Quandt's linear
functions were not restricted to being continuous at the join point.
The more specific problem of obtaining least squares estimates of a
segmented function which is continuous at the join points has been
(5)
considered by Hudsonv '. The case of one join point is considered specifi-
cally, and Hudson's general method can easily be applied to the specific
case of the "hockey stick" function.
2. Solutions of the Least Squares Equations
Suppose we have n pairs of observations (x., y.), and without loss
of generality, assume that the x.'s are ordered. Then the residual sum
of squares, S, is
s(x j = y (yry0)2 + 2 [>VVb(xrxo)]2
xi--xo Vxo
The problem can be solved easily for fixed x , giving a familiar looking
set of normal equations:
-------
xi>xo
(Vo>
Vxo
xi>xo
yo
b
-
n
E,
i = l
n
2 (xrxo)vi
xi>xo
(4)
This could be done for each x while x is stepped in small increments from
x- to x . The value of x , y and b would be those values (not
necessarily unique) giving the smallest sum of squares.
The method of Hudson, on the other hand, gives the exact minimum(s) in a
finite number of steps, x must either lie in an interval between x. and
x. , , or at one of the x.'s, and so equations need be derived for these two
J ' J
cases only. If x lies in the interval (x., x -+-,) then the values of x ,
vJ J J ' *'
y , and b are given by
. J
my
o <
b= (.1. Yi -
(5)
'n \ / n \ n
i=j+ixV \i=j+iyi/" m i=j+iVl
J. *i -mo]
(6)
(7)
where m = n-j.
-------
If x lies at x., the values of y and b are given by
n n n n
I yJl xf - I xjfl xy
\N+i
n I xf - ( I x,)2
(8)
n
( I
-,- ' ny )/ V x.
The residual sum of squares can be computed for the n-1 intervals and
n x.'s. The absolute minimum sum of squares can always be determined in at
most 2n-l steps.
It is clear that the problem is symmetric in that the function
y = y0» x-xo
y = yo + b(xQ-x), x <_ XQ
can be fitted by making the transformation
z = -x.
3. Confidence Intervals for x
Hudson also gives methods for computing confidence intervals. This
involves looking at "likelihood regions", which in this case are intervals
such that
S(x^) ! (1 + .-) S(XQ)
where S(x ) is the residual sum of squares as a function of the join noint.
The value of 6 can be aporoximately determined from F tables with 1 and n-3
degrees of freedom. As oointed out by Hudson and Quandt, this interval for
-------
x need not be connected, although the function S(x ) is continuous. It is
also possible for xl and/or x to be contained in the interval. If xx is
in the interval, then the "hockey stick" is not a significantly better fit
than a straight line. If x is in the interval then no meaningful relationship
between x and y has been discovered.
4. Examples
The first example consists of data artifically generated from the
equation
y = 2 for x <_ 4.5
y = x-2.5forx^4.5
The variance, o^, is 0.25.
The points are
xl 23456789 10
y 1.058 2.18t> d.WC 1.473 2.419 3.432 5.016 6.045 6.184 7.372
Figure 1 shows the points and the least squares fit, which is
y = 1 .82", for x <_ 4. I5b
y - .9/3 x -2,218 for x. >_ 4.158
^\.
The confidence interval (u - .05) for x is (3.202, 5.155). .-'. - .^.936.
a- -rarj'n of the sun1 of squares, S. as a function of the join poi^t, x ,
is giver, in Figure 2.
Tne -.^rond example is taken from a 3 year study of 112 student
nurses in L,.i Angeles ^ . Each day the nurses filled in a hea:th question-
naire the: included nupst^'ons on eye discomfort, headache, fever, cough,
e4c. The deoendent variable war the neTc^t of nursc-c, whn
-------
eye discomfort without experiencing fever. The independent variable
was the maximum hourly oxidant, expressed in pphm.
A "hockey stick" curve was fitted to the data. The least squares fit
gave
y = 5.77 for x < 14.67
y = 5.77 + .617*(x - 14.67) for x >_ 14.67 (7)
The 95 percent confidence limits for x ( = 14.67) were (13.25, 16.37).. The
mean square error was 13.59. The relatively narrow confidence interval
resulted from the large number of points (867) and the consistency of
the data.
5. Comparison with Probit and Logit Curves
Although the use of a "hockey stick" curve provides a convenient
method for hypothesis testing, this does not mean that the curve is
a good fit to dose-response data. For this reason the "hockey stick"
was compared with the probit and logit curves for data which was simu-
lated from probit and logit curves. The results of these simulations
are given in Tables 1 and 2. The measure of goodness of fit is the
standard R2 used in regression analysis.
There are several factors which make the comparison of the various
curves difficult. The data was simulated by starting with a known probit
(or logit) curve. 10, 20, 50 or 100 equally spaced points were selected
±. L. A. U
from the 0 to 50 percentile points. At each point, sample sizes of
-------
10, 20, 50 or TOO were generated using pseudo-random numbers. The
parameters of the probit (or logit) curve and the hockey stick were
estimated. Thus the generation of the data favors the probit (or logit)
curve.
On the other hand, the criterion of fit is the total sum of squares
(or standard R2), which favors the hockey stick, since the probit (or
logit) curves can be thought of weighted least squares fits. In addition,
the hockey stick is a three parameter curve, whereas the probit and
logit curves both have only two parameters. In spite of these difficulties,
Tables 1 and 2 give a crude comparison of the fits of the hockey
stick with the probit and logit curves.
th
The simulations were done for dose values ranging up to the 50 '
percentile (LDL0) of the distribution, since air pollution health data
rarely goes beyond the 50 ' percentile. Under these restrictions,
there was little or no difference between the fit of the "hockey stick"
vs. either the probit or logit curve. This was true for 10 to 100
doses (K) with 10 to 100 subjects (N) per dose. All curves had better
R? for increasing K and N, which is to be expected. Additional
simulations were made using a maximum dose at the 25 percebtile and
a maximum dose at the 75 percentile. These simulations all showed a
similar pattern.
-------
6. Discussion
Although the fitting of a "Hockey stick" function to data is not a
particularly difficult problem, there are at least two items which should
be noted. First, this simple departure from linear regression gives a sum
of squares function that may not have a first derivative with respect to x
at several points, a fact amply demonstrated in Figure 2 from our earlier
example. This alone is enough to create nroblems for many non-linear least-
squares fitting computer programs. Secondly, the implication of the
comparison to the probit and logit curves is of interest. Currently there is
a great deal of discussion about the extrapolation of dose-response curves to
very low doses. If 100 observations at each of 100 points cannot distinguish
between a "hockey stick" and a probit or logit curve, it is clear that the
resolution of this problem strictly through large scale sampling is not
feasible.
In summary, the "hockey stick" curve provides a convenient method for
estimation and hypothesis testing in low-dose and/or high dose regions of dose
response curves, the estimation procedures are simple and straight-forward,
and it's fit to the data appears to be indistinguishable from that of the
standard logit or probit curves. The use of the "hockey stick" function is
definitely not a major breakthrough in curve fitting. It does, however, offer
a means of testing for a threshold level that is not available using standard
dose-response curves.
-------
REFERENCES
[1] Stokinger, H. E.: Concepts of Thresholds in Standards Setting. Arch.
Environ. Health 25:153-157. 1972.
[2] Dinman, B. D.: "Non-concept" of "no-threshold": Chemicals in the
Environmental. Science 175:495-497, 1972.
[3] Waldron, H. A.: The Blood Lead Threshold. Arch. Environ. Health
29:271-273, 1974.
[4] Quandt, R. E.: "The Estimation of the Parameters of a Linear Regression
System Obeying Two Separate Regimes", Journal of the American Statistical
Assoc., Vol. 53 (1958), pp. 873-880.
[5] Hudson, D. J.: "Fitting Segmented Curves Whose Join Points Have to be
Estimated", Journal of the American Statistical Association, Vol. 61
(1966), pp. 1097-1129.
[6] Hammer, D. I., et. al.: "The Los Angeles Student Nurse Study I.
Relationship of Daily Symptom Reporting and Photochemical Oxidants",
submitted to Archives of Environmental Health.
-------
>- 4-
Figure 1. Fitted hockey stick curve to artificial data.
Figure 2. Residual sum of squares as a function of the break
point, XQ
10
-------
O
D-
I
CM
CTl l"~*- CTl CO
en CM i LO
UD OO CTl CTl
co CM
«* r-~ CTI
UD r--- oo
UD
r CT1 i CO
r «d- en <3-
UD r--. ex: CT>
LO co LO oo
o u-> oc oo
UD r-- co CTI
QJ
^
O
o-. LO UD i
r O G LO
r^ co a- 0-1
oo en r c-
CM CO CTi CO
UD r-^ co cr>
oo cc
i
UD 1
CO CO
OO CTi
oo cr,
QJ
>
i- in
3 c
O O
O)
.^
O
O £
in o
s-
.n i
O 3
S- E
Q- !
oo
-l-> (O i
r- O
U-
s-
4 O
O 4-
co
O
a>
O -i-
H- S-
CO WD «D i
CO CTi «3 LD
CO C\J C\J OJ
ic «a- c:
UD CTl CO CO
c o en LO
O UD O-l IO
r- tn
c: -r-
ro -Q
r- O
S- S-
(D Q-
LO t i r
i LT> OJ i
i C O O
03 o co CM
C3~l CO r L0
>d- c\j LO CM
CM i O O
i CM O O
CTi r^ *^i~ CTi
o o en CM
r^- LO CM UD
UD CO i O
UD CM -3- >£>
ID 1 (^ ^J-
r^. U3 CT) UD
vi- r~> UD co
CO UD CM i
+J
3 -i«i
O 0
oo
CO
o >,
E
-------
o o o o
oooo
en en en en
O O O O
r\3 ro ro ro
o o o o
oooo
O en ro '
oooo
o en ro '
oooo
O en 1N3 i
OOOO
O en r\i i
OOOO
eo UD -~-j
UDi ro -Pi
01 CO ro Ol
' us co eo
en ' cr> o
O ' -Fa OO
«D --j cn en
O Co o en
en en co <-o
CO o CO cr>
o CD i eo
to cr> 01 o
eo oo en -~-i
co ^J oo UD
co cr> to o
o o o
' ro 01 eo
-pi -J ~-J CT>
cri ro ' en
co at
7S- Q)
ro 3
<< o
ro
CO
o
cr
O
c
O
ca
£D
cr
' CO UD CO
U3 CO -P» O
ro i -p» en
O ' -P* CO
i^D CO --J l£>
o ' coro oo oo o
o o 'eo
eo ^J oo eo
-PS. co o oo
ho en ' '
U3 I OO O
O O O '
' CO ^J CT>
en en co eo
r\3 ~~j o eri
QJ
I 5
O -
ta cu
- 3
C-H O
ro
o
C 3>
-s cr
< o
ro c
o -
. cu
o '
i/> ro
<-*- co
o r+ ro
CO -"
CD'. O
en-
o
cu s:
rf- 3-
cu ro
-s
CD ro
fu
i/i ~a
ro
a. -n
o
o
o
o
fl)
I S
O -i.
en oo
->. o
(-+ 3
-h O
O -h
-5
-n
a -
1 ro ro
en CO co
^j en -p»
i ' o en
crt cri eo
~^i co « -P>
ro < o en
-^i oo cr> -vi
en o
yo '-o cy> o
ro eo -p» en
IJD -P» -P> "»J
en ' ro co
Co o ro
en ^i eo
o
O)
CO
0> -h
O
-h -S
o :r
3 O
O
a>
i - ro
o c:
3 -s
CO <
ro
CO >«j en -p.
cr> en ~-j o
en co co en
co -"si ai -P>
^4 co co
ai ro -P> ^i
co ^i en -P>
CO co
O
o
7T
Co --J tr co
cr> ~~j cri oo
~~j o co en
co ~~J en co
^i -^i a-i co
en co CT> UD
co --~J en -P>
Co co o '
p* -P> eo o
CO CO CTl -P>
'o ro en oo
co o ' en
o
a
12
------- |