Regression Using Hockey Stick Functions


EPA-600/1-76-024
June 1976
Environmental Health Effects Research Series
                                                        U.S.

-------
                 RESEARCH REPORTING SERIES

Research reports  of the Office of  Research and Development, U.S. Environ-
mental Protection Agency, have been grouped into five series.  These five broad
categories were established to facilitate further development and application
of environmental  technology.   Elimination  of traditional  grouping  was  con-
sciously planned  to foster technology transfer  and a  maximum interface in
related fields. The five series are:
     1.    Environmental Health Effects Research
     2.    Environmental Protection Technology
     3.    Ecological Research
     4.    Environmental Monitoring
     5.    Socioeconomic Environmental Studies
This report has been  assigned to  the ENVIRONMENTAL HEALTH EFFECTS
RESEARCH series. This series describes  projects  and studies relating to the
tolerances of man for unhealthful substances or conditions.  This work is gener-
ally  assessed from a  medical viewpoint,  including physiological or  psycho-
logical studies.  In addition to toxicology and other medical specialities, study
areas include biomedical instrumentation and health research techniques uti-
lizing animals—but always  with intended application to human health measures.
 This document is available to the public through the National Technical Informa-
 tion Service, Springfield, Virginia 22161.

-------
                                             EPA-600/1-76-024
                                             June 1976
REGRESSION USING "HOCKEY STOCK" FUNCTIONS
                  By

         Victor Hasselblad
          John P. Creason
         William C. Nelson
 Statistics and Data Management Office
  Health Effects Research Laboratory
 U.S. Environmental Protection Agency
  Research Triangle Park, N.C. 27711
 U.S. ENVIRONMENTAL PROTECTION AGENCY
  OFFICE OF RESEARCH AND DEVELOPMENT
  HEALTH EFFECTS RESEARCH LABORATORY
  RESEARCH TRIANGLE PARK, N.C. 27711
                               ?ROTECTION

-------
                          DISCLAIMER

     This report has been reviewed by the Health Effects Research
Laboratory,  U.S. Environmental  Protection Agency, and approved for
publication.   Mention of trade  names or commercial  products does
not constitute endorsement or recommendation for use.

-------
                                ABSTRACT

     The establishment of criteria  for air pollutants  requires  that  a
threshold level  be established below which no  adverse  health  effects
are observed.   Since standard dose  response curves,  such  as the loait
or probit, assume an effect at all  levels, a segmented function was
developed.  This function has zero  slope  up to a  point, and then
increases monotonically from that point.   Thus the name "hockey stick"
function.  The increasina portion need not be  linear;  any function that
can be fitted by "least squares" techniques will  work. A method for
computing confidence intervals is also given.
     Since the curve can be used as a dose response curve, some comparisons
are made with the more conventional probit and logit curves.   In general,
the fit of the "hockey stick" curve is as good as either  the  logit or
probit curve, even when the data originate from a logit or probit
distribution.

-------
                              1 .   Introduction

     In dose-response type studies, the logit or probit functions are usually
apolied to observed data to estimate the appropriate parameters of the response
curve.   Only the probit and logit functions have been seriously considered, as
no theoretical basis has been suggested for other alternative functions.  These
two functions are practically indistinguishable except for very small or very
large resoonse regions.  The rate of increase in response per unit increase in
dose is frequently very small in  these regions, and considerable difficulties
are encountered in the determination of the function endpoints.
     A major problem associated with the establishment of criteria for air
pollutants is that of determining acceptable levels (thresholds) below which
no effects are discernable.  There is considerable support for the hyoothesis
that these thresholds do exist for chemical toxins    >">   , the demonstra-
tion of adaptive mechanisms being the most important.  In this view,
effects due to pollutants will appear only when changes induced in a person
are beyond the capacity of that person's homeostatic mechanisms to overcome them.
There will therefore be a threshold dose level, x , below which there is no
response other than the existing  background response, y .  The dose-response
relationship may then be described by regression functions of the form
          0, x <_xo
     y = yn + f(s, x-x ), x > x , f(o, o) = o,                            (1)
          (j     —     (j     —  u     —
where y , x , and e_ are to be estimated from the available data.  As mentioned
above, considerable difficulties  are encountered in trying to estimate x  using
the logit or probit functions for f _[e , x-x ), so that alternative functions were
considered.  The particular case  of equation (1) examined in this paoer is what
might be called a "hockey stick"  function:

-------
     y = v X-X0
     y = y0 + b(x-xo),  x  >_ XQ                                     (2)
     This function  is a special  case of two linear functions, with
different slopes, that  has  been  considered by Quandt^  .  His more general
problem, which allowed  for  different variances as well, was solved by
calculating the likelihood  function for varying x .  The resulting tests
for significance are based  on  the  likelihood ratio criterion.  Quandt's linear
functions were not  restricted  to being continuous at the join point.
     The more specific  problem of obtaining least squares estimates of a
segmented function which  is  continuous at the join points has been
                    (5)
considered by Hudsonv  '.   The case of one join point is considered specifi-
cally, and Hudson's general  method can easily be applied to the specific
case of the "hockey stick" function.

             2.  Solutions of the Least  Squares Equations

     Suppose we have n  pairs of  observations  (x., y.), and without loss
of generality, assume that the x.'s  are  ordered.  Then the residual sum
of squares, S, is
     s(x j = y   (yry0)2 +  2   [>VVb(xrxo)]2
           xi--xo           Vxo
The problem can be solved easily for fixed  x ,  giving  a  familiar  looking
set of normal equations:

-------
                         xi>xo
              (Vo>
        Vxo
                  xi>xo







yo


b




-


n
E,
i = l
n
2 (xrxo)vi
xi>xo
                                                                        (4)
This could be  done for each x  while x  is  stepped in small increments  from


x- to x .   The value of x  , y  and b would  be those values (not


necessarily unique) giving the smallest  sum of squares.



     The method of Hudson, on the other  hand, gives the exact minimum(s)  in a


finite number  of steps,  x  must either  lie in an interval between x. and


x. , , or at one of the x.'s, and so equations need be derived for these two
 J  '                   J

cases only.   If x  lies in the interval  (x., x -+-,) then the values of x ,
                vJ                       J   J  '                      *•'

y , and b  are  given by
           .  J
       my
         o <
b= (.1.  Yi  -
                                                              (5)
                'n     \ /  n     \      n

                i=j+ixV \i=j+iyi/"  m  i=j+iVl
J.  *i -mo]
                                                              (6)
                                                              (7)
  where m = n-j.

-------
     If x  lies  at  x.,  the values of y  and b are given by
             n        n         n       n
             I    yJl    xf  -  I   xjfl   xy
                   \N+i
n I   xf - (  I   x,)2
                                                             (8)
     n
   (  I
                -,-  '  ny  )/  V   x.
     The residual  sum  of  squares can be computed for the n-1 intervals and
n x.'s.  The absolute  minimum  sum of squares can always be determined in at
most 2n-l  steps.
     It is clear  that  the problem is symmetric in that the function
     y = y0» x-xo
     y = yo + b(xQ-x), x  <_ XQ
 can be fitted by making  the transformation
     z = -x.
                      3.   Confidence  Intervals for x
      Hudson also gives  methods  for  computing confidence intervals.  This
 involves looking at "likelihood regions", which  in this case are  intervals
 such that
      S(x^) ! (1  + .-) S(XQ)
 where S(x ) is the residual  sum of  squares  as  a  function of the join noint.
 The value of 6 can be aporoximately determined from  F tables with 1 and n-3
 degrees of freedom.  As oointed out by Hudson  and  Quandt, this interval for

-------
x  need not be connected, although the function S(x ) is continuous.  It is
also possible for xl and/or x  to be contained in the interval.  If xx is
in the interval, then the "hockey stick" is not a significantly better fit
than a straight line.  If x  is in the interval then no meaningful  relationship
between x and y has been discovered.
                               4.  Examples

      The first example consists of data artifically generated from the
 equation
      y =       2 for x <_ 4.5
      y = x-2.5forx^4.5
 The variance, o^, is 0.25.
      The points are
      xl      23456789      10
      y  1.058  2.18t>  d.WC  1.473  2.419  3.432  5.016  6.045  6.184   7.372
 Figure 1 shows the points and the least squares fit, which  is
      y = 1 .82", for x <_ 4. I5b
      y - .9/3 x  -2,218 for x. >_ 4.158
                                                               ^\.
 The confidence interval  (u - .05) for x   is  (3.202, 5.155).   .-'. -  .^.936.
 a-  -rarj'n of the sun1 of squares, S. as a function of the join  poi^t, x  ,
 is giver, in  Figure 2.
      Tne -.^rond example  is taken from a 3 year study of  112  student
 nurses  in  L,.i  Angeles  ^   .   Each  day  the  nurses  filled in  a hea:th question-
  naire  the:  included  nupst^'ons on eye  discomfort,  headache,  fever,  cough,
  e4c.   The deoendent  variable war the  neTc^t  of nursc-c,  whn

-------
eye discomfort without experiencing fever.   The independent variable
was the maximum hourly oxidant,  expressed in pphm.
     A "hockey stick" curve was  fitted to the data.   The least squares fit
gave
     y = 5.77                    for x < 14.67
     y = 5.77 + .617*(x - 14.67) for x >_ 14.67                   (7)

The 95 percent confidence limits for x  ( = 14.67) were (13.25, 16.37)..  The
mean square error was 13.59.  The relatively narrow confidence interval
resulted from the large number of points (867) and the consistency of
the data.
                5.   Comparison with  Probit  and  Logit  Curves

       Although  the  use  of  a  "hockey stick"  curve  provides a  convenient
  method  for  hypothesis  testing,  this does  not  mean that  the  curve  is
  a  good  fit  to  dose-response data.   For  this reason  the  "hockey  stick"
  was  compared with  the  probit and  logit  curves for data  which  was  simu-
  lated from  probit  and  logit curves.   The  results of these simulations
  are  given in Tables  1  and 2.  The  measure of  goodness of fit  is the
  standard R2 used  in  regression  analysis.
       There  are several  factors  which make the comparison of the various
  curves  difficult.  The data was simulated by  starting with  a  known probit
  (or  logit)  curve.  10, 20,  50 or  100 equally  spaced points  were selected
           ±. L.       A. U
  from the 0   to 50   percentile points.   At each point, sample  sizes of

-------
 10, 20, 50 or TOO were generated using pseudo-random numbers.  The



 parameters of the probit  (or logit) curve and the hockey stick were



 estimated.  Thus the generation of the data favors the probit (or logit)



 curve.



     On the other hand, the criterion of fit is the total sum of squares



 (or standard R2), which favors the hockey stick, since the probit (or



 logit) curves can be thought of weighted least squares fits.  In addition,



 the hockey stick is a three parameter curve, whereas the probit and



 logit curves both have only two parameters.  In spite of these difficulties,



 Tables 1 and 2 give a crude comparison of the fits of the hockey



 stick with the probit and logit curves.



                                                                  th
     The simulations  were  done  for  dose  values  ranging up to  the  50  '



percentile (LDL0)  of  the  distribution,  since air pollution health data



rarely goes  beyond  the  50  '  percentile.   Under  these  restrictions,



there was  little  or no  difference  between the  fit  of  the  "hockey  stick"



vs.  either the  probit or  logit  curve.   This  was  true  for  10  to 100



doses (K)  with  10  to  100  subjects  (N)  per dose.   All  curves  had  better



R? for increasing  K and N, which  is  to  be expected.   Additional



simulations  were  made using  a maximum  dose  at  the  25    percebtile and



a maximum  dose  at  the 75    percentile.   These  simulations  all  showed a



similar pattern.

-------
                           6.   Discussion





     Although the fitting  of a  "Hockey  stick"  function  to  data  is  not  a



particularly difficult problem, there are  at  least  two  items which should



be noted.   First, this simple departure from  linear regression  gives a sum



of squares function that may not have a first  derivative with respect  to x



at several points, a fact  amply demonstrated  in  Figure  2 from our  earlier



example.   This alone is enough  to create nroblems  for many non-linear  least-



squares fitting computer programs.   Secondly,  the  implication of the



comparison to the probit and logit  curves  is  of  interest.   Currently there  is



a great deal of discussion about the extrapolation  of dose-response curves  to



very low doses.  If 100 observations at each  of  100 points cannot  distinguish



between a "hockey stick" and a  probit or logit curve, it is clear  that the



resolution of this problem strictly through large  scale sampling is not



feasible.



     In summary, the "hockey stick" curve provides  a convenient method for



estimation and hypothesis  testing in low-dose and/or high  dose  regions of dose



response curves, the estimation procedures are simple and  straight-forward,



and it's fit to the data appears to be  indistinguishable from that of  the



standard logit or probit curves.  The  use of  the "hockey stick" function  is



definitely not a major breakthrough in  curve  fitting.  It  does, however,  offer



a means of testing for a threshold level that is not available  using  standard



dose-response curves.

-------
                               REFERENCES


[1]  Stokinger, H.  E.:   Concepts  of Thresholds  in  Standards  Setting.   Arch.
     Environ.  Health 25:153-157.  1972.

[2]  Dinman, B. D.:   "Non-concept"  of "no-threshold":   Chemicals  in  the
     Environmental.   Science 175:495-497,  1972.

[3]  Waldron,  H.  A.:  The  Blood Lead Threshold.  Arch.  Environ. Health
     29:271-273,  1974.

[4]  Quandt, R. E.:   "The  Estimation of the  Parameters  of  a  Linear Regression
     System Obeying  Two Separate  Regimes", Journal  of  the  American Statistical
     Assoc., Vol.  53 (1958), pp.  873-880.

[5]  Hudson, D. J.:   "Fitting Segmented Curves  Whose Join  Points  Have  to  be
     Estimated",  Journal of the American Statistical Association, Vol. 61
     (1966), pp.  1097-1129.

[6]  Hammer, D. I.,  et.  al.:  "The  Los  Angeles  Student  Nurse Study I.
     Relationship of Daily  Symptom  Reporting and Photochemical Oxidants",
     submitted to Archives  of Environmental  Health.

-------
>-  4-
                         Figure 1.  Fitted hockey stick curve to artificial data.
                    Figure 2.  Residual sum of squares as a function of the break
                    point, XQ
                                                  10

-------
                        O


                       D-
                        I

                       CM
                         CTl l"~*- CTl CO
                         en CM i— LO
                         UD OO CTl CTl
                            co CM
                            «* r-~ CTI
                            UD r--- oo
                                                          UD
                 r— CT1 i— CO
                 r— «d- en <3-
                 UD r--. ex: CT>
LO  co LO oo
o  u-> oc oo
UD  r-- co CTI
                        QJ
                       ^
                        O
                               o-. LO UD i—
                               r— O G LO
                               r^ co a- 0-1
                                         oo en r— c-
                                         CM CO CTi CO
                                         UD r-^ co cr>
                 oo cc •
                 i—
                 UD 1
                                                   CO  CO
                                                   OO  CTi
                                                                                       oo cr,
       QJ
       >   •
       i-  in
       3  c
       O  O
 O)
.^
 O
 O  £
in  o
    s-
.n i—
 O  3
 S-  E
Q- •!—
   oo
-l->  (O i
•r-  O
U-
    s-
4—  O
 O  4-
 co

 O
    a>
                    O -i-
                    H-  S-
                         CO WD «D i—
                         CO CTi «3 LD
                         CO C\J C\J OJ
                            ic «a- •— c:
                            UD CTl CO CO
                            c o en LO
                            O UD O-l IO
                            r- tn 
 c:  -r-
 ro  -Q
•r-  O
 S-  S-
 (D  Q-
                         LO t— i— r—
                         i— LT> OJ i—
                         i— C O O
03 o co CM
C3~l CO r— L0
>d- c\j LO CM
CM i— O O
i— CM O O
CTi r^ *^i~ CTi
o o en CM
r^- LO CM UD
UD CO i— O
UD CM -3- >£>
ID 1— (^ ^J-
r^. U3 CT) UD
•vi- r~> UD co
CO UD CM i—
                    +J
                    3 -i«i
                    O  0
                       oo
                    CO
                    o  >,
                    E  
-------
o o o o
oooo
                en en en en
                O O O O
r\3 ro ro ro
o o o o
                                                  oooo
O en ro —'
oooo
                o en ro —'
                oooo
O en 1N3 —i
OOOO
O en r\i —i
OOOO
—•  eo UD -~-j
UD—i ro -Pi
01  CO ro Ol
—'  us co eo
en  —' cr> o
                 O —' -Fa  OO
                 «D --j cn  en
                 O Co o  en
                 en en co  <-o
                 CO o CO  cr>
o CD —i eo
to cr> 01 o
eo oo en -~-i
co ^J oo UD
co cr> to o
o o o —•
—' ro 01 eo
-pi -J ~-J CT>
cri ro —' en
co at •
 7S- Q)
 ro  3
<<  o
    ro
 CO
                                                                    o
                                                                       cr
                                                                       O
                                                                       c
                                                                                 O
                                                                                    ca
                                                                                           £D
                                                                                           cr
— ' CO UD CO
U3 CO -P» O
ro — i -p» en
                 O — ' -P* CO
                 i^D CO --J l£>
                 o — ' coro    — •  oo oo o
 o o —'eo
 eo ^J oo  eo
 -PS. co o  oo
 ho en —'  —'
 U3 —I OO  O
O O O  — '
— ' CO ^J  CT>
                                                   en en  co eo
                                                   r\3 ~~j  o eri
    QJ
 I	5
 O  -••
ta  cu
 -•• 3
 C-H O
    ro
 o
 C  3>
 -s  cr
 <  o
 ro  c
                                                                                 o -••
                                                                                 .   cu
                                                                                 o —'
                                                                                    i/>     ro
                                                                                 <-*•-   co
                                                                                 o     r+ ro
                                                                                    CO -"• •
                                                                                 CD'—. O
                                                                                 en-
                                                                                 o
                                                                                 cu  s:
                                                                                 rf- 3-
                                                                                 cu  ro
                                                                                    -s
                                                                                 CD ro
                                                                                 fu
                                                                                 i/i  ~a
                                                                                 ro
                                                                                 a. -n
                                                                                    o
                                                                                 o  — •
                                                                                           o
                                                                                           o
                                         fl)
                                     I	S
                                     O  -i.
                                     en  oo
                                     ->. o
                                     (-+ 3

                                     -h O
                                     O  -h
                                     -5
                                         -n
                                     a -••
      1 ro ro
    en CO co
    ^j en -p»
  i —' o en
    crt cri eo
                 ~^i co —« -P>
                 ro —< o en
                 -^i oo cr> -vi
                 en —• o —•
                 yo '-o cy> o
ro eo -p» en
IJD -P» -P> "»J
en —' ro co
Co o ro —•
  en ^i eo

o
   O)
CO
0> -h
   O
-h -S

o :r
3 O
   O
                                                                                 a>
                                                                                   i -••    ro
                                                                                 o c:
                                                                                 3 -s
                                                                                 CO <
                                                                                 •   ro
CO >«j en -p.
cr> en ~-j o
en co co en
                 co -"si ai -P>
                 ^4 co co —•
                 ai ro -P> ^i
co ^i en  -P>
CO   co
 O
 o
 7T
Co --J tr co
cr> ~~j cri oo
~~j o co en
                 co ~~J en co
                 ^i -^i a-i co
                 en co CT> UD
co --~J en -P>
Co co o —'
•p* -P> eo o
CO CO CTl -P>
'•o ro en oo
co o —' en
 o
•a
                                        12
-------