Linear Regression of Nonpoint Source Pollution Analyses


                          United States
                          Environmental Protection
                          Agency
                                   Office Of Water
                                   (4503F)
                           EPA-841-B-97-007
                                   June 1997
&EPA
Linear Regression  for  Nonpoint
Source   Pollution  Analyses
   INTRODUCTION
   The purpose of this fact sheet is to demonstrate an
   approach for describing the relationship between variables
   using regression.  The fact sheet is targeted toward
   persons in state water quality monitoring agencies who are
   responsible for nonpoint source assessments and
   implementation of watershed management.

   Regression can be used to model or predict the behavior
   of one or more variables. The general regression model,
   where e is an error term, is given as
                                   +  €
                                                 (1)
   In this equation, the behavior of a single dependent
   variable (y) is modeled with one or more independent
   variables (x,, ..., xn). The x's may be linear or nonlinear
   (e.g., jCj can represent:*2, x3, r1, etc.). p0, ..., Pn are
   numerical constants that are computed using equations
   described later. Nonlinear models are commonly applied
   to physical systems, but they are somewhat more difficult
   to analyze because iterative techniques are involved when
   the model cannot be transformed to a linear model.  The
   use of two or more independent variables (x) in a linear
   function to describe the behavior of y is referred to as
   multiple linear regression.  In either case, regression
   techniques attempt to explain as much of the variation hi
   the dependent variable as possible.

   In nonpoint source analyses, linear regression is often
   used to determine the extent to which the value of a water
   quality variable (y) is influenced by land use or hydrologic
   factors (x) such as crop type, soil type, percentage of land
   treatment, rainfall, or stream flow, or by another water
   quality variable. Practical applications of these regression
   results include the.ability to predict the water quality
   impacts due to changes in the independent variables.

   SIMPLE LINEAR REGRESSION
                                f
   The simplest form of regression is to consider one
   dependent and one independent variable using

      y = P0 + Pj* + e                         (2)
   where y is the dependent variable, x is the independent
   variable, and P0 and ^l are numerical constants
                              representing the y-intercept and slope, respectively.
                              Helsel and Hirsch (1995) summarize the key assumptions
                              regarding application of linear regression (Table 1). The
                              uses of a regression analysis should not be extended
                              beyond those supported by the assumptions met. Note
                              that the normality assumption (assumption 5) can be
                              relaxed when testing hypotheses and estimating confidence
                              intervals if the sample size is relatively large.

                              The first step in applying linear regression (assumption 1
                              in Table 1) is to examine the data to see if linear
                              regression makes sense—that is, to use a bivariate scatter
                              plot to see if the points approximate a straight line. If
                              they fall hi a straight line, linear regression makes sense;
                              if they do not, a data transformation might be needed, or
                              perhaps a nonlinear relationship should be used.

                              To illustrate the use of linear regression, the  fraction of
                              water (split) collected by a water and sediment sampler
                              for a plot-sized runoff sampler is used (Dressing et al.,
                              1987).  In this data set the sampling percentage (split) was
                              measured for a range of flow rates. The scatter plot
                              (Figure 1) shows that linear regression can be applied.

                              Presuming that the data are representative (assumption 2
                              hi Table 1), the next step is to develop the regression line
                              using the method of least squares (Freund, 1973). To
                              determine the values of (i0 and pt in Equation 2, the
                              following equations can be used (Helsel and Hirsch,
                              1995):
= ^L
  55.
                                              j=i
                                                     -  «w2
                                 Po  = y  ~ Pi*
                                                                            (3)
                                       (4)
                              where n, x, and y are the number of observations, the
                              mean of the independent variable (e.g., flow rate), and the
                              mean of the dependent variable (e.g., split), respectively.
                              Sg, is the sum of the xy cross products and SSX is the sum
                              of the squares x.

-------
                     1.  Assumptions necessary for the purposes of linear regression.
 ill	in
 	I
 iiiiiiiiiiiiiii ii
 infilling nil mi
 il	I	
 "I	
 	I	
Assumption
(1) The model form is
correct: y is linearly
related to x
{2} The data used to fit
the model are represen-
tative of data of interest
(3) The variance of the
residuals is constant and
does not depend on x or
anything else
(4} The residuals are
independent
{5} The residuals are
normally distributed
Purpose
Predict y
given x
S
S



Predict y and a
variance for the
prediction
y
/
/


Obtain best
linear unbiased
estimator of y
/
/
/
/•

Test hypotheses,
estimate confidence or
prediction intervals
/
/
/
/
S
S Indicates that assumption is required.
              Reprinted from Helsel and Hirsch, Statistical Methods in Water Resources, 1995, page 225, with kind permission from Elsevier Science - NL, Sara
              Burgerhamtraat 25, 1055 KV Amsterdam, The Netherlands.
-is	,:,;.:::
For the data in thejfas^ two	cohirnns	of Table	2 (same as _
those displayed in Figure 1), Equations 3*"an3'4 were used
to compute a slope (P,) oif-0.0119 and an intercept (P0) of
3.131,7. fej, Sm and SSX were computed as 28.89,2.79,
                                      Thus, the linear
2=^V^!: ;-•""ffipdefjqr predicting split versus flow rate is
                                                                         ASSUMPTION EVALUATION

                                                                         The analyst must make sure that PO and P_ make sense.  In
                                                                         this case, perhaps the best approach is to plot the
                                                                         regression line with the raw data, as shown in Figure 1.
                                                                         The third column in Table 2 contains the predicted split, 'y,,
                                                                         computed using Equation 5 for each flow rate.  The
:;iiic ;r,
.ft '"'
"III IIIIEIIIIIBII ''
ill 	 Ii 	
liilliK
•iiii'iiiif!:!
!!:- 	 	 	 	 	 '!!

RIS'S1'11'11!1!!! 	
tuan

ii^
f
!' 	 I 	 jiiliup "Ililllllifl
iiiii'lB^^^
00

 -1
1=
iSliitl
li'SSillHE'CfljII'l II

'
•
^>




o ^.
LOW
|l|l|
i 	 i«! 	 «



^^




O 3
r v=i./^
i|^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^




^-



O -*l
-Tf= C
111 J1
lllllllli'lirl III 'illlllllliUl'iill 111
iiiliiii^iiii'siiiiiisii;:1:!!!1;!!!
tfi'^itrr




<^
•


3 SC
3f=>rri'
illii!!!iiili::isi!i!:ii!:iB^^
i 	 i 	 i 	 = 	








D &<
i
1!
111!!








D
IK
liiliviiiinini,
Mw^'









1
Ill'll!!"!'"
ll"!1'!,









Flow Rate, Split, /, Predicted Residual
r * ' x .*.

192 312 2 9028 0 2172

4.9 2.86 3.0733 -0.2133
44.4 2.70 2.6024 0.0976
25.8 2.83 2.8241 0.0059
37.6 2.60 2.6835 -0.0835
40.1 2.58 2.6536 -0.0736
47.4 2.49 2.5666 -0.0766
35.7 2.60 2.7061 -0.1061
13.9 3.19 2.9660 0.2240
            Figure 1. Split versus flow rate.

ittS^
	 '''I"1 	 1"! 	 	 	 il11'""' 	 	
'"''ir'1'1!! '„ "III"!',, ijillifi1'1 : '" i. i,'l linJh! J1 ft!]! I,!1"1"!,'1!1'1!
L 	 I'!1!1
1 	 	 	
ii]ii«
:::::!:
"'":i:^~^:^^t^^^~~
i/Rii^'i-'iitii'ii 	 iiiji.]^^
tt~% 	 : 	 	 ! 	
jiseian 	 iiiM 	 i1:: 	 I
11-1 	 : 	 : 	 : 	 ' 	 : 	 :"
fflrf!;1!!^:!
.'.: !•:.'•,
                                                                                        	 '"I

-------
 predicted split, y,, is plotted as the regression line in
 Figure 1. By visual inspection, p\, and P[ seem reasonable.

 Residuals plotted as a function of predicted values of y,
 residuals plotted as a function of time, and normal
 probability plots of residuals are the most effective
 approaches to evaluate the last three assumptions listed in
 Table 1, respectively. The fourth column of Table 2
 presents the residuals, e-,, which are computed as the
 observed split minus the predicted split (y, - y^.

 The plot of residuals should appear to be a uniform band
 of points around 0, as shown in Case A of Figure 2
 (Ponce, 1980).  In Figure 2, residuals are plotted as a
 function of predicted values of y.  The analyst should look
 for two types of patterns when evaluating assumption 3
 from Table 1 (e.g., constant variance).   The first is a
 pattern of increasing or decreasing variance with predicted
 values of y, as depicted in Case B of Figure 2.  The
 second is a pattern (e.g.,  a trend, a curved line) of the
 residual with predicted values ofy. Both characteristics
 are usually assessed based on a review of the residual
 plots and professional judgment alone.  The analyst may
 also need to examine other variables besides predicted
 values ofy to fully evaluate assumption 3.

 Independence of residuals (assumption 4 from Table 1)
 can be evaluated by examining residuals plotted as a
 function of time. The analyst should look for the same
 patterns as before. As an alternative for evaluating
 independence, the analyst can also plot the fth. residual, eh
 as a function of the (i-l)th residual, e^.  One word of
 caution is in order when reviewing any residual plot: If
 there are more points in a certain section of the residual
 plot, the residuals might not appear to be a uniform band
 of points around 0 (as suggested in Case A of Figure 2);
 instead, that section might have a somewhat wider band
 (Helsel and Hirsch, 1995).  This is an expected result.

 The normality of residuals can be assessed by examining a
 probability plot.  Two problems with nonnormal residuals
 are the loss of power in subsequent hypothesis tests and
 increased prediction intervals together with the impression
 of symmetry (Helsel and Hirsch, 1995).

 Figure 3 displays all three of these plots for the split data
 analyzed from Table 2.  From Figure 3, .A and B, the
 split residuals appear to be independent of predicted
 values ofy and tune,  as well as having constant variance.
 Thus,  the regression meets assumptions 3 and 4 listed in
 Table  1. In this analysis, testing for residual
 independence is important since the testing apparatus was
 calibrated initially, The pumps or other equipment could
Figure 2.  Plot of residuals versus predicted values.
(Source: Ponce, 1980)
.O.M





DM-

g ]




1
-025.
-2
. .
*
*
***** *
•* 25 26 2:7 2.8 2.9 30 3.1
PREDICTED VALUES OF SPUTpq
A) Split residuals as function of predicted values of split.

*
• » *
TIME ' ' ' '
B) Split residuals as function of time.
^^ [I,

^>^"'" 1-
KORMAL QUANT1LES
C) Probability plot of split residuals.
Figures. Plot of split residuals.

-------
                 I'M 'llllllll'illlll.ilipil. Ill1 'h'.ii" ii'.i'.PT'l
                                                     ^
                                                                            ....... ;:?«;: ..... if ..... s ......... i^ ..... i ..... i ..... :> ...... jiK-g ..... |:;;i ..... i ..... ;.,!:/::.;. ............ ,f,rw;:.- ....... ::;;, ...... ;;;, :

                                                 nliiiliiniiiiiim t:*'i I'iiiilwiniijl niir	",
                                                                                           sHr£i?»B;.^$.v.":£
                                ill--,,! , ,:	liliiic

"f;£i;;J£;;£;'.:i	II:	have""9!ffered'mpeiioTaaaice OTertune,"winch"'in"turn
'il!!	'	'"'	'	"	l|l"!|1	'l	'	^l^afjfectffie	resuftsT'TE'ig^e"	3C,	the	pjoga'gii|ty	pi5t^'
^ir.l,~	™'sugges^s'that the data might notr^orously"follow the
1	^sifiynir'Eibfn^iiy assumption,  ggjoyg Up0n inspectioici any
^k"™"™"™	"'"™v'fi5rtSiCity viofiSon^pp«iri"'to'"be	relatively'"minor.  The
                              3C would fall along tl
                                                              ''ne
 equations used* in computing the analysis of variance
	££frOV!X^sunra!aiy"	Sfbte'mat'i^ and'Ai?,, the" sum' of
 the xy cross products and the sum of the squares x, are
 defined in Equation 3.

   he coefficient of determination, R2, can be used to
                                           ibu'ted.  Th
                                       lUted to evaluate normality.
        •• • -'•'	- -	;	-""-	i	•- - -i1;-"	' '-='="	-	;	:	i;-"--"'"	;' iff-	=!	;="	;!	••; 1	=--iii3 i;	*:
             Had this analysis violated anyof these assumptions, using
        ~'~ "I3i^ere9£,,i:egression technique, transforming me'data," or '
             addirig variables to the regression would have to be
            f conspered.  Alternatiyely, the uses of the regression
         H^i.tejs'uits' could be limited to those identified in Table 1 as
                                                                               evaluate what 'proportion of the variation can be explained
                                                                                   ................................. , ..................................... £ ............ ,|^ [[[ ..... , .................. ................. ............................................ C ..............................
                                                                               by file" model (3augush, 1986).  "f? can be computed as
                                                                              	i"	1993)	
                    grjmii| ..... flg y? well ..... .the,, |egression line fits tihe data,
       .  _  [SSy-s2(n-2)]         SSE
     Jf   — 	 — 1 ~	
    	  SS..               SS..
                                                                              where
                                                                           IIU lll'iHI WllilllliW^^^^^            <:!IJ 'lllilllllK^^^^^^^^^^ 	'''"Pi1 i/f xJi	'' I
                                                                                        iioflin	K..'il|«trj	mKW.'-\~'.t'••"> ".::', us si"1
                                                                           liiiiiiiiiiiiii' and
                                                                           'Uiiiliiiw^            	ililHIi!!,,: '%;iliillllliiii,,iiK^    ill 'I ,i«	 i' .  "	: '''"
                                                                                                                                     (6)
                                                                                                                                     (7)
                                                                          '
                 !:i             t      i:
   SisaHSMsSil; ifl£,i ......... Evaluate the proportion of variation in v explained
   llllilf                              ............... ..... 1 .............................. .................. ........... „ [[[ ......................................... I ..... ................ ................................... ................................ ......... , ................................
             ...... jjii ...... >iif'|iJ!i,;iy &e, iE?99S^-     •                   •    -
                         whether B0 is zero.
                            '
                                                                           ' IIPMPPPIH^^^^^             ....... . .I'Lul' ............... I ................ I ...... « .......... .T ,!<1 ...... '. !'r ............... ,  ,
                                                                                      . ^    ~   ...... iliiiW^^^^^^^   ..... • ...... iip-inpiR -•' . A  !,:; » j* ' „  • :":" ",         ""•"'
                                                                              The residual, &„ is defined as Vj - y.  Values for R2 range
                                                                           .   •            I               .  '             ''.           »
                                                                                                                                     (8)
                                                                                                         .            ,   ,,,. ,   ,,
                                                                                ,,, , ,    ~   —.   reErese?tmI ,^e 9as? wnere
                                                                              observed y values are on the regression line.  The
                                                                            [[[ S!™ ...... [[[ n ........................................... ............................... £?, ...... .n< .............. M,, • ........ .,   • .
               ;* ^Cgnigute the confidence interval for go-                  relatronships (Freund, 1973) and is computed as the
                *   Compute me confidence interval for pV                  square root of'JR5'. The sign of r should be the same as the

                 	B	.nj	:l!::™^^^^                                     	ii2	?£%Z,'	iLiSS^* ""*M.".if	f01,,^',,^^1,^16 extrej??
                rgSeIJ'rnIght imagine, many of these evaluations have          values representing the strongest association and 0
                                     into standard spreadsheet                rrepresenting no i
                                         'resent	a	^^^^oJ" fonnat'thaf	'	
                                                                              Usmg the split data from above, the sum of residuals-
                                                                              squared\SSE) is equal to 0.2227 and die sum of the
                           use to present tEe resuits"frbm a regression
              inalysis.  The top portion of Table 3 also presents the
              ''' '<«"><   iiiliui||lil|li|i|!. mil'"	I'Sil	II	...i|.iiiiiiii	|lni|T|i 'ilii'ilHilh	IllPh'P

Intercept (30)
Flow Rate (p,)
Coefficients
3.1317
-0.0119
Standard
Error
0.072914
0.002237
t Statistic

-------
squares y (SSy) is 0.7093; thus, R2 is equal to 1-
(0.2227/0.7093) = 0.686, or 68.6 percent of the variance
is explained by the model. The correlation coefficient, r,
is then equal to -0.828. The overall model can also be
evaluated with the F statistic (28.41), which is computed
in Table 3.  The F statistic is a measure of the variability
in the data set that is explained by the regression equation
in comparison to the variability that is not explained by
the regression equation.  Since the/? value of 0.0001366 is
less than 0.05, the overall model  is significant at the 95
percent confidence level.

Are Po and Pj significantly different from zero?  The
standard error for p0 and PJ hi Table 4 can be calculated
as (Helsel and Hirsch, 1995)
Table 5. Percentiles of the ^df distribution (values oft
such that 100(l-a)% of the distribution is less than t).
            = s
                  n   SS
where
    s =
 (9)


(10)




(11)
The value s is equal to the standard error of the regression
(which is the same as the standard deviation of the
residuals).  The corresponding / statistics (with n - 2
degrees of freedom) for P0 and PJ are then equal to P0 and
p! divided by their respective standard error.  The /
statistics may then be compared to values from the /
distribution to determine whether p0 or P, are significantly
different from zero. In this case, P0 and pt are both
significantly different from zero based on inspection of
thek associated/? values hi Table 4.

The confidence intervals for P0 and Pi can be computed
using the following formulas (Helsel and Hirsch, 1995):
                                                  (12)

                                                  (13)
where taa
-------
                  1 ill
                                   I 111	
                                    Ml
                                                Ill	
	I	
     'Ill
 In thi| example, j> is equal to the predicted split using
 Equation 5 and the flow rate equal to x0.  SSX and s can be
 estimated by using Equations 3 and 11, respectively. This
 interval is most narrow at~x and widens as XQ moves
in farther from"x.  By calculating the interval at each point
 1 along the regression line, a curve like the dashed line in
 Figure 4 for the example data can be plotted.  The
 equation for the prediction interval for individual values of
 y at Jc	=	x"g is (Helsel 'and	Hirsch,	1995)	"	i	
                                          ss.
                                                                 (15)
            Figure 4 also shows this interval for the example data.
                                    11	1,   '' '         '' 'i 'i
            One ofjhe simplest (in theory) nonpouit source control
            applications of linear regressibn is the regressibn of a
            walef quality indicator against an implementation
            indicator.  For example, flow-adjusted total suspended
                    Jt§S) concentration could be regressed" against a
                    ~-	3SS!tV!3t5M;'«aB3^a.'^SB^EL.	,;,	-	i	,	t	
	lilliiii	'	I	lii	S
                                                Ji^^^^^^~~^~~~'~J§^J	SaTan
                        of ,§11	cropland for which "delivery' "to"' the	
I II  Ml il'lllllllllllllllll  111  Mill i1 III ll|l 11     I                            111 111
         I           II
 versus tune will most likely be confounded by the
 variability hi precipitation and flows.  Thus, considerable
 data manipulation (transformation, stratification, etc.)
 might be required before regression analysis can be
 successfully applied. In these cases, it might be more
 appropriate to apply one of the alternatives to regression
 described by Helsel and Hirsch (1995).

 In many cases water quality parameters are regressed
 against flow.  This approach is particularly relevant hi
 nonpouit source studies.  In analysis of covariance,
 regressions against flow are often performed prior to an
 ANOVA-  One of the implicit goals of nonpouit source
 control is to change the relationship between flow and
 pollutant concentration or load.  In paired watershed
 studies,  measured parameters from paired samples are
 often regressed against each other to compare the
 watersheds (OSEPA, l9l?3).   These regression lines can
 be compared over time to test for the impact of nonpouit
 source control efforts (Spooner  et al., 1985). The reader
 is referred to Paired Watershed Study Design (USEPA,
                          3elnbnstrates this technique.	
            strearn is likely to be 50 percent or greater.  A significant
                   f6 sloge would' suggest'(but not" prove)"that"water
                        improved" because" of •IIjpj~gngggn[lof
            sedirnent control practices'^''       	'
                                                                           NONLINEAR REGRESSION AND
                                                                          " T RAN S FORM ATIO NS

                                                                           Nonlinear regression (as discussed here) involves
                                                                      II Il'lgnsfbnnation to linear equations, followed by simple
                            e use of simple linear regression is to            linear regression. Helsel and Hirsch (1995) provide a
     i	i1	j	"'model^a water quality parameter versus tune.'  In' this	      detailed discussion on transformations using the "bulging
                      , a significant slope would indicate change over       rule" described by Mosteller and Tukey (1977), which can
.IIS	j,,;,!	^^	i_rdrne. ",'pie sign' of the slope would 'indicate eitEef'"' 	be used to select appropriate transformations.  Crawford et
      ""'	""''"'	'""	'"	'""	'"     "	'	 . depen3mg"oii th'e parameter	   al. (1983) list the numerous regression models most often
                                                                          applied* By tEe U.S. Geological Survey for flow-adjusting
                                                                                                    concentrations.  The selection
                                                                                                    of which transformation to use
                                                                                                    is ultimately based on an
                                                                                                    inspection of the residuals and
                                                                                                    whether the assumptions
                                                                                                    described earlier are met.
                                                                                                    Typical transformations include
                                                                                                    x2, x3, lux, 1/x, x°-s, etc.
,lj|^^
.
        $4




        30




        28-



        22

        20
                                                                  95 PERCENT CONFIDENCE
                                                                  INTERVAL FOR MEAN
                                                                  RESPONSE
                              95 PERCENT CONFIDENCE
                              INTERVAL ON INDIVIDUAL
                              ESTIMATES OF SPLIT
                                  10
                                             20          30          40
                                                   FLOWRATE (GPM)
                                                                                50
                                                                                            60
                           of spit versus flow rate withi confidence fimite for mean response and
                                                          '"'       "'""
                          When the residuals do not
                          exhibit constant variance
                          (heteroscedasticity), one of
                          several common
                          transformations should be used.
                          Lx>garithmic transformations
                          are used when the standard
                          deviation hi the original scale is
                          proportional to the mean of y.
                          Square root transformations are
                          used when the variance is
                          proportional to the mean of y.
                          In many instances, the right
                     ="=-	i-"-	-•- •	I	"ill1'1; ''i;:: f lilji "fi	II	-i	I	T-*^	'"?;	«;T:;ii;il	*;:;il:	I'!'1!'1	""'	*^?l«**i	;jii'1	"^	I	-t^*^	'i	-n11""	I	-"•	'SSttiS*	'"••''"';;"'" ":iiill!i '"''"I '
	gllllll!!!, IlilliiSllilS^^^^ 	II,,,,;;,!,!;!!!1, „';,, 'fill'1:"

IliiiiB     	"•'	,i> !,;„,
                     „„!,,,, ;|	i	;	;	 ;;;,||;|,|,,,j|	„ ;	"I;	;	;„;„
                     jn'iii!	,	'"::;„	I:;;!:":;'1:,:	aaJii.^ jliiilfliilLJ.:,,!!!!	i	.ii;,:	:i	iiiJ!ii:!i!!ii!i	Iliij!:1	':ii!ii!!!:;:jjiii'	^in1:!!!1!'!':	i	libs:	I!!:!!!	i!!!!:,:!:!:!!'!!!::::'	is	im,:	!	•'!
                                      i' <" iliiniiiljiiiiiililni:1 liiiiiiiiiiijiiiiil'iliiliiiiiiiiliikiiiiiiiiiii ii|iiiii|i|iiiiiiiiiiiiiiiint iiiiiif liiiijiiiii «i'n iiiiiei'iiiiiiiliiiinii iiNiiiiiiiiiiiiliiu|iiHiiii|Hu     4^'
                        iiiiiiiiliiiililiiiihiiiliiir'/iillliiiiiiiiilliiiiilii1:11	iii|,,ni ill
                                                                          	;|^-~j'.:.r	
                                                                          'ilihiidi'iifnuiiiliiiiiiiiiiiil:!	^fiilinM^        n'1,,1," 'i I'M.' "i ii	ii"i" ,,," .iiiu'j . 'n ,••',!"'  ii
                                                                                                                                   E IHl ,I||M' ,',,"il!' Mil
                                                                                                                                   liiMiiii	,,^4i:>:;
-------
 transformation will "fix" the nonlinear and heteroscedastic
 problem. With data that are percentages or proportions
 (between the values of 0 and 1), the variances at 0 and 1
 are small. The arcsin of the square root of the individual
 values is a common transformation that helps spread out
 the values near 0 and 1 to increase their variance
 (Snedecor and Cochran, 1980).

 There are several disadvantages when applying
 transformations to regression applications. The most
 important issue is that the regression line and confidence
 intervals are symmetric in the transformed form of the
 variables.  When these lines are transformed back to their
 normal units, the lines will no longer be symmetrical.
 The most notable time in hydrology when this creates a
 problem is when estimating mass loading.  To estimate the
 mass, the means for  short time  periods are regressed and
 summed to estimate the total mass over a longer period.
 This approach is acceptable if no transformations are
 used—the analyst is summing the means.  However, if a
 log transformation was used, summing the mass over the
 back-transformed values results in summing the median,
 which will result  hi an estimate that is biased low for the
 total mass (Helsel and Hirsch, 1995).

 As an example'of nonlinear regression, consider a
 common relationship that is used to describe load (L) as a
 function of discharge (Q):

    L  = aQb
 Taking the logarithms of both sides yields
(16)
          = \n(a)+b
(17)
which has the same form as Equation 2, introduced at the
beginning of this document, where ln(L) corresponds to y,
ln(a) corresponds to P0, b corresponds to pls and ln(<2)
corresponds to x.  By taking the logarithms of both sides,
the nonlinear problem has been reduced to a simple linear
model.  The only additional step that the analyst must
perform is to convert L and Q to ln(L) and ln(Q) before
using standard software.  The analyst should be aware that
all of the confidence limits are hi transformed units; when
they are plotted hi normal units, the confidence intervals
will not be symmetric.

Figure 5 demonstrates how transforming the data may
improve the regression analysis.  In Figure 5A, sulfate
concentrations (hi milligrams per liter) are plotted as a
function of stream flow (in cubic feet per second).  The
apparent downward trend is typical of a stream dilution
effect; however, the trend is clearly nonlinear.  The trend
line plotted in this figure, as well as the residuals plotted'
in Figure 5C, demonstrate that a linear model would tend
to over- and underestimate sulfate concentrations
depending on the flow. Figure 5B displays the same data
 after computing the logarithms (base 10) of the sulfate and
 flow data. A trend line fitted to these data and the
 residual plot (Figure 5D) clearly demonstrate that
 applying linear regression after log-transformation would
 be appropriate for these data.

 CONCLUSION

 When properly used, regression analysis can be an
 important tool for evaluating nonpoint source data.
 However, the analyst should pay close attention that the
 application of regression does not exceed the uses that are
 met hi Table 1.  In some instances it might be necessary
 to select distribution-free approaches that tend to  be more
 robust.  The reader is referred to Statistical Methods in
 Water Resources (Helsel and Hirsch, 1995)  for a  more
 complete discussion regarding distribution-free
 approaches.

 REFERENCES

 Crawford, C.G., J.R. Slack, and R.M. Hirsch. 1983.
 Nonparametric tests for trends in water-quality data using
 the statistical analysis system. USGS Open File Report
 83-550. U.S. Geological Survey, Reston, Virginia.

 Dressing, S., J.  Spooner, J.M. Kreglow, E.O. Beasley,
 and P.W. Westerman.  1987. Water and sediment
 sampler for plot and field studies. /. Environ. Qual.
 16(l):59-64.

 Freund, J.E. 1973. Modem elementary statistics.
 Prentice-Hall, Englewood Cliffs, New Jersey.

 Gaugush, R.F., ed. 1986. Statistical methods for  reservoir
 water quality investigations. Instruction Report E-86-2.
 U.S. Army Engineer Waterways Experiment Station,
 Vicksburg, Mississippi.

 Helsel, D.R., and R.M. Hirsch. 1995. Statistical  methods
 in water resources. Elsevier, Amsterdam.

 Mosteller, F., and J.W. Tukey. 1977. Data  analysis and
 regression. Addison-Wesley Publishers, Menlo Park,
 California.

 Ponce, S-L.  1980. Statistical methods commonly used in
water quality data. WSDG Technical Paper WSDG-TP-
00001. U.S. Department of Agriculture, Forest Service.

Remington, R.D., andM.A. Schork. 1970.  Statistics
with applications to the biological and health sciences.
Prentice-Hall, Englewood Cliffs, New Jersey.
-------
        iljllil	iiilLiii:X	\ii;iiil!L,i	lifllib
        I	Inn	11.1.1	in...	null	g	n	I	I	iiHiigin	.


               Figure 5.  Comparison of regression analyses using raw and log-transformed data.
' |:
                               ,.
                         j G,W._, and WjG- SS^JIES-...!:	St&ISSl'
                                                                	I	Ames,
                                                                            	
               Iowa.
   	•> .•!<'-'".'.  •':.  II'   ,,  .:•'%•,: :.;•;•'••  !SJ;:Ei;l:iy;^":;>vfei<{?	!i	'^..f^ li-'j"
       :.:- • •  '•  Spooner, J.,, R.P. Maas, S.A. Dressm^,  M.D. Smolen,

 |i|;	|	;;;	...;	,,,::	d^lfjiBSiflg water Duality improvements "from
                                       hon, ..... pr^eedings-of a national
                                                    ''i   ," KSssuri'i'EPA
                                                .^ ........ illL,!^        ..... ' ..... , liwi; :!,,,:„;;»!;,!!„,!!, JlE!!,!,,,!,™;,!,!,!!;; ....... ' ...... \ '\:L, »j,f ....... 1;:11;;iil!l!;1;;ll1!"hh;,1
                                                                                "'      ''        '     '
  j-"	:	:	QO?.  tLS. Enyirojamentai Profectjp,n Agency, Office of
  I	 |	!	!„.	UJ' 	t; a	j^Jjiij,,,,'	i,|	'Ii	.^.Slaiil;	i	!, :;;iil;;Li	L!';;!	!	.S,	'	In	*,„ ,h	"	  '  	 '  	 "	 '	' „   	 T ,' 	 '„  '                 „             	' ii"i	,	
                           rtjijgtpn, DC.
                                                                                iiiiiiniiliiiiif !l'i lIKjiiiii i pip|il|ii!iiiii!!« :iiii!I!i!!iI!iiiii!iiiii|ii!!IIIIIiiiii3       :i;ii!i|iiiiiiii!i!iiiii!nji!iyh!li^''i! 'riJiilJiJ i i i >n v P' ii" iiii'1; \f, i #!• ^'", ;,: •'' i"' "\ <  iii»: '•  '•', <,» v~ 7V, /, > 17\S,'' liiiji t ~ i 'ilpi'x' "I" i  ! >i!l \Sr \
                                                                                                                    .      EJiii..!;!;.:!f ISi«. ?„»,;> !:::..:,E -,y . ,,;;, v.'J.^i; laSjh 'f W- ;:||lt|i
                                                                                                       i/SSiiif aafiiirtlnw             	«i,'!' -'ikiw" , > ''hi^1 :i':Ji! .•';," •/  \  '' ^'i,!:!;*!'!!'!!,1 M',^ i iJii'P1""1!!11 ^I'l'Ieiii
    ....... ...................... |||f ............................ ........... f | ...... , .........   |B ...... „ f ................. ....... f ........ ..... , , ......... ,| |r | . ............ , ....... .; ....... , ^ ^ | ...... ..........  ........ |h
                     '                                                      '   '
                                                          ' '' |i '' '; f ...... i1  ,"l!ji|ii1"1  !"' 'I|""'1' .I1'-1;'! I ...... ; f;1 'il'ii ''i]i ••'  •",,•  '! 'vs' ..... i1  ii'i, ii' |:«!l' ,/ , ;  ,; • ^ •).  ,
   .I	 i	i	'',' ' ',.  "',''	iiLini i	i	 '„  ,'i',i .  ,,,:, ,,,»"i j	inilu	,	.,,,!,,. i	i ,nii	!, iiiiiiii,;:,,m,iii,,|,:.	„ „::	in,;!, ,.|; n,,!	iiiiiUiiiiiiiu, i,mi,|i|iii iin,, •	si	jik,	ii" jii!!.;,«!	ikini	;.,i
     K^^^^^     	i>«^^^       	R	ft!	i!!:ii
-------