Categorical Regression (CatReg]
User Guide
Developed for: December 2, 2015
A pfYA United Sunsi
Ewonrr^fiUl Pro$ซton
CI
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
TABLE OF CONTENTS
1.0 INTRODUCTION 5
1.1 About CatReg 5
1.2 CatReg's Two Models 6
1.3 CatReg and Toxicology Data 6
2.0 INSTALLING AND USING CATREG 8
2.1 Step 1: Install R 8
2.2 Step 2: Install CatReg 8
2.3 Overview of a CatReg Session 8
3.0 DATASET AND VARIABLES TAB 9
3.1 Mapping Variables 11
3.2 Filtering Data Values by Variable 11
3.3 Before Running an Analysis 11
3.4 Data Requirements and Error Messages 11
3.5 Data Types 13
3.6 The Data Input File 13
3.7 Input File: chemx.csv 15
3.8 Input File: chemy.csv 16
3.9 Input File: chemz.csv 19
4.0 MODEL AND BMD TAB 22
4.1 Setting BMD Specifications for an Analysis 23
4.2 Setting Model Specifications for an Analysis 23
4.3 Running an Analysis 24
4.4 About the CatReg Output File 24
4.5 Changing the Output File's Name and Location 24
4.6 Stratifying 24
4.7 Clustering 25
4.8 Link Function 26
4.9 Model Form 27
4.9.1 Understanding the Model Equations 27
4.10 Zero Background Response 28
4.11 Worst Case Analysis 31
5.0 PLOTS TAB 32
5.1 Copying and Printing Plots and Output Files 33
5.2 Plot Functions and Options 33
5.3 Concentrations and Durations for Designated ERC and Severity (catplot) 34
5.4 Concentrations and Durations for Designated Probability and Severity of Strata (stratplot)....35
5.5 Extra Risk Concentration at Desired Duration (confplot) 36
5.6 Probability Versus One Explanatory Variable (prplot) 37
Doc. No.: N/A Page 2 of 65 Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
5.7 Data Plotted by Stratum (dataplot) 38
5.8 Contribution to Deviance for Individual Datum (devplot) 39
6.0 HYPOTHESES TAB 42
6.1 Testing Parameters 43
7.0 ASSESSING MODEL FIT 45
8.0 WORKING WITH THE MENUS, TOOL BARS, & STATUS BARS 46
8.1 File Menu 46
8.2 Help Menu 46
8.3 Data Grid menus 46
8.4 Text Window (Results) Menu 47
8.5 Status Bar 47
9.0 WORKING WITH THE DATA GRID AND DATASETS 48
9.1 Opening Existing Dataset Files 48
9.2 Creating a New Dataset 49
9.3 Entering or Importing Data 49
9.3.1 Entering and editing data 49
9.3.2 Importing data 49
9.4 Copying, Cutting, and Pasting Data 50
9.4.1 Selecting multiple sequential cells 50
9.4.2 Copying and pasting multiple cells of data 50
9.5 Renaming Columns 50
9.6 Adding and Deleting Data Grid Columns and Rows 50
9.7 Sorting data 51
9.8 Exporting Data 51
9.9 Saving a Dataset 51
9.9.1 Saving changes to the current dataset 51
9.9.2 Saving a dataset to a different name and directory location 51
9.10 The Proceed Button 52
9.11 Renaming Required Input Variables 52
9.12 Converting Data Files to Comma-Separated Files 52
9.13 Combining Severity Categories 52
9.14 Recoding Missing Values 52
10.0 REFERENCES 53
11.0 DEFINITIONS, ACRONYMS, AND ABBREVIATIONS 54
APPENDIX A: DISTRIBUTION OF CONTINUOUS RESPONSE DATA OVER SEVERITY LEVELS 55
APPENDIX B: TECHNICAL DISCUSSION 58
B.1 Link Functions 58
B.2 Interval Censoring 59
B.3 Parameter Estimation 59
Doc. No.: N/A
Page 3 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
B.3.1 Maximum Likelihood Estimation 59
B.3.2 Generalized Likelihood Estimation 60
B.4 Confidence Limit Calculations 61
APPENDIX C: TECHNICAL BACKGROUND: MODELS AND EXTRA RISK 63
C.1 Exposure-response Models 63
C.2 Extra Risk Concentration (ERC) 64
Doc. No.: N/A
Page 4 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
1.0 INTRODUCTION
The U.S. Environmental Protection Agency (EPA) developed the Categorical Regression
(CatReg) application as a tool to facilitate exposure-response analyses.
In general, CatReg should be viewed as a statistical tool for developing an exposure-response
curve and addressing related questions. A thorough analysis may require numerous executions
of CatReg, ideally guided by both toxicological and statistical considerations.
A critical feature of CatReg is its capability to support data analysis needed for exposure-
response modeling, including:
Assessing and comparing how well models fit the data
Testing for differences across studies and the significance of covariates within single or
pooled studies
Detecting outliers
The program's options facilitate sensitivity analysis and produce numerous plots in addition to test
results.
This documentation provides instruction on how to use CatReg. However, the documentation
does not address in detail CatReg concepts or guidance on CatReg methods. While the EPA
CatReg methods guidance has not been finalized at this time, every attempt has been made to
make this software consistent with the most recent working draft guidance and discussions of the
EPA Benchmark Dose Work group.
Until formal CatReg methods guidance is available, users of this software are strongly
encouraged to review existing background material such as the CatReg Software User Manual:
R-Version (EPA, 2006) before using this software.
The U.S. Environmental Protection Agency's (EPA's) National Center for Environmental
Assessment encourages the broad application of this software. In this document, however, EPA
has chosen to focus on the application of the software to the assessment of adverse effects
associated with acute inhalation exposure. The description of dataset files for the software reflect
this application. The user is free to modify the input fields to support other applications.
Appendix B provides additional technical description of the statistical methods used by the
program.
1.1 About CatReg
CatReg is a computer program developed to support toxicologists and health scientists in
conducting exposure-response analyses, most often for controlled animal experiments.
"Exposure" has two components:
Exposure level, indicated by a concentration or dose of the agent of interest
Exposure duration, when the concentration or dose varies within the data
"Response" refers to occurrence of a detrimental health effect of a user-defined level of severity.
More specifically, effects observed in toxicological studies are assigned to ordinal severity
categories and associated with the exposure conditions (e.g., concentration and duration) under
which the effects occurred. "Ordinal" here means that the categories have a natural ordering in
terms of severity or strength of response, but the spacing between ordinal scores is not subject to
direct interpretation.
Doc. No.: N/A
Page 5 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
For example, response might have four levels of severity coded as:
0 = "no adverse effect"
1 = "mild adverse effect"
2 = "moderate/severe effect"
3 = "lethal effect"
An ordinal response of 2 is higher than a response of 1, but the difference is not necessarily the
same as the difference between 3 and 2. The simplest case is dichotomous response data, with
just two severity levels, such as: 0 = "no adverse effect", 1 = "adverse effect".
If data are reported on a continuous scale, such as mean and standard error of respiratory rate
depression, the user can distribute the total number of experimental subjects over the severity
levels using a method discussed in Section 3.9 and Appendix A.
1.2 CatReg's Two Models
CatReg provides two basic models, with variations to be explained, to relate the probabilities of
the different severity categories to exposure level and exposure duration, taking user-defined
covariates into account (e.g., species, gender, target organ, etc.).
The parameters in the models are an intercept term and coefficients of concentration and
duration, either of which may be log-transformed (to the base 10, denoted as "log," "Iog10," or
"log 10").
Model 1, the cumulative odds model, allows the intercept term to vary with severity level, but
not the coefficients of concentration and duration.
Model 2, the unrestricted cumulative model, allows any of the parameters to vary with
severity level.
The probability that a specified severity level or worse will occur increases as concentration or
duration increases. The user can choose for either Model 1 or Model 2 to conform to the logistic,
normal, or Gumbel cumulative probability distribution (see Appendix B and Appendix C).
There is a function (called the link function) in each case that transforms the probability for each
severity level to a linear function of the unknown parameters, the format of a linear statistical
model. The link functions are the logit, probit, and cloglog (complementary log-log) functions for
the logistic, normal, and Gumbel cumulative probability distributions, respectively.
The parameter estimates and their statistical characteristics, including standard errors and
significance levels, are routinely output by CatReg, along with an analysis of deviance table to
assess model fit and a table of estimates of extra risk concentrations (concentrations at which
extra risk is a user-specified value) for 1,4,8, and 24 hour exposure durations.
1.3 CatReg and Toxicology Data
CatReg was developed for, but is not limited to, meta-analysis of toxicology data. Meta-analysis
refers to the analysis of data or results from multiple studies simultaneously. Meta-analysis
becomes valuable when individual experiments are too narrow to address broad concerns.
For example, in acute inhalation risk assessment, it is important to investigate the combined
effects of concentration and duration of exposure, but few published experiments vary both the
concentration and the duration of exposure (Guth et al., 1997). By combining information from
multiple studies, the contribution of both concentration and duration to toxicity can be estimated.
Moreover, the combined analysis allows the analyst to investigate variation among experiments,
an important benchmark for the level of model uncertainty.
Doc. No.: N/A
Page 6 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Different exposure-response experiments may consider the same or different toxicological
endpoints, and toxicological judgment is required to determine if, and when, two different
endpoints, or gradations of the same endpoint, are of comparable severity.
A relatively simple example is analysis of mortality studies, with two severity levels: 0 = "not
lethal", 1 = "lethal". The same endpoint is used for all studies and no intermediate degrees of
health gradation are addressed.
A little more complicated example might involve a single health effect, or mode of action, but with
more than one severity level corresponding to manifestations of progressive "stages" of
development.
Where studies address dissimilar endpoints that may be the consequence of different modes of
action, particular care needs to be exercised to decide if comparable severity levels can be
assigned across studies. It may not be reasonable to include all studies in the same analysis.
For example, two toxicology experiments might report stages of anesthesia while another reports
suppression of the shock-avoidance response. It might be the case that a toxicologist can
confidently assign endpoints of the first two studies to comparable severity levels, but not be able
to include the third study. In that case, one analysis could address the first two studies and a
second analysis the third study, since the studies cannot all be put on a "toxicologically
equivalent" severity scale for analysis together.
Doc. No.: N/A
Page 7 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
2.0 INSTALLING AND USING CATREG
2.1 Step 1: Install R
CatReg relies on the open source package R for Windows to perform all statistical calculations.
The 32-bit edition of version 3.1 or later of the R statistical software must be installed on your
computer before you can use CatReg.
For more information on installing R, refer to the R Project's web site at http://www.r-proiect.org/.
2.2 Step 2: Install CatReg
1. Locate the CatReg.zip file you downloaded. Typical locations for file downloads are the
Windows Desktop or the Download directory in My Documents.
2. In Windows Explorer, double-click the downloaded .zip file. Drag and drop the CATREG
folder from the .zip file to the directory of your choice on your computer.
It is recommended that you place the CATREG folder (and its subfolders) in the simplest,
shortest directory, without special characters or spaces, for which you have
administrative rights (for most EPA users, this will be C:\Users\[EPA user's LAN ID]; for
non-EPA users, this could be as simple as C:\).
3. Double click CatReg.exe to start the program.
2.3 Overview of a CatReg Session
The following steps illustrate one path through a CatReg session.
1. Open CatReg.
2. If a data file does not exist in CSV format, or it needs to be edited or cleaned, open the
file in the Data Grid. Edit the file as needed.
3. In the Analysis Screen, select the Click here to load data button.
4. Select the CSV data file of interest.
5. In the Dataset and Variables tab, map the dataset variables to match CatReg's variables.
If you want to filter out specific values, select them here. Note that the summary fields at
the bottom of the screen record the selections you make.
6. In the Model and BMD tab, specify the options that CatReg will use to calculate an
exposure-response curve. All you need to do is select or deselect option boxes or
buttons, but having an understanding of clustering and stratification will help inform your
selections.
7. Click the Run Analysis button. CatReg opens a separate window to display the text-
based results.
8. On the Plots tab (which is enabled after an analysis is run), select the plots you want
generated for the analysis. Click the Run Plots button. CatReg opens a separate window
to display the plots. You can copy and paste the plots into other graphics programs, such
as PowerPoint or GnuPlot, to print or edit them.
9. On the Hypotheses tab (which is enabled after an analysis is run), you can test exposure-
response hypotheses for Intercept, Dose, and Time parameters. Click Run Tests.
CatReg will display a separate window to display the text-based results.
Doc. No.: N/A
Page 8 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
3.0 DATASET AND VARIABLES TAB
Categorical Regression Version 3.0.1.0 Beta - [Analysis Screen]
> i B-hp&J
a-i File Help
Dataset and Variables | Model and BMP [ Plots j Hypotheses
Dataset
Click hereto load data
Model Variable Mapping
Incidence N Sev Lo
Filter Data Values by Variable
Filter 2 Filter 3
Summary of Run Options
Output Rle:
Filtered Out: I
Clustered:
Stratified:
Censoring:
Run Analysis
Save Analysis
Save Analysis As... |
Right-click on Dataset control for additional option(s).
Ready
Figure 1,Initial CatReg window.
When you first open CatReg, you will see an empty window with the button, "Click here to load
data," Click the button to display a dialog box where you can select a CSV dataset,
CatReg displays the loaded dataset and enables the dropdown lists.
Doc. No.: N/A
Page 9 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
uy Categorical Regression Version 3,0.1.0 Beta - [Analysis Screen]
4? File
Help
_ & X
Dataset and Variables j Model and BMD | Plots
| Hypotheses
Dataset: c\usepa\catreg\data\chemx. csv
Exp.
Group
Species
Target
mg/m3
Hours
SevLo
Nsub
Incid
*
~ 1
1
MU
C
1259
1.25
0
10
10
2
1
2
MU
c
1259
1.6
0
10
9
3
1
2
MU
c
1259
1.6
1
10
1
4
1
3
MU
c
1259
2
10
4
5
1
3
MU
c
1259
2
1
10
6
6
1
4
MU
c
1259
2.5
10
1
7
1
4
MU
c
1259
2.5
1
10
9
8
1
5
MU
c
1585
1.25
10
7
9
1
5
MU
c
1585
1.25
1
10
3
10
1
6
MU
c
1585
1.6
10
3
11
1
6
MU
c
1585
1.6
1
10
7
12
1
7
MU
c
1585
2
10
1
13
1
7
MU
c
1585
2
1
10
7
14
1
7
MU
c
1585
2
10
2
15
1
8
MU
c
1585
2.5
1
10
4
16
1
8
MU
c
1585
2.5
10
6
17
1
9
MU
c
2000
1.25
10
7
18
1
9
MU
c
2000
1.25
1
10
3
Model Variable Mapping
Filter Data Values by Variable
Dose Time
Incidence
N Sev Lo Sev Hi
Filter 1
Filter 2
Filter 3
Filter 4
Exp
~ Hours
~ Incid ~
Nsub ~ SevLo ~
-
1 -1
filter Out?
~ear All
Summary of Run Options
~ MU
Output File* C:\usepa\CatReg\Data\Chemx.otx
U HI
GD
Filtered Out:
Gustered:
[ Ok ] | Cancel J
Stratified:
Censoring:
Run Analysts
Save Analysis
Save Analysis As~ j
Clos. J
Right-click on Dataset control for additional option(s).
Ready
Figure 2.CatReg Analysis Screen with loaded dataset and mapped variables.
The following table summarizes choices you can make on this tab.
Specifications
Description
Notes
Variables
Variables as defined in the dataset's
column headers.
Dose
Specify the variable for exposure
dose.
Required.
Time
Specify the variable for duration.
Exposure duration in hours. Required unless all durations are
equal. (Default value = 1.)
Incidence
Specify the variable for incidence of
severity level or severity range.
Required.
N
Specify the variable for number of
subjects in a treatment group.
Required.
Doc. No.: N/A
Page 10 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Specifications
Description
Notes
SevLo
Specify the variable for the lowest
severity level.
Required.
SevHi
Specify the variable for the highest
severity level.
Required if response spans more than one severity level.
SevHi is required if the severity level for one or more records
is entered as a range, but not otherwise.
For example, if the severity level for a record is a range such
as level 1 to level 2, then SevLo is entered as 1 and SevHi is
entered as 2. Data records for which SevLo * SevHi are
referred as censored data. If only one severity level applies,
e.g. level 1, then SevLo = SevHi = 1
Filter
Check to display a popup window
showing values for the selected
variable.
Check the option boxes beside a
value to remove (filter) it from the
analysis.
The summary fields at the bottom of
the screen display the filtered values
in the Filtered Out text box.
Use this option to fit the exposure-response curve to a subset
of the data. One reason to filter data is to investigate how the
fit changes when certain observations are excluded.
For instance, a particular study may be suspect, and it may be
desirable to compare the parameter estimates with and
without the suspect study.
3.1 Mapping Variables
For the Model Variable Mapping, select the dataset variables that correspond to the CatReg
variables of Dose, Time, Incidence, etc.
3.2 Filtering Data Values by Variable
From the dropdown lists, select the dataset variables whose records you want to be removed
(filtered) from the analysis, without removing them from the input file.
This option is used to fit the exposure-response curve to a subset of the data. One reason to filter
data is to investigate how the fit changes when certain observations are excluded. For instance,
a particular study may be suspect, and it may be desirable to compare the parameter estimates
with and without the suspect study.
To filter data, select the variable of interest; CatReg will display all the values for that variable.
Tick the box for the value you want to be removed, and click OK.
This option is also useful for scanning a particular field to see what values have been observed.
Unexpected values may indicate a problem with the input data file (e.g., errors in data entry).
3.3 Before Running an Analysis
Although you can select Run Analysis after specifying the model variables and any filtering, it is
recommended that you review the Model and BMD tab settings before running an analysis.
3.4 Data Requirements and Error Messages
Certain minimal data requirements need to be satisfied in order to estimate the categorical
regression model. There needs to be at least one response in each severity category. If some
categories are completely absent, then categories will need to be combined.
Doc. No.: N/A
Page 11 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
If both concentration and duration effects are to be modeled, then both C and 7 need to be
varying in the data. If 7 is constant, then a reduced model is fit so that 7 is dropped from the
model.
There are additional technical limitations on the complexity of the model relative to the data. The
most obvious is that the model cannot include more parameters than the number of independent
observations. Less obvious problems sometimes occur if certain variables are redundant. These
problems are revealed by R error messages indicating that there are redundant variables in the
model. This means that at least one variable in the model can be expressed in terms of the
others. Reducing the number of stratification variables usually will solve the problem.
Sometimes, R will return a "failure to converge" message. This is an indication that there are too
many variables in the model, and that the model needs to be simplified. This problem relates to
the number of variables needed to completely isolate the different severity categories. The
solution is to remove one or more variables from the model.
CatReg attempts to display any R error messages in plain language. In this section, the error
messages are printed in this font (Courier New) for easy recognition.
Sometimes "NaN" (not a number) is given for some of the model deviance iterations. It is
displayed when CatReg is searching for the solutions to the parameter estimates in early
iterations of the program. CatReg tries to compute the deviances for those solutions and finds
they are undefined (NaN). Once a solution is determined for runs like this, the following warning
message may appear: nas produced in: log (likes) (Na means "not available").
The message Warning : Gamma hit its maximum bound! ! ! may OCCur when the
parameter y (gamma) is being estimated. The smallest positive value of concentration in an input
file is used as a practical boundary on gamma. The estimate shown for gamma, and the other
parameters in the summary table of estimates, are not maximum likelihood in that case, and the
user is advised to consider the setup option that assumes the background risk is zero.
An error message occurs if a coefficient of concentration or time is negative, e.g., Error: time
is negative! Estimates of coefficient parameters do not satisfy non-
negativity constraint on the parameters. A negative estimate is
evidence of no effect. This run will terminate. The user needs to modify the
run. The data may be indicating there is no effect or there may just be too many parameters in
the model.
Similarly, the estimates of severity intercepts may violate the order constraint, resulting in a
message Such as Sev 1
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
3.5 Data Types
When the source document for an experiment does not report the outcome for individual subjects,
or otherwise report the incidence of different health effects, the data may not be suitable for
CatReg. For example, a report of "mild" pathology for a treatment group might mean that a few or
many in the group manifested that response or that the "mild" response was the most common,
with both lesser and more severe effects also present in the group. In either case, there is not
sufficient information to divide a treatment group into incidence of severity categories.
It is sometimes reasonable to represent a health outcome measured on a continuous scale as
categorical data. Continuous data from acute studies, such as enzyme activities, tidal volume,
respiratory rate, blood pressure, etc., often are reported as a mean value, with a measure of
dispersion, such as the standard error or standard deviation, for each treatment group. To
convert these data to severity levels for CatReg, each severity level needs to be equated to an
interval of values on the continuous scale.
For example, if the full range of responses is 0 to 100, the user might decide to classify outcomes
0 to 20 as "no effect", 21 to 40 as a "mild adverse effect", 41 to 65 a "moderate adverse effect",
and 66 to 100 "severe effect". The mean for a treatment group falls into a single severity level,
but some of the individual responses of subjects in the group may have been dispersed over
adjacent severity levels.
Knowing the mean and standard deviation (or standard error that can be converted to a standard
deviation by multiplying by the square root of the size of the treatment group) and assuming a
distribution for the continuous data (e.g., normal), an estimate can be made of the incidence at
each severity level (see Appendix A for details). The estimated incidence figures need not be
whole numbers, but must still sum to the total group size. Incidence estimation is not possible if
the mean is reported without a measure of dispersion.
3.6 The Data Input File
The same category system of severity levels must be used for all data in an input file.
Considerable toxicological judgment may be required for classification of various health effects
into severity levels and for achieving comparability across experiments. When that cannot be
done for all the studies of interest, it may be necessary to group the studies into more than one
input file. Classification judgments must be made systematically according to documented
criteria.
The minimum number of severity levels is two (severity levels coded as 0 and 1, corresponding to
absence or presence of an effect) and the maximum number is four.
Suggested severity categories for a three-category classification are "no adverse effect", "adverse
effect", and "lethal effect", coded as severity levels 0,1,2, respectively.
A four-category scheme might be "no adverse effect", "mild adverse effect", "moderate/severe
effect", and "lethal effect", coded as 0, 1, 2, 3, respectively.
In some toxicology studies, it may not be possible to score all response data completely.
Consider a four-category scoring system in which 0 = "no observable effect," 1 = "mild effect," 2 =
"moderate effect," and 3 = "severe effect." Published data from an animal mortality study may not
include nonlethal outcomes; therefore, the response score for an animal that survives is
uncertain, or "censored." That score is known to be less than 3, but it is not known whether the
score should be 0, 1, or 2. Such an observation is said to be "interval censored." Another
situation where the response score may be interval censored is in combining data from
experiments with different endpoints. For some endpoints, it may not be clear from the toxicology
whether a specific response should be considered "mild" or "moderate." An interval censored
analysis simply could report that the response is either 1 or 2, but the specific score is not known.
The ability to include partial information about the ordinal scores is one of the important features
of CatReg. CatReg incorporates this type of partial information in an interval-censored analysis.
Doc. No.: N/A
Page 13 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
In general, interval censoring occurs if the response is known only to lie in an interval of potential
values. Such intervals are specified in CatReg by supplying the lower and upper limits of the
known range for each observation. Suggested codes to indicate species and sex in an input file
are provided in Table 1.
Table 1: Recommended Codes for Species and Sex
Species
Code
Sex
Code
Human
HU
Female
F
Rat
RT
Male
M
Mouse
MU
Both sexes
B
Rabbit
RB
Guinea pig
GP
Each column of the user input file is referred to as a data field, with the first record (row) being
variable names and all subsequent rows containing data for the variables. CatReg requires
information for four- six data fields, depending on the data. The names of these data fields and
the corresponding default variable names that CatReg looks for in the user input file are shown in
Table 2. For example, "cone" refers to a data field for exposure concentration and CatReg looks
for the variable name "mg/m3" to identify that field, unless the variable name has been changed
from the default. The default variable name for "cone" might be changed, for example, if the
concentration used in experiments is different from milligrams per cubic meter.
Using the default variable names in Table 2 as an example, the user input file must include data
in each record (beyond the first that contains variable names) for variables "mg/m3", "Nsub",
"Incid", and "SevLo". Data are also required for the variable "Hours" unless all exposure duration
times are equal, in which case it can be omitted (CatReg uses 1 as the default value in that case).
The variable SevHi is required if the severity level for one or more records is entered as a range,
but not otherwise. For example, if the severity level for a record is a range such as level 1 to level
2, then SevLo is entered as 1 and SevHi is entered as 2. Data records for which SevLo ฃ SevHi
are referred as censored data. If only one severity level applies, e.g. level 1, then SevLo = SevHi
= 1.
Table 2: Data Fields That May Be Required
Data Field
Variable
Name
Description
Cone
mg/m3
Exposure concentration. Use of the human equivalent concentration is recommended.
Always required.
Time
Hours
Exposure duration in hours. Required unless all durations are equal. (Default value =
1)
Nsub
Nsub
Number of subjects in a treatment group. Always required.
Incid
Incid
Incidence of severity level (or severity range) for the record. Always required.
Loscore
SevLo
Lowest severity level for the record. Always required.
Hiscore
SevHi
Highest severity level for the record. Required if response spans more than one severity
level.
A separate record is entered for each severity level (or range of severity levels in the case of
censored data) observed in a treatment group. For example, if the user determines three severity
classifications for health effects, denoted as 0, 1, and 2, then the outcome for a treatment group
is represented as an incidence for each severity level that is observed. To illustrate, a treatment
group of size 10 might result in 3 subjects being classified at severity level 0, 4 at severity level 1,
and 3 at severity level 2, which would require three consecutive records (consecutive rows of
Doc. No.: N/A
Page 14 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
data) in the input file. The records for the treatment group must not only be consecutive but their
incidence (values of Incid) must sum to the treatment group size (Nsub).
A severity level with no observations need not be entered as a separate record. For example, if a
treatment group of size 10 had 6 subjects classified at severity level 0 and 4 at severity level 1,
then only two records would be required to enter the data. The value of Nsub would be 10 in both
records; the value of Incid would be 6 in one record and 4 in the other.
Microsoft Excel spreadsheets can be used to construct a data file, but the file must be saved as a
comma-delimited file with a "csv" extension, rather than as an Excel file with an "xls" extension.
Because CatReg assumes that data are separated by commas rather than by blank spaces,
spaces are interpreted as characters and should be avoided unless intended to be part of the
data. For instance,", MU," is distinct from ",MU,". Because R is case-sensitive, "mil," is different
from "Mu,".
Variables in addition to those in Table 2 may be added to the input file at the user's discretion,
either for use in execution of CatReg or to facilitate organizing and keeping track of the data. The
user can refer to them in the same manner as variables in the required fields when using CatReg
options. For example, one might want to add "strain" as a variable to distinguish between two
strains of mice and have CatReg test whether their exposure-response curves are significantly
different in some respect, or add "Ref.id" to record the source of the data, even if it is not used
during the execution of CatReg.
Three varied examples of input files follow, described as experimental results for hypothetical
chemicals named chemx, chemy, and chemz. The data were generated by simulation, except the
data for what is being called chemy that were constructed from a few experiments on exposure of
rodents to hydrogen sulfide. The input file for chemx is an example of four experiments, one each
on the four combinations of species (RT and MU) and target organs (C and L). The input file for
chemy has a more complicated structure and illustrates how a toxicologist might determine the
severity levels. The input file for chemz is an illustration of converting a continuous response to
severity categories for use in CatReg.
3.7 Input File: chemx.csv.
Table 3 displays the first part of the input file for chemx.csv. Four experiments were conducted
under identical exposure conditions, each consisting of 10 observations at each combination of
four concentrations (mg/m3) and four duration (Hours), for a total of 64 treatment groups. The
concentrations are 1259, 1585, 2000, and 2512 mg/m3; the durations are 1.25, 1.6, 2.0, and 2.5
hours.
There are three severity levels: no adverse effect (SevLo = 0), mild adverse effect (SevLo = 1),
moderate/severe effect (SevLo = 2). Two experiments are on mice (Species = MU) and two are
on rats (Species = RT), with one of the two experiments on each species reporting effects on the
central nervous system (Target = C) and the other reporting effects on the liver (Target = L).
"Exp" denotes an experiment number, "Group" the treatment group within the experiment, "Nsub"
the number of subjects in the treatment group, and "Incid" the incidence in the treatment group of
the severity level (SevLo) shown in the record (row of data).
The variable names mg/m3, SevLo, Nsub, and Incid are required and an error message is printed
if any of them is missing. In this example, the exposure durations vary so the variable Hours is
included. SevHi would have been included as a variable if the severity level had been censored
for one or more records (spanned more than one severity level). For example, to make the 10
subjects in the first record (row of data) classified as severity level 0-1, the variable SevHi would
be added to the input file and the first record would remain unchanged except for SevHi = 1. In
that case, subsequent records that are not censored would be given the same value for SevLo
and SevHi. For example, the second record indicates that 9 subjects were classified at severity
level 0. If SevHi were included as a variable, then SevHi would be set to 9 for that record, making
SevLo = SevHi = 9.
Doc. No.: N/A
Page 15 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
It may be noted that variables Exp., Group, Species, and Target are names created by the user.
Species and Target were included in this case to be able to distinguish between species and
target organ in the data analysis, but other names could be used in their place. Exp. and Group
were added by the user to facilitate record keeping. The variable Group is not required by
CatReg but records for the same treatment group must be together in the input file, all with the
common value of Nsub and values of Incid that sum to Nsub. Adding a variable such as Group
provides a convenient check of the data for the user. CatReg determines treatment groups by
reading records until the values of Incid sum to Nsub, then starting over with the next record.
Table 3: Part of the Input File Chemx.Csv
Exp.
Group
Species
Target
mg/m3
Hours
SevLo
Nsub
Incid
1
1
MU
C
1259
1.25
0
10
10
1
2
MU
C
1259
1.6
0
10
9
1
2
MU
c
1259
1.6
1
10
1
1
3
MU
c
1259
2
0
10
4
1
3
MU
c
1259
2
1
10
6
1
4
MU
c
1259
2.5
0
10
1
1
4
MU
c
1259
2.5
1
10
9
1
5
MU
c
1585
1.25
0
10
7
1
5
MU
c
1585
1.25
1
10
3
1
6
MU
c
1585
1.6
0
10
3
1
6
MU
c
1585
1.6
1
10
7
1
7
MU
c
1585
2
0
10
1
1
7
MU
c
1585
2
1
10
7
1
7
MU
c
1585
2
2
10
2
1
8
MU
c
1585
2.5
1
10
4
1
8
MU
c
1585
2.5
2
10
6
1
9
MU
c
2000
1.25
0
10
7
1
9
MU
c
2000
1.25
1
10
3
1
10
MU
c
2000
1.6
0
10
2
1
10
MU
c
2000
1.6
1
10
5
1
10
MU
c
2000
1.6
2
10
3
1
11
MU
c
2000
2
1
10
6
1
11
MU
c
2000
2
2
10
4
1
12
MU
c
2000
2.5
1
10
4
1
12
MU
c
2000
2.5
2
10
6
1
13
MU
c
2512
1.25
1
10
9
1
13
MU
c
2512
1.25
2
10
1
1
14
MU
c
2512
1.6
1
10
5
1
14
MU
c
2512
1.6
2
10
5
1
15
MU
c
2512
2
1
10
2
1
15
MU
c
2512
2
2
10
8
1
16
MU
c
2512
2.5
2
10
10
2
1
RT
c
1259
1.25
0
10
10
2
2
RT
c
1259
1.6
0
10
9
2
2
RT
c
1259
1.6
1
10
1
2
3
RT
c
1259
2
0
10
9
2
3
RT
c
1259
2
1
10
1
3.8 Input File: chemy.csv.
Table 4 is part of a larger input file that was constructed for exposure of rodents to hydrogen
sulfide. Only part of the available experimental data are used here for illustration, so it is referred
to as "chemy" instead of hydrogen sulfide. It illustrates a more elaborate coding system and
some other features not included in the preceding example, e.g. censoring, and provides a
realistic example for discussion of toxicological judgment in severity classification. The available
studies varied on the organ sites and endpoints examined and a four-category system of severity
Doc. No.: N/A
Page 16 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
levels was implemented: no adverse effect (0), mild adverse effect (1), moderate/severe effect
(2), and lethal effect (3). Again, CatReg expects to find variables mg/m3, Nsub, Incid, and SevLo,
at a minimum, and Hours as well, if exposure duration varies, and SevHi if there are any
censored data. Also (again) notice that a separate record is required for each different severity
(or range of severity in the case of censored data) in a treatment group, and that Incid sums to
Nsub for records in the same treatment group.
"Ref.id" (reference identification) is a number assigned by the user to the source of the
information in the record. In the example, Ref.id = 20938 is the source of the material used to
construct the first 10 data records shown. "Exp." identifies experiments within each Ref.id,
numbered sequentially from 1; "Group" numbers treatment groups within each experiment (i.e.,
subjects alike with respect to all methods and materials variables); and "Nsub" is the number of
subjects in each group. A separate record (row) is entered for each severity level (or range of
severity levels) in a treatment group. "Marker" just numbers the data records. The severity levels
are entered under "SevLo", the lowest possible severity level for the record, and "SevHi", the
highest severity level for the record.
When SevLo ฃ SevHi, "y" is entered under "Censored"; "n" when SevLo = SevHi. In Table 4, the
variable Censored has been added by the user to readily distinguish between records with a
single severity level and those with a range of severity levels; it is not required by CatReg. The
variable BestNum also has been added by the user to indicate the most likely severity level when
SevLo ฃ SevHi; otherwise, the common value for SevLo and SevHi is entered for BestNum. To
use the scores in the BestNum column, the default variable names SevLo and SevHi must both
be changed to BestNum by editing the names in the Data Grid.
The first reference (Ref.id = 20938) reported two experiments, one with mice (Species = MU) and
one with rats (Species = RT), with both sexes in both experiments (Sex = B). There is a user-
defined variable coded to indicate the target organ (e.g., Target = Resp) and the primary endpoint
(Endpoint = Lethality), both of which were the same for both experiments. The first group (Group
= 1) under the first experiment (Exp = 1) is for rats (Species = RT) of both sexes (Sex = B)
exposed at a concentration of 330 mg/m3 (mg/m3 = 330) for 6 h (Hours = 6). There were 26 rats
at risk (Nsub = 26). The user-defined severity classification was the same for all 26: severity
level 0 to 2 (SevLo = 0, SevHi = 2), with severity level 1 the best guess for a single severity level
(BestNum = 1). Because the results variables were the same for all 26 subjects, only one record
is needed to record the data for the whole treatment group.
The sixth and seventh records (Markers 6 and 7) are for the first treatment group (Group = 1) of
the second experiment (Exp = 2) in Ref.id = 20938. Mice (Species = MU) of both sexes (Sex = B)
were exposed to 360 mg/m3 (mg/m3 = 360) for 6 h (Hours = 6). Two records are required
because there are two distinct severity classifications: one for 23 subjects (Incid = 23) with
severity level 1 to 2 (SevLo = 1, SevHi = 2) and severity level 2 the best guess (BestNum = 2)
and the other with three subjects at severity level 3 (SevLo = 3, SevHi = 3) and BestNum = 3. Six
records in the example have different values for SevLo and SevHi, as indicated by Censored = y.
In Ref.id = 20938 of Table 4, the effect severity for concentrations at which no subjects died were
censored 0 to 2 (e.g., Marker 1) for no adverse effects to severe adverse effects, because the
effects were unknown. Survivors from groups in which some subjects died (e.g., Marker 2) were
assumed to have suffered adverse effects and were censored 1 to 2 because the effects could
have been mild to severe. Survivors from groups in which most of the subjects died (e.g., Marker
5) were assumed to have suffered severe effects. For the 360-, 390-, 420-, and 460-mg/m3 H2S
exposures, there is one record for the number of subjects exhibiting lethal effects, and another for
the number of subjects exhibiting nonlethal effects. Only one record was made for exposure to
330 mg/m3, because all subjects were assumed to exhibit effects of severity 0 to 2, and one
record was made for exposure to 500 mg/m3, because all subjects died (severity 3).
Doc. No.: N/A
Page 17 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Table 4: Part of input file CHEMY.CSV
Marker
Ref.id
Exp
Group
Species
Sex
mg/m3
Hours
Target
Endpoint
Nsub
Incid
BestNum
SevLo
SevHi
Censored
1
20938
1
1
RT
B
330
6
Resp
Lethality
26
26
1
0
2
y
2
20938
1
2
RT
B
390
6
Resp
Lethality
26
20
2
1
2
y
3
20938
1
2
RT
B
390
6
Resp
Lethality
26
6
3
3
3
n
4
20938
1
3
RT
B
460
6
Resp
Lethality
26
23
3
3
3
n
5
20938
1
3
RT
B
460
6
Resp
Lethality
26
3
2
2
2
n
6
20938
2
1
MU
B
360
6
Resp
Lethality
26
23
2
1
2
y
7
20938
2
1
MU
B
360
6
Resp
Lethality
26
3
3
3
3
n
8
20938
2
2
MU
B
420
6
Resp
Lethality
26
13
3
3
3
n
9
20938
2
2
MU
B
420
6
Resp
Lethality
26
13
2
1
2
y
10
20938
2
3
MU
B
500
6
Resp
Lethality
26
26
3
3
3
n
11
61831
1
1
RT
M
14
4
Resp
N lavage
12
12
0
0
0
n
12
61831
1
2
RT
M
278
4
Resp
N lavage
12
12
1
0
1
y
13
61831
1
3
RT
M
556
4
Resp
N lavage
12
12
2
1
2
y
Notes:
Marker - Record number.
Ref. id - Source identifier.
Exp - Experiment number within a source.
Group - Treatment group number (within an
experiment).
Species - Species.
Sex - Sex.
mg/m3 - Exposure concentration.
Hours - Exposure duration.
Target - Target organ.
Endpoint - Toxic endpoint.
Nsub - Number of subjects in treatment group.
Incid - Number of animals responding.
BestNum - Analyst's best estimate of severity category for censored data; same as SevLo
and SevHi for noncensored data.
SevLo - Lowest applicable severity level.
SevHi - Highest applicable severity level.
Censored - Severity level reported as a range if "y".
Doc. No.: N/A
Page 18 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Ref.id = 61831 of the example illustrates a case where health effects are described for each
treatment group, for which the severity levels can be decided, but incidence of the severity levels
is unknown. These data are not suitable for CatReg because correct values for Incid are
unknown. Nevertheless, one might want to test the sensitivity of the results of CatReg to different
assumptions regarding the incidence values and it provides a useful example for discussion of
deciding severity levels. It is included here primarily as an example of deciding severity levels
from toxicological data. The data are from a study by Lopez et al. (1987), in which groups of 12
rats were exposed to 0, 14, 278, and 556 mg/m3 H2S for 4 h. Incid = 12 is used in the example
for illustration, which is equivalent to assuming that all 12 rats in each treatment group have the
same severity classification shown for the group as a whole.
Data from the Lopez et al. study were assigned to severity levels as follows. Groups of four rats
were killed at 1, 20, and 44 h after 4-h exposure for the examination of biochemical indicators of
injury and inflammatory response in the respiratory tract. Nasal lavage fluid was examined for
lactate dehydrogenase (LDH), alkaline phosphatase (ALP), protein, and number of nucleated
cells. Bronchoalveolar lavage (BAL) fluid was examined for activities of LDH, ALP, and L-
glutamyl transpeptidase. All measurements were reported as means, plus or minus standard
deviations. No changes in any parameters were noted among rats exposed to 14 mg/m3 H2S.
The only parameter significantly different from controls in nasal lavage fluid at all post-exposure
time periods was increased cellularity, which was significant at the 556-mg/m3 exposure. In BAL
fluid, LDH activity was elevated at 44 h post-exposure, and ALP activity was significantly
decreased at 20 and 44 h after the 278-mg/m3 exposure. At all post-exposure durations for the
556-mg/m3 exposure, protein concentration and LDH activity were elevated, but ALP was
decreased.
Because a number of biochemical and cellular parameters were measured at several post-
exposure periods, significant changes were considered to be adverse only if the changes were
still significant at the last post-exposure measurement. In other words, reversible changes were
classified as no-observed-adverse effects. Table 5 shows how these effects were categorized
using the four-category severity scheme. Because no changes were noted after exposure to
14mg/m3 H2S, effects were coded 0 for no adverse effect. Effects at the 278-mg/m3 H2S
exposure were estimated to range from severity 0 to 1 (no adverse effect to mild adverse effect)
because significant changes in two biochemical parameters were observed. The adversity of
those changes was uncertain but assumed to be less than severe. Effects caused by exposure
to 556 mg/m3 H2S were estimated to range from severity 1 to 2 (mild to severe adverse effect)
because of changes in the activities of several enzymes and nasal cytopathology. No deaths
occurred during this study, so category 3 (lethality) was not used.
Table 5: Example of Severity Categorization for Nonlethal Effects
Exposure
Concentration
(mg/m3 H2S)
Statistically Significant
Effects Reported
Severity
Score
Censored
14
None
0
No
2748
tLDH, lALP
0-1
Yes
556
TProtein, tLDH, 4-ALP,
tcellularity, nasal cytopathology
1-2
Yes
3.9 Input File: chemz.csv
An artificial example of how continuous data might be coded is displayed in Table 6. It is
assumed that the standard error was included along with the mean for each treatment group. In
this case, the incidence of each severity in a group was estimated from the mean and standard
error, which produces fractional subjects. The method to estimate incidence at each severity from
group means and a measure of variability is described in Appendix A.
Doc. No.: N/A
Page 19 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Male rats in treatment groups of 10 each were exposed to a toxicant at various concentrations
and durations. Adverse effects occurred in the respiratory tract, with severity indicated by lung
weight. For illustration, a four-category severity classification is used, with lung weight 0 to 20
classified as "no effect", 21 to 40 as "mild adverse effect", 41 to 65 as "moderate adverse effect",
and 66 to 100 as "severe effect". The mean for a treatment group falls into a single severity level,
but some of the individual responses of subjects in the group may have been dispersed over
adjacent severity levels.
Doc. No.: N/A
Page 20 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Table 6: Part of input file CHEMZ.CSV
Marker
Ref.id
Exp
Group
Species
Sex
mg/m3
Hours
Target
Endpoint
Nsub
Incid
BestNum
SevLo
SevHi
Censored
1
1
1
1
RT
M
330
2
Resp
Lung wt
10
1
0
0
0
n
2
1
1
2
RT
M
360
2
Resp
Lung wt
10
1
1
1
1
n
3
1
1
3
RT
M
390
2
Resp
Lung wt
10
1
1
1
2
y
4
1
1
4
RT
M
410
2
Resp
Lung wt
10
1
2
2
2
n
5
1
1
5
RT
M
460
2
Resp
Lung wt
10
1
2
2
2
n
6
1
2
1
RT
M
460
1
Resp
Lung wt
10
1
1
1
1
n
7
1
2
2
RT
M
510
1
Resp
Lung wt
10
1
1
1
1
n
8
1
2
3
RT
M
560
1
Resp
Lung wt
10
1
2
2
2
n
9
1
2
4
RT
M
610
1
Resp
Lung wt
10
1
2
2
2
n
10
1
3
1
RT
M
560
0.5
Resp
Lung wt
10
1
1
1
1
n
11
1
3
2
RT
M
610
0.5
Resp
Lung wt
10
1
1
1
2
y
12
1
3
3
RT
M
660
0.5
Resp
Lung wt
10
1
2
2
2
n
13
1
3
4
RT
M
710
0.5
Resp
Lung wt
10
1
2
2
2
n
14
2
1
1
RT
M
330
2
Resp
Lung wt
10
7.1
0
0
0
n
15
2
1
1
RT
M
330
2
Resp
Lung wt
10
2.9
0
1
1
n
16
2
1
2
RT
M
360
2
Resp
Lung wt
10
2.9
1
0
0
n
17
2
1
2
RT
M
360
2
Resp
Lung wt
10
7.1
1
1
1
n
18
2
1
3
RT
M
390
2
Resp
Lung wt
10
4.9
1
1
1
n
19
2
1
3
RT
M
390
2
Resp
Lung wt
10
5.1
1
2
2
n
20
2
1
4
RT
M
410
2
Resp
Lung wt
10
1.2
1
1
1
n
21
2
1
4
RT
M
410
2
Resp
Lung wt
10
8.8
2
2
2
n
Doc. No.: N/A
Page 21 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
4.0 MODEL AND BMP TAB
aCategorical Regression Version 3.0.1.0 Beta - [Analysis Screen]
^ File Help
Dataset and Variables] Model and BMD | plots ] Hypotheses j
| Extra
H
0.1
95
Risk:
BMR:
Confidence Level (%):
0.5
1
2
4
8
r
Model Specifications
| Logit
3
[ Cumulative Odds ~J
[Yes
Link Function:
Model Form:
Zero Background Response:
j Log10
H
|LoglO
(No
1
Dose:
T.me:
Worst Case Analysis:
Stratification
Model variables can be stratified by or
Stratify... Intercept Dose
by
nore dataset variables.
Selected variables will be used to define groups of correlated data.
Cluster 1 Cluster 2 Cluster 3 Cluster 4
then by
then by
Summary of Run Options
Output File: CiVisepaNCatFiegVDataNChemx.ot
Filtered Out: Spedes=!RT)
Clustered:
Ready
Figure 3.Model and BMD tab.
The Model and BMD tab provides further opportunity to specify the options that CatReg will use
to calculate an exposure-response curve. All you need to do is select or deselect option boxes or
buttons, but having an understanding of clustering and stratification will help inform your
selections.
See also Appendix C, for technical background discussions on exposure-response models and
extra risk concentration (ERC).
Doc. No.: N/A
Page 22 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
4.1 Setting BMD Specifications for an Analysis
Specifications
Description
Risk
Options are:
Unadjusted
Added
Extra (default)
Added risk is the additional proportion of total animals that respond in the presence of the dose, or
the predicted probability of response at dose d, P(d), minus the predicted probability of response
in the absence of exposure, P(0).
Extra risk is the additional risk divided by the predicted proportion of animals that will not respond
in the absence of exposure, 1 - P(0).
BMR
The response, generally expressed as in excess of background, at which a benchmark dose or
concentration is desired.
User input value (or default of .1000). BMR must be a number >0 and <1.
Confidence Level (%)
The confidence level (default 0.95) associated with the statistical lower bound of BMD (BMDL)
calculation.
Confidence level must be a number >0 and <1.
Time
Exposure duration, in hours
4.2 Setting Model Specifications for an Analysis
Specifications
Available Options
Notes
Link Function
Options are:
Logit (default)
Probit
Cloglog
For more information, see page 26 .
Model Form
Options are:
Cumulative Odds (default)
Unrestricted Odds
For more information, see page 27.
Zero Background Response
Options are:
Yes (default)
No
For more information, see page 28.
Dose
Options are:
Log10 (default)
Linear
Time
Options are:
Log10 (default)
Linear
Doc. No.: N/A
Page 23 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Specifications
Available Options
Notes
Worst Case Analysis
Options are:
Yes
No (default)
For more information, see page 31.
4.3 Running an Analysis
After specifying the parameters and options, click the Run Analysis button. CatReg will execute
the analysis according to your selections.
CatReg then displays the output file specified in the Output File box at the bottom of the Analysis
Screen.
4.4 About the CatReg Output File
CatReg's output file has the extension *.otx.
The output file provides summary information such as the name of the input file, the setup options
used, the table of coefficient estimates, an analysis of deviance table, and extra risk
concentrations (ERCs) with upper and lower confidence bounds at exposure durations of 1, 4, 8,
and 24 hours.
The .OTX output file is a simple text file that CatReg automatically saves to its Data subdirectory.
From the text file window, you can save, print, edit, and otherwise manipulate the file's text. You
can also set the file's font and line wrapping preferences.
CatReg by default suggests a name for the output file based on the loaded dataset. For example,
if the dataset is named "ChemZ.csv," then CatReg will save the output file as "ChemZ.otx".
4.5 Changing the Output File's Name and Location
To change the name and location of the output file, click the ... button beside the Output File field.
CatReg displays a Save As dialog box. Use this dialog to define a new location and filename.
By default, CatReg saves output files to its Data directory.
4.6 Stratifying
On the Model and BMD tab, select the variable(s) from the picklists to stratify by Intercept, Dose,
or Time.
Stratification is a way of allowing one or more of the regression parameters (intercept, coefficient
for concentration ("mg/m3"), and coefficient of time ("Hours") to change when a specified variable
changes value. For example, instead of assuming a common intercept parameter for three
different species, stratification of intercept on the variable Species adds two more intercept
parameters so there is one for each species. Stratification of the intercept on three species
defines three subgroups (strata) of data, one for each species. The same parameter can be
stratified on more than one variable. For example, the intercept might be stratified on both
Species and Target. If there are two target organs for each of three species, then there are six
strata, each corresponding to a distinct combination of species and target. CatReg will provide
six intercept estimates, one for each species-target combination. To stratify the intercept on
Species and Target, enter those two variable names in response to CatReg's query on whether to
stratify the intercept (enter both on the same line, separated by a space, or on separate lines). In
the same way, CatReg queries for variables on which to stratify the coefficients of concentration
and time.
Stratification is often conducted to test if the value of a regression parameter (e.g., the intercept)
is the same for two or more values of a variable (e.g., Species). The user typically wants to
Doc. No.: N/A
Page 24 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
produce an exposure-response curve that is suitably "accurate" by taking account of different
parameter values that may occur between species, endpoints, etc., but that achieves that
objective as simply as possible (i.e., with the minimal number of parameters). Stratification can
be a way of implementing toxicological considerations, as the following example illustrates.
Suppose the data contain mortality results from experiments using rats and mice. The basic
explanatory variables are atmospheric toxicant concentration and duration of exposure. The
response score equals 0 for surviving animals and 1 for animals that died. In this case, the
maximum severity score is S = 1. Assuming C and T enter the model logarithmically, the basic
model has the form
L[Pr(Y = 1 |C,T)] = on +/3Hogio(C)+/32*logio(T),
where L is the link function (the inverse function of H in Appendix C, Eq. 1a). Logarithmic scaling
is typical when the explanatory variables range over two or more factors of 10. Because of
different rates of respiration and metabolism among rats and mice, it may be reasonable to
assume the internal dose for rats should be rescaled compared to that of mice. One possibility is
to assume that a concentration of C for rats is equivalent to a concentration of kC for mice, where
k is common to all mice in the study. Then, for the mice,
L[Pr(Y = 1 |C,T)] = on + |0gio (kC) + fe* logio (T).
= [ori + /3r logio (k)] + /3r log-io (C) + /32* log-io (T).
This shows that the mice, in effect, have a different intercept than do the rats, namely criMU = criRT
+ /3r logio(/c), where MU and RT refer to mouse and rat parameters, respectively. By stratifying
the intercept parameter, the data are allowed to determine the estimate of the conversion factor k.
Whether k is significant would be determined by testing criMU = criRT.
4.7 Clustering
On the Model and BMD tab, select the variable(s) from the list boxes that are part of a cluster.
The list boxes display the variables for selection.
An input file may consist of subsamples of data from common sources that causes them to be
more similar to each other than to observations from another source.
Example 1: Suppose there are reports from three different "identical" experiments conducted at
three different laboratories. The data from each laboratory may be considered a cluster because
of the following reasons:
There are likely some differences among laboratories in the way subjects were fed, their
animal suppliers, the age of the subjects, the conditions under which the subjects have been
taken off the study, the protocol for histopathology (or just different histopathologists), etc.
The differences can be viewed as random effects by thinking of the specific laboratories as a
random sample among a population of laboratories.
The magnitude of the differences among the specific laboratories may vary.
The laboratories are reasonably homogeneous (i.e., there is not one or more of them that is
unreliable or consistently different in some way from the others).
Example 2: Suppose that an experiment is conducted wherein pregnant female rats are exposed
to a toxic substance. Each rat gives birth to a litter and the pups are examined for specific health
effects. Each litter could be considered a cluster sample.
Clustering is necessary whenever there is reason to suspect that batches of data are correlated
(i.e., when the design of the study involves cluster sampling). The cluster variable should
uniquely identify each batch of correlated data uniquely.
Cluster labels might be text identifiers, identification numbers, combinations of variables, etc.
Doc. No.: N/A
Page 25 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
The only requirements are that observations from the same cluster have the same cluster label,
and those from different clusters have different cluster labels. If no cluster variables are
specified, the program treats all data as being independent.
CatReg assumes that responses from the same cluster are correlated, whereas observations
from different clusters are independent. It adjusts for the cluster sampling effect using the
method of generalized estimating equations (GEE). The cluster adjustment affects standard
errors, confidence limits and hypothesis tests (p-values), but it does not affect parameter
estimates or the deviance (a statistic used to measure the fit of exposure-response curve to the
data). For technical background on GEE, see Simpson et al. (1996b) and Diggle et al. (1994). It
also may be noted that cluster sampling invalidates the large sample F distribution of the
generalized F-statistic. However, it is common practice to compute F as a rough guideline (see
Venables and Ripley, 1994, p. 187). In any case, the R2 statistic gives an idea of how much
variation in the response is accounted for by the explanatory variables (see Section 7 for
information on how F and R2 are computed). Ignoring clusters of observations typically leads to
underestimation of variability in estimates and confidence bounds that are inappropriately narrow.
4.8 Link Function
A link function is a function applied to the exposure-response curve to transform it to a simple
linear relationship in concentration and duration. By also transforming the observed responses,
the link function reduces the mathematical complexity of estimating the parameters. The
parameter estimates then are substituted into the (untransformed) exposure-response curve.
CatReg provides three different link functions for the exposure-response curves.
Table 7: Link Functions
The Link Function...
...Corresponds to Probability Function
Logit
Logistic
Probit
Normal
Cloglog (complementary log-log)
Gumbel
A comparison of how well the different link functions fit the data may be assessed using the AIC
(Akaike Information Criteria, Akaike (1974)). A link with a smaller AIC provides a closer fit to the
data. The AIC for different link functions may be compared only if there are no changes in the
data or in use of the data filtering and stratification options. Otherwise, the differences in AIC may
result from the changes.
The parameter estimates and their statistical characteristics, including standard errors and
significance levels, are routinely output by CatReg, along with an analysis of deviance table to
assess model fit and a table of estimates of extra risk concentrations (concentrations at which
extra risk is a user-specified value) for 1,4,8, and 24 hour exposure durations.
Doc. No.: N/A
Page 26 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
4.9 Model Form
There are two choices of models, Model 1 (cumulative odds model) and Model 2 (unrestricted
cumulative model).
CatReg provides a choice of two models, the cumulative odds model and the unrestricted
cumulative model.
The Cumulative Odds Model is described by the following equation:
Pr(Y> s\C,T) = H[as + P1 */1(C) + P2 * f2(T)\ Model 1
The Unrestricted Cumulative Model is described by the following equation:
Pr(Y> s|C, T) = /-/[as + P1 s * /1 (C) + P2s * /2(7)] Model 2
CatReg refers to any model of the form of Model 1 as a cumulative odds model because the
model is expressed in terms of the cumulative probabilities, or odds, for Y > s.
Note that Model 1 is a special case of Model 2 wherein parameters pi s and P2s do not depend
on s (which denotes severity level). That is, Model 1 is a simplification of Model 2 in which only
the intercept term can vary across severity levels, not the coefficients of concentration or time (a
restriction called parallelism).
In other words, the cumulative odds model (Model 1) states that:
the probability that a severity level s or higher will occur at a given concentration (C) and time
(T) is given by the exposure-response curve (logistic, normal, orGumbel, determined by the
choice of link function), and
the intercept parameters may differ by severity level (i.e., a different intercept for each
severity level), but the coefficients for C and Tdo not differ by severity.
A primary use of fitting Model 2 is to test whether the simpler Model 1 is adequate. Model 2,
although more general than Model 1, has the undesirable feature that the regression lines for
different severity levels may cross. Often the crossing is well outside the range of values of
interest, so the model can be used to make empirical risk estimates. The user has the option to
add an additional parameter, y, which represents a hypothetical background concentration, in
some circumstances.
4.9.1 Understanding the Model Equations
The left-hand side of the Model 1 and Model 2 equations above is read as follows: the probability
that a response of severity level s or greater occurs, given that concentration is C and time is T
(time refers to exposure duration). No expression for s = 0 is included because this is the minimal
category, and Yis always greater than or equal to 0 (i.e., Pr(Y> 0|C,T) = 1).
The right-hand side is described as follows:
H is a probability function taking values between 0 and 1, for which the user has three
choices: logistic, normal, and Gumbel.
The parameter as is the intercept for severity level s, s = 2,...,S (to be called the intercept or
severity parameters). The severity parameters are ordered as a13 a23...3as. This constraint
is a consequence of the requirement that the probability of exceeding a lower score is larger
than the probability of exceeding a higher score for any fixed levels of C and T.
In Model 1, the parameter pi determines the dependence of the response on concentration
(I), whereas P2 determines the dependence on time (T). In Model 2, the parameters are also
indexed by s because they may change values with severity level s.
Doc. No.: N/A
Page 27 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Current choices for/1 and /2 are "untransformed" and "base-10 logarithm." Other
transformations of C and T may be obtained by transforming the input data.
Parameters are as, pis (to be called the coefficient of concentration), P2s (to be called the
coefficient of time or duration), for severity levels s running from 1 to S. All parameters may
be stratified on variables.
4.10 Zero Background Response
You can choose Yes or No for the Zero Background Response option.
If you choose Yes, then the implied probability of an adverse response at zero concentration
is zero and observations at zero concentration are uninformative (treated the same as if they
were filtered out).
If you choose No, then CatReg adds a hypothetical background concentration to the
administered concentrations, denoted as the parameter g (gamma) that is estimated by
maximum likelihood and displayed in the summary table of parameter estimates.
For example, an experimental concentration of 50 mg/m3 is treated as an observation at
concentration (50 + y) mg/m3, where y is estimated from the data simultaneously with the other
model parameters. If the logarithmic option is chosen for time (Hours), the implied probability of
an adverse response at zero time is zero and observations at zero time are uninformative (treated
the same as if they were filtered out).
When C (concentration) is log-transformed, the probability of an adverse effect of level 1 or
higher, approaches zero as concentration approaches zero. If there is a positive probability of an
adverse effect even when concentration is zero, i.e., so-called background response not
attributable to exposure, then the user can modify the probability function in Appendix C, Eq.
1a,b. When the user chooses the log scale for concentration, an option appears on the screen:
Assume zero background risk (i.e., response cannot occur at zero concentration)?(y). The
exposure-response curve in Eq. 1a,b is modified by adding a hypothetical background
concentration level, y (gamma), to the administered concentration C given in the input file (as
variable mg/m3). The parameter y is estimated by maximum likelihood simultaneously with the
other parameters and the result is added to the summary table of parameter estimates. If set Zero
Background Response to Yes, then the data records where concentration is zero are non-
informative and CatReg ignores them (i.e., effectively filters those data). That reduces the total
degrees of freedom compared to a response of "n" which uses the data where concentration is
zero.
Table 8 lists an input file that was generated by simulation using log- transformed C and
hypothetical background concentration y = 20 mg/m3. Evidence of non-zero background risk is
apparent in the occurrence of both severity levels 1 and 2 when mg/m3 = 0. The other parameter
values used for simulation are ai = -7.5, cfe = -9.5, pn = 2.5, P12 = 2.0, P21 = 2.0, P22 = 1.8. The
data were simulated for the unrestricted cumulative model (Model 2) with the logit link, and log-
transformed T (exposure duration). Setting Zero Background Response to No informs CatReg to
add the parameter y. The output file is displayed in Table 9. The evidence of background risk is
not significant in this data set. The estimate of gamma is 16.0 with standard error 28.1, which is
not significantly different from zero.
This example suggests that one might need a substantial background effect for it to be significant,
at least for small sample sizes. The current example consists of treatment groups of size 10 each
at 16 exposures (concentrations of 0, 80, 180, and 480; durations of 20, 100, 200, and 500
hours). To examine the effect of sample size further, the same example was repeated but with
treatment group sizes of 50, 100, and 5,000. The estimate, standard error, and significance level
of gamma, for sample sizes of 50, 100, and 5,000, respectively, were: (31.9, 20.7, 0.12), (12.6,
7.7, 0.10) and (21.5, 1.6, and <10-5). For this example, estimates of gamma appear to become
statistically significant and to converge to the neighborhood of the correct value, 20, very slowly.
When 1000 datasets were simulated with treatment groups of size 10, the median estimate of
Doc. No.: N/A
Page 28 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
gamma was 19.7 (20.0 was the value of gamma used for simulation), but the standard error of
the estimates was high (69.3).
The smallest positive value of concentration in an input file is used as a practical boundary on
gamma. If the estimate of gamma is set to the boundary value, a message appears in the output:
Warning: Gamma hit its maximum bound! ! !. The estimate shown for gamma, and the
other parameters in the summary table of estimates, are not maximum likelihood in that case, and
the user is advised to consider the setup option that assumes the background risk is zero.
Table 8: Input file backgdIO.csv
mg/m3
Hours
SevLo
Nsub
Incid
0
20
0
10
8
0
20
1
10
2
0
20
2
10
0
0
100
0
10
6
0
100
1
10
4
0
100
2
10
0
0
200
0
10
5
0
200
1
10
4
0
200
2
10
1
0
500
0
10
3
0
500
1
10
7
0
500
2
10
0
80
20
0
10
4
80
20
1
10
6
80
20
2
10
0
80
100
0
10
2
80
100
1
10
8
80
100
2
10
0
80
200
0
10
1
80
200
1
10
8
80
200
2
10
1
80
500
0
10
0
80
500
1
10
4
80
500
2
10
6
180
20
0
10
3
180
20
1
10
6
180
20
2
10
1
180
100
0
10
3
Doc. No.: N/A
Page 29 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
mg/m3
Hours
SevLo
Nsub
Incid
180
100
1
10
7
180
100
2
10
0
180
200
0
10
0
180
200
1
10
5
180
200
2
10
5
180
500
0
10
1
180
500
1
10
4
180
500
2
10
5
480
20
0
10
0
480
20
1
10
8
480
20
2
10
2
480
100
0
10
1
480
100
1
10
8
480
100
2
10
1
480
200
0
10
0
480
200
1
10
2
480
200
2
10
8
480
500
0
10
0
480
500
1
10
2
480
500
2
10
8
Doc. No.: N/A
Page 30 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Table 9: CatReg summary from output file of CatReg. Input file: backgdIO.csv. Model 2. Logit. Scales:
Iog10. Estimated background risk.
input file
Filtered data
Model
Link
Cl usteri ng
Message
Iterations
Devi ance
Residual DF
AIC
backgdlO.csv
none
unrestri cted
logit
none
26 10
239.8087
25
253.8087
cumulative model
Seal e:
Concentration: loglOC mg/m3 )
Duration : loglOC Hours )
Stratification:
No Stratification on Intercept, Concentration and Duration.
Coeffi ci ents:
SEVl
SEV2
LG10CONC:SEVl
LGlOTIME:SEVl
LG10CONC:SEV2
LGlOTIME:SEV2
Gamma
Estimate
-6.279480
-13.394898
2.305462
1.567190
2.747634
2.777898
15.956845
Std. Error
3.5472283
33.7617732
1.4596636
0.4607761
0.6149344
0.6190206
28.1522306
Z-Test=0
-1.7702497
-0.3967475
1.5794473
3.4011975
4.4681741
4.4875700
0.5668057
p-value
0.07669
0.69155
0.11423
0.00067
0.00001
0.00001
0.57085
4.11 Worst Case Analysis
CatReg provides an option to do a worst-case analysis when there is at least one record in the
input file that contains censored data:
In a worst case analysis, censored responses are treated as occurring at their highest possible
(worst) severities.
Although the graphical presentation of censored points will not change in a worst-case analysis,
higher estimates of risk will be produced than those for the corresponding censored analysis.
Comparison of risk estimates from the two methods provides an indication of the sensitivity of the
results to the severity scoring.
When the worst case option is selected, the output file contains the line: Type of analysis: Worst-
case.
When the worst-case option is not selected, the output line is: Type of analysis: Censored.
Doc. No.: N/A
Page 31 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
5.0 PLOTS TAB
CatReg includes six functions for making plots after an exposure-response curve has been fit to
the data.
"h1 Analysis Screen
11^1
Dataset and Variables | Model and BMD~| Plots | Hypotheses]
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Figure 5.Plots generated by CatReg.
5.1 Copying and Printing Plots and Output Files
Right-click on any plot in the window and select Copy to Clipboard.
You can then paste the plot into PowerPoint, GnuPlot, or other graphics application. Within that
application, you can print or manipulate the components of the plot.
5.2 Plot Functions and Options
The following tables describe the plot functions and specific options available within CatReg.
Table 10: CatReg Plot Functions
Function
Description
prplot(gp=x)
Probability of exceeding a specified severity level on y-axis, concentration or
duration on x-axis. Plots the probability curve as a function of dose, keeping
duration fixed, or as a function of duration with concentration fixed.
Use the dropdown list to set the duration or concentration level to be used.
catplotO
Concentration on y-axis, duration on x-axis. Plot of ERC line with confidence
interval for a single severity level and stratum, and response data for all severity
levels.
stratplotO
Concentration on y-axis, duration on x-axis. Plot of ERC lines and response data
for all strata for a single severity level.
confplot(Duration)
Strata for a single severity level displayed on y-axis, concentration on x-axis. Plot
of ERC with confidence interval for the unstratified model, and for individual strata
of the model, for a single exposure duration.
dataplot()
Concentration on y-axis, duration on x-axis. Plots response data for all severity
levels combined by stratum.
devplotQ
Generalized deviance residuals on y-axis and data observation number, log dose,
or log-duration on x-axis.
Doc. No.: N/A
Page 33 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
5.3 Concentrations and Durations for Designated ERC and Severity (catplot)
The catplotO plots the response data and graphs the extra risk concentration (ERC) with
confidence interval, for a single severity level and stratum.
Exposure concentration is on the y-axis and exposure duration on the x-axis.
The current ERC settings are used for ERC percentile and severity level.
The confidence interval percentile is 90% (two-sided) and cannot be overridden.
By default, concentration and duration are graphed on a log-linear scale. This type of graph is
useful for showing how extra risk changes with concentration or duration. Be aware that the
weighting of individual points is not shown.
The number of points in each severity category will be displayed in the R command window,
along with the number of hidden points.
Each call to catplot generates a new graphics window. Thus, repeated use of catplot allows
comparison of results across strata.
B?C10 Line (SEV1 :MU:C) with 90% Two-sided Confidence Bounds, Unk = logit
O
8 H
V4
8
O
CM
8
U>
a
*
fe ฆ
0
a A
!\
ฆ A
A
Q ..
0
ฃ B.
A
ฆ
-
ฆ.
0
& O
A
ฐ No effect
A Adverse
ฆ Severe
* Censored
ฆฆ
E
O
O
8
o
1.2 1.4 1.6 1.8 20 22 24
Duration( Hours)
* ftusefflfti rwtvgtfedctTTfcfence tnirfcare eqjvdert to 99% Om-sded corf dence bands in each drecKcn
Figure 6. Illustration of catplotO-
The x-y locations of symbols on the graph indicate the exposure concentrations and durations of
observations on mice with the central nervous system as the target organ. The symbol itself
indicates the severity category, as shown in the legend. The lines on the graph are the estimated
ERC10 (solid line) and upper and lower 95% one-sided confidence bounds (dashed lines)
(equivalently, two sided 90% confidence bounds). Slicing the graph vertically at a specific
Doc. No.: N/A
Page 34 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
duration gives the confidence interval that confplot would graph for that duration. As expected,
longer durations require lower concentrations to achieve the estimated 10% level of extra risk.
5.4 Concentrations and Durations for Designated Probability and Severity of Strata
(stratplot)
The command stratplotO plots the response data by stratum for all severity levels and graphs the
extra risk concentration (ERC) (without confidence interval) by stratum for a single severity level.
Exposure concentration is on the y-axis and exposure duration on the x-axis.
The current ERC settings are used (for the ERC percentile and ERC severity level).
Response to a pop-up menu determines whether to include a legend.
By default, concentration and duration are graphed on a log-linear scale.
Be aware that the weighting of individual points is not shown. The number of points in each
severity category will be displayed in the R command window, along with the number of hidden
points.
This graph provides comparison of the ERC curves for different strata by plotting them on the
same graph.
All Strata: ERC10 lines at SEV=1, Unk = logit
Dura Ion (Hours)
Figure 7. Illustration of stratplotO.
Doc. No.: N/A
Page 35 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
5.5 Extra Risk Concentration at Desired Duration (confplot)
The command confplot(Duration) displays the extra risk concentration (ERC) with confidence
interval for the unstratified model and for individual strata of the stratified model, for a single
severity level at the exposure duration set in the "Duration-1 field. The default value is 10, for 10
hours exposure.
The time argument is required unless duration is not included in the data as an explanatory
variable.
The current ERC settings are used for the ERC percentile and severity level.
The confidence interval percentile is 90% (two-sided).
This graph is useful for comparing ERC estimates and confidence intervals among strata, and
comparing individual strata with the unstratified model.
ERC10 (SEV1) with 90%Two-sided Corfidence Bounds
829.61 h
!885 08
-]'94426
o
ง
CO
10113 03 \-
1071.86
-] 1134.11
754.93 \-
80855!
-] 865.98!
o
U)
91254
969.19
-] 102936
600
800
1000
1200
Concertrabon{ Duraiofi(Hours)=21IJ'*= logit)
* Diese 90%Tvu>-siJed confbence bounds are cqurdent Id 96% One-sided ctnHence bounds in eadi direct km.
Figure 8. Illustration of confplotO-
The above figure was produced with Duration=2. Concentration is on the x-axis. The central dot
for each stratum is the estimated ERC10 for severity level 1.
The vertical solid line is the estimated ERC10 for the unstratified exposure-response curve, and
the vertical dashed lines are the associated confidence intervals (90% for two-sided bounds,
determined from the ERC settings).
To display the confidence interval for another duration, repeat the confplot command with the
Duration field reset to the new duration.
Doc. No.: N/A
Page 36 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
5.6 Probability Versus One Explanatory Variable (prplot)
The command prplot(gp=x) displays the probability curve (i.e., the exposure-response curve) as a
function of concentration, keeping duration fixed, or as a function of duration with concentration
fixed.
If either of these variables is constant in the data, then prplot graphs the probability against the
nonconstant variable.
If both concentration and duration vary in the data dataset, then you need to tell prplot which
variable to hold constant. You can specify either Time or Dose from the "gp=" dropdown list. For
example, to plot the response probability versus duration at a fixed concentration, select
gp=Dose. To plot the response probability versus concentration at a fixed duration, select
gp=Time.
The severity level is determined by the current ERC settings. The choice of stratum and whether
to include a legend are determined by response to pop-up menus.
The function prplot can help assess whether the fitted probability curve is consistent with the
data, and for representing the risk over a range of exposure levels.
Figure 9 was produced with gp=Time and x=1.25. The curve plots the probability of the
occurrence of a severity level 1 response or greater for the liver target organ in mice (species
MU) as a function of concentration, with duration fixed at 1.25 h. The value used for time need not
be an exposure time in the dataset. When that occurs, the probability curve is displayed but there
are no data to display.
In these graphs, the vertical location of a symbol represents whether the response at a particular
concentration was equal to or greater than the severity of interest (adverse, severity 1, for the
above figure). No effect responses are plotted at Pr = 0. Adverse effects and severe effects are
both plotted at Pr = 1. There are no censored observations. If the dataset had contained some
observations censored as [0,1], they would have been graphed as at Y = 0.5 on the severity
>1 graph, and as at Y = 0 on the severity >2 graph. This is because, in the first case, it is not
known whether the censored observation meets the severity threshold (i.e., whether severity >1),
whereas in the second case the severity is known to be less than 2.
Doc. No.: N/A
Page 37 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
II
A
ฃ m
(D <3
I
CL
o
< )
5.7 Data Plotted by Stratum (dataplot)
The command dataplotO plots the response data for all severity levels by stratum, without a
response probability curve, as shown in the following figure.
Exposure concentration is on the y-axis and exposure duration on the x-axis. Notice that this
figure is very similar to the stratplotO figure, except that the dataplotO figure does not show the
ERC10 lines.
Doc. No.: N/A Page 38 of 65 Effective Date: December 2, 2015
Duration(Hours)= 1.25 Stratum = SEV1:MU:L
Concertration (m^m3)
Figure 9. Illustration of prplotO-
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Data plotfall SEV points) with strata
CO
E
to
ID
u
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Generally, plotting deviances versus observation number (Obs.Number) is a good choice. It
provides a representation of the relative effectiveness of the model in fitting the different
observations or strata. Devplot does not require jittering because each observation is represented
uniquely under the Obs.Number option. The fit is suspect if one or a few observations have much
larger deviance residuals than the remaining observations because the fit may be unduly
influenced by these observations. If one stratum has large deviances, the model may be
inadequate for this stratum. Rerunning the model without this stratum would allow one to
determine whether the results for the other groups are heavily influenced by the poorly fit subset.
Plots of deviance versus dose (or log-dose) or time (or log-time) are useful in studying the
adequacy of the functional form of the regression relationship. Trends in the deviances would
suggest a problem with the functional form. One should be aware that differences in the density
of the data at different concentrations or durations will affect the perception. Regions of the plot
with more data will tend to have more spread in the deviances because of random variation, even
if the model is adequate for the data.
The following figure was produced by selecting Observation # from the pop-up menu.
Deviance plot
ID
0)
o
ฃZ
CD
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
relatively poorly described by the curve (i.e., residual deviances are relatively high) in comparison
to most of the other data. These data may or may not be influential on the fitted exposure-
response curve. To examine whether they are, the curve could be refit after filtering out liver data
for mice.
Deviance plot
O
CO
IO
CM
o
CM
IO
0
50 100
Obs. Number
150
Figure 12. Illustration of devplotQ Example 2.
Doc. No.: N/A
Page 41 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
6.0 HYPOTHESES TAB
a-1 Categorical Regression Version 3.0.1.0 Beta - [Analysis Screen]
3 .| a ||giT
File Help
Dataset and Variables Model and BMD ] Plots , Hypotheses
Automatic Tests
Intercept Term Equality...
Removal of Time
Removal of Dose
Coefficients tested for removal: None
Coefficients tested for equality:
Group 1: SEV1 SEV2
Test statistics:
Chisquare df p-value
130.4604 1 le-05
The P Value of the equality test is <= 0.05. This
is generally considered
significant indicating that all the tested
parameters should be retained
in the model.
User-defined Tests
Coefficients tested for removal: LG10T1ME
Coefficients tested for equality: None
Test statistics:
Chisquare df p-value
111.691 1 le-05
The P Value of the removal test is <= 0,05. This
is generally considered
significant indicating that the tested
parameters should not be removed
from the model.
Coefficients tested for removal: LG10CONC
Coefficients tested for equality: None
Test statistics:
Chisquare df p-value
1103153 1 le-05
The P Value of the removal test is < = 0.05. Thi
is generally considered
significant, indicating that the tested
parameters should not be removed
from the model.
SEV1
SEV2
3:INTERCEPT
Value of 0
Test for
Equality of
Summary of Run Options
Output File' CAjjsepa\CatReg\Data\Chemx.otx
Filtered Out:
Qustered
Stratified
Censoring
Save Analyse As-
Done processing!
Figure 13.Hypotheses tab.
From the Hypotheses tab, you can test exposure-response hypotheses. The Hypotheses tab is
disabled until you have run an analysis. After running an analysis, the controls are enabled and
eligible parameters from the analysis are loaded into the Intercept, Dose, and Time parameter
lists.
CatReg automatically runs parameter equality tests, based on the dataset contents. Those
results are displayed in the Automatic Tests section.
Simply click and drag parameters from the Available Parameters box to the appropriate Test
for Equality boxes. For example, under Intercept, you can drag SEV1 to the left Test for
Equality box and SEV2 to the right box.
Click Clear Tests to clear the Test for Equality boxes and start over.
Click Run Tests to run the specified hypotheses tests.
After running the tests, CatReg displays a new window showing the text-based results. The
window's title bar indicates the file name and location (in CatReg's Data\OptionFiles
directory).
The hypotheses results are saved as a text file with the extension *.ANX.
Doc. No.: N/A
Page 42 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Categorical Regression Version 3.0.1.0 Beta - [Hypotheses Result For C:\usepa\CatReg\Data\OptionFiles\chemx.anx]
1 = i@Urfa.il
ฆ5 File Help
_ B X
File Edit Preferences
Hypothesis Test Results
Coefficients tested for removal: None
Coefficients tested for equality:
Group 1 : SEV1 SEV2
Test statistics:
Chisquare df p-value
137.13 1 le-05
##############################################################################
The P Value of the equality test is <= 0.05. This is generally considered
significant, indicating that all the tested parameters should be retained
in the model.
########################*#####################################################
50 row(s) added. ฆ: 1
Figure 14.Hypotheses results.
6.1 Testing Parameters
The output file includes a simple test of the hypothesis that a parameter is zero for each
parameter in the table. Dividing the estimated coefficient by the standard error provides a Z
statistic for the corresponding parameter. This statistic provides a one-degree-of-freedom test of
the null hypothesis that the parameter equals zero.
Under the null hypothesis, the Z statistic has an approximate standard normal distribution. The
larger the sample, the better the approximation is.
The p-value in the table gives the significance level of the test. A parameter that is not
significantly different from zero may be considered a candidate for removal and simplification of
the exposure-response curve. The p-values apply to individual parameters considered singly,
however, and further testing is needed to test the joint hypothesis that more than one parameter
is zero, that two or more parameters are equal, or a combination of the two.
CatReg lists all the parameters in the Parameters list box. You can click on a parameter to select
it, and then click the appropriate buttons to test the parameter(s) for removal or to select a group
of parameters to test for equality (the default is none in both cases).
The idea is to express the hypothesis to be tested as a set of constraints on the model
coefficients.
A test is then conducted of the hypothesis or joint hypotheses, if more than one was entered. The
test is a (generalized Wald-type) chi-square test of the null hypothesis that all of the specified
constraints hold. The distribution of the test statistic is derived from the sampling distribution of
the estimated model coefficients, and it takes into account any cluster sampling.
One may want to test for a gender difference, or whether there are interspecies differences in the
exposure-response.
For such a test, the null hypothesis is that the specified parameters are equal. A p-value less than
0.05 is usually taken as evidence that the hypothesis should be rejected (i.e., the specified
parameters are not equal).
The parameters to be tested as equal must, of course, be included in the exposure-response
curve; this is accomplished by stratifying. For example, to test that there is no difference between
Doc. No.: N/A
Page 43 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
species in the coefficients of concentration, the user would stratify the Dose parameter on
species on the Dataset and Variables tab.
When CatReg queries which coefficients to test for equality, the parameters to be tested (if any)
are entered from the Parameters list on the Hypotheses tab.
To test for no difference between species in the coefficients of concentration, you would enter the
coefficient for each species. Some care needs to be used when tests involve the intercept
parameters because of the way they are represented as increments relative to a reference.
In conducting tests, it is necessary to be aware of which estimates are incremental changes from
others, as in the case of some intercept parameters, and which are "stand alone", as this can
affect how a test is constructed.
Doc. No.: N/A
Page 44 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
7.0 ASSESSING MODEL FIT
This section focuses on the fit of the exposure-response curve rather than on its parameters. In
linear regression analysis, it is common practice to consider the proportion of variation accounted
for by the model, the so-called R2 statistic, as a measure of the model's explanatory power. This
statistic, which ranges in value between 0 and 1, is the ratio of the model and total sums of
squares. These sums of squares, along with degrees of freedom, F-tests, and so on usually are
reported in the form of an analysis of variance table. Standard texts such as Weisberg (1985)
describe the use and interpretation of these statistics.
CatReg provides generalized analysis of variance and R2 statistics for assessing the explanatory
capacity of the exposure-response curve. These are derived from deviance statistics for
hierarchical models. Following McCullagh and Nelder (1989) and Venables and Ripley (1994),
this type of analysis is called the analysis of deviance. The analysis of deviance statistics and R2
statistic are in a table in the output file. The command deviance. fit () will also cause them to
be calculated and written to the output file.
After running CatReg, the fitted curve is available for further analysis. As an example of the
analysis of deviance, consider the output file in Table 5-5. The summary output in the table,
which shows the coefficient estimates, indicates that seven parameters have been estimated.
There are 64 treatment groups (2 species x 2 targets x 4 concentrations x 4 durations), so total
degrees of freedom (df) is 64 x (3 severity levels -1) = 128 (unadjusted for SEV1 and SEV2). The
estimates of the SEV1 and SEV2 intercepts provide no information about how well the curve fits
the data, so it is customary to adjust the total degrees of freedom for them in the analysis of
deviance table, leaving 128 - 2 = 126 degrees of freedom. The model has 5 parameters, aside
from SEV1 and SEV2, which are MU:L:INTERCEPT, RT:C:INTERCEPT, RT:L:INTERCEPT,
LG10C0NC, AND LG10TIME. This leaves 126 - 5 = 121 degrees of freedom for the residual
deviance.
The analysis of deviance table for the example shown in Table 5-5 partitions the total deviance
into the sum of two components, the "model" deviance and the "residual" deviance. The total
deviance is the deviance when the only parameters in the exposure-response curve (or "model"
in the terminology being used here to show the comparison with the analysis of variance) are the
intercepts, SEV1 and SEV2. This is referred to as the null model because it contains no
explanatory variables. The residual deviance is simply the deviance of the fitted model, which
includes the 5 parameters, aside from SEV1 and SEV2, i.e., MU:L:INTERCEPT,
RT:C:INTERCEPT, RT:L:INTERCEPT, LG10C0NC, and LG10TIME, as explanatory variables.
The model deviance is that part of the total deviance that is explained by the model (the
proportion is the generalized R2). In this example, 46.4% of the total deviance is explained by the
model. R2 is a general measure of the proportion of the variation in the response that is
accounted for by the explanatory variables.
The mean deviance entries, labeled as "Mean.Dev," are computed as deviance divided by
degrees of freedom. An approximate F-test of the model is obtained as the ratio of model to
residual mean deviations. This is an approximate F-statistic in large samples under the ordinal
regression model with independent responses. This statistic tests the null hypothesis that all
explanatory variables can be dropped from the model. The same hypothesis may be tested using
partest. The two results will be similar if the responses are independent, and the residual
degrees of freedom are reasonably large, say larger than 15.
Generally, the F-test will reject the null hypothesis unless the sample size is very small or the
model fits poorly. It merely verifies that there is some relationship between the response and the
explanatory variables. The generalized R2 statistic often will be of more direct interest as a
measure of the explanatory value of the variables in the model.
Cluster sampling invalidates the large sample F distribution of the generalized F-statistic.
However, it is common practice to compute F as a rough guideline; see Venables and Ripley
(1994, p. 187). In any case, the R2 statistic gives an idea of how much variation in the response
is accounted for by the explanatory variables.
Doc. No.: N/A
Page 45 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
8.0 WORKING WITH THE MENUS, TOOL BARS, & STATUS BARS
At all times, CatReg displays a menu bar at the top of the CatReg application window, and a
status bar at the bottom of the window.
CatReg presents different menu options depending on the type of window displayed, such as the
Data Grid and result windows.
This section describes the different options provided by the CatReg menus and status bars.
8.1 File Menu
Command
Function
New Analysis
Opens a new Analysis Screen window.
Open Analysis
Opens a previously saved analysis file f.anx).
New Dataset
Opens a new Data Grid window.
Open Dataset
Opens a previously saved dataset f.csv).
Exit
Closes the CatReg application. CatReg will prompt you to save any unsaved changes.
Help Menu
Command
Function
CatReg Help
Displays the contents of the Help documentation in a new window.
About...
Window describing the CatReg Sponsors and Credits, CatReg program version, and a disclaimer.
8.3 Data Grid menus
These menu options are available only from an open Data Grid window.
Menu
Command
Function
File
Save Dataset
Saves changes to the selected dataset.
Save Dataset As...
Displays the Windows Save As dialog box so the current dataset can be saved under a new
filename.
Import Data From
Available options are:
Tab-delimited text file
Space-delimited text file
Excel 2003 or older file f.xls)
BMDS 1 .xx dataset f.set)
Export Data To
Available options are to a tab- or space-delimited text file
Close
Closes the selected window. CatReg will prompt you to save any unsaved changes.
Edit
Cut (Ctrl+X)
Selected data is cut from an active output file.
Copy (Ctrl+C)
Selected data is copied to the clipboard.
Paste (Ctrl +V)
Cut/copied data is pasted into output file at cursor location.
Delete (Del)
Delete selected data.
Select All (Ctrl+A)
Selects all text in current active window.
Data Grid
Add Column(s)
From the dropdown list, select a predefined number of columns to add to the data grid. CatReg
creates the columns only after Add Column(s) is clicked.
Doc. No.: N/A
Page 46 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Menu
Command
Function
Add Row(s)
From the dropdown list, select a predefined number of rows to add to the data grid. CatReg
creates the rows only after Add Row(s) is clicked.
Insert Row
Inserts a new row above the currently selected row in the data grid. The currently selected row
is indicated by the black arrow to the left of the row number.
Insert Column
Inserts a new column to the left of the current column. The column that has a selected cell is
the current column.
Delete Row
Deletes the currently selected row in the data grid.
Delete Column
Deletes the currently selected column in the data grid. The column that has a selected cell is
the current column.
8.4 Text Window (Results) Menu
Menu
Command
Function
File
Save
Saves changes to the selected file.
Save As...
Displays the Windows Save As dialog box so the file can be saved under a new filename.
Print
Prints the contents of the selected window.
Print Setup
Displays the Windows Print Setup dialog box, where you can select such options as orientation,
margins, paper size, and so on.
Print Preview
Displays a window showing how the file will look when printed.
Close
Closes the selected window. CatReg will prompt you to save any unsaved changes.
Edit
Undo (Ctrl+Z)
Undo the most recent change.
Cut (Ctrl+X)
Selected data is cut from an active output file.
Copy (Ctrl+C)
Selected data is copied to the clipboard.
Paste (Ctrl +V)
Cut/copied data is pasted into output file at cursor location.
Select All (Ctrl+A)
Selects all text in current active window.
Preferences
Word Wrap
Toggle word wrap.
Font
Display font selection dialog.
8.5 Status Bar
Each window in CatReg has its own status bar and communicates different information.
Status Bar Location
Functions
CatReg application window
Displays the results of actions executed within session and data windows, such as when rows or
columns are inserted, a session is saved, or when a parameter options file is opened for editing.
Analysis Screen
Displays such messages as "Right-click on Dataset control for additional option(s). or "Done
processing!"
Doc. No.: N/A
Page 47 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
9.0 WORKING WITH THE DATA GRID AND DATASETS
CatReg can open and save data files stored as comma-separated values text files with the *.CSV
extension.
Use the Data Grid window to enter and edit data. After the data are entered/modified as desired,
you can save and close the Data Grid window or Proceed to the analysis. CatReg displays the
data file's name in the Data Grid window's title bar.
A new Data Grid window displays 50 rows and 50 columns.
Note Decimal Separators: BMDS supports regional settings for the decimal separator in the
user interface and in spreadsheets created by the Export to Excel function.
Thousands Separators: No "thousands separator" (regardless of regional setting) can
be used in the data; that is, one thousand can only be written as 1000 rather than as
1,000.
Categorical Regression Version 3,0,1.0 Beta - [C:\usepa\CatReg\Data\Chemx.csv]
(Hi
File Help
_ S X
File Edit
Data Grid
Proceed
Exp.
Group
Species
Target
mg/m3
Hours
1
1
1
MU
C
1259
iJ
~ 2
1
2
MU
C
1259
1
3
1
2
MU
c
1259
1
4
1
3
MU
c
1259
5
1
3
MU
c
1259
6
1
4
MU
c
1259
2
I I
4
~
~
50 row(s) added.
Figure 15.Data Grid screen.
Components of the Data Grid window are, from top to bottom:
The title bar, which includes the path and file name of the currently loaded dataset.
The menu bar, with its own File and Edit menu options that apply specifically to the Data Grid
window. There is also a Data Grid menu that enables you to add or remove rows and
columns.
The Data Grid spreadsheet. From here, you can enter, edit, and sort data; rename columns;
and perform mathematical transformations on columns.
9.1 Opening Existing Dataset Files
You can open any .CSV file from within CatReg and use the Data Grid window to edit the data
values.
1. From the CatReg application window, select File>Open Dataset.
2. CatReg displays the Open dialog box. By default, CatReg displays any .CSV files in the
CatReg program directory's Data folder.
Doc. No.: N/A
Page 48 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
3. Double-click the dataset to load, or select the file in the Open dialog box and press Open.
Note: If a .CSV file is open in Excel, then CatReg cannot load that file for analysis. Close the
file in Excel before attempting to load it into CatReg.
9.2 Creating a New Dataset
In a Session Grid window, select File>New Dataset and select one of the following options. A
blank Data Grid window opens, with "UntitledData.csv" displayed in the title bar.
From here, you can manually enter data or import an existing dataset.
9.3 Entering or Importing Data
If you opened an existing dataset file, then data will already be present in the Data Grid.
However, if you selected the New Dataset option, then a blank Data Grid window appears with
the filename "UntitledData.csv" in the title bar. From the Data Grid window, you can create a new
dataset in a variety of ways.
When the Proceed button on the Data Grid is enabled, you can then either save the entered data
to a .csv file or view the imported data in a new Analysis Screen.
9.3.1 Entering and editing data
You can manually type the data into individual cells of the Data Grid. Click on a cell
and then type in the value.
You can copy and paste data from an Excel spreadsheet or other open Data Grid
window using the standard Windows Cut-Copy-Paste commands.
To change any entered data, simply click on the cell and type in a new value.
However, you must save the dataset; CatReg does not automatically save changes
made in the Data Grid window.
Because you can manually edit the data, you can use this facility to recode variables
or missing values before proceeding with an analysis.
9.3.2 Importing data
CatReg can import delimited DOS text files, files in Excel 2003 format (*.xls), or BMDS 1 .xx
dataset files (*.set), which are tab-delimited.
By default, CatReg dataset files are stored as DOS text files delimited with commas; each line
represents a row in a spreadsheet and each comma signifies the start of a new value within the
row. However, CatReg can import text whose values are delimited by tabs or spaces.
If you have already entered values into the Data Grid, importing a new file into the Data Grid will
overwrite those values.
Note: The first row of the imported file is reserved for column headers (variable names). Once
imported, you can rename the column header in the Data Grid. Column headers must be
one word with no spaces.
From the Data Grid's File menu, select Import Data From and then the type of file you want to
import. Valid file formats include:
Tab-delimited text file (*.txt)
Space-delimited text file (*.txt)
Excel 2003 (or earlier) spreadsheet (*.xls)
BMDS 1 .xx dataset (*.set)
Doc. No.: N/A
Page 49 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
9.4 Copying, Cutting, and Pasting Data
You can select one or multiple cells of data within the Data Grid window and then use the
standard Windows Cut (Ctrl+X), Copy (Ctrl+C), Paste (Ctrl+V), Delete (Ctrl+Del), and Select All
(Ctrl+A) commands to move the data wherever you like. These commands are also accessible
from the Data Grid's Edit menu.
You can use these commands to transfer data to and from an Excel spreadsheet or from other
open Data Grid windows.
9.4.1 Selecting multiple sequential cells
Multiple cells can be copied or pasted at the same time.
Using the mouse: Click and drag to highlight the selected cells you want to copy or cut.
Using the mouse and keyboard: Click on the first cell to select it. Press the Shift key and then
click on the last cell to select all the cells.
9.4.2 Copying and pasting multiple cells of data
This technique enables you to copy or paste the same data cell values into multiple cells much
more quickly than you could by entering them one at a time. You could, for example, highlight five
cells containing the values you want, click on the first blank cell in a series, and then paste to
insert the contents of the five copied cells.
Note: The amount of information you can paste is constrained by the number of empty cells. If
only 3 cells in the Data Grid are empty, and you have copied 10 cells, then only the three
empty cells receive the pasted information.
1. Select a group of cells whose contents you want to copy, and press Ctrl+C or select
Edit>Copy from the Data Grid menu bar.
2. Click on a blank cell that will contain the first value in the series.
3. Press Ctrl+V or select Edit>Paste to paste the copied values into the cells.
9.5 Renaming Columns
Data Grids are initially created with names in the form of Coll, Col2, Col3, and so on. You can
rename columns in the Data Grid window to provide more meaningful names describing the data.
Column headers must be one word with no spaces.
1. Right-click on a column title to display a popup menu with a single item: Rename
Column.
2. Select Rename Column. The Rename dialog box appears.
3. Enter a one-word title with no spaces. Valid column titles can include InterCapitaliZation,
Punc.Tua.Tion, orHy-Phens.
4. Click the Save and Exit button.
9.6 Adding and Deleting Data Grid Columns and Rows
Note: When all of the cells in a column/row are empty the column/row is removed when the
dataset is saved. However, if a column/row contains any data at all, the column/row is
retained when the dataset is saved.
To insert a row above the currently selected row, select Data Grid>lnsert Row.
To delete the currently selected row, select Data Grid>Delete Row.
To add multiple rows above the currently selected row, select Data Grid, pick a predefined
number of rows to add from the picklist (1,2,5,10, 50, or 100), and click Add Row(s).
Doc. No.: N/A
Page 50 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
To insert a column to the left of the currently selected column, select Data Grid>lnsert
Column. In the Set Name dialog box, enter a name for the new column and click Save and
Exit.
To delete the currently selected column, select Data Grid>Delete Column.
To insert multiple columns, select Data Grid, pick a predefined number of columns from the
picklist (1, 2, 5, or 10), click Add Column(s), enter titles for the new columns in the Set Name
dialog box, and click Save and Exit. The new columns are added to the rightmost side of the
Data Grid window.
9.7 Sorting data
In the Data Grid window, left-click on a column header to sort the dataset by that column's values.
Click again to toggle between ascending order and descending order for the sort.
An up-pointing triangle in a column indicates that its values are sorted ascending; a down-
pointing triangle indicates a descending sort order.
You must save the dataset to retain the sort order.
Note: Sorts only work with numeric data and non-empty cells. Enter a zero in an empty cell to
ensure proper sorting.
9.8 Exporting Data
You can export a dataset to delimited text files that can be imported into other applications for
further analysis.
1. With a dataset loaded, from the Data Grid menubar select File>Export Data To and
select one of the following options:
Tab Delimited Text File (*.txt)
Space Delimited Text File (*.txt)
2. In the Save As dialog box, specify a filename and location for the export file and click
Save.
9.9 Saving a Dataset
CatReg displays the dataset's filename in the Data Grid's title bar.
Newly created datasets are initially assigned a default name of "UntitledData.csv."
If an analysis is run on a dataset before it is saved to another name, the analysis results are
saved to the CatReg root directory.
It is recommended that you save datasets to a unique directory. The Data subdirectory within the
CatReg program directory is usually a good location for consolidating your CatReg data files.
9.9.1 Saving changes to the current dataset
To save your work, from the Data Grid menubar select File>Save Dataset. CatReg will save any
changes to the dataset filename that is displayed in the Data Grid's title bar.
9.9.2 Saving a dataset to a different name and directory location
1. From the Data Grid menubar, select File>Save Dataset As...
2. The Save As dialog box displays. Enter a new file name in the File name field.
3. The Save As dialog box defaults to the CatReg program directory's Data subdirectory. If
you want to save the file to a different location, navigate to that location and click Save.
4. CatReg saves the new file as a comma-separated values text file with a .csv extension.
Doc. No.: N/A
Page 51 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
9.10 The Proceed Button
CatReg assumes that, when you click Proceed, you want to load the dataset into the Analysis
Screen, which is the next stage of the CatReg analysis process. However, CatReg disables the
Proceed button until certain conditions are met.
When you import an existing dataset, the Proceed button is enabled. Clicking Proceed loads the
dataset into a new Analysis Screen.
Clicking on the Proceed button opens the Save As dialog box so that you can save the entered
data. Enter a name for the CSV file that will be created and click Save. After the file is saved,
CatReg loads the data into a new Analysis Screen. However, you can choose to cancel the Save
As dialog box; CatReg will continue to load the data into a new Analysis Screen.
9.11 Renaming Required Input Variables
Open the dataset in the Data Grid window to manually rename variables.
9.12 Converting Data Files to Comma-Separated Files
CatReg requires the input data to be comma-separated.
If the input data file is created using Excel, it can be saved as a csv file.
If the file is delimited by spaces or tabs, then select the specific option from the File>lmport
Data From submenu.
9.13 Combining Severity Categories
Sometimes it is necessary to combine severity categories. For example, if some categories
contain insufficient data, then adjacent severity categories may be combined to form broader
categories. In some cases, the results may suggest that two or more of the severity parameters
are not significantly different.
In these situations, you may want to redefine the lower and upper endpoints of an interval of
categories. You do this by manually assigning all observations with severity categories that you
want to join to the same severity level (and possibly re-assigning other severity values if they are
affected by that re-assignment).
This can be done in the Data Grid, Excel, or other application you use to prepare your dataset.
For example, if the data has severity = 0,1,2,3, and if you determine that sev 1 and 2 should be
merged, then you will keep sev=0 observations the same, assign sev=1 to all the previous sev 1
and sev 2 observations, and then set the previous sev 3 observation to sev=2.
9.14 Recoding Missing Values
Different systems often use different codes for missing values. CatReg assumes that a missing
value is coded as a blank value.
If your dataset codes missing data as "-9999", you will need to change them to blank values.
Doc. No.: N/A
Page 52 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
10.0 REFERENCES
Akaike, M. (1974) A new look at statistical model identification. IEEE Transactions on Automatic
Control AU-19: 716-722.
Diggle, P. J.; Liang, K.-Y.; Zeger, S. L. (1994) Analysis of longitudinal data. New York, NY:
Clarendon Press.
Huber, P. J. (1967) The behavior of maximum likelihood estimates under nonstandard conditions.
In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability.
Volume 1. Berkeley, CA: University of California Press.
Liang, K.-Y.; Zeger, S. L. (1986) Longitudinal data analysis using generalized linear models.
Biometrika 73: 13-22.
Lopez, A.; Prior, M.; Yong, S.; Albassam, M.; Lillie, L. E. (1987) Biochemical and cytologic
alterations in the respiratory tract of rats exposed for 4 hours to hydrogen sulfide. Fundam. Appl.
Toxicol. 9: 753-762.
McCullagh, P.; Nelder, J. A. (1989) Generalized linear models. 2nd ed. London, United Kingdom:
Chapman and Hall.
Morgan, B. J. T. (1992) Analysis of quantal response data. London, United Kingdom: Chapman
& Hall.
Simpson, D. G.; Carroll, R. J.; Xie, M.; Guth, D. J. (1996a) Weighted logistic regression and
robust analysis of diverse toxicology data. Commun. Stat. 25: 2615-2632.
Simpson, D. G.; Carroll, R. J.; Zhou, H.; Guth, D. J. (1996b) Interval censoring and marginal
analysis in ordinal regression. J. Agric. Biol. Environ. Stat. 1: 354-376.
U.S. Environmental Protection Agency. (2000) CatReg software documentation. Research
Triangle Park, NC: Office of Research and Development, National Center for Environmental
Assessment; report no. EPA/600/R-98/053.
Venables, W. N.; Ripley, B. D. (1994) Modern applied statistics with S-Plusฎ. New York, NY:
Springer-Verlag.
Weisberg, S. (1985) Applied linear regression. 2nd ed. New York, NY: Wiley.
Doc. No.: N/A
Page 53 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
11.0 DEFINITIONS, ACRONYMS, AND ABBREVIATIONS
Akaike Information Criteria (AIC) - The deviance (-2 times the log of the maximized value of the
likelihood function) + 2 times the number of parameters in the model.
Categorical regression - A model expressing the probabilities of different response categories as
functions of explanatory variables.
Cluster sample - A data set comprised of subsamples of data from common sources. For
example, a data set may contain several data records per laboratory from several different
laboratories. The subgroup of data records from each individual laboratory would represent a
cluster sample.
Cumulative odds regression - Ordinal regression model for directly modeling the probabilities of
exceeding different severity levels.
Deviance - The minimized value of twice the negative logarithm of the likelihood function.
Extra risk concentration (ERC) - Concentration at which extra risk is a user-specified value.
ERC settings - Three numbers: ERC percentile, ERC severity level, and ERC percentile for
confidence intervals.
Filtering - Exclusion of selected data records from the analysis. This capability "filters out"
selected data without altering the data input file.
Generalized estimating equation - An equation depending on the data and the parameter values,
such that solving for the parameter values yields consistent estimates.
Hierarchical models - An ordered series of models, such that each model is a special case of the
next one in the series.
Inf - Infinite value
Interval censored data - Data for which the response is known only to lie in an interval of values.
Likelihood function - For categorical response data, a model for the joint probability of the
observed data values, expressed as a function of the model parameters.
Link function - A function applied to the categorical response probability to transform the
categorical regression model to linear units.
Meta-analysis - The analysis of data from multiple studies to determine overall trends and
increase power.
NaN - Not a number
Ordinal data - Data reported as ordered categories. The order is meaningful, but the numerical
difference between ordered categories is not.
Parallelism - The coefficients of concentration and time in the exposure-response model (i.e., the
probability function) do not change with severity level (parallelism applies to Model 1, the
cumulative odds model, but not to Model 2, the unrestricted cumulative model).
Probability function - The function of the explanatory variables that gives the probability of
exceeding a given severity level.
Proportional odds regression - Ordinal regression in which the log-odds of exceeding different
severity levels are parallel across severity categories. It is a special case of cumulative odds
regression with the logistic link function.
Stratification - To create subsets of data by allowing the model parameters to vary by subset.
Covariate information such as species, sex, and target organ may be used as a basis for creating
the sub sets.
Doc. No.: N/A
Page 54 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
APPENDIX A: DISTRIBUTION OF CONTINUOUS RESPONSE DATA OVER
SEVERITY LEVELS
Response data are sometimes measured on a continuous scale, with the mean and standard
error reported. Section 3.9 contains a hypothetical example for lung weight in rats in which a
mean and standard error were reported for each treatment group. The lung weight data were
distributed over the severity levels shown for "Ref.id" = 2 in Table 6 for analysis by CatReg. The
following description explains how that was done and provides an example calculation for
reference. A technical explanation is provided at the end of the appendix.
The distribution of lung weights in healthy, unexposed rats is needed, either estimated from
control animals in the experiment or "known" from other sources. It is assumed here that the
distribution of lung weights is normally distributed with mean 1.0 g and standard deviation 0.05 g.
(Note: The normal distribution is assumed simply for illustration. The same idea could be applied
to other distributions). The user then determines weight intervals for severity levels to be used.
For this purpose, it may be helpful to estimate first the highest weight that might be considered in
the "normal" range for unexposed animals. The weight 1.15 g, which is three standard deviations
above the mean, is an upper bound on virtually all lung weights in unexposed animals (i.e., a
weight above 1.15 g is above the normal range of lung weights). The following correspondence
was made between severity levels and lung weights for the example: SevO (<1.15 g), Sev1
(1.15 to 1.50 g), and Sev2 (>1.50 g).
Suppose that the average lung weights and standard errors shown in Table A-1 were reported for
treatment groups in "Ref.id" = 2, Table 6. (Note: If SE is the standard error from a treatment
group of size n, the estimate of the standard deviation is SE) There is a separate mean and
standard error reported for each treatment group, denoted by , and SEi, respectively for index i,
with i = 1 ,...,13. The ith treatment group is assumed to be a sample from a normal distribution
with unknown mean and standard deviation, denoted by jji and ai, respectively. The treatment
groups are rather small (ni = 10 for all i), so it was assumed that the standard deviation was the
same for treatment groups with similar estimates of the standard deviation (i.e., SEi^Jn~~). The
standard deviation was assumed equal for indices 1,2,6, and 10 (to be called Group A), 3, 7,
and 11 (Group B), 4, 8, and 12 (Group C), and 5, 9, and 13 (Group D) in Table A-1. An estimate
of the common standard deviation in Group A, cta, is calculated from SEi, SE2, SE6, and SE10 as
follows (estimation of the standard deviations for the other three groups is similar).
Let oa2 be the common variance for Group A. The variance estimate from Sample 1 is S12 =
niSEi2 = 10 (0.03)2. The estimate of cm2, denoted by Sa2 is the sum of the estimates of a a2 from
samples 1, 2, 6, and 10, weighted by their degrees of freedom (df) (9 for each sample). This
gives s,a2 = 1/36 [9(10)(0.03)2 + 9(10)(0.025)2 + 9(10)(0.025)2 + 9(10)(0.025)2] = 0.00694. The
proportion of Sample 1 with lung weights less than 1.15 g is estimated as follows. If X is a new
observation for Sample 1, then
r = (x - X, )/(s' (n, +1)/ n, Y*
has a t-distribution with nA - 4 df, where n1 is the number of observations in Sample 1 (i.e., n1 =
10), and nA is the number of observations in Group A (i.e., nA = 40). Pr (X < 1.15 g) is estimated
by Pr(T < (1.15 - 1.1 )/[0.00694 (11/10)0.5] = 0.716, shown in Table 12 under SevO for "Index" =
1. To estimate Pr (1.15 < X < 1.5), first estimate Pr (X < 1.5) as above, except with 1.15 g
replaced by 1.5 g, and then subtract the estimate of Pr(X < 1.15). Similarly, the relationship Pr(X
> 1.5 = 1 Pr(X < 1.5) is used to estimate Pr (X >1.5). The estimated proportions of each sample
with lung weights in the intervals <1.15, 1.15 to 1.5, and >1.5 are displayed in Table 12.
Doc. No.: N/A
Page 55 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Table 11: Summary Data for "Ref.ld" = 2 In Table 6
Index
Exp.
Group
mg/m3
Hours
Average Lung Weight (X )
Standard Error
1
1
1
330
2
1.1
0.03
2
1
2
360
2
1.2
0.025
3
1
3
390
2
1.5
0.04
4
1
4
410
2
1.8
0.08
5
1
5
460
2
1.7
0.1
6
2
1
460
1
1.2
0.025
7
2
2
510
1
1.3
0.04
8
2
3
560
1
1.6
0.08
9
2
4
610
1
1.8
0.1
10
3
1
560
0.5
1.3
0.025
11
3
2
610
0.5
1.5
0.04
12
3
3
660
0.5
1.6
0.08
13
3
4
710
0.5
1.8
0.1
Doc. No.: N/A
Page 56 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Table 12: Estimated proportions in severity levels
Group
Index
s
In + 1
I
1.1 5- X
df
SevO
Pr(X< 1.15)
1.5-X
Sev1
Pr(1.15 < X < 1.5)
Sev2
Pr(X> 1.5)
In
In + 1
#
A
1
0.0874
1.1
0.572
36
0.715
4.577
0.285
0
A
2
0.0874
1.2
-0.572
36
0.285
3.432
0.714
0
B
3
0.1327
1.5
-2.638
27
0.007
0
0.493
0.500
C
4
0.2653
1.8
-2.450
27
0.011
-1.131
0.123
0.866
D
5
0.3317
1.7
-1.1658
27
0.127
-0.603
0.149
0.724
A
6
0.0874
1.2
-0.572
36
0.285
3.432
0.714
0.001
B
7
0.1327
1.3
-1.130
27
0.134
1.507
0.794
0.072
C
8
0.2653
1.6
-4.146
27
0
-0.377
0.354
0.645
D
9
0.3317
1.8
-1.960
27
0.030
-0.904
0.157
0.813
A
10
0.0874
1.3
-1.716
36
0.047
2.288
0.939
0.014
B
11
0.1327
1.5
-2.638
27
0.007
0
0.493
0.5
C
12
0.2653
1.6
-4.146
27
0
-0.377
0.354
0.645
D
13
0.3317
1.8
-1.960
27
0.030
-0.904
0.157
0.813
Doc. No.: N/A
Page 57 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
APPENDIX B: TECHNICAL DISCUSSION
B.1 Link Functions
Without loss of generality, link functions will be discussed in the context of Model 1, described by
Eq. 1a in Appendix C, specifically,
Pr(r > s|C, 1ฆ)= H[a,+A *f (C)+ A ป f, (71)]
CatReg currently supports three forms for H:
Logistic H(x) = ex I (1 + ex),
-S_
1 r 2 '
Normal H(x) = e dz and
* 2n _co
Gumbel H(x) = 1 - exp(-ex).
The inverse of H, which is denoted by L, is called the link function in the statistical literature. The
link functions corresponding to the probability functions given above are
Log it L(p) = log[p/(1-p)],
Probit L(p) = 100 pth percentile of normal (0,1), and
Cloglog L(p) = log[-log(1 - p)],
where p is any number between 0 and 1. The link function and probability function are inverse to
each other in the sense that H[L(p)] = p and L[H(x)] = x. Applying the link function to both sides of
Model 1 gives
L[Pr(Y> s| C,T)] = crs + /3i * h (C) +/32*fi (T),
s = 1,2,... ,S.
This expression shows that the link-transformed probability follows a linear model.
The use of a link function is essential here. Without it, the linear model becomes unbounded, and
one is led to absurd estimates of probabilities for extreme values of C and 7, namely negative
probabilities or probabilities greater than 1. Moreover, link functions may be derived from a basic
assumption that the ordinal severity score corresponds to exceeding an underlying toxic response
threshold (see discussion below). Under this approach, the ordinal response score is called a
quantal response because it is a quantization of the underlying response (see, e.g., Morgan,
1992). Any ordinal regression model of the type given in Model 1 may be interpreted as a quantal
response model. When pi and p2 are constant across severity categories data from one severity
category adds information to the modeling of another category.
A toxicological interpretation of the link function follows from quantal response analysis. In
particular, let Z denote a particular measure of health for a randomly selected subject given
exposure to concentration C for duration T. Larger values of Z correspond to a healthier
individual. Suppose that the health variable Z is distributed in the population as
Doc. No.: N/A
Page 58 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
Pr(Z s occurs if the health measure Z is below a
threshold aas, where a is specific to the health measure. Then, under exposure (C, T), the
probability of toxic severity of category s or higher is
Pr(Y> s) = Pr(Z < acts) = H[as + j8r ft (C) + /32* h (7)].
This is precisely the ordinal regression model given in Model 1. It is apparent that the link
function is a reflection of the underlying distribution of Z, which cannot be measured directly, but
its distribution, in particular the dependence on C and T, can be estimated indirectly from the
toxicological response data.
B.2 Interval Censoring
Although censored responses do not provide as much information as fully scored responses, they
do provide some information about the model. This information is used in the maximum
likelihood estimation and the generalized maximum likelihood estimation described in Section
B.3. In fitting the model by maximum likelihood, it is necessary to compute the probability of the
observed response as a function of the model parameters. Table 13 shows how these
probabilities are computed in a three-category scoring system with interval censoring, using
Model 1.
Table 13: Interval Probabilities for Model 1 with Three Severity Categories
Interval
Probability
(0,0)
1- H[a-\ + /3rfi(C) + /32*f2 (T)]
(0,1)
^H[ct2 +fr>HC) +P2>f2{T)\
(1,1)
H[Gh + /3i*fi(C) + /?2-f2 (T)] - H[o<2 + (C) + /32-f2 (7)]
(1,2)
H[a-\ + /3r fi(0) + /32* f2 (T)]
(2,2)
H[a2 + pvHC) + p2'f2{T)\
B.3 Parameter Estimation
B.3.1 Maximum Likelihood Estimation
The likelihood function is defined to be the joint probability density of the data, viewed as a
function of the parameters. In categorical regression, the response variables are discrete, so the
likelihood may be interpreted as the probability that an investigation would result in the particular
values that were observed. This probability depends on the unknown parameters. Maximum
likelihood estimates the unknown parameters by the values that maximize the likelihood of the
Doc. No.: N/A
Page 59 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
observed data. It is often more convenient to work with the logarithm of the likelihood. For this
purpose, it is common to define the deviance function:
Deviance = -2 * log(likelihood).
The deviance is a nonnegative measure of model fit. Maximizing the likelihood is mathematically
equivalent to minimizing the deviance. The factor 2 is included because it is the correct multiplier
for certain likelihood-based, goodness-of-fit tests. Smaller deviances (larger likelihoods)
correspond to a closer fit of the model to the data. A deviance of 0 would indicate a perfect fit,
that is, a "saturated" model. Generally, a deviance of zero would indicate a model that is too
complicated. The deviance value shown in CatReg summary statistics such as that in Table 9 is
the "residual deviance" discussed in Section 7.0.
If all data are independent, the likelihood function for interval-censored ordinal regression has a
simple form. For /' = 1,2,..., n, let V, denote the ordinal response, and let C, and 7, denote the
concentration and duration of exposure for the /th experimental subject, respectively. Yi may be
known only to lie in an interval. To account for this, let L, and U, denote the lower and upper
endpoints of the known interval for V,, respectively. If it is known that V, = k, then L, = U, = k. For
convenient reference, denote the probability of severity s or greater by
Pr(s) =
1, ifs = 0;
H[a, + /U (C) + PJ2 (Tt)], 7/5=1,... JS;
ฐ> ifs>S..
Then the deviance for interval-censored ordinal regression is given by
Deviance = loufPn l: ฆ )] (L, )-/>({/,+!)].
Parameter estimates are computed by iteratively minimizing the deviance. CatReg uses the R
function optim() to perform this optimization.
B.3.2 Generalized Likelihood Estimation
Weighted ordinal regression analysis corresponds to a modified likelihood in which the probability
associated with the /th observation is raised to a positive power w,. This results in a modified
likelihood with a weighted deviance:
Deviance = - 2 W; log[/^ (Lt) ~ I] (Ut + 1)].
If the weights do not correspond to incidences, then this likelihood corresponds to a nonstandard
ordinal regression model. In this situation, it is more common to interpret the deviance as a
generalized criterion and to assume that the usual ordinal regression model holds. Under this
assumption, the generalized deviance still leads to consistent estimates of the parameters, but it
does not correspond to the likelihood of the data. Instead, the estimator is defined by a
generalized estimating equation, which provides the basis for computing valid large-sample
confidence intervals and test statistics.
Doc. No.: N/A
Page 60 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
A further modification arises when the data are cluster sampled. The likelihood for cluster-
sampled data does not have the simple form given above. Rather, it involves a product of
multiple integrals of conditional likelihoods. Such likelihoods are computationally challenging and
the results may be sensitive to the specification of the correlation structure. An alternative
approach is to assume the ordinal regression model holds in a population-average sense.
Consistent estimates then may be obtained quite generally, without making extensive
distributional assumptions about the correlation structure. To achieve this, CatReg takes the
expression derived above as a "working deviance" criterion. Minimizing it leads to consistent
estimates under the population-average model. The main impact on the analysis compared to a
standard likelihood analysis is the use of generalized estimating equation methods for making
statistical inferences. In particular, rather than reporting the inverse information matrix as the
estimated parameter covariance, the well-known sandwich formula is used.
Most applications of CatReg will involve cluster-sampled data and possible weighting of
observations. As noted above, CatReg uses the weighted independence criterion as an
estimating criterion, but computes confidence intervals and hypothesis tests without assuming the
criterion is the likelihood. This approach has a long history in the statistical literature. Huber
(1967) derived the large-sample theory of "maximum likelihood" estimators when the working
likelihood is different from the actual likelihood of the data. In the literature on robust statistics,
this type of estimator is called an "M-estimator" because it generalizes maximum likelihood.
Liang and Zeger (1986) extended the method to the analysis of correlated data, based on a
"working" correlation structure, without assuming the working correlation structure was correct.
This general approach is widely used in biometry, econometrics, and survey sampling.
B.4 Confidence Limit Calculations
CatReg uses the method of generalized estimating equations, which is well accepted in the
literature (see Diggle et al., 1994), for the calculation of confidence limits for cluster-sampled
data. The classical likelihood ratio inferences do not apply to cluster-correlated data because the
likelihood ratio test assumes independent responses. The application of generalized likelihood
ratio tests for correlated data is, however, an area of active research that may produce usable
results in the future.
Confidence intervals and hypothesis tests about the parameters rely on a large-sample normal
approximation to the joint distribution of the parameter estimates. The main steps in the
derivation of this approximation are as follows. First, assuming the data are cluster sampled,
write the generalized deviance as
GD=-2Tl1 TU log[^(Lv)-pv(UtJ +1)],
where -2 w,y log[Pij(Ljj) - Pij(Ujj + 1)] is the contribution of the yth individual from cluster /',
j = 1,...,ซ/, /' = 1 ,...,/V. Let B denote the vector of all model parameters. The GD estimate of B
solves a generalized estimating equation
o = Y"> vj*
V Z^=1 T. ,
where the summand ^ is given by
Doc. No.: N/A
Page 61 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
T dPy(Ly)-dPy(Uy+1)
and dPij(t) denotes the vector of derivatives of Pij(t), with respect to the components of B,
evaluated at the estimated parameters. Expanding the estimating equation in a Taylor series
leads to a large-sample normal approximation. The estimated parameter vector, B , is
approximately multivariate normal. The mean of the approximating normal distribution equals the
true value of B, and the covariance matrix is given by the sandwich formula
Est.Cov(B)J lCJ ',
where J is given by
and C is a covariance estimate for the total score, given by
c-
If the working likelihood were the actual likelihood of the data, then J and C would estimate the
same matrices, and the usual inverse information, J~\ would estimate the covariance of B .
Further details are given in Simpson etal. (1996a).
Standard errors of individual parameter estimates are obtained as the square roots of the
diagonal elements of the estimated covariance. Confidence intervals and hypothesis tests derive
from the normal approximation for B .
Doc. No.: N/A
Page 62 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
APPENDIX C: Technical Background: Models and Extra Risk
C.1 Exposure-response Models
Let Y denote a dependent variable that represents the severity or intensity of the response.
Assume Y is an ordinal score taking one of the values (0,1 S). A score of 0 corresponds to the
lowest severity (e.g., no adverse effect), and a score of S corresponds to the highest severity
(e.g., lethal, in a toxicological context). Categorical regression is a method for modeling the
probability distribution of Y as a function of the explanatory variables, concentration (C) and
duration (T). It employs a generalized linear model (McCullagh and Nelder, 1989) for the
dependence of the probabilities of different severity categories on the explanatory variables.
CatReg provides a choice of two models, Model 1 (the cumulative odds model) described by Eq.
1a and Model 2 (the unrestricted cumulative model) described by Eq. 1b, with variations on both
as described below. CatReg refers to any model of the form of Model 1 as a cumulative odds
model because the model is expressed in terms of the cumulative probabilities, or odds, for Y> s.
Note that Model 1 is a special case of Model 2 wherein parameters (3is and p2s do not depend on
s (which denotes severity level, as discussed below). A primary use of fitting Model 2 is to test
whether the simpler Model 1 is adequate. Model 2, although more general than Model 1, has the
undesirable feature that the regression lines for different severity levels may cross. Often the
crossing is well outside the range of values of interest, so the model can be used to make
empirical risk estimates. The user has the option to add an additional parameter, y, which
represents a hypothetical background concentration, in some circumstances (see page 28). For
s = 1,2,...,S,
Model 1
?r(Y>s\C,T) = H[as+/3, */,(c)+/?2 */2(r)] Eq. 1a
Model 2
Pr{Y>s\C,T) = H[as + /3ls*fl(C)+/i2s*f2(T)] Eq. 1b
The left-hand side is read as follows: the probability that a response of severity level s or greater
occurs, given that concentration is C and time is T (time refers to exposure duration). No
expression for s = 0 is included because this is the minimal category, and Y is always greater
than or equal to 0 (i.e., Pr(Y > 0|C,T) = 1). The right-hand side is described as follows:
H is a probability function taking values between 0 and 1, for which the user has three
choices: (1) logistic, (2) normal, and (3) Gumbel (described further in Appendix B).
The parameter as is the intercept for severity level s, s = 2,...,S (to be called the intercept or
severity parameters). The severity parameters are ordered as a1 n a2 n... naS. This
constraint is a consequence of the requirement that the probability of exceeding a lower
score is larger than the probability of exceeding a higher score for any fixed levels of C and T.
In Model 1, the parameter (31 determines the dependence of the response on concentration
(C), whereas (32 determines the dependence on time (T). In Model 2, the parameters are also
indexed by s because they may change values with severity level s.
Current choices for f1 and f2 are "untransformed " and "base-10 logarithm ". Other
transformations of C and T may be obtained by transforming the input data.
Parameters are as, (31s (to be called the coefficient of concentration), (32s (to be called the
coefficient of time or duration), for severity levels s running from 1 to S. All parameters may
be stratified on variables, as discussed under "Stratifying" in Section 5.2.
The normal and logistic distributions are symmetric, each having median equal to zero. The
Gumbel distribution is skewed, with a lower tail similar to that of the logistic distribution and a
lighter upper tail. Figure 4-1 displays these three probability distributions. To compare the
shapes effectively, the distributions have been rescaled to have medians = 0 and equal 25th
Doc. No.: N/A
Page 63 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
percentiles (labeled as "EC25" on the horizontal axis). The scaled logistic and normal
distributions are very close for much of the range and differ substantively only in the extreme tails.
The Gumbel distribution is skewed, with a heavier tail on the left and a lighter tail on the right.
For each of the three choices of the probability function H there is an inverse function of H, called
the link function, that transforms it to a simple linear function in concentration and duration.
CatReg requests the name of the link function instead of the name of the probability function.
The corresponding link functions (in parentheses) are: logistic (logit), normal (probit), and
Gumbel (cloglog). There is further discussion of linking in Section 4.8 and Appendix B, that
includes an example of how link functions may be derived from a basic assumption that the
ordinal severity score corresponds to exceeding an underlying toxic response threshold.
_Q
ro
n
o
CO
O
<ฃ>
o
o
CM
o
p
o
Normal (m=0, s=1)
Logistic (m=0, s=.614)
Gumbel {m= 281. s= 767)
~r~
-2
EC25
0
X
Figure 16.Normal, logistic, and Gumbel probability functions shifted and scaled to have equal medians
and 25th percentiles.
C.2 Extra Risk Concentration (ERC)
Extra risk at concentration C=c and time T=t, at severity level s, is defined as
Pr(7 >s\C = c,T = t)~ Pr(7 > s\C = 0,T = t)
l-Pr(7>s|C = 0,r = 0
Eq.2
Forq between 1 and 100, inclusive, ERCq, at time T = t, is the concentration c for which equation
(4-2) equals q/100. For example, ERC10 at T = 2 (exposure duration of 2 hours) for severity level
1 is the value of c that satisfies
Pr(7 > l|C = c,T = 2) - Pr(7 > l|C = 0,T = 2)
1 - Pr(7 > l|C = 0, r = 2)
= 0.1 Eq. 3
In words, ERC10 at T = 2 for severity level s is the exposure concentration at which the
probability is 0.10 of an adverse effect of level s or higher due to exposure of two hours, i.e.,
Doc. No.: N/A
Page 64 of 65
Effective Date: December 2, 2015
-------
Lockheed Martin Information Systems & Global Solutions - Civil
Categorical Regression (CatReg) User Guide (Draft)
given the adverse effect would not have occurred from other causes ("background causes")
during that time.
Doc. No.: N/A Page 65 of 65 Effective Date: December 2, 2015
------- |