The ToxCast Analysis Pipeline: An R Package for Processing and Modeling Chemical Screening Data


The ToxCast™ Analysis Pipeline:
An R Package for Processing and Modeling
Chemical Screening Data
Version 1.0
Dayne L. Filer, Parth Kothiya, Woodrow R. Setzer,
Richard S. Judson, Matthew T. Martin
December 8, 2015
1

-------
CONTENTS
Contents
Introduction	3
Overview 		3
Package Settings		3
Assay Structure		5
Register and Upload New Data	6
Data Processing and the tcplRun Function	12
Data Normalization	16
Single-concentration Screening	18
Level 1		18
Level 2		20
Multiple-concentration Screening	22
Level 1		22
Level 2		24
Level 3		26
Level 4		28
Level 5		33
Level 6		38
A Field Explanation/Database Structure	42
Single-concentration tables 		43
Multiple-concentration tables		45
Auxiliary annotation tables		51
B Level 0 Pre-processing	56
C Burst Z-Score Calculation	58
2

-------
Introduction
Overview
The tcpl package was developed to process high-throughput and high-content
screening data generated by the U.S. Environmental Protection Agency (EPA)
ToxCast™ program.1 ToxCast is screening thousands of chemicals with hun-
dreds of assays coming from numerous and diverse biochemical and cell-based
technology platforms. The diverse data, received in heterogeneous formats from
numerous vendors, are transformed to a standard computable format and loaded
into the tcpl database by vendor-specific R scripts. Once data is loaded into
the database, ToxCast utilizes the generalized processing functions provided in
this package to process, normalize, model, qualify, flag, inspect, and visualize
the data. While developed primarily for ToxCast, we have attempted to make
the tcpl package generally applicable to chemical-screening community.
The tcpl package includes processing functionality for two screening paradigms:
(1) single-concentration screening and (2) multiple-concentration screening. Single-
concentration screening consists of testing chemicals at one concentration, often
for the purpose of identifying potentially active chemicals to test in the multiple-
concentration format. Multiple-concentration screening consists of testing chem-
icals across a concentration range, such that the modeled activity can give an
estimate of potency, efficacy, etc.
Prior to the pipeline processing provided in this package, all the data must go
through pre-processing (level 0). Level 0 pre-processing utilizes dataset-specific
R scripts to process the heterogeneous data into a uniform format and to load
the uniform data into the tcpl database. Level 0 pre-processing is outside the
scope of this package, but can be done for virtually any high-throughput or
high-content chemical screening effort, provided the resulting data includes the
minimum required information.
In addition to storing the data, the tcpl database stores every process-
ing/analysis decision at the assay component or assay endpoint level to facilitate
transparency and reproducibility. For the illustrative purposes of this vignette
we have included a SQLite version of the tcpl database containing a small subset
of data from the ToxCast program. Because of differences in database capabil-
ities, not all functionality of the package will work with the SQLite version.
To best utilize the package the user should work with a MySQL database and
the RMySQL package. The package includes a SQL file to initialize the MySQL
database on the user's server of choice. Additionally, the MySQL version of the
ToxCast database containing all the publicly available ToxCast data is available
for download at: .
Package Settings
First, it is highly recommended for users to utilize the data, table package. The
tcpl package utilizes the data.table package for all data frame-like objects.
x
3

-------
Introduction
>	library(data.table)
>	library(tcpl)
>	## Store the path the tcpl directory for loading data
>	pkg_dir <- system.file(package = "tcpl")
Every time the package is loaded in a new R session, a message similar to
the following will print showing the default package settings:
tcpl (vl.O) loaded with the following settings:
TCPL_DB: /usr/local/lib64/R/library/tcpl/sql/xmpl.sqlite
TCPL_USER: NA
TCPL_H0ST: NA
TCPL_DRVR: SQLite
TCPL_INT: FALSE
Default settings stored in TCPL.conf. See ?tcplListOpts
or ?tcplSetOpts for more information.
The package consists of six settings: (1) $TCPL_DB points to the tcpl database
(either the SQLite file, as in the given example above, or the name of the MySQL
database), (2) $TCPL_USER stores the username for accessing the database, (3)
$TCPL_PASS stores the password for accessing the database, (4) $TCPL_H0ST
points to the MySQL server host, (5) $TCPL_DRVR indicates which database
driver to use (either "MySQL" or "SQLite"), and (6) $TCPL_INT controls how
chemical information is accessed, and should always be FALSE. When TRUE, the
$TCPL_INT setting points to different queries to work with the internal EPA
database structure for fetching chemical information.
Refer to ?tcplSetOpts for more information. At any time users can check
the settings using tcplListOpts (). An example of database settings would be
as follows:
> tcplSetOpts(drvr = "MySQL",
user = "root",
pass =
host = "localhost",
db = "toxcastdb")
4

-------
Introduction
Notice in the tcplSetOpts example, the int parameter was not changed.
tcplSetOpts will only make changes to the parameters given.
The package is always loaded with the settings stored in the TCPL.config
file located within the package directory. The user can edit the file, such that
the package loads with the desired settings, rather than having to call the tc-
plSetOpts function every time. The TCPL.config file has to be edited whenever
the package is updated or re-installed.
Assay Structure
The definition of an "assay" is, for the purposes of this package, broken into:
assay .source - the vendor/origination of the data
assay - the procedure to generate the component data
assay_component - the raw data readout (s)
assay_component_endpoint - the normalized component data
Each assay element is represented by a separate table in the tcpl database.
In general, we refer to an "assay_component_endpoint" as an "assay endpoint."
As we move down the hierarchy, each additional layer has a one-to-many rela-
tionship with the previous layer. For example, an assay component can have
multiple assay endpoints, but an assay endpoint can derive only from a single
assay component.
All processing occurs by assay component or assay endpoint, depending on
the processing type (single-concentration or multiple-concentration) and level.
No data are stored at the assay or assay source level. The assay and assay_source
tables store annotations to help in the processing and down-stream understand-
ing/analysis of the data. For more information about the assay annotations and
the ToxCast assays please refer to .
Throughout the package the levels of assay hierarchy are defined and refer-
enced by their primary keys (IDs) in the tcpl database: asid (assay source ID),
aid (assay ID), acid (assay component ID), and aeid (assay endpoint ID). In
addition, the package abbreviates the fields for the assay hierarchy names. The
abbreviations mirror the abbreviations for the IDs with "nm" in place of "id" in
the abbreviations, e.g. assay_component_name is abbreviated acnm.
5

-------
Register and Upload New Data
This section explains how to register and upload new data into the tcpl database
using a small subset of ToxCast data showing changes intracellular Cortisol
hormone. The subset of data comes from an assay measuring steroidogenesis
through cellular levels of mutliple steroid hormones.
The tcpl package provides three functions for adding new data: (1) tc-
plRegister to register a new assay or chemical ID, (2) tcplUpdate to change
or add additional information for existing assay or chemical IDs, and (3) tc-
plWriteLvlO for loading data. Before writing any data to the tcpl database,
the user has to register the assay and chemical information.
The first step in registering new assays is to register the assay source. As
discussed in the previous section, the package refers to the levels of the assay
hierarchy by their ID names, e.g. asid for assay source.The following code shows
how to register an assay source, then ensure the assay source was properly reg-
istered.
	 II Input 	
>	## Add a new assay source, call it CTox,
>	## that produced the data
>	tcplRegister(what = "asid", fids = list(asnm = "CTox"))
[1] TRUE
> tcplLoadAsidO
asid asnm
1: 1 CTox
The tcplRegister function takes the abbreviation for assay source-name,
but the function will also take the unabbreviated form. The same is true of
the tcplLoadA- functions, which load the information for the assay annotations
stored in the database. The next steps show how to register, in order, an assay,
assay component, and assay endpoints.
	 n Input 	
> tcplRegister("aid",
list(asid = 1,
anm = "Steroidogenesis",
assay_footprint = "96 well"))
6

-------
Register and Upload New Data
[1] TRUE
When registering an assay (aid), the user must give an asid to map the assay
to the correct assay source. Registering an assay, in addition to an assay_name
(anm) and asid, requires assay-footprint. The assay-footprint field is used in
the assay plate visualization functions (discussed later) to define the appropriate
plate size. The ass ay-footprint field can take most string values, but only the
numeric value will be extracted, e.g. the text string "hello 384" would indicate
to draw a 384-well microtitier plate. Values containing multiple numeric values
in assay-footprint may cause errors in plotting plate diagrams.
With the assay registered, the next step is to register an assay component.
The example data presented here only contains data for one of the many steroids
measured and only requires one assay component, but at this step the user could
add multiple assay components to the "Steroidogenesis" assay.
	 II Input 	
> tcplRegister("acid", list(aid = 1, acnm = "CTox_C0RT"))
[1] TRUE
	 It I jqni!. 	
> tcplRegister("aeid",
list(acid = c(l, 1),
aenm = c("CTox_CORT_up", "CTox_CORT_dn"),
normalized_data_type =
rep("log2_fold_induction", 2),
export_ready = c(l, 1),
burst_assay = c(0, 0),
fit_all = c(0, 0)))
[1] TRUE
In the example above two assay endpoints were assigned to the assay com-
ponent. Multiple endpoints allow for different normalization approaches of the
data, in this case to detect activity in both the positive and negative direc-
tions (up and down). Notice registering an assay endpoint also requires the
normalized-data-type field. The normalized-data-type field gives some default
values for plotting. Currently the package supports three normalized-data-type
7

-------
Register and Upload New Data
values: (1) "percent_activity," (2) "log2_fold_induction," and (3) "loglO_fold_induction."
Any other values will be treated as "percent_activity."
The other three additional fields when registering an assay endpoint do not
have to be explicitly defined when working in the MySQL environment and
will default to the values given above. All three fields represent Boolean values
(1 or 0, 1 being TRUE). The export-ready field indicates (1) the data is done
and ready for export or (0) still in progress. The burst_assay field is specific
to multiple-concentration processing and indicates (1) the assay endpoint is
included in the burst distribution calculation or (0) not (Appendix C). The
fit-all field is specific to multiple-concentration processing and indicates (1) the
package should try to fit every concentration series, or (0) only attempt to fit
concentration series that show evidence of activity (page 28).
The final piece of assay information needed is the assay component source
name (abbreviated acsn), stored in the "assay_component_map table." The
assay component source name is intended to simplify level 0 pre-processing by
defining unique character strings (concatenating information if necessary) from
the source files that identify the specific assay components. The unique character
strings (acsn) get mapped to acid. An example of how to register a new acsn
will be given later in this section.
With the minimal assay information registered, the next step is to register
the necessary chemical and sample information. The "chdat.csv" file included in
the package contains the sample and chemical information for the data that will
be loaded. The following shows an example of how to load chemical information.
Similar to the order in registering assay information, the user must first register
chemicals, then register samples that map to chemical.
	 II Input 	
>	ch <- fread(file.path(pkg_dir, "sql", "chdat.csv"))
>	head(ch)


spid
casn
1
01140000A
26172-55-4
2
01140002A
109-43-3
3
01140004A
486-56-6
4
01140006A
2058-94-8
5
01140008A
732-11-6
6
01140010A
89-83-8


chnm
1
5-Chloro-2
-methyl-3(2H)-isothiazolone
2

Dibutyl decanedioate
3

Cotinine
4

Perfluoroundecanoic acid
5

Phosmet
6

Thymol
8

-------
Register and Upload New Data
>	## Register the unique chemicals
>	tcplRegister("chid",
ch[ , unique(.SD), .SDcols = cO'casn", "chnm")])
[1] TRUE
The "chdat.csv" file contains a map of sample to chemical information, but
chemical and sample information have to be registered separately because a
chemical could potentially have multiple samples. Registering chemicals only
takes a chemical CAS registry number (casn) and name (chnm). In the above
example only the unique chemicals were loaded. The casn and chnm fields have
unique constraints; trying to register multiple chemicals with the same name or
CAS registry number is not possible and will result in an error. With the chem-
icals loaded the samples can be registered by mapping the sample ID (spid) to
chemical ID. Note, the user needs to load the chemical information to get the
chemical IDs then merge the new chemical IDs with the sample IDs from the
original file by chemical name or CASRN.
	 II iTipUl. 	,
>	cmap <- tcplLoadChemO
>	tcplRegister("spid",
merge(ch[ , list(spid, casn)],
cmap[ , list(casn, chid)],
by = "casn")[ , list(spid, chid)])
[1] TRUE
After registering the chemical and assay information the data can be loaded
into the tcpl database. The package includes two files from the ToxCast
program, "scdat.csv" and "mcdat.csv," with a subset of single- and multiple-
concentration data, respectively. The single- and multiple-concentration pro-
cessing require the same level 0 fields; more information about level 0 pre-
processing in Appendix B.
	 n		
>	scdat <- fread(file.path(pkg_dir, "sql", "scdat.csv"))
>	mcdat <- fread(file.path(pkg_dir, "sql", "mcdat.csv"))
>	c(unique(scdat$acsn), unique(mcdat$acsn))
9

-------
Register and Upload New Data
[1] "cort" "cort"
As discussed above, the final step before loading data is mapping the assay
component source name (acsn) to the correct acid. An assay component can
have multiple acsn values, but an acsn must be unique to one assay component.
Assay components can have multiple acsn values to minimize the amount of
data manipulation required (and therefore potential errors) during the level 0
pre-processing if assay source files change or are inconsistent. The example data
presented here only has one acsn value, "cort."
	 It I TlpUl. 		
> tcplRegister("acsn", list(acid = 1, acsn = "cort"))
	 a Output.
[1] TRUE
The data are now ready to be loaded with the tcplWriteLvlO function.
>	tcplWriteLvlO(dat = scdat, type = "sc")
>	tcplWriteLvlO(dat = mcdat, type = "mc")
The type argument is used throughout the package to distinguish the type
of data/processing: "sc" indicates single-concentration; "mc" indicates multiple-
concentration. The tcplLoadData function can be used to load the data from
the database.
	 II Input 	
> tcplLoadData(lvl = 0, fid = "acid", val = 1, type = "sc")
10

-------
Register and Upload New Data

sOid
spid
acid
apid
rowi
coli
wilt
1
1
01140000A
1
TP0001059.Plate.8
4
6
t
2
2
01140000A
1
TP0001059.Plate.8
5
6
t
3
3
01140002A
1
TP0001061.Plate.14
6
9
t
4
4
01140002A
1
TP0001061.Plate.14
7
9
t
5
5
01140004A
1
TP0001059.Plate.5
4
3
t
4892
4893
4894
4895
4896
1
2
3
4
5
4892
4893
4894
4895
4896
wllq
1
1
1
1
1
TX209150
TX210870
TX210870
TX212325
TX212325 1
cone	rval
100.1 16.130000
100.1 17.270000
1 TP0001061.Plate.13
1 TP0001061.Plate.14
1 TP0001061.Plate.14
1 TP0000885.Plate.1
TP0000885.Plate.1
5
4
5
2
3
TP0001059 Plate 8.
TP0001059 Plate 8.
100.1 25.870000 TP0001061 Plate 14.
100.1 24.160000 TP0001061 Plate 14.
100.0 7.670000 TP0001059 Plate 5.
6
6
6
6
6
sref
.CeeTox.csv
.CeeTox.csv
.CeeTox.csv
.CeeTox.csv
.CeeTox.csv
4892
4893
4894
4895
4896
1	10.0 18.620000 TP0001061 Plate	13_CeeTox.csv
1	100.0 28.370000 TP0001061 Plate	14_CeeTox.csv
1	100.0 28.440000 TP0001061 Plate	14_CeeTox.csv
1	10.0 7.961641	TP0000885	Plate lA.xlsx
1	10.0 8.753819	TP0000885	Plate lA.xlsx
Notice in the loaded data the acsn is replaced by the correct acid and
the sOid field is added. The "s#" fields, and corresponding "m^" fields in the
multiple-concentration data, are the primary keys for each level of data. These
primary keys link the various levels of data. All of the keys are auto-generated
and will change anytime data are reloaded or processed. Note, the primary keys
only change for the levels affected, e.g. if the user reprocesses level 1, the level
0 keys will remain the same.
11

-------
Data Processing and the tcplRun Function
This section is intended help the user understand the general aspects of how the
data is processed before diving into the specifics of each processing level for both
screening paradigms. The details of the two screening paradigms are provided
in later sections.
All processing in the tcpl package occurs at the assay component or assay
endpoint level. There is no capability within either screening paradigm to do
any processing which combines data from multiple assay components or assay
endpoints. Any combining of data must occur before or after the pipeline pro-
cessing. For example, a ratio of two values could be processed through the
pipeline if the user calculated the ratio during the level 0 pre-processing and
uploaded a single "component."
Once data are uploaded in the database, data processing occurs through the
tcplRun function for both single- and multiple-concentration screening. The
tcplRun function can either take a single ID (acid or aeid, depending on the
processing type and level) or an asid. If given an asid the tcplRun function will
attempt to process all corresponding components/endpoints. When processing
by acid or aeid, the user must know which ID to give for each level (Table 1).
The processing is sequential, and every level of processing requires successful
processing at the antecedent level. Any processing changes will cause a "delete
cascade," removing any subsequent data affected by the processing change to
ensure complete data fidelity at any given time. For example, processing level
3 data will cause the data from levels 4 through 6 to be deleted for the cor-
responding IDs. Changing any method assignments will also trigger a delete
cascade for any corresponding data (more on method assignments below).
The user must give a start and end level when using the tcplRun function.
If processing more than one assay component or endpoint, the function will
not stop if one component or endpoint fails. If a component or endpoint fails
while processing multiple levels, the function will not attempt to processes the
failed component/endpoint in subsequent levels. When finished processing, the
tcplRun function returns a list indicating the processing success of each id. For
each level processed the list will contain two elements: (1) "1#" a named Boolean
vector where TRUE indicates successful processing, and (2) "l#_failed" containing
the names of any ids that failed processing where "#" is the processing level.
The processing functions print messages to the console indicating the four
steps of the processing. First, data for the given assay component ID are loaded,
the data are processed, data for the same ID in subsequent levels are deleted,
then the processed data is written to the database. The "outfile" parameter in
the tcplRun function gives the user the option of printing all of the output text
to a file.
The tcplRun function will attempt to use multiple processors on Unix-based
systems (does not include Windows). Depending on the system environment,
or if the user is running into memory constraints, the user may wish to use
less processing power and can do so by setting the "mc.cores" parameter in the
tcplRun function.
12

-------
Data Processing and the tcplRun Function
Table 1: Processing checklist
Type
Level
Input ID
Method ID
SC
Lvl 1
acid
aeid
SC
Lvl 2
aeid
aeid
MC
Lvl 1
acid
N/A
MC
Lvl 2
acid
acid
MC
Lvl 3
acid
aeid
MC
Lvl 4
aeid
N/A
MC
Lvl 5
aeid
aeid
MC
Lvl 6
aeid
aeid
The Input ID column indicates the ID used for each pro-
cessing step; Method ID indicates the ID used for assigning
methods for data processing, when necessary. SC = single-
concentration; MC = multiple-concentration.
The processing requirements vary by screening paradigm and level. Later
sections will cover the details, but in general, many of the processing steps
require specific methods to accommodate different experimental designs or data
processing approaches.
Notice from Table 1 that level 1 single-concentration processing (SCI) re-
quires an acid input (Table 1), but the methods are assigned by aeid. The
same is true for MC3 processing. SCI and MC3 are the normalization steps
and convert acid to aeid. (Only MC2 has methods assigned by acid.) The
normalization process is discussed in the following section.
To promote reproducibility, all method assignments must occur through the
database. Methods cannot be passed to either the tcplRun function or the
low-level processing functions called by tcplRun.
In general, method data are stored in the "_methods" and "_id" tables that
correspond to the data-storing tables. For example, the "scl" table is accompa-
nied by the "scl_methods" table which stores the available methods for SCI, and
the "scl_aeid" table which stores the method assignments and execution order.
The tcpl package provides three functions for easily modifying and load-
ing the method assignments for the given assay components or endpoints: (1)
tcplAssignMthd allows the user to assign methods, (2) tcplClearMthd clears
method assignments, and (3) tcplLoadMthd queries the tcpl database and re-
turns the method assignments. The package also includes the tcplListMthd
function that queries the tcpl database and returns the list of available meth-
ods.
The following code blocks will give some examples of how to use the method-
13

-------
Data Processing and the tcplRun Function
related functions.
	 II Input 	
>	## For illustrative purposes, assign level 2 MC methods to
>	## ACIDs 98, 99. First check for available methods.
>	mthds <- tcplListMthd(lvl = 2, type = "mc")
>	mthds[1:2]
	 II Out.},lit. 	
mc2_mthd_id mc2_mthd	desc
1:	1 none apply no level 2 method
2:	2 log2	log2 all raw data
	 It I jqni!. 	
>	## Assign some methods to ACID 555
>	tcplAssignMthd(lvl = 2,
id = 98:99,
mthd_id = c(3, 4, 2),
ordr = 1:3,
type = "mc")
Completed delete cascade for 0 ids (0.09 sees)
> tcplLoadMthd(lvl =2, id = 98:99, type = "mc")

acid
mthd
mthd_id
ordr
1
98
rmneg
3
1
2
98
rmzero
4
2
3
98
log2
2
3
4
99
rmneg
3
1
5
99
rmzero
4
2
6
99
log2
2
3
	 It I jqni!. 	
>	## Methods can be cleared one at a time for the given id(s)
>	tcplClearMthd(lvl =2, id = 98, mthd_id = 2, type = "mc")
Completed delete cascade for 0 ids (0.09 sees)
14

-------
Data Processing and the tcplRun Function
> tcplLoadMthd(lvl = 2, id = 98, type = "mc")
	 11 Ouf.pi.it
acid mthd mthd_id ordr
1: 98 rmneg	3 1
2: 98 rmzero	4 2
	 It I TlpUl. 	
>	## Or all methods can be cleared for the given id(s)
>	tcplClearMthd(lvl = 2, id = 98:99, type = "mc")
	 11 0uf.pi.it 	
Completed delete cascade for 0 ids (0.09 sees)
	 It I jqnil. 	
> tcplLoadMthd(lvl =2, id = 98:99, type = "mc")
	 11 OsiS.j,!)?. 	
Empty data.table (0 rows) of 4 cols: acid,mthd,mthd_id,ordr
15

-------
Data Normalization
Data normalization occurs in both single- and multiple-concentration processing
at levels 1 and 3, respectively. While the two paradigms use different meth-
ods, the normalization approach is the same for both single- and multiple-
concentration processing. Data normalization does not have to occur within
the package, and normalized data can be loaded into the database at level 0.
However, data must be zero-centered and will only be fit in the positive
direction.
The tcpl package supports fold-change and a percent of control approaches
to normalization. All data must be zero-centered so all fold-change data must be
log-transformed. Normalizing to a control requires three normalization methods:
(1) one to define the baseline value, (2) one to define the control value, and (3)
one to calculate percent of control ("resp.pc"). Normalizing to fold-change also
requires three methods: (1) one to define the baseline value, (2) one to calculate
the fold-change, and (3) one to log-transform the fold-change values. Methods
defining a baseline value (bval) have the "bval" prefix, methods defining the
control value (pval) have the "pval" prefix, and methods that calculate or modify
the final response value have the "resp" prefix. For example, "resp.log2" does a
log-transformation of the response value using a base value of 2. The formluae
for calculating the percent of control and fold-change response values are listed
in equations 1 and 2, respectively.
The percent of control and fold-change values, respectively:
aval — bval	, .
resp = 				—-100	(1)
pval — bval
resp = aval I bval	(2)
Order matters when assigning normalization methods. The bval, and pval
if normalizing as a percent of control, need to be calculated prior to calculating
the response value. Table 2 shows some possible normalization schemes.
Table 2: Example normalization method assignments.
1.	bval.apid.nwlls.med
2.	resp.fc
3.	resp.log2
4.	resp.mult.negl
1.	bval.apid.lowconc.med
2.	resp.fc
3.	resp.log2
4.
1.	none
2.	resp.loglO
3.	resp.blineshift.50.spid
4.
1.	bval.apid.lowconc.med
2.	pval.apid.pwlls.med
3.	resp.pc
4.	resp.multnegl
1.	bval.spid.lowconc.med
2.	pval.apid.mwlls.med
3.	resp.pc
4.
1.	none
2.	resp.multnegl
3.
4.
16

-------
Data Normalization
If the data does not require any normalization the "none" method must be
assigned for normalization. The "none" method simply copies the input data
to the response field. Without assigning "none" the response field will not get
generated and the processing will not complete.
To reiterate, the package only models response in the positive direction.
Therefore, signal in the negative direction must transformed to the positive
direction during normalization. Negative direction data are inverted by mul-
tiplying the final response values by —1 (see the "resp.mult.neg"' methods in
Table 2).
In addition to the required normalization methods, the user can add addi-
tional methods to transform the normalized values. For example, the third fold-
change example in Table 2 includes "resp.blineshift.50.spid," which corrects for
baseline deviations by spid. A complete list of available methods, by processing
type and level, can be listed with tcplListMthd. More information is available
in the package documentation, and can be found by running ??tcpl: : Methods.
As discussed in the Assay Structure section (page 5), an assay component
can have more than one assay endpoint. Creating multiple endpoints for one
component enables multiple normalization approaches. Multiple normalization
approaches may become necessary when the assay component detects signal in
both positive and negative directions.
17

-------
Single-concentration Screening
This section will cover the tcpl process for handling single-concentration data2.
The goal of single-concentration processing is to identify potentially active com-
pounds from a broad screen at a single concentration. After the data is loaded
into the tcpl database, the single-concentration processing consists of 2 levels
(Table 3).
Table 3: Summary of the tcpl single-concentration pipeline
Description
Lvl 0 Pre-processing: Vendor/dataset-specific pre-processing to orga-
nize heterogeneous raw data to the uniform format for processing
by the tcpl package^
Lvl 1 Normalize: Apply assay endpoint-specific normalization listed in
the "scl_aeid" table to the raw data to define response
Lvl 2 Activity Call: Collapse replicates by median response, define the
response cutoff based on methods in the "sc2_aeid" table, and de-
termine activity
t Level 0 pre-processing is outside the scope of this package
Level 1
Level 1 processing converts the assay component to assay endpoint(s) and defines
the normalized-response value field (resp); logarithm-concentration field (logc);
and optionally, the baseline value (bval) and positive control value (pval) fields.
The purpose of level 1 is to normalize the raw values to either the percentage of
a control or to fold-change from baseline. The normalization process is discussed
in greater detail in the Data Normalization section (page 16).
Before beginning the normalization process, all wells with well quality (wllq)
equal to 0 are removed.
The first step in beginning the processing is to identify which assay endpoints
stem from the assay component(s) being processed.
> tcplLoadAeid(fld = "acid", val = 1)
2This section assumes a working knowledge of the concepts covered in the Data Processing
and Data Normalization sections (pages 12 and 16, respectively).
18

-------
Single-concentration Screening
	 a out.pi.if.
acid aeid	aenm
1: 1 1 CTox_CORT_up
2: 1 2 CTox_CORT_dn
With the corresponding endpoints identified, the appropriate methods can
be assigned.
	 It I jqni!. 	
> tcplAssignMthd(lvl = 1,
id = 1:2,
mthd_id = c(l, 11, 13),
ordr = 1:3,
type = "sc")
	 il 0uf.pi.if. 	
Completed delete cascade for 2 ids (0.01 sees)
> tcplAssignMthd(lvl = 1,
id = 2,
mthd_id = 16,
ordr = 4,
type = "sc")
	 a 0uf.pi.it 	
Completed delete cascade for 1 ids (0.01 sees)
Above, methods 1, 11, and 13 were assigned for both endpoints. The method
assignments instruct the processing to: (1) calculate bval for each assay plate ID
by taking the median of all data where the well type equals "n;" (2) calculate a
fold-change over bval; (3) log-transform the fold-change values with base 2. The
second method assignment (only for AEID 2) indicates to multiply all response
values by —1.
For a complete list of normalization methods see tcplListMthd(lvl = 1,
type = "sc") or ?SCl_Methods. With the assay endpoints and normalization
methods defined, the data are ready for level 1 processing.
19

-------
Single-concentration Screening
	 II Input 	
>	## Do level 1 processing for acid 1
>	scl_res <- tcplRun(id = 1, slvl = 1, elvl = 1, type = "sc")
	 il Output 	
Loaded LO ACID1 (4896 rows; 0.03 sees)
Processed LI ACID1 (9708 rows; 1.17 sees)
Writing level 1 data for 1 ids...
Completed delete cascade for 2 ids (0.01 sees)
Writing level 1 complete. (0.09 sees)
Total processing time: 0.02 mins
Notice that level 1 processing takes an assay component ID, not an
assay endpoint ID, as the input ID. As mentioned in previously, the user
must assign normalization methods by assay endpoint, then do the processing by
assay component. The level 1 processing will attempt to process all endpoints
in the database for a given component. If one endpoint fails for any reason
(e.g., does not have appropriate methods assigned), the processing for the entire
component fails.
Level 2
Level 2 processing defines the baseline median absolute deviation (bmad), col-
lapses any replicates by sample ID, and determines the activity.
Before the data are collapsed by sample ID, the bmad is calculated as the
median absolute deviation of all wells with well type equal to "t." The calculation
to define bmad is done once across the entire assay endpoint. If additional
data is added to the database for an assay component, the bmad values
for all associated assay endpoints will change. Note, this bmad definition
is different from the bmad definition used for multiple-concentration screening.
To collapse the data by sample ID, the median response value is calculated at
each concentration. The data are then further collapsed by taking the maximum
of those median values (max_med).
Once the data are collapsed, such that each assay endpoint-sample pair only
has one value, the activity is determined. For a sample to get an active hit-call,
the max-med must be greater than an efficacy cutoff. The efficacy cutoff is
determined by the level 2 methods. The efficacy cutoff value (coff) is defined
as the maximum of all values given by the assigned level 2 methods. Failing to
assign a level 2 method will result in every sample being called active. For a
complete list of level 5 methods see tcplListMthd(lvl = 2, type = "sc") or
?SC2_Methods.
20

-------
Single-concentration Screening
>	## Assign a cutoff value of log2(1.2)
>	tcplAssignMthd(lvl = 2,
id = 1:2,
mthd_id = 3,
type = "sc")
	 a Output. 	
Completed delete cascade for 2 ids (0.01 sees)
For the example data the cutoff value is log2(1.2). If the maximum median
value (max-med) is greater than or equal to the efficacy cutoff (coff), the sample
ID is considered active and the hit-call {kite) is set to 1.
With the methods assigned, the level 2 processing can be completed.
	 II Input 	
>	## Do level 1 processing for acid 1
>	sc2_res <- tcplRun(id = 1:2, slvl = 2, elvl = 2, type = "sc")
	 il		
Loaded LI AEID1 (4854 rows; 0.04 sees)
Processed L2 AEID1 (4854 rows; 0.12 sees)
Loaded LI AEID2 (4854 rows; 0.04 sees)
Processed L2 AEID2 (4854 rows; 0.15 sees)
Writing level 2 data for 2 ids...
Completed delete cascade for 2 ids (0.01 sees)
Writing level 2 complete. (0.12 sees)
Total processing time: 0.01 mins
21

-------
Multiple-concentration Screening
This section will cover the tcpl process for handling multiple-concentration
data3. The goal of multiple-concentration processing is to estimate the activity,
potency, efficacy, and other parameters for sample-assay pairs. After the data
is loaded into the tcpl database, the multiple-concentration processing consists
of six levels (Table 4).
Table 4: Summary of the tcpl multiple-concentration pipeline
Description
Lvl 0 Pre-processing: Vendor/dataset-specific pre-processing to orga-
nize heterogeneous raw data to the uniform format for processing
by the tcpl package^
Lvl 1 Index: Define the replicate and concentration indices to facilitate
all subsequent processing
Lvl 2 Transform: Apply assay component-specific transformations
listed in the "mc2_acid" table to the raw data to define the cor-
rected data
Lvl 3 Normalize: Apply assay endpoint-specific normalization listed in
the "mc3_aeid" table to the corrected data to define response
Lvl 4 Fit: Model the concentration-response data utilizing three objec-
tive functions: (1) constant, (2) hill, and (3) gain-loss
Lvl 5 Model Selection/Acitivty Call: Select the winning model, define
the response cutoff based on methods in the "mc5_aeid" table, and
determine activity
Lvl 6 Flag: Flag potential false positive and false negative findings based
on methods in the "mc6_aeid" table
t Level 0 pre-processing is outside the scope of this package
Level 1
Level 1 processing defines the replicate and concentration index fields to fa-
cilitate downstream processing. Because of cost, availability, physicochemical,
and technical constraints screening-level efforts utilize numerous experimental
designs and test compound (sample) stock concentrations. The resulting data
may contain inconsistent numbers of concentrations, concentration values, and
technical replicates. To enable quick and uniform processing, level 1 process-
ing explicitly defines concentration and replicate indices, giving integer values
3This section assumes a working knowledge of the concepts covered in the Data Processing
and Data Normalization sections (pages 12 and 16, respectively).
22

-------
Multiple-concentration Screening
1... N to increasing concentrations and technical replicates, where 1 represents
the lowest concentration or first technical replicate.
To assign replicate and concentration indices we assume one of two exper-
imental designs. The first design assumes samples are plated in multiple con-
centrations on each assay plate, such that the concentration series all falls on
a single assay plate. The second design assumes samples are plated in a single
concentration on each assay plate, such that the concentration series falls across
many assay plates.
For both experimental designs, data are ordered by source file (srcf), as-
say plate ID (apid), column index (coti), row index (rowi), sample ID (spid),
and concentration (cone). Concentration is rounded to three significant figures
to correct for potential rounding errors. After ordering the data we create a
temporary replicate ID, identifying an individual concentration series. For test
compounds in experimental designs with the concentration series on a single
plate and all control compounds, the temporary replicate ID consists of the
sample ID, well type (wilt), source file, assay plate ID, and concentration. The
temporary replicate ID for test compounds in experimental designs with con-
centration series that span multiple assay plates is defined similarly, but does
not include assay plate ID.
Once the data are ordered, and the temporary replicate ID is defined, the
data are scanned from top to bottom and increment the replicate index (repi)
every time a replicate ID is duplicated. Then, for each replicate, the concen-
tration index (endx) is defined by ranking the unique concentrations, with the
lowest concentration starting at 1.
The following demonstrates how to carry out the level 1 processing and look
at the resulting data:
	 II Input 	
>	## Do level 1 processing for acid 1
>	mcl_res <- tcplRun(id = 1, slvl = 1, elvl = 1, type = "mc")
	 II Oiif.j/iil. 	
Loaded LO ACID1 (7170 rows; 0.03 sees)
Processed LI ACID1 (7170 rows; 0.24 sees)
Writing level 1 data for 1 ids...
Completed delete cascade for 2 ids (0.07 sees)
Writing level 1 complete. (0.12 sees)
Total processing time: 0.01 mins
With the processing complete, the resulting level 1 data can be loaded to
check the processing:
23

-------
Multiple-concentration Screening
	 II Input 	
>	## Load the level 1 data and look at the cndx and repi values
>	mldat <- tcplLoadData(lvl = 1,
fid = "acid",
val = 1,
type = "mc")
>	mldat <- tcplPrepOtpt(mldat)
>	setkeyv(mldat, cO'repi", "cndx"))
>	mldat[chnm == "3-Phenylphenol", list(chnm, cone, cndx, repi)]
11 Ouf.pi.it

chnm
cone
cndx
repi
1
3-Phenylphenol
0.082
1
1
2
3-Phenylphenol
0.247
2
1
3
3-Phenylphenol
0.741
3
1
4
3-Phenylphenol
2.222
4
1
5
3-Phenylphenol
6.667
5
1
6
3-Phenylphenol
20.000
6
1
7
3-Phenylphenol
0.082
1
2
8
3-Phenylphenol
0.247
2
2
9
3-Phenylphenol
0.741
3
2
10
3-Phenylphenol
2.222
4
2
11
3-Phenylphenol
6.667
5
2
12
3-Phenylphenol
20.000
6
2
3-phenylphenol contains two replicates, each with six distinct concentrations.
The package also contains a tool for visualizing the data at the assay plate level.
In Figure 1 we see the results of tcplPlotPlate. The tcplPlotPlate function
can be used to visualize the data at levels 1 to 3. The row and column indices are
printed along the edge of the plate, with the values in each well represented by
color. While the plate does not give sample ID information, the letter/number
codes in the wells indicate the well type and concentration index, respectively.
The plate display also shows the wells with poor quality (as defined by the well
quality, wllq, field at level 0) with an "X." Plotting plates in subsequent levels
wells with poor quality will appear empty. The title of the plate display lists
the assay component/assay endpoint and the assay plate ID (apid).
Level 2
Level 2 processing removes data where the well quality (wllq) equals 0 and
defines the corrected value (cval) field. Level 2 processing allows for any trans-
formation of the raw values at the assay component level. Examples of transfor-
mation methods could range from basic logarithm transformations, to complex
24

-------
Multiple-concentration Screening
	 R Input 	
> tcplPlotPlate(dat = mldat, apid = "09Apr2014.Plate.17")
ACID1 (CToxCORT): 09Apr2014.Plate.17
o
o
o
o
o
o
o
o
o
o
o
o
o
o
(J3J
o
o
o
o
(0)
©
o
o
o
o
o

o
o
o
o
(©)
(©)
o
o
o
o
o
(©)
o
o
o
o
(©)
©
o
o
o
o
o
(©)
o
o
o
o
(©)
©
o
o
o
o
o
o
o
o
o
o
o
©
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
Figure 1: An assay plate diagram. The color indicates the raw values according
to the key on the right. The bold lines on the key show the distribution of values
for the plate on the scale of values across the entire assay. The text inside each
well shows the well type and concentration index. For example, "t4" indicates a
test compound at the fourth concentration. The wells with an "X" have a well
quality of 0.
spacial noise reduction algorithms. Currently the tcpl package only consists
of basic transformations, but could be expanded in future releases. Level 2
processing does not include normalization methods; normalization should occur
during level 3 processing.
For the example data used in this vignette, no transformations are necessary
at level 2. To not apply any transformation methods, assign the "none" method:
	 R Input 	
> tcplAssignMthd(lvl = 2,
id = 1,
mthd_id = 1,
ordr = 1,
type = "mc")
25

-------
Multiple-concentration Screening
Completed delete cascade for 2 ids (0.07 sees)
Every assay component needs at least one transformation method assigned
to complete level 2 processing. With the method assigned, the processing can
be completed.
	 II Input. 	
>	## Do level 2 processing for acid 1
>	mc2_res <- tcplRun(id = 1, slvl = 2, elvl = 2, type = "mc")
Loaded LI ACID1 (7170 rows; 0.04 sees)
Processed L2 ACID1 (7053 rows; 0.01 sees)
Writing level 2 data for 1 ids...
Completed delete cascade for 2 ids (0.08 sees)
Writing level 2 complete. (0.12 sees)
Total processing time: 0 mins
For the complete list of level 2 transformation methods currently available,
see tcplListMthd(lvl = 2, type = "mc") or ?MC2_Methods for more detail.
The coding methodology used to implement the methods is beyond the scope of
this vignette, but, in brief, the method names in the database correspond to a
function name in the list of functions returned by mc2_mthds() (the mc2_mthds
function is not exported, and not intended for use by the user). Each of the
functions in the list given by mc2_mthds() only return expression objects that
processing function called by tcplRun executes in the local function environ-
ment to avoid making additional copies of the data in memory. We encourage
suggestions for new methods.
Level 3
Level 3 processing converts the assay component to assay endpoint(s) and defines
the normalized-response value field (resp); logarithm-concentration field (logo);
and optionally, the baseline value (hval) and positive control value (pval) fields.
The purpose of level 3 processing is to normalize the corrected values to either
the percentage of a control or to fold-change from baseline. The normalization
process is discussed in greater detail in the Data Normalization section (page
16). The processing aspect of level 3 is almost completely analogous to level
26

-------
Multiple-concentration Screening
2, except the user has to be careful about using assay component versus assay
endpoint.
The user first needs to check which assay endpoints stem from the the assay
component queued for processing.
	 II Input 	
>	## Look at the assay endpoints for acid 1
>	tcplLoadAeid(fld = "acid", val = 1)
	 II Out.},lit.
acid aeid	aenm
1: 1 1 CTox_CORT_up
2: 1 2 CTox_CORT_dn
With the corresponding assay endpoints listed, the normalization methods
can be assigned.
> tcplAssignMthd(lvl = 3,
id = 1:2,
mthd_id = c(17, 9, 7),
ordr = 1:3,
type = "mc")
	 11 Ouf.pi.it 	
Completed delete cascade for 2 ids (0.03 sees)
> tcplAssignMthd(lvl = 3,
id = 2,
mthd_id = 6,
ordr = 4,
type = "mc")
	 a ouf.|Ait. 	
Completed delete cascade for 1 ids (0.02 sees)
Above, methods 17, 9, and 7 were assigned for both endpoints. The method
assignments instruct the processing to: (1) calculate bval for each assay plate ID
by taking the median of all data where the well type equals "n" or the well type
equals "t" and the concentration index is 1 or 2; (2) calculate a fold-change over
bval; (3) log-transform the fold-change values with base 2. The second method
27

-------
Multiple-concentration Screening
assignment (only for AEID 2) tells the processing to multiply all response values
by -1.
For a complete list of normalization methods see tcplListMthd(lvl = 3,
type = "mc") or ?MC3_Methods. With the assay endpoints and normalization
methods defined, the data are ready for level 3 processing.
	 II Input 	
>	## Do level 3 processing for acid 1
>	mc3_res <- tcplRun(id = 1, slvl = 3, elvl = 3, type = "mc")
	 il Oiif.j/iil. 	
Loaded L2 ACID1 (7053 rows; 0.06 sees)
Processed L3 ACID1 (AEIDS: 1, 2; 14106 rows; 3.09 sees)
Writing level 3 data for 1 ids...
Completed delete cascade for 2 ids (0.05 sees)
Writing level 3 complete. (0.5 sees)
Total processing time: 0.06 mins
Notice that level 3 processing takes an assay component ID, not
an assay endpoint ID, as the input ID. As mentioned in previous sections,
the user must assign normalization methods by assay endpoint, then do the
processing by assay component. The level 3 processing will attempt to process
all endpoints in the database for a given component. If one endpoint fails for
any reason (e.g., does not have appropriate methods assigned), the processing
for the entire component fails.
Level 4
Level 4 processing splits the data into concentration series by sample and assay
endpoint, then models the activity of each concentration series. Activity is
modeled only in the positive direction. More information on readouts with both
directions is available in the previous section.
The first step in level 4 processing is to remove the well types with only one
concentration. To establish the noise-band for the assay endpoint, the baseline
median absolute deviation (bmad) is calculated as the median absolute deviation
of the response values for test compounds where the concentration index equals
1 or 2. The calculation to define bmad is done once across the entire assay
endpoint. If additional data is added to the database for an assay
component, the bmad values for all associated assay endpoints will
change. Note, this bmad definition is different from the bmad definition used
for single-concentration screening.
28

-------
Multiple-concentration Screening
Before the model parameters are estimated, a set of summary values are
calculated for each concentration series: the minimum and maximum response;
minimum and maximum log concentration; the number of concentrations, points,
and replicates; the maximum mean and median with the concentration at which
they occur; and the number of medians greater than ibmad. When referring to
the concentration series the "mean" and "median" values are defined as the mean
or median of the response values at every concentration. In other words, the
maximum median is the maximum of all median values across the concentration
series.
Concentration series must have at least four concentrations to enter the
fitting algorithm. By default, concentration series must additionally have at
least one median value greater than 3hmad to enter the fitting algorithm. The
median value above 3 bmad requirement can be ignored by setting fit-all to 1 in
the assay endpoint annotation.
All models draw from the Student's t-distribution with four degrees of free-
dom. The wider tails in the t-distribution diminish the influence of outlier
values, and produce more robust estimates than do the more commonly used
normal distribution. The robust fitting removes the need for any outlier elimina-
tion before fitting. The fitting algorithm utilizes maximum likelihood estimates
parameters for three models as defined below in equations 3 through 16.
Let t(z, v) be the Student's t-distribution with v degrees of freedom, yi be
the observed response at the ith observation, and ^ be the estimated response
at the ith observation. We calculate Zj as
Vi ~ Mi	/q\
zi = 	TT'	3
exp(
4The AC50 is the activity concentration at 50%, or the concentration where the modeled
activity equals 50% of the top asymptote.
29

-------
Multiple-concentration Screening
with the constraints
0 < tp < 1.2max resp,	(7)
min logc — 2 < ga < max logc + 0.5,	(8)
and
0.3 < gw < 8.	(9)
The third model in the fitting algorithm is a constrained gain-loss model
(gnls), defined as a product of two Hill models, with a shared top asymptote
and both bottom asymptote values equal to 0. Including the scale term, the
gain-loss model has six parameters. Let tp be the shared top asymptote, ga be
the AC50 in the gain direction, gw be the Hill coefficient in the gain direction, la
be the AC50 in the loss direction, Iw be the Hill coefficient in the loss direction,
and Xi be the log concentration at the ith observation. Then ^ for the gain-loss
model is given by
^ ~~ ^ ( 1 -|- 10(ga-Xi)gw ^ ^ 1 _|_ lQ{xi-la)lw ^ '	(l^)
with the constraints
0 < tp < 1.2max resp,
(11)
min logc — 2 < ga < max logc,
(12)
0.3 < gw < 8,
(13)
min logc — 2 < la < max logc + 2,
(14)
0.3 < Iw < 18,
(15)
and
ga — la > 0.25.	(16)
Level 4 does not utilize any assay endpoint-specific methods; the user only
needs to run the tcplRun function. Level 4 processing and all subsequent
processing is done by assay endpoint, not assay component. The pre-
vious section showed how to find the assay endpoints for an assay component
using the tcplLoadAeid function. The example dataset includes two assay end-
points with aeid values of 1 and 2.
>	## Do level 4 processing for aeid 1 and load the data
>	mc4_res <- tcplRun(id = 1:2, slvl = 4, elvl = 4, type = "mc")
30

-------
Multiple-concentration Screening
	 a out.pi.if. 	
Loaded L3 AEID1 (6306 rows; 0.24 sees)
Processed L4 AEID1 (6306 rows; 17.48 sees)
Loaded L3 AEID2 (6306 rows; 0.15 sees)
Processed L4 AEID2 (6306 rows; 30.07 sees)
Writing level 4 data for 2 ids...
Completed delete cascade for 2 ids (0.02 sees)
Writing level 4 complete. (0.18 sees)
Total processing time: 0.8 mins
The level 4 data include 52 variables, including the ID fields. A complete list
of level 4 fields is available in Appendix A. The level 4 data include the fields
cast, hill, and gnls indicating the convergence of the model where a value of 1
means the model converged and a value of 0 means the model did not converge.
N/A values indicate the fitting algorithm did not attempt to fit the model, enst
will be N/A when the concentration series had less than 4 concentrations; hill
and gnls will be N/A when none of the medians were greater than or equal
to 3 bmad. Similarly, the hcov and gcov fields indicate the success in inverting
the Hessian matrix. Where the Hessian matrix did not invert, the parameter
standard deviation estimates will be N/A. NaN values in the parameter stan-
dard deviation fields indicate the covariance matrix was not positive definite. In
Figure 2 the hill field is used to find potentially active compounds to visualize
with the tcplPlotL4ID function.
	 II Input 	
>	## Load the level 4 data
>	m4dat <- tcplLoadData(4, type = "mc")
>	## List the first m4ids where the hill model convered
>	## for AEID 1
>	m4dat[hill == 1 & aeid == 1, head(m4id)]
	 il Oul.jAit.
[1] 3 15 16 19 21 34
The model summary values in Figure 2 include Akaike Information Criterion
(AIC), probability, and the root mean square error (RMSE). Let log(C(9,y))
be the log-likelihood of the model 6 given the observed values y, and K be the
number of parameters in 9, then,
AIC = —2 log(£(0, y)) + 2K.	(17)
31

-------
Multiple-concentration Screening
>	## Plot a fit for m4id 21
>	tcplPlotM4ID(m4id = 686, lvl
4)
i i
i i
l Of
I I
I i
Concentration (|J.M)
ASSAY:	AEID2 (CTox_CORT_dn)
NAME:	Norgestrel
CHID:	568 CASRN: 797-63
SPID(S):	01141142A
M4ID:	686
HILL MODEL (i
tp
sd:
0.267
0.47
0.343
GAIN-LOSS MODEL (in b
tp	ga
val: 1.58 0.644
sd: NaN	NaN
AIC: 34.63
PROB: 0
RMSE: 0.9
MAX MEAN: 1.56
1.81
NaN
1.92
NaN
13.6
NaN
MAX MED: 1.56
BMAD: 0.164
Figure 2: An example level 4 plot for a single concentration series. The orange
dashed line shows the constant model, the red dashed line shows the Hill model,
and the blue dashed line shows the gain-loss model. The gray striped box shows
the baseline region, 0 ± 3 hmad. The summary panel shows assay endpoint and
sample information, the parameter values (val) and standard deviations (sd) for
the Hill and gain-loss models, and summary values for each model.
The probability, Wj, is defined as the weight of evidence that model i is the best
model, given that one of the models must be the best model. Let Aj be the
difference AIQ — AICmin for the ith model. If R is the set of models, then Wj
is given by
exp
Etiexp(-iAr
(18)
The RMSE is given by
RMSE
Sill(Vi - Mi
N
(19)
where N is the number of observations, and ^ and yi are the estimated and
observed values at the ith observation, respectively.
32

-------
Multiple-concentration Screening
Level 5
Level 5 processing determines the winning model and activity for the concen-
tration series, bins all of the concentration series into categories, and calculates
additional point-of-departure estimates based on the activity cutoff.
The model with lowest AIC value is selected as the winning model
(modi), and is used to determine the activity or hit-call for the concentration
series. If two models have equal AIC values, the simpler model (the model with
fewer parameters) wins the tie. All of the parameters for the winning model
are stored at level 5 with the prefix "modL" to facilitate easier queries. For a
concentration series to get an active hit-call, either the Hill or gain-loss must
be selected as the winning model. In addition to selecting the Hill or gain-loss
model, the modeled and observed response must meet an efficacy cutoff.
The efficacy cutoff is defined by the level 5 methods. The efficacy cutoff value
(coff) is defined as the maximum of all values given by the assigned level 5 meth-
ods. Failing to assign a level 5 method will result in every concentration series
being called active. For a complete list of level 5 methods see tcplListMthd(lvl
= 5) or ?MC5_Methods.
	 II Input 	
>	## Assign a cutoff value of bmad*6
>	tcplAssignMthd(lvl = 5,
id = 1:2,
mthd_id = 6,
type = "mc")
	 11 Ouf.pi.it 	
Completed delete cascade for 2 ids (0.01 sees)
For the example data the cutoff value is 6bmad. If the Hill or gain-loss
model wins, and the estimated top parameter for the winning model (modl-tp)
and the maximum median value (max-med) are both greater than or equal to
the efficacy cutoff (coff), the concentration series is considered active and the
hit-call (hitc) is set to 1.
The hit-call can be 1, 0, or -1. A hit-call of 1 or 0 indicates the concentration
series is active or inactive, respectively, according to the analysis; a hit-call of
-1 indicates the concentration series had less than four concentrations.
For active concentration series, two additional point-of-departure estimates
are calculated for the winning model: (1) the activity concentration at base-
line (ACB or modl-acb) and (2) the activity concentration at cutoff (ACC or
modl-acc). The ACB and ACC are defined as the concentration where the
estimated model value equals 3bmad and the cutoff, respectively. The point-of-
departure estimates are summarized in Figure 3.
33

-------
Multiple-concentration Screening
in
Figure 3: The point-of-departure estimates calculated by the tcpl package. The
shaded rectangle represents the baseline region, 0 ± 3 bm.ad. The dark stripped
line represents the efficacy cutoff (coff). The vertical lines show where the point-
of-departure estimates are defined: the red line shows the ACB, the yellow line
shows the ACC, and the blue line shows the AC-50.
All concentration series fall into a single fit category (fit.c), defined by the
leaves on the tree structure in Figure 4. Concentration series in the same cate-
gory will have similar characteristics, and often look very similar. Categorizing
all of the series enables faster quality control checking and easier identification
of potential false results. The first split differentiates series by hit-call. Series
with a hit-call of -1 go into fit category 2. The following two paragraphs will
outline the logic for the active and inactive branches.
The first split in the active branch differentiates series by the model winner,
Hill or gain-loss. For each model, the next split is defined by the efficacy of it's
top parameter in relation to the cutoff. The top value is either less than 1.2 coff
or greater than or equal to 1.2 coff. Finally, series on the active branch go into
leaves based on the position of the AC50 parameter in relation to the tested
concentration range. For comparison purposes, the activity concentration at
95% (AC-95) is calculated, but not stored.5 Series with AC50 values less than
the minimum concentration tested (logc-min) go into the "<=" leaves, series
with AC-50 values greater than the minimum tested concentration and AC-95
values less than maximum tested concentration (logc_max) go into the "=="
leaves, and series with AC-95 values greater than the maximum concentration
5 Any activity concentration value or estimated model values for a given concentration can
be calculated using the tcplACXX and tcplACVal functions, respectively
34

-------
Multiple-concentration Screening

-------
Multiple-concentration Screening
tested go into the ">=" leaves.
The inactive branch is first divided by whether any median values were
greater than or equal to ibmad. Series with no evidence of activity go into fit
category 4. Similar to the active branch, series with evidence for activity are
separated by the model winner. The Hill and gain-loss portions of the inactive
branch follow the same logic. First, series diverge by the efficacy of their top
parameter in relation to the cutoff: modl_tp < 0.8coff or modl_tp > 0.8coff.
Then the same comparison is made on the top values of the losing model. If
the losing model did not converge, then the series go into the "DNC" category.
If the losing model top value is greater than or equal to 0.8coff, then the series
are split based on whether the losing model top surpassed the cutoff. On the
constant model branch, if neither top parameter is greater than or equal to
0.8bmad, then the series goes into fit category 7. If one of the top parameters is
greater than or equal to 0.8coff, the series goes into fit category 9 or 10 based
on whether one of the top values surpassed the cutoff.
With the level 5 methods assigned, the data are ready for level 5 processing:
	 II Input 	
>	## Do level 5 processing for aeid 1 and load the data
>	mc5_res <- tcplRun(id = 1:2, slvl = 5, elvl = 5, type = "mc")
	 II Oiif.j/iil. 	
Loaded L4 AEID1 (524 rows; 0.03 sees)
Processed L5 AEID1 (524 rows; 0.12 sees)
Loaded L4 AEID2 (524 rows; 0.03 sees)
Processed L5 AEID2 (524 rows; 0.11 sees)
Writing level 5 data for 2 ids...
Completed delete cascade for 2 ids (0.01 sees)
Writing level 5 complete. (0.05 sees)
Total processing time: 0.01 mins
Figure 5 shows an example of a concentration series in fit category 37, indi-
cating the series is active and the Hill model won with a top value less than or
equal to l.2coff, and an AC50 value within the tested concentration range. The
tcplPlotFitc function shows the distribution of concentration series across the
fit category tree (Figure 6).
The distribution in Figure 6 shows 24-40 concentration series fell into fit cat-
egory 21. Following the logic discussed previously, fit category 21 indicates an
inactive series where the Hill model was selected, the top asymptote for the Hill
model was greater than 0.8coff, and the gain-loss top asymptote was greater
than or equal to the cutoff. The series in fit category 21 can be found easily in
36

-------
Multiple-concentration Screening
> tcplPlotM4ID(m4id = 370, lvl = 5)
ASSAY: AEID1 (CToxCORTup)
NAME: Acid Orange 156
CHID: 1261 CASRN: 68555-86-:
SPID(S): TX006045
M4ID: 370
HILL MODEL (in red):
tp	ga	gw
sd:
NaN
1.5
NaN
GAIN-LOSS MODEL (in :
tp	ga
val: 1.41 1.53
sd: NaN	NaN
5.72
NaN
CNST
AIC: 19.9
PROB: 0
RMSE: 0.53
HILL
1.29
0.88
0.18
2.15
NaN
GNLS
5.29
0.12
0.18
3.79
NaN
Concentration (|J.M)
MAX MEAN: 1.12	MAX_MED: 1.12	BMAD: 0.164
COPP: 0.984 HIT-CALL: 1 PITC: 37 ACTP: 1
Figure 5: An example level 5 plot for a single concentration series. The solid
line and model highlighting indicate the model winner. The horizontal line
shows the cutoff value. In addition to the information from the level 4 plots,
the summary panel includes the cutoff (coff), hit-call (kite), fit category (fit.c)
and activity probability (actp) values.
the level 5 data.
> head(m5dat[fitc == 21,
list(m4id, hill_tp, gnls_tp,
max_med, coff, hitc)])

m4id
hill_tp
gnls_tp
max_med
coff
hitc
1
3
1.1483868
1.148420
0.9419547
0.9836658
0
2
21
1.2195154
1.219515
0.9644619
0.9836658
0
3
45
1.0100353
1.010033
0.7313420
0.9836658
0
4
46
1.0853209
1.085321
0.8303106
0.9836658
0
5
125
0.9852517
1.008044
0.8201074
0.9836658
0
6
174
0.9736302
1.107837
0.8692471
0.9836658
0
37

-------
Multiple-concentration Screening
	 R Input 	
>	m5dat <- tcplLoadData(lvl = 5, type = "mc")
>	tcplPlotFitc(fitc = m5dat$fitc)
1-3
4-5
6-8
9-14
15-23 24-40 41-
-67 • 68h
Figure 6: The distribution of concentration series by fit category for the example
data. Both the size and color of the circles indicate the number of concentration
series. The legend gives the range for number of concentration series by color.
The plot in Figure 7 shows a concentration series in fit category 21. In
the example given by Figure 7, the lull //< and gnls_t.p parameters are equal
and greater than coff; however, the maximum median value (m.ax-med) is not
greater than the cutoff making the series inactive.
Level 6
Level 6 processing uses various methods to identify concentration series with
etiologies that may suggest false positive/false negative results or explain ap-
parent anomalies in the data. Each flag has is defined by a level 6 method that
has to be assigned to each assay endpoint. Similar to level 5, an assay endpoint
does not need any level 6 methods assigned to complete processing.
	 R Input 	
> tcplAssignMthd(lvl = 6,
id = 1:2,
mthd_id = c(6:8, 10:12, 15:16),
type = "mc")
38

-------
Multiple-concentration Screening
	 R Input -
> tcplPlotM4ID(m4id = 45, lvl = 5)
lw
5.46
6790
>: 0.164
Figure 7: Level 5 plot for m4id 45 showing an example series in fit category 21.
	 R Output 	
Completed delete cascade for 2 ids (0.01 sees)
	 R Input 	
> tcplLoadMthd(lvl =6, id = 1, type = "mc")
	 R Output 	
aeid	mthd mthd_id nddr
1
1 singlept.hit.high
6
0
2
1 singlept.hit.mid
7
0
3
1 mult ipo int.neg
8
0
4
1 noise
10
0
5
1 border.hit
11
0
6
1 border.miss
12
0
7
1 gnls.lowconc
15
0
8
1 overfit.hit
16
0
The example above assigns the most common flags. Some of the available
flags only apply to specific experimental designs and do not apply to all data.
ASSAY: AEID1 (CToxCORTup)
Concentration (|J.M)
NAME:	4-tert-Butylphenol
CHID:	177 CASRN: 98-54-4
SPID(S): 01140352A
M4ID:	45
HILL MODEL (in red):
tp	ga	gw
val: 1.01 1.66	1.08
sd: 0.851 0.733	0.62
GAIN-LOSS MODEL (in blue):
tp	ga	gw	la
val: 1.01 1.66 1.08	3.37
sd: 0.851 0.733 0.62	1700
CNST	HILL	6NLS
AIC: 13.65	-0.44	3.56
PROB: 0	0.88	0.12
RMSE: 0.39	0.17	0.17
MAX_MEAN: 0.731 MAX_MED: 0.731
COPP: 0.984 HIT-CALL: 0 PITC: 21
BMA
ACT
39

-------
Multiple-concentration Screening
For a complete list of normalization methods see tcplListMthd(lvl = 6) or
?MC6_Methods.
The additional nddr field in the "mc6_methods" (and the output from tc-
plLoadMthd()/tcplListMthd() for level 6) indicates whether the method re-
quires additional data. Methods with an nddr value of 0 only require the mod-
eled/summary information from levels 4 and 5. Methods with an nddr value
of 1 also require the individual response and concentration values from level 3.
Methods requiring data from level 3 can greatly increase the processing time.
	 II Input 	
>	## Do level 6 processing
>	mc6_res <- tcplRun(id = 1:2, slvl = 6, elvl = 6, type = "mc")
	 II Oiif.j/iil. 	
Loaded L5 AEID1 (524 rows; 0.04 sees)
Processed L6 AEID1 (524 rows; 2.5 sees)
Loaded L5 AEID2 (524 rows; 0.07 sees)
Processed L6 AEID2 (524 rows; 5.7 sees)
Writing level 6 data for 2 ids...
Completed delete cascade for 2 ids (0.01 sees)
Writing level 6 complete. (0.04 sees)
Total processing time: 0.14 mins
> m6dat <- tcplLoadData(lvl = 6, type = "mc")
For the two assay endpoints, 268 out of the 1048 concentration series were
flagged in the level 6 processing. Series not flagged in the level 6 processing do
not get stored at level 6. Each series-flag combination is a separate entry in the
level 6 data. Or, in other words, if a series has multiple flags it will show up on
multiple rows in the output. For example, consider the following results:
	 It I jqni!. 	,
> m6dat[m4id == 46]
	 11 0uf.pi.it 	
aeid m6id m4id m5id	spid mc6_mthd_id
1: 1 5 46 46 01140354A	8
2: 1 110 46 46 01140354A	12
flag fval fval_unit
1: Multiple points above baseline, inactive NA	NA
2:	Borderline inactive NA	NA
40

-------
Multiple-concentration Screening
The data above lists two flags: "Multiple points above baseline, inactive" and
"Borderline inactive." Without knowing much about the flags one might assume
this concentration series had some evidence of activity but was not called a hit,
and could potentially be a false negative. In cases of borderline results, plotting
the curve is often helpful.
> tcplPlotM4ID(m4id
	 R Input —
46, lvl = 6)
Concentration (|J.M)
ASSAY: AEID1 (CToxCORTup)
NAME:	2-Aminoanthraquinone
CHID: 178 CASRN: 117-79-3
SPID(S): 01140354A
M4ID: 46
HILL MODEL (in red):
tp	ga	gw
val: 1.09 0.834	1
sd: 0.542 0.476	0.88
GAIN-LOSS MODEL (in blue):
tp	ga	gw
val: 1.09 0.834 1
sd: 0.542 0.476 0.88
CNST
AIC: 19.78
PROB: 0
RMSE: 0.48
MAX MEAN: 0.83
HILL
3.68
0.88
0.23
GNLS
7.68
0.12
0.23
4.97
5330
MAX MED: 0.83
COPP: 0.984 HIT-CALL:
FLAGS:
8; 12
BMAD:
ACTP:
Figure 8: An example level 6 plot for a single concentration series. All level 6
method ID In ml In! id values are concatenated in the flags section. If flags
have an associated value (fval), the value will be shown in parentheses to the
right of the level 6 method ID.
The evidence of true activity shown in Figure 8 could be argued either way.
Level 6 processing does not attempt to define truth in the matter of borderline
compounds or data anomalies, but rather attempts to identify concentration
series for closer consideration.
41

-------
A Field Explanation/Database Structure
This appendix contains reference tables that describe the structure and table
fields found in the tcpl database. The first sections of this appendix describe
the data-containing tables, followed by a section describing the additional an-
notation tables.
In general, the single-concentration data and accompanying methods are
found in the "sctables, where the number indications the processing level.
Likewise, the multiple-concentration data and accompanying methods are found
in the "mc#" tables. Each processing level that has accompanying methods will
also have a tables with the "_methods" and "Jd" naming scheme. For example,
the database contains the following tables: "mc5" storing the data from multiple-
concentration level 5 processing, "mc5_methods" storing the available level 5
methods, and "mc5_aeid" storing the method assignments for level 5. Note, the
table storing the method assignments for level 2 multiple-concentration process-
ing is called "mc2_acid" because MC2 methods are assigned by assay component
ID.
There are two additional tables, "sc2_agg" and "mc4_agg," that link the data
in tables "sc2" and "mc4" to the data in tables "scl" and "mc3," respectively. This
is necessary because each entry in the database before SC2 and MC4 processing
represents a single value; subsequent entries represent summary/modeled values
that encompass many values. To know what values were used in calculating the
summary/modeled values, the user must use the "_agg" look-up tables.
Each of the methods tables have fields analogous to mc5_mthd-id, mc5-mthd,
and desc. These fields represent the unique key for the method, the abbreviated
method name (used to call the method from the corresponding mc5_mthds func-
tion), and a brief description of the method, respectively. The "mc6_methods"
table may also includes nddr field. More information about nddr is available in
the discussion of multiple-concentration level 6 processing (page 38).
The method assignment tables will have fields analogous to mc5-mthd-id
matching the method ID from the methods tables, an assay component or assay
endpoint ID, and possibly an exec-ordr field indicating the order in which to
execute the methods.
The method and method assignment tables will not be listed in the tables
below to reduce redundancy.
Many of the tables also include the created-date, modified-date, and modified-by
fields that store information helpful for tracking changes to the data. These fields
will not be discussed further or included in the tables below.
Many of the tables specific to the assay annotation are not utilized by the
tcpl package. The full complexity of the assay annotation used by the ToxCast
program is beyond the scope of this vignette and the tcpl package. More
information about the ToxCast assay annotation can be found at: chttp://
epa.gov/ncct/toxcast/data.html>.
42

-------
Field Explanation/Database Structure
Single-concentration data-containing tables
Table 5: Fields in scO table.
Field Description
sOid
Level 0 ID
acid
Assay component ID
spid
Sample ID
cpid
Chemical plate ID
apid
Assay plate ID
rowi
Assay plate row index
coli
Assay plate column index
wilt
Well typet
wllq
1 if the well quality was good, else 0*
cone
Concentration in micromolar
rval
Raw assay component value/readout from vendor
sref
Filename of the source file containing the data
^Information about the different well types is available in Appendix B.

Table 6: Fields in scl table.
Field
Description
slid
Level 1 ID
sOid
Level 0 ID
acid
Assay component ID
aeid
Assay component endpoint ID
logc
Log base 10 concentration
bval
Baseline value
pval
Positive control value
resp
Normalized response value
43

-------

Field Explanation/Database Structure

Table 7: Fields in sc2_agg table.
Field
Description
aeid
Assay component endpoint ID
sOid
Level 0 ID
slid
Level 1 ID
s2id
Level 2 ID

Table 8: Fields in sc2 table.
Field
Description
s2id
Level 2 ID
aeid
Assay component endpoint ID
spid
Sample ID
bmad
Baseline median absolute deviation
max_med
Maximum median response value
hitc
Hit-/activity-call, 1 if active, 0 if inactive
coff
Efficacy cutoff value
tmpi
Ignore, temporary index used for uploading purposes
44

-------
Field Explanation/Database Structure
Multiple-concentration data-containing tables
The "mcO" table, other than containing mOid rather than sOid, is identical to
the "scO" described in the section above.
Table 9: Fields in mcl table.
Field Description
mlid
Level 1 ID
mOid
Level 0 ID
acid
Assay component ID
cndx
Concentration index
repi
Replicate index

Table 10: Fields in mc2 table.
Field
Description
m2id
Level 2 ID
mOid
Level 0 ID
acid
Assay component ID
mlid
Level 1 ID
cval
Corrected value
45

-------

Field Explanation/Database Structure

Table 11: Fields in mc3 table.
Field
Description
m3id
Level 3 ID
aeid
Assay endpoint ID
mOid
Level 0 ID
acid
Assay component ID
mlid
Level 1 ID
m2id
Level 2 ID
bval
Baseline value
pval
Positive control value
logc
Log base 10 concentration
resp
Normalized response value

Table 12: Fields in mc4_agg table.
Field
Description
aeid
Assay endpoint ID
mOid
Level 0 ID
mlid
Level 1 ID
m2id
Level 2 ID
m3id
Level 3 ID
m4id
Level 4 ID
46

-------

Field Explanation/Database Structure

Table 13: Fields in mc4 table (Part 1).
Field
Description
m4id
Level 4 ID
aeid
Assay endpoint ID
spid
Sample ID
bmad
Baseline median absolute deviation
resp_max
Maximum response value
resp_min
Minimum response value
max_mean
Maximum mean response value
max_mean_conc
Log concentration at max-mean
max_med
Maximum median response value
max_med_conc
Log concentration at max_med
logc_max
Maximum log concentration tested
logc_min
Minimum log concentration tested
cnst
1 if the constant model converged, 0 if it failed to converge, N/A

if series had less than four concentrations
hill
1 if the Hill model converged, 0 if it failed to converge, N/A if

series had less than four concentrations or if max-med < 3bmad
hcov
1 if the Hill model Hessian matrix could be inverted, else 0
gnls
1 if the gain-loss model converged, 0 if it failed to converge, N/A

if series had less than four concentrations or if max-med < 3bmad
gcov
1 if the gain-loss model Hessian matrix could be inverted, else 0
cnst_er
Scale term for the constant model
cnst_aic
AIC for the constant model
cnst_rmse
RMSE for the constant model
cnst_prob
Probability the constant model is the true model
hilLtp
Top asymptote for the Hill model
hill_tp_sd
Standard deviation for hill-tp
hilLga
AC50 for the Hill model
hill_g£L_sd
Standard deviation for hill-ga
47

-------

Field Explanation/Database Structure

Table 14: Fields in mc4 table (Part 2).
Field
Description
hilLgw
Hill coefficient
hilLgw_sd
Standard deviation for hill-gw
hilLer
Scale term for the Hill model
hill_er_sd
Standard deviation for hill-er
hilLaic
AIC for the Hill model
hilLrmse
RMSE for the Hill model
hilLprob
Probability the Hill model is the true model
gnls_tp
Top asymptote for the gain-loss model
gnls_tp_jsd
Standard deviation for gnls_tp
gnls_ga
AC50 in the gain direction for the gain-loss model
gnls_ga_sd
Standard deviation for gnls-ga
gnls_gw
Hill coefficient in the gain direction
gnls_gw_sd
Standard deviation for gnls-gw
gnls_la
AC50 in the loss direction for the gain-loss model
gnls_l£L_sd
Standard deviation for gnls-la
gnls_lw
Hill coefficient in the loss direction
gnls_lw_sd
Standard deviation for gnlsJ/m
gnls_er
Scale term for the gain-loss model
gnls_er_sd
Standard deviation for gnls-er
gnls_aic
AIC for the gain-loss model
gnls_rmse
RMSE for the gain-loss model
gnls_prob
Probability the gain-loss model is the true model
nconc
Number of concentrations tested
npts
Number of points in the concentration series
nrep
Number of replicates in the concentration series
nmed_gtbl
Number of median values greater than 3 bmad
tmpi
Ignore, temporary index used for uploading purposes
48

-------

Field Explanation/Database Structure

Table 15: Fields in mc5 table.
Field
Description
m5id
Level 5 ID
m4id
Level 4 ID
aeid
Assay endpoint ID
modi
Winning model: "cnst", "hill", or "gnls"
hitc
Hit-/activity-call, 1 if active, 0 if inactive, -1 if cannot determine
fitc
Fit category
coff
Efficacy cutoff value
actp
Activity probability (1 — cnst-prob)
modLer
Scale term for the winning model
modLtp
Top asymptote for the winning model
modLga
Gain AC50 for the winning model
modLgw
Gain Hill coefficient for the winning model
modLla
Loss AC50 for the winning model
modLlw
Loss Hill coefficient for the winning model
modLprob
Probability for the winning model
modLrmse
RMSE for the winning model
modLacc
Activity concentration at cutoff for the winning model
modLacb
Activity concentration at baseline for the winning model
modLaclO
AC 10 for the winning model
49

-------

Field Explanation/Database Structure

Table 16: Fields in mc6 table.
Field
Description
m6id
Level 6 ID
m5id
Level 5 ID
m4id
Level 4 ID
aeid
Assay endpoint ID
m6_mthd_id
Level 6 method ID
flag Text text output for the level 6 method
fval
Value from the flag method, if applicable
fvaLunit
Units for fval, if applicable
50

-------
Field Explanation/Database Structure
Auxiliary annotation tables
As mentioned in the introduction to this appendix, a full description of the assay
annotation is beyond the scope of this vignette. The fields pertinent to the tcpl
package are listed in the tables below.
Table 17: List of annotation tables.
Table Name	Description
assay
Assay-level annotation
assay_component
Assay component-level annotation
assay_component_endpoint
Assay endpoint-level annotation
assay_component_map
Assay component source names and their corresponding assay

component ids
assay_reagent*
Assay reagent information
assay_reference*
Map of citations to assay
assay_source
Assay source-level annotation
chemical
List of chemicals and associated identifiers
chemicaLlibrary
Map of chemicals to different chemical libraries
citations*
List of citations
gene
Gene identifiers and descriptions
intended_target
Intended assay target at the assay endpoint level
mc5_fit_categories
The level 5 fit categories
organism*
Organism identifiers and descriptions
sample
Sample ID information and chemical ID mapping
technologicaLtarget*
Technological assay target at the assay component level
* indicates tables not currently used by the tcpl package
51

-------

Field Explanation/Database Structure

Table 18: Fields in assay.
Field
Description
aid
Assay ID
asid
Assay source ID
assay_name
Assay name (abbreviated "anm" within the package)
assay_desc
Assay description
timepoint_hr
Treatment duration in hours
assay_footprint
Microtiter plate size^
t discussed further in the "Register and Upload New Data" section (page 6)
Table 19: Fields in assay_component.
Field	Description
acid
Assay component ID
aid
Assay ID
assay_component_name
Assay component name (abbreviated "acnm" within the package)
assay_component_desc
Assay component description

Table 20: Fields in assay_source.
Field
Description
asid
Assay source ID
assay_source_name
Assay source name (typically an abbreviation of the as-
say_source_long_name, abbreviated "asnm" within the package)
assay_source_long_name
The full assay source name
assay_source_description
Assay source description
52

-------
Field Explanation/Database Structure
Table 21
: Fields in assay_component_endpoint.
Field
Description
aeid
Assay component endpoint ID
acid
Assay component ID
assay_component_endpoint_name
Assay component endpoint name (abbreviated "aenm" within the
package)
assay_component_endpoint_desc
Assay component endpoint description
export_ready
0 or 1, used to flag data as "done"
normalized_data_type
The units of the normalized data^
burst_assay
0 or 1, 1 indicates the assay results should be used in calculating
the burst z-score
fit_all
0 or 1, 1 indicates all results should be fit, regardless of whether
the max-med surpasses 2>bmad
t discussed further in the "Register and Upload New Data" section (page 6)
Table 22: Fields in assay_component_map table.
Field Description
acid
Assay component ID
acsn
Assay component source name

Table 23: Fields in chemical.
Field
Description
chid
Chemical ID^
casn
CAS Registry Number
chnm
Chemical name
1 this is the DSSTox GSID within the ToxCast data, but can be any integer
and will be auto-generated (if not explicitly defined) for newly registered
chemicals
53

-------
Field Explanation/Database Structure
Table 24: Fields in chemicaLlibrary.
Field Description
chid Chemical ID
clib Chemical library

Table 25: Fields in gene.
Field
Description
geneJd
Gene ID
gene_symbol
Gene symbol

Table 26: Fields in intendecLtarget.
Field
Description
aeid
Assay endpoint ID
targetJd
Target ID
source The table to look-up the target ID, currently only supports "gene"
The "intended_target" and "gene" tables are listed above because the tc-
plLoadAeidlnfo function utilizes these tables. Currently, the tcplRegister
function does not have support for registering new entries in either the "in-
tended_target" or "gene" tables. The added complexity with the intermediate
"intended_target" table could hypothetically allow users to map non-gene tar-
gets (e.g. a protein or cell process), but that complexity is not fully built into
the database or the tcpl package. The tcplLoadAeidlnf o function assumes all
targets (target_id) listed in "intendecLtarget" map to genes (gene_id).
Table 27: Fields in mc5_fit_categories table.
Field	Description
fitc
Fit category
parent_fitc
Parent fit category
name
Fit category name
xloc
x-axis location for plotting purposes
yloc
y-axis location for plotting purposes
54

-------

Field Explanation/Database Structure

Table 28: Fields in sample.
Field
Description
spid
Sample ID
chid
Chemical ID
stkc
Stock concentration
stkc_unit
Stock concentration unit
tested_conc_unit
The concentration unit for the concentration values in the data-

containing tables
spicLiegacy	A place-holder for previous sample ID strings
The stock concentration fields in the "sample" table allow the user to track
the original concentration when the neat sample is solubilized in vehicle before
any serial dilutions for testing purposes.
55

-------
B Level 0 Pre-processing
Level 0 pre-processing can be done on virtually any high-throughput/high-
content screening application. In the ToxCast program, level 0 processing is
done in R by vendor/dataset-specific scripts. The individual R scripts act as
the "laboratory notebook" for the data, with all pre-processing decisions clearly
commented and explained.
Level 0 pre-processing has to reformat the raw data into the standard format
for the pipeline, and also can make manual changes to the data. All manual
changes to the data should be very well documented with justification. Common
examples of manual changes include fixing a sample ID typo, or changing well
quality value(s) to 0 after finding obvious problems like a plate row/column
missing an assay reagent.
Each row in the level 0 pre-processing data represents one well-assay com-
ponent combination, containing 11 fields (Table 29). The only field in level 0
pre-processing not stored at level 0 is the assay component source name (acsn).
The assay component source name should be some concatenation of data from
the assay source file that identifies the unique assay components. When the
data are loaded into the database, the assay component source name is mapped
to assay component ID through the assay_component_map table in the tcpl
database. Assay components can have multiple assay component source names,
but each assay component source name can only map to a single assay compo-
nent.
The well type field is used in the processing to differentiate controls from test
compounds in numerous applications, including normalization and definition of
the assay noise level. Currently, the tcpl package includes the eight well types in
Table 30. Package users are encouraged to suggest new well types and methods
to better accommodate their data.
The final step in level 0 pre-processing is loading the data into the tcpl
database. The tcpl package includes the tcplWriteLvlO function to load data
into the database. The tcplWriteLvlO function maps the assay component
source name to the appropriate assay component ID, checks each field for the
correct class, and checks the database for the sample IDs with well type "t." Each
test compound sample ID must be included in the tcpl database before loading
data. The tcplWriteLvlO also checks each test compound for concentration
values.
56

-------

Level 0 Pre-processing

Field
Table 29: Required fields in level 0 pre-processing.
Description
N/A
acsn
Assay component source name
No
spid
Sample ID
No
cpid
Chemical plate ID
Yes
apid
Assay plate ID
Yes
rowi
Assay plate row index, as an integer
Yes
coli
Assay plate column index, as an integer
Yes
wilt
Well type
No
wllq
1 if the well quality was good, else 0
No
cone
Concentration in micromolar
Not
rval
Raw assay component value/readout from vendor
Yes*
sref
Filename of the source file containing the data
No
The N/A column indicates whether the field can be N/A in the pre-processed data,
t Concentration can be N/A for control values only tested at a single concen-
tration. Concentration cannot be N/A for any test compound (well type of
"t") data.
^If the raw value is N/A, well type has to be 0.
Well Type
Table 30: Well types
Description

t
Test compound

c
Gain-of-signal control in multiple concentrations

P
Gain-of-signal control in single concentration

n
Neutral/negative control

m
Loss-of-signal control in multiple concentrations

o
Loss-of-signal control in single concentration

b
Blank well

V
Viability control

57

-------
C Burst Z-Score Calculation
The tcplVarMat function creates chemical-by-assay matrices for the level 4 and
level 5 data. When multiple sample-assay series exist for one chemical, a single
series is selected by the tcplSubsetChid function. See ?tcplSubsetChid for
more information.
The var parameter for tcplVarMat can accept any of the level 4 or level
5 fields/variables, or one of two special variables. The first special variable,
"tested", returns 0 or 1, where 1 indicates the chemical-assay pair was tested in
either multiple-concentration format or single-concentration format. Chemical-
assay pairs not tested in the multiple-concentration format will be N/A in the
hit-call matrix. The second special parameter, "zscore" returns a z-score based
on the distribution of burst assays.
The burst assay endpoints are defined by the "burst_assay" field in the as-
say_component_endpoint table, where 1 indicates the assay endpoint is used
in the burst distribution calculation. The example dataset is limited, so a
good illustrative example is beyond the scope of this vignette. Assay end-
points labeled as burst assays can be indentified by running tcplLoadAeid(f Id
= "burst_assay", val = 1). Conversely, non burst assays can be identified
by running the same code with "val" equal to 0 rather than 1.
For each chemical, the burst distribution is defined by the median and MAD
of the AC50 (modl-ga) values for the burst endpoints where the hit-call was 1
(active). Once the burst distribution is defined for each chemical, the global
burst MAD is defined as the median of all MAD values for chemicals with
greater than 1 active burst endpoint. The burst median for chemicals with less
two active burst endpoints is set to 3.6 The burst z-score is calculated for each
AC50 value as
'modl-ga — cyto-pt
zscore =	——	-	,	(20)
global_maa
where cyto_pt is the burst median. All of the values to define the burst distri-
bution are also returned by the tcplVarMat function when var is "zscore." The
burst z-score values are multiplied by -1 to make values that are more potent
relative to the burst distribution a higher positive z-score.
6 In log base 10 micromolar units, 3 is equivalent to 1 molar.
58

-------