Toxicity Reference Database Version 2.1 User Guide


Protection Agency

Office of Research and Development

Center for Computational Toxicology and Exposure

Toxicity Reference
Database Version 2.1
User Guide

EPA601B22001 | August 2022 | www.epa.gov/research

Xs, EPA

United States
Environmental

-------
EPA Report Number 601B22001
August 2022

Toxicity Reference Database
Version 2.1
User Guide

by

Madison Feshuk, Sean Watford, Lori Kolaczkowski,

Katie Paul Friedman
US Environmental Protection Agency
Office of Research and Development
Center for Computational Toxicology and Exposure
Research Triangle Park, North Carolina

-------
Purpose

The purpose of this document is to provide documentation on how to technically
access and use the Toxicity Reference Database (ToxRefDB) version 2.1. The latest
data can be accessed through EPA's Clowder site (https://clowder.edap-
cluster.com/datasets/6 7 747fefe4b0856fdc65639b#folderld=62c5cfebe4b0 Id27e3b2d85 7)
. More information about ToxRefDB version 2.0 and its development can be found in
the publications below.

Watford, S., Pham, L.L., Wignall, J., Shin, R., Martin, M.T., and Friedman, K.P. (2019).

ToxRefDB version 2.0: Improved utility for predictive and retrospective toxicology
analyses. Reproductive Toxicology, 89, 145-158. DOI: 10.1016/j.reprotox.2019.07.012

Pham, L.L., Watford, S., Friedman, K.P., Wignall, J.A., and Shapiro, A.J. (2019).
Python BMDS: A Python interface library and web application for the canonical EPA
dose-response modeling software. Reproductive toxicology. DOI:
10.1016/j.reprotox.2019.07.013

This user guide does not necessarily reflect U. S. EPA policy.

2

-------
Abstract

ToxRefDB contains in vivo study data from over 5900 guideline or guideline-like studies for
over 1100 chemicals. This is largely comprised of curated animal study data from repeat dose
studies conducted according to Health Effects Series 870 guidelines, and many of these
studies (over 3,000 of them) come from registrant-submitted toxicity studies known as data
evaluation records (DERs) from the U.S. EPA's Office of Pesticide Programs (OPP). By
employing a controlled vocabulary for enhanced data quality, ToxRefDB serves as a resource
for study design, quantitative dose response, and endpoint testing status information given
guideline specifications. The database can aid in the validation of in vitro high throughput
screening of chemicals and serve as a resource for retrospective and predictive toxicology
applications.

3

-------
Table of Contents

Purpose	2

Abstract	3

Overview	5

Summary of v2.1 Update	7

Table 1: v2.1 Summary Statistics	7

Figure 1: Study-Level Data Landscape	8

Figure 2: Chemical-Level Data Landscape	9

Changes between v2.0 and v2.1	10

Table 2: Changes between v2.0 and v2.1	10

Accessing information in ToxRefDB	11

Installing MySQL and loading ToxRefDB	11

Example queries using MySQL	12

Programmatic Access	13

Python	13

R	13

Database Structure	15

Figure 3: ToxRefDB v2.1 ERD	15

Figure 4: Schema Overview	16

Data Curation Process	17

Figure 5: Data Extraction and Review Workflow	17

Quality Assurance in Data Extraction	18

Efforts to Reduce Error Rate	18

Unit Standardization	19

Study Reliability with ToxRTool	19

Table 3: ToxRTool Guideline Adherence Score	20

Guideline Profiles	21

Table 4: Guideline Profile Coverage	21

Endpoint Terminology	24

Figure 6: Hierarchical endpoint terminology example	24

Ontology mappings	24

Figure 7: Cross-referenced Terminology Sources	25

Negative Endpoints and Effects	26

Table 5: Endpoint Observation Status	26

Figure 8: Decision tree for identification of negative endpoints and effects	27

Figure 9: Example Observation Status Interpretation	27

Ongoing Work	28

Data Dictionary	29

4

-------
Overview

The Toxicity Reference Database (ToxRefDB) serves as a resource for structured animal
toxicity data for many retrospective and predictive toxicology applications. ToxRefDB
contains in vivo study data from over 5900 guideline or guideline-like human health relevant
studies for over 1100 chemicals.

The study types covered in ToxRefDB include the following repeat dose study designs utilizing
various administration routes (predominantly oral): chronic (CHR; 1-2 year exposures
depending on species and study design) conducted predominantly in rats, mice, and dogs;
subchronic (SUB; 90 day exposures) conducted predominantly in rats, mice, and dogs;
subacute (SAC; 14-28 day exposures depending on the source and guideline) conducted
predominantly in rats, mice, and dogs; prenatal developmental (DEV) conducted
predominantly in rats and rabbits; multigeneration reproductive toxicity studies (MGR)
conducted predominantly in rats; reproductive (REP) toxicity studies conducted largely in rats;
developmental neurotoxicity (DNT) studies conducted predominantly in rats; and a small
number of studies with designs characterized as acute (ACU), neurological (NEU), or "other"
(OTH).

Many of the studies (over 3,000) come from registrant-submitted toxicity studies known as data
evaluation records (DERs) from the U.S. EPA's Office of Pesticide Programs (OPP). Since
2009, continued curation efforts have expanded ToxRefDB to include toxicity studies from ten
additional sources, including the National Toxicology Program (NTP), peer-reviewed primary
research articles (OpenLit), and pharmaceutical pre-clinical toxicity studies (Pfizer, Sanofi,
GSK, Merck), among others (RIVM, PMRA, unpublished and unassigned sources). 90% of the
studies with completed curation (processed=1) correspond pesticide actives and inerts.
Although most studies in the database correspond to pesticides, curation of other study
sources incorporated additional functional use types of chemicals.

ToxRefDB serves as a resource for study design, quantitative dose response, and endpoint
testing status information given guideline specifications from the US Environmental Protection
Agency (US EPA) and the National Toxicology Program (NTP) headquartered at the National
Institute of Environmental Health Sciences. The legacy and current data curation workflow is
described in more detail in later sections. An important component of ToxRefDB is its
controlled vocabulary for studies and effects observed for enhanced data quality.

The first version of ToxRefDB (ToxRefDB 1.0) was initially released as a series of
spreadsheets, which are still available on EPA's FTP site and referenced in FigShare
(https://doi.Org/10.23645/epacomptox.6062545.v1). ToxRefDB underwent significant updates
that are described in the recent publication (Watford et al., 2019) and was released as
ToxRefDB v2.0. ToxRefDB v2.0 and associated summary files can be found
here: https://doi.orq/10.23645/epacomptox.6062545.v3.

ToxRefDB v2.1 is a minor update of ToxRefDB v2.0 to correct issues discovered with the
compilation script that caused some extracted values to not import properly from AccessDB
curation files, such as failure to import some effects. The .sql export of ToxRefDB v2.is
available for public download here: https://doi.org/10.23645/epacomptox.6062545. Although
the overall number of studies and chemicals remains unchanged, the v2.1 update includes
additional data as previously curated studies with extracted dose treatment groups and effects
are now fully accessible. This added data can improve the utility of ToxRefDB as a resource

5

-------
for curated legacy in vivo information by providing more complete information of the past
animal studies conducted. Moving forward, an application-driven workflow with the Data
Collection Tool (DCT) will be utilized to create a more sustainable process for loading curated
information to a database and support a more regular release cycle.

In addition to the accessing data via SQL downloads, ToxRefDB information is also
summarized with calculated point-of-departure values at the chemical and study level for
inclusion in the summary-level database, the Toxicity Value Database (ToxValDB), which is
accessible via the CompTox Chemicals Dashboard. This list aggregates chemicals associated
with curations in ToxRefDB v2.0: https://comptox.epa.gov/dashboard/chemical-
lists/T0XREFDB2.ToxRefDB v2.1 values will be incorporated in the next ToxValDB release.

6

-------
Summary ofv2.1 Update

ToxRefDB v2.1 is a minor update to ToxRefDB v2.0 to correct issues discovered with the
compilation script that caused some extracted values to not import properly from AccessDB
curation files, such as failure to import some effects. ToxRefDB v2.1 contains summary
information from 5986 studies for 1143 chemicals.

For ToxRefDB v2.0, quantitative (i.e. dose-response) data was extracted. This curation was
completed for 3871 studies with plans to extract and release the remaining data in subsequent
data releases. No additional curation was performed for the v2.1 update. To provide the reader
with a summary of the scope and coverage of the database, ToxRefDB was filtered to present
only data where a full curation with guideline profile observations was complete. This is
achieved using a 'processed' flag set to 1 within the study table.

Table 1 is a summary table of the number of chemicals and number of studies for each study
source, study type, and species. Study type abbreviations are as follows: CHR = Chronic, DEV
= Prenatal-Developmental, MGR = Multigeneration Reproductive, SAC = Subacute, SUB =
Subchronic.

Table 1: v2.1 Summary Statistics

Study type

Study source

Species

Number of studies

Number of chemicals

CHR

NTP

mouse

178

173





rat

169

164



OpenLit

mouse

4

4





rat

5

5



OPP DER

dog

331

298





hamster

4

3





mouse

342

303





primate

1

1





rat

398

328

Total CHR





1432

557

DEV

NTP

mouse

1

1





rabbit

3

3





rat

6

6



OpenLit

rat

1

1



OPP DER

mouse

18

16





rabbit

431

372





rat

508

433



Other

mouse

1

1





rabbit

1

1





rat

4

4

Total DEV





974

486

MGR

OpenLit

rat

1

1



OPP DER

mouse

2

2





rat

339

310



Other

rat

19

19

7

-------
Total MGR

361

331

SAC

NTP

mouse

rat

OPP DER

dog

mouse

rabbit

rat

Total SAC

SUB

NTP

hamster

mouse

119

107

rat

127

114

OpenLit

mouse

rat

OPP DER

dog

214

195

hamster

mouse

123

112

primate

rabbit

rat

418

335

Total SUB

1020

498

Database totals

3871

748

Figure 1 depicts a breakdown of studies by study source, study type, and species.
Figure 1: Study-Level Data Landscape

Type

Species

Source

1500

3000

to
.92

0
-Q

2000

1000

663

CHR SUB DEVMGR SAC
study type

<*- ^ .# # .i?

-------
Figure 2 depicts a breakdown of chemicals by study source, study type, and species.
Figure 2: Chemical-Level Data Landscape

Type

Species

Source

600

J/5
03

o 400

| 200
c

800

600

_t/>

0
jC

v 400
>~—
o

0
.Q

E
c

200

600

S 400
E

0
_£I

0
_Q

| 200
c

CHR SUB DEV MGR SAC
study type

<
-------
Changes between v2.0 and v2.1

The following table details a summary of differences between ToxRefDB v2.0 and v2.1.
ToxRefDB v2.1 is a minor update to recover thousands of extracted values that failed to import
properly from the original AccessDB curation files as described in the Data Curation Process
section. Although the overall number of studies and chemical remains unchanged, the v2.1
update includes additional data as previously curated studies (+594 studies with extracted
effects) with extracted dose treatment groups (+5226 dose treatment groups with effects) and
effects (+21756 effects) are now fully accessible. This added data can improve the utility of
ToxRefDB as a resource for curated legacy in vivo information by providing more complete
information of the past animal studies conducted.

Table 2: Changes between v2.0 and v2.1

Output

v2.0

v2.1

Change

Total number of studies with complete curation

3882

3871

-11

Number of studies with extracted effects

3068

3662

594

Total number of chemicals

748

Total database rows, including studies with no extracted

328623

344868

16245

effects

Total effects extracted

313525

335281

21756

Dose treatment groups with effects

35679

40905

5226

Unique effects: Cholinesterase endpoint category

5323

6008

685

Unique effects: Developmental endpoint category

8502

9640

1138

Unique effects: Reproductive endpoint category

4691

5775

1084

Unique effects: Systemic endpoint category

284352

302674

18322

Unique critical effects: Cholinesterase endpoint category

713

796

Unique critical effects: Developmental endpoint category

1118

1276

158

Unique critical effects: Reproductive endpoint category

488

645

157

Unique critical effects: Systemic endpoint category

18757

20989

2232

-------
Accessing information in ToxRefDB

A MySQL database export and summary files of ToxRefDB v2.1 are available for public
download, available here. The summary spreadsheet contains study and chemical-level
information for reference. ToxRefDB information is also summarized with calculated point-of-
departure values at the chemical and study level for inclusion in the summary-level database,
the Toxicity Value Database (ToxValDB), which is accessible via the CompTox Chemicals
Dashboard. ToxRefDB v2.1 values will be incorporated in the next ToxValDB release.

Below is documentation on how to install MySQL, load ToxRefDB, and access the data using
both SQL and programmatic access using either Python or R. Another useful tool to access
the data is MySQL Workbench, which provides a user interface to interact with any MySQL
database.

Installing MySQL and loading ToxRefDB

Steps to install MySQL load ToxRefDB are detailed below. More comprehensive
documentation for using MySQL can be found online.

• Download the ToxRefDB MySQL database

• Download the latest version of the MySQL community server.

• Select the appropriate installer for your operating system

o For Windows, download the MSI installer
o For MAC and Linux, download the DMG installer

• The installer will walk you through the installation. During the installation, be sure to
copy the temporary root password. You will need it later.

o For Windows, MySQL should automatically be added to your PATH
o For MAC and Linux, if MySQL was not added to your PATH automatically you
will have to add it manually

• Open the terminal and type:

» echo 'export PATH=/usr/local/mysql/bin:$PATH'
» ~/.bash_profile

Open the command line (Windows) or terminal (MAC and Linux) to login to the MySQL
server with the command

mysql -u root -p

Enter the temporary root password when prompted for a password. Change the root
password following instructions detailed here.

Create the ToxRefDB database, select it as the default database, and load the dump file
following instructions detailed here:

mysql> CREATE DATABASE IF NOT EXISTS toxrefdb_2_0;
mysql> USE toxrefdb_2_0;

mysql> source toxrefdb_2_0.sql

-------
Example queries using MySQL

Once the ToxRefDB instance is established, the user is ready to begin querying the database.
These example queries can be tailored for exploratory data analysis, specific research
questions based the individual's use case, or risk assessment workflows.

# Get number of studies per study type

SELECT studyJype, COUNT(study_id) FROM study
GROUP BY studyjype;

# Get number of studies per study type and species
SELECT study_type,species, COUNT(studyJd) FROM study
GROUP BY studyjype,species;

# Get number of studies per source

SELECT study_source, COUNT(study_id) FROM study
GROUP BY study_source;

# Get all study information for chronic studies
SELECT * FROM study WHERE study_type="CHR";

# Get all treatment group and dosing information for a single chemical
SELECT * FROM chemical

INNER JOIN study ON chemical.chemical_id=study.chemical_id
INNER JOIN tg ON tg.study_id=study.study_id
INNER JOIN dose ON dose.study_id=study.study_id
INNER JOIN dtg ON dtg.tg_id=tg.tg_id AND dose.dose_id=dtg.dose_id
WHERE casrn="42509-80-8";

# Get number of studies per endpoint

SELECT endpoint_category, endpoint_type, endpoint_target,

COUNT(DISTINCT study.studyjd) AS "number of studies" FROM study
INNER JOIN tg ON study.study_id=tg.studyjd
INNER JOIN tg_effect ON tg.tg_id=tg_effect.tg_id
INNER JOIN effect ON effect.effect_id=tg_effect.effect_id
INNER JOIN endpoint ON endpoint.endpoint_id=effect.endpoint_id
GROUP BY endpoint_category,endpoint_type,endpoint_target;

# Get all study-level LELs and LOAELs for effect profile 2

SELECT * FROM pod WHERE effect_profiIe_id=2 AND studyjd IS NOT NULL AND podjype IN("loael","lel");

# Get chemical-level PODs for effect profile 2

SELECT * FROM pod WHERE effect_profileJd=2 AND studyjd IS NULL;

# Get study-level PODs for effect profile 2 and for a specific endpoint
SELECT DISTINCT pod.* FROM pod

INNER JOIN podJg_effect ON pod.podJd=podJg_effect.podJd
INNER JOIN tg_effect ON tg_effect.tg_effectJd=podJg_effect.tg_effectJd
INNER JOIN effect ON effect.effect_id=tg_effect.effect_id
INNER JOIN endpoint ON endpoint.endpoint_id=effect.endpoint_id
WHERE effect_profiIe_id=2 AND studyjd IS NOT NULL
AND endpointjarget LIKE "thyroid%";

# Get all dose-response data for a study
SELECT * FROM chemical

INNER JOIN study ON study.chemicaljd=chemical.chemicaljd

INNER JOIN tg ON tg.studyjd=study.studyjd

INNER JOIN dose ON dose.study_id=study.studyjd

INNER JOIN dtg ON dtg.tg_id=tg.tg_id AND dose.doseJd=dtg.dose_id

INNER JOIN tg_effect ON tg.tgjd=tg_effect.tg_id

INNER JOIN effect ON effect.effect_id=tg_effect.effect_id

INNER JOIN endpoint ON endpoint.endpoint_id=effect.endpoint_id

INNER JOIN dtg effect ON tg effect.tg effect id=dtg effect.tg effect id AND dtg.dtg id=dtg effect.dtg id

-------
WHERE study.study_id=687;

Programmatic Access

The user is not limited to SQL queries in MySQL Workbench to access ToxRefDB. You can
also programmatically access the data with several languages. Below are examples of
accessing the data into datasets for further work in Python and R. You will still have to connect
to the database through the language specific connector.

Python

In the example below, the python packages sqlalchemv, pandas, and pymysql are required.
You can, however, use any type of connector. Any SQL query can replace the one provided in
this example.

# Load libraries
import sqlalchemy as sa
import pandas as pd

# Establish connection
username = ""
password = ""
host = ""
database = ""

engine =sa.create_engine(f mysql+pymysql://{username}:{password}@{host}/{database} )

# Get guideline profiles

results = pd.read_sql(

SELECT guideline.guidelinejd,
guideline.guideline_number,
guideline.name,
guideline.profile_name,
guideline.description,

g u i d el i n e_profi le. g u i d e I i n e_p rof i I e_i d,

guideline_profile.obs_status,

guideline_profile. description,

endpoint.endpointjd,

endpoint.endpoint_category,

endpoint.endpoint_type,

endpoint.endpoint_target FROM guideline

INNER JOIN guideline_profile ON guideline.guideline_id=guideline_profile.guidelinejd
INNER JOIN endpoint ON endpoint.endpoint_id=guideline_profile.endpoint_id
.engine)

# Export to excel

writer = pd.ExcelWriter("guideline_profiles.xlsx")
results.to_excel(writer,index=False,merge_cells=False)

writer. saveQ

In the example below, the R package RMySQL required. Any SQL query can replace the one
provided in this example.

# Load library
library(RMySQL)

# Establish connection

-------
con <-dbConnect(drv = RMySQL::MySQL(), user="",
password = "",
host = "", database ="")

# Get all ToxRefDB information for subchronic studies

output <-dbGetQuery(con, "SELECT chemical.casrn,

chemical. preferred_name,

study.studyjd,

study. study_type,

study. study_year,

study. study_source,

study.species,

study.strain_group,

study.admin_route,

study. admin_method,

endpoint.endpoint_category,

endpoint.endpoint_type,

endpoint.endpoint_target,

endpoint.endpoint_id,

tg_effect.life_stage,

tg_effect.tg_effect_id,

effect, effectjd,

effect. effect_desc,

tg.sex,

tg.generation,

dose.dosejevel,

dtg.dose_adjusted,

dtg.dose_adjusted_unit,

dtg_effect.treatment_related,

dtg_effect.critical_effect,

tested_status,

reported_status FROM chemical

INNER JOIN study ON chemical.chemical_id=study.chemical_id
LEFT JOIN dose ON dose.study_id=study.study_id
LEFT JOIN tg ON tg.study_id=study.study_id

LEFT JOIN dtg ON tg.tg_id=dtg.tg_id AND dose.dose_id=dtg.dose_id
LEFT JOIN tg_effect ON tg.tg_id=tg_effect.tg_id

LEFT JOIN dtg_effect ON tg_effect.tg_effect_id=dtg_effect.tg_effect_id AND dtg.dtg_id=dtg_effect.dtg_id

LEFT JOIN effect ON effect.effect_id=tg_effect.effectjd

LEFT JOIN endpoint ON endpoint.endpoint_id=effect.endpoint_id

LEFT JOIN obs ON obs.study_id=study.study_id AND obs.endpoint_id=endpoint.endpoint_id
WHERE study_type='SUB'")

-------
Database Structure

This entity-relationship diagram (ERD) can be used to understand the relationships between
tables. BMDExpress software (Pham et al, 2019) was not run to calculate benchmark dose
values for v2.1, therefore BMD tables were dropped from the v2.1 schema.

Figure 3: ToxRefDB v2.1 ERD

J taxrefdb_dd

toxrefidb _tabl e TEXT
toxreftlb J eld TEXT
description TEXT

J endpoint

endpointjd INT(ll)

endpoint_category VARCHAR(255)
endpoint_type VARCH AR(255)

H-

endpoint_target VARCHAR(255)

~ chemical

chemical jd INT(ll)
dsstox_substance_id VARCH AR(45)
casrn VARCHAR(255)
preferred_name VARCHAR(255)

_J guideline

guideline Jd INT(11)
guideline_number VARCHAR(64)
name VARCHAR(512)
profile_name VARQHAR(64)
description VARCHAR(1024)

i i

H negative_endpoint

negative.endpointjd INT(ll)
endpointjd IWT(ll)
> studyjd INT(ll)

~3 unit_standardization

unitstandardizationjd INT(ll)
- original_unit VARCHAR(255)
corrected_unit VARCHAR(255)

dose

dosejd INT(11)

> studyjd INT(ll)

Odosejevd INT(ll)

Oconc DOUBLE
0 concjjnit VARCHAR{255)
0 vehicle VARCHAR(255)
dose_comment VARCHAR(1024)

Z] study_clowder ~

study_dowderJd INT (11)
studyjd INT(ll)
filename VARCHAR{128)
- filetype VARCHAR{3)
'dowder_uid VAROiAR(128)

J effect_profile_group_toxrefdb

effectjirofile_group_toxrefdbJdINT(ll)
^ groupjd INT(11)

> effect_profileJd INT(ll)

' tg_effectjd INT(11)

"3 study

studyjd INT(ll)

' chemicaljd INT(ll)
study_sourceJd VARCHAR(255)
-• studyjatatian VARCHAR(1024)
study_year INT(ll)

'•> study_source VARCHAR(255)
studyjype VARCHAR(255)
study_tvp e_gu ideli ne VARCHAR(255)
speaes VARCHAR(255)
stranjroup VARCHAR(255)
strain VARCHAR(255)
adminjoute VARCH AR(255)
adminjnethod VARCHAR(255)

• substerice_source_name VARCHAR(255)
substance_purity VARCH AR(255)
substanceJot_bath VARGHAR(255)

> substance_comment VARCHAR(255)
dose_sfart INT(11)
dose_start_unit VARCH AR(255)
dose_end INT(ll)
dose_end_unit VARCH AR(255)

• study_com ment VARCH AR(2048)
guideline Jd INT(ll)
processed TINYINT(ll)

Z] study_toxrtool

studyjoxrtooljd INT(ll)
0 toxrtooljd INT(ll)

- studyjd INT(ll)

O score INT(ll)

- toxrtool_comm ent VARCHAR(1024)
^filename VARCHAR{128)

_] ontology_toxrefdb T

ontology _toxrefdbJd INT(ll)

¦> ontologyjd INT(ll)

toxrefdbjd INT(11)

¦toxrefdbjable VARCH AR(64)

toxreftib Jdd VARCHAR(45)

Z] ob s

obsjd INT(ll)

> studyjd INT(ll)
endpointjd INT(ll)

• status VARCH AR(64)
default TINYINT(1)
tested_status TINY INT (1)
reported_status TINYINT(1)
guideline_profileJd INT(ll)
obs_comment VARCHAR(1024)

Z] effect

effectjd INT(ll)
' endpointjd INT(ll)
effect_desc VARCH AR(255)
•> cancer_related TINYINT(l)

_zzzx

D dtg_effect

dtg_effectjd INT(ll)

^ tg_effectjd INT(ll)

~dtgjd INT(ll)

treatmentjelatedTINYINT(l)
' critical_efFectTINYINT(l)
sample_size VARCH AR(32)
0 effectual DOUBLE
efFect_val_unit VARCHAR(128)
effect_var DOUBLE
effect_var_type VARCHAR(32)
time DOUBLE
• time_unit VARCHAR(64)
dtg_effect_comm ent VARCH AR(1024)

-L

~ toxrtool

toxrtooljd INT(11)
criteria_group VARCHAR(256)
-"criteria INT(ll)

> question VARCHAR(1024)

~3 ontology

ontology Jd INT(11)
ontology_name VARCHAR(64)

> uid VARCHAR(45)

> uid_type VARCHAR(45)
label VARCHAR(256)
description VARCHAR(2048)

' uri VARCHAR(45)

~3 effect_profile_group

effect_profile_groupJd INT(ll)
> groupjd INT(11)

- group_name VARCH AR(123)

' group_descri pti on VARCHAR(2048)

- effect_profileJd INT(ll)

—H-

~ dtg

dtgjd INT (11)

* dosejd INT(ll)

~ tgJd INT(ll)

• dose_adjusted DOUBLE
dose_adjusted_unit V ARCHAR(32)
dtg_comment VARCHAR(1024)
mg_kg_day_value DOUBLE

\-K

~ tg_effect

tg_eflfectjd INT(ll)

> tgjd INT(ll)

> life_stage VARCHAR(32)
effect_desc_free VARCHAR(25S)
target_site VARCHAR(64)

- direction TINYINT(l)
effect_com ment VARCHAR(1024)

* effectjd INT(11)
no_quant_dala_reported TINYINT(l)

guideline_profile

guidelinejjrofileJd INT(ll)

> endpointjd INT(ll)

> guidelinejd INT(11)
obs_status VARCHAR(64)

¦ description VARCHAR(1024)

J pod_tg_effect ~

pod_tg_effectJd INT(ll)
•> podjd INT(ll)
¦> tg_efFectJd INT(ll)

tgjd INT(ll)

~ studyjd INT(ll)

- sex VARCH AR(8)
generation VARCHAR(16)
dosejieriod VARCHAR(32)
dose_durat'on INT(11)
dose_duraiion_unit VARCHAR(16)
'¦n FLOAT
tg_comment VARCHAR(1024)

--K

Z] negative_effect ~

negative_effectJd INT(ll)
studyjd INT(ll)
• endpointjd INT(ll)
effectjd INT(ll)

~ effectjprofile

effect_profileJd INT(ll)
effect_profilejiame VARCHAR(128)
effect_profile_description VARCHAR(2048)

~ pod

' podjd INT(ll)

•' pod_type VARCH AR{45)
- sex VARCH AR(8)

> adminjoute VARCHAR(255)
species VARCHAR(255)
qualifier VARCHAR(8)
pod_vdue DOUBLE

> pod_unit VARCHAR(45)
mg_kg_day_value DOUBLE

¦ dosejevd INT(ll)
max_doseJevel INT(ll)
• stcggered_dosingTINYINT(l)

> chemicaljd INT(11)
j studyjd INT(ll)

effect_profil eJd INT(ll)
groupjd INT(ll)

-------
Figure 4: Schema Overview

metadata, dosing, and significant treatment-related and critical effects.
Part 2:

Observation status for ToxRefDB endpoints

Reported status

Was the endpoint described in the study literature?

Tested status

is "assumed" based on the
default from guideline profile

Tested status

Were data collected for the
endpoint?

~No~

(not tested)

No effect data recorded for
the endpoint in database

Treatment group effect data

• Life stage

• Direction of net change
across all doses
(increase/decrease)

Qualitative

• Treatment
related?

• Critical effect?

Treatment-related endpoint effects

Was the data collected described as at least
one of the following?

1. Toxicologicallysignificant

2. Biologically significant

3. Statistically significant

4. Used to derive LOEL/NOEL

5. Treatment-related or Dose-related

6. Quantitative data suggests trend across doses

Yes

Effect data information

Method information describing the
data collected for each applicable
endpoint's effect

Part 2 provides more context about the data entry method. Portion of ToxRefDB 1.0 that
carried over to version 2.0 unchanged. The previously extracted information from ToxRefDBvl
was checked for accuracy and modified/added for QA purposes.

A. Curator assigns endpoint testing status according to guideline profile. Uses decision
tree to classify 400 standardized endpoints as described in study reports. Guideline
profiles were developed that match language found in the studies. These guideline
profiles were used for inference of negative endpoints/effects.

B. Observed Endpoints classified as "tested" are evaluated for treatment-related effects.
Treatment-related effects are indexed by endpoint and method information pertaining to
the data collected.

C. Where available, complete dose-response effect qualitative and/or quantitative data for
each dose was extracted.

-------
Data Curation Process

Initially, ToxRefDB v1.0 provided only summary effect levels and lacked quantitative dose
response information. This task initially proceeded using an Excel file-based extraction;
however, the process required manual corrections after uploading study extractions to the
ToxRefDB MySQL database, including inconsistent comments, different number of animals for
the same treatment group, and added effects outside of the controlled terminology. The
quantitative information and its application in ToxRefDB v2.0 served as a strong impetus to re-
extract the studies.

An Access database file was generated from the MySQL database for each study in v1.0, and
this approach offered several improvements including standardized options for more consistent
reporting in some fields, such as the units on time and dose, dose-treatment group, and effect
information; checkbox reporting for observation status on each endpoint and effect; and a log
for tracking changes and facilitating QA. Nearly 32% of the studies were extracted using the
Excel-based approach, with the remaining studies extracted using the Access database
approach. Switching to Access database files from Excel files significantly reduced errors and
increased standardization of reporting items such as units, endpoints, and effects.

Figure 5: Data Extraction and Review Workflow

Generate Access database tees for

Figure 5 details the workflow of the overall data extraction process for ToxRefDB v2.0. Access
databases files were generated for each study in ToxRefDB v1.0 and bundled with the
corresponding source files for data extraction. The data in the Access databases are curated
with additional data extracted from the source files with up to three levels of review. The
Access databases are returned by the reviewers and the data is imported back into the MySQL
database with the study table designation of processed=1.

ToxRefDB

(MySQL) /*

-------
ToxRefDB v2.0 curation also included the implementation of guideline profiles to guide
curation. Endpoints were annotated (e.g. "required", "not required") according to guidelines for
subacute, subchronic, chronic, developmental, and multigenerational reproductive designs,
distinguishing negative responses from untested. Implementation of controlled vocabulary
improved data quality; standardization to guideline requirements and cross-referencing with
United Medical Language System (UMLS) connects ToxRefDB v2.0 observations to
vocabularies linked to UMLS, including PubMed medical subject headings (MeSH). The
endpoint terminology and its hierarchical nature is described in later sections.

Moving forward, an application-driven workflow with the Data Collection Tool (DCT) will be
utilized to create a more sustainable process for loading curated information to a database.
The DCT improves upon the legacy ToxRefDB curation workflow to provide document
allocation, curation and workflow management among users, and management review with
data conflict resolution, resulting in records that directly link quality-controlled curations to
source documents. The DCT offers flexibility via its modular workflow for curating the
heterogeneous and complex in vivo study designs.

A multi-layer review process will continue to be implemented with the DCT to ensure data
integrity and minimize data entry error.

Quality Assurance in Data Extraction

Guidance for data extraction was stratified first according to study type (e.g., CHR, SUB, DEV,
MGR) then by study source (e.g., OPP DER and NTP) because of the differences in both
study design and adverse effects required for reporting as stated in guidelines. The process
used to extract study information was also an important aspect of QA efforts for ToxRefDB
v2.0. First, a primary reviewer extracted study, dose, treatment group, effect, and endpoint
observation information. The instructions detailed how to review the toxicological data and
extract it from the original data sources consistently across reviewers using the Access
database. This was reviewed by a second, senior reviewer, who was asked to review all
extracted information as if they were extracting it again and, also, to review the comment log
from the primary reviewer. Finally, if either the primary or secondary reviewer noted that it was
necessary, an additional senior toxicologist reviewed the comment logs, extracted information,
and resolved any conflicts or questions prior to finalization of the extraction. The final, tertiary
review occurred for approximately 10% of the studies. Review by a manager to resolve any
differences between the primary and secondary reviewer serves to inform any training needs
or gaps for the reviewers. During this process, subject matter experts can also be consulted to
resolve questions. For release of ToxRefDB v2.0, the full quantitative data extraction for all
CHR and SUB studies were completed, with quantitative data extraction completed for many
other study types and sources as well.

Efforts to Reduce Error Rate

Error rate is an inherent problem for legacy databases as much of the source information was
entered manually and human errors resulting from transcription are impossible to completely
avoid. However, as part of the ToxRefDB v2.0 curaion effort, more robust QA processes were
implemented to promote greater fidelity of the information extracted and numerous quality
control (QC) checks to verify data integrity.

-------
First, studies were extracted utilizing a defined QA process, with multiple levels of review and
Access form-based entry (described previously) to prevent extraction errors. Upon uploaded
into ToxRefDB v2.0, these extractions were required to pass specific QC checks because,
although the Access database files enforce the MySQL database constraints as well minimize
data entry error by standardizing vocabulary used, logical errors can persist. After the
extracted data was uploaded through the import script, a series of potential logical errors were
identified through unit tests where their curated value could be assumed. Flagged logical
errors that have been corrected included:

• Dose level numbering did not correspond to the total number of doses;

• Duplication of concentration/dose values, including two control doses;

• No concentration and no dose adjusted value for a reported effect (possible extraction
error or possibly that the effect was qualitatively reported);

• The critical effect level is at a dose below where treatment-related effects were
observed; and/or,

• The control was incorrectly identified as a critical effect level.

Any of these issues that could not be resolved systematically were flagged to undergo a
second round of extraction and review to correct. Though QC is an ongoing and evolving
process, these QC checks are serving as an improvement to the overall database and
database development process.

Unit Standardization

An additional ongoing problem for reporting quantitative data from clinical or related laboratory
findings is unit standardization. No guidance is provided on how to report findings in the
OCSPP guidelines nor from any other sources, so units were extracted exactly as they were
presented in the reports. The units were standardized by eliminating duplicate entries for the
same units that were originally entered differently or with typographical errors. Units were only
standardized, and no conversions were introduced in the current database. Ongoing efforts
include further standardization of units and defining conversions that cannot be systematically
automated.

Study Reliability with ToxRTool

Most studies referenced within ToxRefDB were extracted via summaries from OPP DERs, and
these studies typically follow OCSPP 870 series Health Effects Testing Guidelines. As
ToxRefDB was expanded, additional studies needed to be assessed for reliability and
guideline adherence.

The Toxicological Data Reliability Assessment Tool (ToxRTool) was adapted for reliability
assessment. ToxRTool is an Excel application that includes questions across 5 criteria with
numerical responses that are summed to lead to a Klimisch score: a score ranging from 1-4
that captures an overall assessment of reliability.

A total of 522 OpenLit studies were assessed with the ToxRTool with scores ranging from 8 to
23. As explained in the table below, most studies reviewed for ToxRefDB v2.0 corresponded to
Klimisch quality scores of 1 (ToxRTool score of > 18) or 2 (ToxRTool score of 13-18). The
ToxRTool scores could be used as a quality flag both to qualify and prioritize studies for the
extraction process, or by users who are performing reviews of information on a single chemical
basis.

-------
Table 3: ToxRTool Guideline Adherence Score

Score

Description

Adheres to modern* OECD/EPA guideline for repeat-dose toxicity studies
(explicitly stated by authors; broad endpoint coverage and ability to assess
dose-response)

Adheres to an existing or previous guideline (explicitly stated by authors;
previous version of OECD/EPA guidelines or FDA guidelines)

Not stated to adhere to guideline but guideline-like in terms of endpoint
coverage and ability to assess dose-response (e.g., NTP). Please see Quick
Guide to EPA Guidelines for chronic and subchronic studies. In this table, you
can easily assess whether the study was guideline-like in terms of the animals
used (species, sex, age, number), dosing requirements, and reporting
recommendations.

Unacceptable adherence to guideline (intended to adhere to guideline but had
major deficiencies)

Unacceptable (no intention to be run as a guideline study, purely open
literature or specialized study)

A study is considered as adhering to "modern" OECD/EPA guidelines if it was published after
1998, which is the date that many Health Effect 870 series guidelines were re-published. Note
that many of the studies extracted, particularly from sources like the NTP and OpenLit, were
never intended to adhere to a guideline and as such "unacceptable" in this case only refers to
their guideline adherence and not the study design itself.

-------
Guideline Profiles

Within a curation, study records are linked to a guideline profile. OPP DERs follow the Series
870 - Health Effects Test Guidelines, described here. NTP reports follow NTP specifications.
Other subsources cannot be uniformly mapped, but some curations may be assigned a
guideline profile based on how closely the study design adheres to a guideline.

Guideline profiles for study endpoints were created from the Office of Chemical Safety and
Pollution Prevention (OCSPP) series 870 Health Effects Testing Guidelines and NTP
specifications (Table 2). This allows for analysis of guideline adherence for both guideline and
non-guideline studies.

Table 4: Guideline Profile Coverage

Additional efforts are underway to develop new profiles. The Guideline Profile column is a
concatenated entry of ToxRefDB's guideline id, guideline number (usually OCSPP Guideline
No. or NA for NTP specifications), guideline name, and abbreviated guideline profile name.

Study Type

Guideline Profile

Guideline Profile Description

CHR-

Carcinogenicity

• 9 | 870.42 | Carcinogenicity
| CHR_carc

The objective of a long-term carcinogenicity study is to observe test
animals for a major portion of life span for development of neoplastic
lesions during or after exposure to test substance by an appropriate
route of administration. The dose period generally lasts a year or
longer, typically 12, 18, or 24 months, and observations will exclude
developmental and neuroloqical effects. See OPPTS 870.4200
Carcinogenicity.

CHR - Chronic
Toxicity

• 17 | NA| 2-Year Toxicity |
CHR_ntp

• 8 | 870.41 | Chronic
Toxicity| CHR_chr_tox

The objective of a chronic toxicity study is to determine the effects of
a substance in a mammalian species following prolonged and
repeated exposure. A chronic toxicity study should generate data to
identify chronic effects and define long-term dose-response
relationships. The dose period generally lasts a year or longer,
typically 12, 18, or 24 months, and observations will exclude
developmental and neuroloaical effects. See OPPTS 870.4100
Chronic Toxicity.

CHR-
Combined
Chronic Toxicity
/ Carcinogenicity

• 10 | 870.43 | Combined
Chronic Toxicity /
Carcinogenicity |
CHR_chr_canc

The objective of a combined chronic toxicity/carcinogenicity study is
to determine the effects of a substance in a mammalian species
following prolonged and repeated exposure. Following updates to the
870 Series Health Effects Guidelines in 1998, this combined study
was preferred to separate submissions of 870.4100 and 870.4200.
The design and conduct should allow for the detection of neoplastic
effects and a determination of the carcinogenic potential as well as
general toxicity. The dose period generally lasts a year or longer,
typically 12, 18, or 24 months, and observations will exclude
developmental and neuroloaical effects. See OPPTS 870.4300
Combined Chronic Toxicitv/Carcinoqenicitv.

DEV - Prenatal
Developmental
Toxicity Study

• 6 | 870.37 | Prenatal
Developmental Toxicity
Study | DEV_pren_dev

This guideline for developmental toxicity testing is designed to
provide general information concerning the effects of exposure of the
pregnant test animal on the developing organism; this may include
death, structural abnormalities, or altered growth and an assessment
of maternal effects. The dose period is usually gestational (in utero)
and the animal is sacrificed prior to delivery. See OPPTS 870.3700
Prenatal Developmental Toxicity Study

-------
MGR - Multi-
generational
reproductive
toxicity study

• 7 | 870.38 | Reproduction
and Fertility Effects |
MGR rep fert

• 13 | 13 | 870.38 |
Reproduction and Fertility
Effects |
MGR_rep_fert_pre98

Note: There are two guideline
profiles due to a 1998 guideline
change. The post-1998
guideline was likely used for
MGR studies that started in
1996.

This guideline for two-generation reproduction testing is designed to
provide general information concerning the effects of a test
substance on the integrity and performance of the male and female
reproductive systems, including gonadal function, the estrous cycle,
mating behavior, conception, gestation, parturition, lactation, and
weaning, and on the growth and development of the offspring. The
study may also provide information about the effects of the test
substance on neonatal morbidity, mortality, target organs in the
offspring, and preliminary data on prenatal and postnatal
developmental toxicity and serve as a guide for subsequent tests.
Additionally, since the study design includes in utero as well as
postnatal exposure, this study provides the opportunity to examine
the susceptibility of the immature/neonatal animal. The dose period
begins in adolescent F0 males and females and continues until the
terminal generation. Some of the litters deliver their pups, while
others mav be sacrificed Driorto deliverv. See OPPTS 870.3800
Reproduction and Fertility Effects.

REP - Fertility
(Segment 1)

• 5 | 870.355 |
Reproduction/Development
Toxicity Screening Test |
REP_rep_dev

This guideline is designed to generate limited information concerning
the effects of a test substance on male and female reproductive
performance such as gonadal function, mating behavior, conception,
development of the conceptus, and parturition. This screening test
guideline can be used to provide initial information on possible
effects on reproduction and/or development, either at an early stage
of assessing the toxicological properties of chemicals, or on
chemicals of high concerns focused on early postnatal evaluation,
with sacrifice of dams and offsprinq at postnatal dav 4. See OPPTS
870.3550 Reproduction and Fertility Effects.

REP - Peri- and
post-natal
toxicity study
(Segment III)

• 5 | 870.355 |
Reproduction/Development
Toxicity Screening Test |
REP_rep_dev

The study may provide information about the effects of the test
substance on neonatal morbidity, mortality, target organs in the
offspring, and preliminary data on prenatal and postnatal
developmental toxicity and serve as a guide for subsequent tests.
Additionally, since the study design includes in utero as well as
postnatal exposure, this study provides the opportunity to examine
the susceptibility of the immature/neonatal animal (F1 generation).
See OPPTS 870.3550 Reproduction and Fertility Effects.

REP-

Reproductive /
developmental
toxicity

screening test

• 5 | 870.355 |
Reproduction/Development
Toxicity Screening Test |
REP_rep_dev

This guideline is designed to generate limited information concerning
the effects of a test substance on male and female reproductive
performance such as gonadal function, mating behavior, conception,
development of the conceptus, and parturition. This screening test
guideline can be used to provide initial information on possible
effects on reproduction and/or development, either at an early stage
of assessing the toxicological properties of chemicals, or on
chemicals of hiqh concern. See OPPTS 870.3550 Reproduction and
Fertility Effects.

SAC - Sub-
acute dermal
toxicity

• 3 | 870.325 | 90-day Dermal
Toxicity | SUB_sub_derm

A 21/28 day repeated dose dermal study will provide information on
possible health hazards likely to arise from repeated dermal
exposure to a test substance for a period of 21/28 days. Dose period
is typically 21-28 days with dermal exposure route, and observations
will exclude developmental and neuroloqical effects. See OPPTS
870.3200 21/28-Dav Dermal Toxicity.

SAC - Sub-
acute repeat
dose toxicity

• 14 | 870.305 | 28-day Oral
Toxicity in Rodents |
S AC_o ra l_ro d e_2 8

• 15 || 14-day Toxicity in
Rodents | SAC_ntp

The objective of a sub-acute repeat dose toxicity study is to
determine the adverse effects of a substance in a mammalian
species occurring after short-term dosing duration. Determination of
acute toxicity is usually an initial step in the assessment and
evaluation of the toxic characteristics of a substance. Dose period is
typically 21-28 days with varied exposure routes, and observations
will exclude developmental and neurological effects. See
https://www.regulations.gov/document/EPA-HQ-OPPT-2009-0156-
0009

-------
SUB-
Subchronic
dermal toxicity

• 16 | | 13-Week Toxicity in
Rodents | SUB_ntp

• 3 | 870.325 | 90-day Dermal
Toxicity | SUB_sub_derm

The subchronic dermal study has been designed to permit the
determination of the no-observed-effect level (NOEL) and toxic
effects associated with continuous or repeated exposure to a test
substance for a period of 90 days. It can provide useful information
on the degree of percutaneous absorption, target organs, the
possibilities of accumulation, and can be of use in selecting dose
levels for chronic studies and for establishing safety criteria for
human exposure. The dose period is typically 90 days or 13 weeks,
but may be as long as 6 months, via dermal routes of exposure.
Observations will exclude developmental and neurological effects.
See OPPTS 870.3250 90-Dav Dermal Toxicity.

SUB-
Subchronic
inhalation
toxicity

• • 4 | 870.3465 | 90-Day
Inhalation Toxicity |
SUB_sub_inha

The subchronic inhalation study has been designed to permit the
determination of the no-observed effect-level (NOEL) and toxic
effects associated with continuous or repeated exposure to a test
substance for a period of 90 days. It will provide information on target
organs and the possibilities of accumulation, and can be used to
select concentration levels for chronic studies and establishing safety
criteria for human exposure. The dose period is typically 90 days
or13 weeks, but it may be as long as 6 months, via inhalation routes
of exposure. Observations will exclude developmental and
neuroloqical effects. See OPPTS 870.3465 90-Dav Inhalation
Toxicity.

SUB-

Subchronic oral
toxicity in
nonrodent

• 2 | 870.315 | 90-day Oral
Toxicity in Nonrodents |
SUB_oral_nonr

The subchronic oral study has been designed to permit the
determination of the no-observed-effect level (NOEL) and toxic
effects associated with continuous or repeated exposure to a test
substance for a period of 90 days. It provides information on target
organs, the possibilities of accumulation, and can be of use in
selecting dose levels for chronic studies and for establishing safety
criteria for human exposure. The dose period is typically 90 days or
13 weeks, but it may be as long as 6 months, via oral routes of
exposure in any nonrodent species. Observations will exclude
developmental and neuroloaical effects. See OPPTS 870.3150 90-
Dav Oral Toxicity in Nonrodents.

SUB-

Subchronic oral
toxicity in
rodents

• 1 | 870.31 | 90-day Oral
Toxicity in Rodents |
SUB_oral_rode

• 16 | | 13-Week Toxicity in
Rodents | SUB_ntp

The subchronic oral study has been designed to permit the
determination of the no-observed-effect level (NOEL) and toxic
effects associated with continuous or repeated exposure to a test
substance for a period of 90 days. It provides information on target
organs, the possibilities of accumulation, and can be of use in
selecting dose levels for chronic studies and for establishing safety
criteria for human exposure. The dose period is typically 90 days or
13 weeks, but may be as long as 6 months, via oral routes of
exposure in rodent species, typically rats and mice. Observations will
exclude developmental and neuroloaical effects. See OPPTS
870.3100 90-Dav Oral Toxicity in Rodents.

-------
Endpoint Terminology

ToxRefDB employs controlled terminology standardized to better reflect both the OCSPP
Health Effects 870 series guidelines and DER summary reporting. This hierarchical
relationship of effects and endpoints was adapted from the vocabulary developed for earlier
versions of ToxRefDB based on the data types curated. Novel values can be added when
found during a curation.

Figure 6: Hierarchical endpoint terminology example

Observation

endpoint category, endpoint type, and
endpoint target.

Observation

reproductive | reproductive performance |
postimplantation loss

Effect

specific condition associated endpoint
taraet

Effect Description

postimplantation loss

TreatmentGroup

Life Stage

Target

Effect Description Free

Effect

Site

Lifestage, location,

adult

uterus

postimplantation site

verbatim text

pregnancy

loss: mean

An example of the terminology hierarchy is demonstrated for an effect described as
"postimplantation loss". The finding is recorded as is in the "effect description free" field, which
is the verbatim wording used in the study report. The remaining fields are part of the ToxRefDB
controlled terminology. The endpoint category is reproductive, the endpoint type is
reproductive performance, the endpoint target is postimplantation loss, the effect description is
postimplantation loss, and the specific observation of "postimplantation loss" was made in the
adult pregnancy life-stage at the specific target site, the uterus.

Ontology mappings

It is increasingly apparent that many toxicology research questions will require the integration
of public data resources, both with those containing the same types of information, as well as
with other databases to connect different kinds of information. ToxRefDBv2.0 allows for
increased connections to other resources, which has greatly enhanced its quantitative and
qualitative utility for predictive toxicology.

For example, efforts linking in vitro effects in ToxCast to in vivo outcomes using predictive
models may help to identify rapid, more efficient chemical screening alternatives. To connect
the ToxRefDB endpoint and effect terminology with other resources, the ToxRefDB
terminology was standardized and cross-referenced to the United Medical Language System
(UMLS). UMLS cross-references enable mapping of />? vivo pathological effects from
ToxRefDB to PubMed (via Medical Subject Headings or MeSH terms), which may be relevant
for toxicological research and systematic review. This enables linkage to any resource that is
also connected to PubMed or indexed with MeSH.

-------
Figure 7: Cross-referenced Terminology Sources

Over 1,800 UMLS concept codes were mapped to endpoints and effects in ToxRefDB via a
manual process. Only 500 of those concept codes are a part of the CDISC-SEND terminology.
All of the concept codes are a part of vocabularies within both National Cancer Institute
Thesaurus (NCIt) as well as UMLS.

(NCIt)

Additionally, the Entity MeSH Co-
occurrence Network (EMCON)
consists of ranked lists of genes for a
given topic. This resource can be
used to identify genes related
to adverse effects observed in
ToxRefDB Subsequently,

ToxCast can be integrated since the
intended targets are mapped to
Entrez gene IDs.

The result of updating the ToxRefDB
terminology and linking to the UMLS
concepts is that

ToxRefDB may be used to better
anchor or compare to new approach
method (NAM) information, including
data from ToxCast or structure-
activity relationship models, as well
as other in vivo databases of
toxicological information, such as eChemPortal, and e-TOX. Integration of these data
resources is a major hurdle toward to evaluating the reproducibility and biological meaning of
both traditional, legacy toxicity information and the data from NAMs.

Additional work may be performed to link to other ontologies and to assist stakeholders in
mapping their ontologies to the ToxRefDB and UMLS ontologies.

-------
Negative Endpoints and Effects

As part of the v2.0 update to ToxRefDB, negative endpoints and effects can be inferred from
guideline profiles and the testing and reporting statuses of endpoints. Given the list of all
observations required for the relevant guideline profile, the curator indicates which endpoints
were missing (meaning not tested) or negative (meaning tested with no effect observed) by
setting tested and reported status accordingly. Endpoint observation status enables automated
distinction of true negatives and a better understanding of false negative effects. Users can
access the current inferred negatives and calculate inferences for a specific subset.

The MySQL database has inferred study-level negative effects and negative endpoints
available in two tables: "negative_effect" and "negative_endpoint". These tables were created
from stored procedures (repopulate_negative_effect and repopulate_negative_endpoint) that
are also available with the full MySQL database. The logic for the stored procedures follows
the inference workflow seen in Figure 6. Endpoint Observation Status distinguishes negative
and missing (not tested) effects based on the study's specific guideline requirements. An effect
is negative if the study has gone through the data extraction process, the effect was tested
(regardless of being reported), and no effect was seen in the study. An endpoint is negative for
a study if all effects for that endpoint are also negative in the study.

Table 5: Endpoint Observation Status

Tested
Status

Reported
Status

Assumption

Yes

The text of the study document explicitly stated the endpoint was
measured, or data was presented in tables for the endpoint. This is the
combination if required by the guideline for that study type and data is
provided within the document, even the effects measured were not
significant.

Yes

This is the combination if the study document explicitly states the endpoint
was not measured or data was not collected, even though the endpoint
was required by the study guidelines.

Yes

The text of the study document does not state the endpoint was measured
and data for the endpoint is not present. However, other evidence
suggests that the endpoint was measured. This is the default for endpoints
required by the study guideline and should only be changed in the face of
direct evidence from the document.

Within the long table of observations from all study guidelines, this is the
default setting for the endpoints not required by the alternative study
guidelines and they should not be changed. Interpret these observations
as irrelevant since they are not serving the selected guideline, therefore
not required to be tested nor reported.

-------
Figure 8: Decision tree for identification of negative endpoints and effects

Negative endpoints and effects can only be identified in studies that have gone through data
extraction and any subsequent QA processes because this ensures confidence in decisions
made about the adherence and/or deviations from the corresponding guideline profiles. We
can infer negatives based on whether or not an endpoint was tested and no treatment group-
related effects were seen. The example below shows how reported results are intrepreted
given the study's guideline profile.

Figure 9: Example Observation Status interpretation

X Yolum
X?k

X protein

(Iumm
x" fettOMS

X bilirubin
X specific
gravity

X occult blood
X urobilinogen
X t^petruc*

o«»ol«iltjrR1.T 0
»lcro«coplc
oat loo of «edl- |R1.T 1

nenta

3. Secroger Croee lea lone ver« not«d. For organs with histopath. R= 1, T= 1
«. «eli;hed ommat

X Liver X Spleen X 3r*ln

X Kl&ney* H«»rt X Teste*

X T.ungs X Thyroid X Mrenal»
X Ovsrles (with pa-
rathyroids)

rTjTI

Groaa Necropsy:

Animals which died or were sacrificed In moribund
condition prior to end of exposure period and were
subjected to complete gross pathological examinations:
all animals were necropsied on the day the event occurred.

Uterine weight, pregnancy status and uterine contents
were recorded.

Animals sacrificed at the end of the treatment/observation
period which were subjected to complete gross pathological
examinations: All sacrificed by l.v. infection of T-61
euthanasia solution on day 29 of presumed gestation.
Thoracic and abdominal cavities examined for gross
lesions.

RO; T1 for the "Required by
Guideline" organs
- UNLESS there are results on gross
pathology for a given organ, which
would make that organ Rl; T1

-------
Ongoing Work

• Replicate extraction of all of the data fields from ToxRefDB's legacy AccessDB curation
system;

• Include a "wizard" to walk the data curator through entry of study meta-data, chemical
composition information, dose information, dose-treatment group information,
quantitative data extraction for dose-treatment groups, and evaluation of the endpoint
observation status according to guideline specification;

• Offer flexibility for curating the heterogeneous and complex in vivo study designs via a
modular workflow;

• Continue to implement and improve controlled vocabularies for experimental design
elements as well as endpoint and effect language;

• Provide document allocation, curation and workflow management among users (internal
and external) with manager review and data conflict resolution for data provenance and
progress tracking;

• Link a quality-controlled curation to Clowder source documents; and

• Create a sustainable pipeline for data integration.

There are several critical advantages inherent in the success of this application. Automating
the data extraction creates a new more systematic and sustainable workflow. Following data
curation, ETL could be managed using Pentaho for direct loading to a database. Overall, this
effort would allow for the continued expansion of the ToxRefDB resource by providing a more
efficient process for curation of study information.

Following conclusion of the initial development phase, curation of developmental toxicity (DEV)
data evaluation records (DERs) from recent pesticide submissions were the selected focus for
Phase I DCT extraction. Future curation efforts were prioritized from the DER documents from
an initial web scrape of all documents that were published since 2008, adhering to existing
guideline profiles, and not currently captured in ToxRefDB. Additional extraction may include
studies previously extracted using Excel and AccessDB files followed by comparison of the
results to look for accuracy, as well as new study types following the generation of new
guideline profiles and vocabularies. Feedback from data curators will help inform further
development enhancements.

Future versions with expanded chemical and study data collected via its new application-driven
curation workflow (DCT) and the creation of a ToxRefDB dashboard will increase ToxRefDB's
utility. Standardization efforts will continue to provide more detailed effect and study-level
information and will allow for more streamlined interoperable database efforts.

Without a user interface, ToxRefDB information is only accessible from the MySQL database
download or via ToxValDB hazard summary section. Complete ToxRefDB information will
soon be integrated into the CompTox Chemicals Dashboard and available via batch search
functionality.

-------
Data Dictionary

A data dictionary is found in the database in the toxrefdb_dd table.

ToxRefDB Table

Field

Field Description

chemical

chemicaljd

PK: Autoincremented unique identifier for a
chemical

dsstox_substance_id

Unique identifier from DSSTox

casrn

CAS Registry Number

preferred_name

Preferred name of the chemical substance tested
in the study.

dose

dosejd

PK: Autoincremented unique identifier for a dose

studyjd

FK: A unique numeric identifier for each study in
the database.

cone

Concentration of a test chemical, typically
reported in ppm within the exposure matrix (e.g.,
feed or water).

conc_unit

Unit associated with a concentration of a test
chemical, typically reported as ppm.

dose_comment

This field can be used to explain any differences
in dosing over the dosing interval or provide
clarifying comments on how the dose was
administered. Specific concentrations of the
vehicle should be listed here when relevant. For
example, if methylcellulose was used as a
vehicle, the concentration of methylcellulose may
be included in the comment field (e.g., 0.5% w/v
aqueous methylcellulose).

dosejevel

Numeric rank indicating the level of dose
administered to test animals, with lower dose
levels indicating lower concentrations of a
chemical (e.g., 0 = vehicle, 1 = lowest dose, etc.).
The dose level for some studies may be
staggered since concentrations may vary by sex
(e.g., male treatment group: 0 = vehicle, 1 =
lowest dose, 3 = second lowest dose, etc.).

vehicle

The media used in administration of chemical

dtg

dtg_id

PK: Autoincremented unique identifier for a
dosed-treatment group

dose_id

FK: A unique numeric identifier for each dose in
the database.

tg_id

FK: A unique numeric identifier for each
treatment group in the database.

dose_adj usted

The amount of the chemical administered in
mg/kg of body weight/day (mg/kg/day). This
value is typically different between male and
female groups receiving the same dose
concentration (cone) due to differences in
bodyweight. If dose_adjusted values were not

-------

provided in a study, then they were calculated
using species scaling factors (FAO/WHO, 2000).

dose_adjusted_unit

Unit associated with the adjusted dose of a
chemical, typically reported in mg/kg/day.

dtg_comment

NULL if no additional comment needed; explains
any difference in the dose-treatment-group over
the course of the study (i.e., interim sacrifice or
changes due to toxicity and/or morbidity); quality
assurance (QA) flags indicate discrepancies
between the reported and correct values for the
study; differences in any dose_adjusted
calculations are provided.

mg_kg_day_value

The mg/kg/day species-specific, converted value
from ppm concentration

dtg_effect

dtg_effect_id

PK: Autoincremented unique identifier for a
dosed-treatment group effect

dtg_id

FK: A unique numeric identifier for each dosed
treatment group in the database.

tg_effect_id

FK: A unique numeric identifier for each
treatment group effect in the database.

critical_effect

Binary description (0,1) for an effect by dose
treatment group. "1" corresponds to a toxic or
adverse effect denoted in the study summary or
via expert judgement using a weight-of-evidence
approach. "0" indicates that although an effect is
produced at this level, it is not considered
adverse, nor immediate precursors to specific
adverse effects. If there are several critical
effects, the no observed adverse effect level
(NOAEL) is determined from the highest dose
level without critical effects. The lowest dose
level at which the critical effect was observed in a
study is the lowest observed adverse effect level
(LOAEL.)

dtg_effect_comment

NULL if no additional comment needed; provides
additional explanation of the dose-treatment-
group-effect row in the table, including statistical
significance.

effect_val

Numeric value of a measured effect, can be
continuous or dichotomous (incidence) data.

effect_val_unit

Unit associated with the effect value.

effect_var

Measurement of the variance for a set of data
associated with a measured effect, generally
reported as the standard deviation (SD) or
standard error (SE).

effect_var_type

Name of the variance metric used to determine
the effect variance, typically the standard
deviation (SD) or standard error (SE). Other
effect_var types include: interquartile range, 95%
confidence limit, and none.

sample_size

Number of animals used for an examination for a
particular effect.

-------

time

Numeric value associated with the duration of the
exposure at which a particular effect was
measured or observed, typically reported in
hours, days, weeks, or months.

treatment_related

Binary description (0,1) for an effect by dose
treatment group. "1" indicates there was a
statistically significant difference from the control
group for the effect; "0" indicates there was no
difference from control group. The highest dose
level at which no significant observable adverse
effects were observed corresponds to the no
effect level (NEL). The lowest effect level (LEL)
can be inferred by treatment related effects.

effect

effectjd

PK: Autoincremented unique identifier for an
effect

endpointjd

FK: A unique numeric identifier for each endpoint
in the database.

effect_desc

More specific description for an effect than
endpoint_category, usually detailing a specific
condition associated with an endpoint_target
(e.g. dysplasia, atrophy, necrosis, etc.).

effect_profile

effect_profile_id

PK: Autoincremented unique identifier for an
effect profile

effect_profile_description

Description of the effect profile

effect_profile_name

Name of the effect profile

effect_p rof i 1 e_g ro u p

effect_profile_id

FK: A unique numeric identifier for each effect
profile in the database.

groupjd

Unique identifier for a group

group_description

The description of a group

group_name

The name of a group

endpoint

endpointjd

PK: Autoincremented unique identifier for an
endpoint

endpoint_category

The broadest descriptive term for an endpoint.
Possible endpoint categories include: systemic,
developmental, reproductive, and cholinesterase.

endpoint_target

Describes more specific information than
endpoint_type, indicating where/how the sample
was collected to supply data for a particular
endpoint. Typically describes an organ/tissue or
metabolite/protein measured.

endpoint_type

The subcategory for endpoint_category, which is
more descriptive for a particular endpoint (e.g.
pathology gross, clinical chemistry, reproductive
performance, etc.)

guideline

guideline_id

PK: Autoincremented unique identifier for a
guideline

description

Information pertinent to a study guideline. For
example, MGR studies conducted post-1998
required the testing of developmental landmarks,
which is notable for observation status.

guideline_number

Number associated with the particular guideline,
that a study adheres to or most closely adheres

-------

to. OPPTS/OCSPP guideline numbers are

differentiated by the distinct number proceeding
870, as dictated by the Office of Chemical Safety

and Pollution Prevention (OCSPP)

name

Name of the particular Office of Chemical Safety
and Pollution Prevention (OCSPP) guideline that
a study adheres to or most closely adheres to.

profile_name

Abbreviated name of the particular Office of
Chemical Safety and Pollution Prevention
(OCSPP) guideline that a study adheres to or
most closely adheres to. See abbreviations
section for profile name list.

guideline_profile

guideline_profile_id

PK: Autoincremented unique identifier for a
guideline profile

endpointjd

FK: A unique numeric identifier for each endpoint
in the database.

guideline_id

FK: A unique numeric identifier for each guideline
in the database.

description

Provides a description of the rationale for an
endpoint observation status.

obs_status

Indicates whether or not an endpoint is required
to be tested according to the particular guideline
a study adheres to. The observation status for an
endpoint can be required, not required, or
triggered.

obs

status

The status regarding whether or not an endpoint
was tested and reported in a study. Assumes that
an endpoint was tested if the guideline the study
adheres to requires that endpoint to be tested.

default

An endpoint is considered tested and reported if
the endpoint appears in the text of the study
source indicating that data was collected. If an
endpoint is required to be tested by the guideline,
tested and reported are the defaults.

tested_status

Indicates if an endpoint was tested (1) or not
tested (0). If an endpoint was tested, it was
examined or measured.

reported_status

Indicates if an endpoint was reported (1) or not
reported (0). If an endpoint was reported, it
appears somewhere in the text of the report.

ontology

ontology_id

PK: Autoincremented unique identifier for an
ontology class

description

The associated description for the identifier

label

The associated label for the identifier

uid

Unique identifier from respective terminology
resource

uid_type

Type of identifier

uri

Uniform resource identifier

ontology_toxrefdb

ontology_toxrefdb_id

PK: Autoincremented unique identifier for an
ontology class associated with a concept in
ToxRefDB

-------

ontologyjd

FK: A unique numeric identifier for each ontology
class in the database.

toxrefdb_table

The associated table in ToxRef

toxrefdb_field

The associated field from toxrefdb_table linked to
a term

toxrefdb_id

Primary key from associated toxrefdb_table

pod

pod_id

PK: Autoincremented unique identifier for a point
of departure or associated effect level

chemicaljd

FK: A unique numeric identifier for each chemical
in the database.

effect_profile_id

FK: A unique numeric identifier for each effect
profile in the database.

groupjd

FK: A unique numeric identifier for each effect
profile group in the database.

study_id

FK: A unique numeric identifier for each study in
the database.

dose_level

Dose level at which the POD was seen

max_dose_level

Maximum dose level tested with relation to where
the POD was captured

mg_kg_day_value

Converted mg/kg/day value

qualifier

A
ii

ii
ii

pod_type

LEL, NEL, LOAEL, or NOAEL

pod_value

Value of the POD or associated effect level

pod_unit

Corresponding unit of the POD or associated
effect level

pod_tg_effect

pod_tg_effect_id

PK: Autoincremented unique identifier for a POD
associated with a treatment group effect

pod_id

FK: A unique numeric identifier for each POD or
associated effect level in the database.

tg_effect_id

FK: A unique numeric identifier for each
treatment group effect in the database.

study

study_id

PK: Autoincremented unique identifier for a study

chemicaljd

FK: A unique numeric identifier for each chemical
in the database.

guidelinejd

FK: A unique numeric identifier for each guideline
in the database.

admin_method

Describes specifically how the chemicals were
administered via the route (e.g., capsule, diet,
gavage, topical, etc.)

dose_end

Time during an animal's life that the
administration of a test substance stopped.

dose_end_unit

Unit of time associated with the end of the dose
(dose end).

dose_start

Time during an animal's life that the
administration of a test substance began.

dose_start_unit

Unit of time associated with the start of the dose
(dose_start).

species

Species of the animal test subject used in a
study.

strain

Intraspecific description of group of animals used
in a study; generally, a stock of animals that

-------

share a uniform morphological or physiological
character, or group that is genetically uniform.

strain_group

Descriptive category for a group of test animals
that is more general than the strain.

study_comment

Pertinent information the curator deemed helpful
to be noted about the study in general, such as
poor document quality (e.g., poor scan), missing
pages, etc.

study_type

Classification to describe animal toxicity testing
that was conducted. ACU (acute): Dose period
typically a day or less. Excludes developmental
and neurological studies.; SAC (subacute): Dose
period is typically 21-28 days. Excludes
developmental and neurological studies.; SUB
(subchronic): Dose period is typically 13 weeks,
but may be as long as 6 months. Excludes
developmental and neurological studies.; CHR
(chronic): Dose period is typically 12, 18, or 24
months (generally any dosing lasting a year or
longer). Excludes developmental and
neurological studies.; DEV (developmental):
Gestational (in utero) dose period. Sacrificed
prior to delivery.; MGR (multigenerational
reproductive): Dose period begins in adolescent
FO males and females and continues until
terminal generation. At least some of the litters
deliver their pups, some may be sacrificed prior
to delivery.; NEU (neurological): Study contains
functional observation battery or other battery of
behavioral testing that occurs during or after
dosing. Pathology has specific interest in the
brain (i.e. regions, morphology, biochemistry, et
cetera), excludes developmental studies; DNT
(developmental neurotoxicity): dose period
occurs anytime during development (i.e. in utero,
lactational, adolescent [after weaning, before
adulthood]). Study contains functional
observation battery or other battery of behavioral
testing that occurs during or after dosing, typically
during adulthood. Pathology has specific interest
in the brain (i.e. regions, morphology,
biochemistry, etc.)

study_type_guideline

Description that combines the study_type and
guideline name for a study.

substance_comment

Pertinent information regarding a substance's
origin (generally the manufacturer/importer that
produced the substance), purity, or other notable
information about the substance in general.

substance_lot_batch

Identifier specific to the origin of a batch of the
test substance used in a study.

substance_purity

Percentage of the administered solution that is
composed of the chemical to be tested after
dilution.

-------

substance_source_name

Name of the supplier that provided the chemical
substance for testing during the study.

tgjd

PK: Autoincremented unique identifier for a
treatment group

study_id

FK: A unique numeric identifier for each study in
the database.

dose_duration

Amount of time a group is dosed. This varies
within studies depending on the dose period of a
particular treatment group.

dose_duration_unit

Unit of time associated with the dose duration.
Typically in days or months.

dose_period

Time point that best characterizes when the
treatment group was evaluated for effects.

Interim: Group sacrificed and examined within the
dosing period. Terminal: Group sacrificed and
examined at study completion and after the
dosing period. These animals are not mated.
Recovery: Group examined after a recovery
period that followed the dosing period at the
study end. Post first mating: Group examined
after first mating. Post second mating: Group
examined after second mating. Post third mating:
Group examined after third mating. Satellite:
Group of animals included in the design and
conduct of a toxicity study, treated, and housed
under conditions identical to those of the main
study animals, but used primarily for some
separate purpose to be defined as needed in the
Comment section. Other: Group of animals that
may have deviated from the full study design, to
be defined as needed in the Comment section.

generation

Generation of the test animal group. FO is the
default choice for animals exposed in non-
reproductive studies (chronic CHR, subchronic
SUB, subacute SAC), dams in reproductive DEV
studies, and the first-generation mating group for
multigenerational MGR studies. F1 is the second-
generation, born to FO. F2 is the third-generation,
born to F1. F3 is the fourth-generation, born to
F2. The fetal generation is the group produced by
FO matings in DEV studies, typically removed
from a female via cesarean section. Pups from
live births are not fetal.

sex

Sex of a test animal group. The gender of fetal
groups is denoted as MF for both males and
females.

tg_comment

NULL if no additional comment needed; contains
information that the extractor/curator found
helpful in describing issues related to a
treatment-group (e.g. animals dosed via capsule
so concentration not reported, added recovery
groups, etc.).

-------
tg_effect

tg_effect_id

PK: Autoincremented unique identifier for a
treatment group effect

effectjd

FK: A unique numeric identifier for each effect in
the database.

tg_id

FK: A unique numeric identifier for each
treatment group in the database.

direction

Description of the net change across all doses
that indicates whether the numerical data
increased, decreased, or stayed the same. This
can also be used to describe effects that did not
have numerical data, but were still described in
the study source.

effect_comment

NULL if no additional comment needed; contains
information that the extractor/curator found
helpful in describing issues related to a
treatment-group-effect (e.g. units not reported,
effect only reported for certain treatment groups,
etc.).

effect_desc_free

Brief verbatim text from study file that was
entered if the effect description differed from
predetermined endpoint terminology.

life_stage

Stage of life that a measurement was taken.
CHR, SUB, and SAC studies typically only have
adult for life_stage, whereas DEV and MGR
studies will always be characterized by multiple
life stages. The different life stages in the
database include: fetal, juvenile, adult, adult-
pregnancy and pregnancy.

target_site

A more specific description than effect_target.
Can describe a specific tissue within an organ,
type of cell, etc.

toxrtool

toxrtool_id

PK: Autoincremented unique identifier for a
toxrtool question

criteria

The ToxRTool comprises a list of evaluation
criteria to assess study reliability that are
subdivided into five groups: test substance
identification, test system characterization, study
design description, study results documentation,
and plausibility of study design and data.

question

Question used as part of the ToxRTool
evaluation criteria to assess study reliability.

question_number

Number indicating the question as part of the
ToxRTool evaluation criteria to assess study
reliability.

study_toxrtool

study_toxrtool_id

PK: Autoincremented unique identifier for a
ToxRTool question associated with a study

toxrtool_id

FK: A unique numeric identifier for each
ToxRTool question in the database.

study_id

FK: A unique numeric identifier for each study in
the database.

score

The associated score for the ToxRTool question

toxrtool_comment

The corresponding comment further describing
the score

-------
37

-------
SEPA

United States
Environmental Protection
Agency

PRESORTED
STANDARD POSTAGE
& FEES PAID EPA
PERMIT NO. G-35

Office of Research and Development (8101R)
Washington, DC 20460

Official Business
Penalty for Private Use
$300

-------