Generalized Read-Across (GenRA) Virtual Training Chat Questions and Answers


GenRA Virtual Training

Chat Questions and Answers

Below are responses to questions asked during the Generalized Read-Across (GenRA) Virtual Training
hosted by the U.S. Environmental Protection Agency's Center for Computational Toxicology and
Exposure (U.S. EPA CCTE) on May 23, 2023, presented by EPA's Dr. Grace Patlewicz, Dr. Esra Mutlu, and
Dr. Imran Shah. Attendees submitted questions throughout the presentation. Though many questions
were answered verbally during the presentation and in the Q&A box, there were some questions we
were not able to answer during the training period. All remaining questions within the scope of the
GenRA training are provided here.

For more information on GenRA, visit the GenRA Resource Hub.

Contents

Chat Questions and Answers	1

TRAINING RELATED QUESTIONS	2

ABOUT GENRA	2

DATA SOURCES	3

FINGERPRINTS	5

SUBSTANCE TYPES	9

SIMILARITY CONTEXT	10

ANALOGS	12

DATA OUTPUTS	14

PREDICTIONS	16

APPLYING GENRA	17

OTHER	17

Appendix A: List of Acronyms	18

-------
TRAINING RELATED QUESTIONS

Question 1: Will the slides be available with the recording?

Question 1A: Can we receive the recorded presentation later?

Question IB will you be giving us the answers in writing to these Qs?

EPA Response: We will contact registrants when the training materials are ready. The slides,
recording, and breakout activity (with and without answers) will be available on the NAMs
training web site: https://www.epa.gov/chemical-research/new-approach-methods-nams-
training

Question 2: Do we get a certificate for this?

EPA Response: We will share the survey shortly that will allow you to get your certificate for this
training! https://epa.govl.qualtrics.com/ife/form/SV 5AMRHbXKdyCDbZI The survey will be
open for couple of weeks following the training.

Question 3: Registered only for the beginner session. Is it possible to get the worksheet for
intermediate/advanced session also. It will be helpful to look at those exercises after completing the
beginner exercises.

EPA Response: We only have one worksheet with the same questions for all breakout rooms.
The worksheet is the same for all sessions! We just matched folks who wanted more advanced
guidance with our most experienced trainers!

ABOUT GENRA

Question 4: Is there a user manual? What is ATG, BSK, NVS

EPA Response: Hi there- yes, there's a user manual, https://www.epa.gov/chemical-
research/generalized-read-across-genra-manual. The most recent functionality is best described
in our manuscript https://doi.Org/10.1016/i.comtox.2022.100258. ATG, BSK and NVS refer to 3
of the ToxCast platforms Attagene, Bioseek and Novoscreen. More information on the Assay
Platform Sources can be found here https://www.epa.gov/chemical-research/generating-
toxcast-data-toxcast-assays

Question 5: Many thanks for your quick response. I went through user manual very quickly but the
acrynoms are not very specific. Is there a publication or reference I can look into for the acronyms?

EPA Response: We are working to update the manual to capture missing acronyms. Some of the
common ones have been captured in the information icons that pop up under each panel in the
application itself.

Question 6: Is it possible to run GenRA for proprietary data?

EPA Response: We don't recommend running GenRA on proprietary compounds. Please reach
out to us if you wish to make use of a Docker image to instantiate GenRA behind your own
firewall. Alternatively, the genra-py package will work to run on user-own datasets. See
https://academic.oup.com/bioinformatics/article/37/19/3380/6194561 for more details.

-------
Question 7: Would it be possible to run molecules as batch? (e.g., running multiple molecules as SD file
or a .csv file format)?

EPA Response: Please use the genra-py package for batch analysis:
https://academic.oup.com/bioinformatics/article/37/19/3380/6194561

Question 8: I have a really basic question, read-across seems quite computational and if there's a
potency function - what's the difference between read across and QSAR? Is there a context where one is
better than the other?

EPA Response: Read-across and QSAR are part of the same continuum of relating some aspects
of a chemical to an activity response. The real difference is that read-across tends to be limited
to a more limited pool of substances as part of an analogue or category approach.

Question 9: How and with what chemicals GenRA was validated?

EPA Response: This is described in more detail in our initial publication and the subsequent
analyses that have followed. Our most recent publication

https://doi.Org/10.1016/i.comtox.2022.100258 will provide a roadmap of how GenRA has
evolved with the relevant citations to all our previous studies.

Question 10: Does the website sit within a company's firewall? Can EPA see what structures are
entered?

EPA Response: We don't advise entering confidential information into GenRA. GenRA could
potentially be provided as a docker image. Please reach out to the GenRA team via the web site
to discuss further.

DATA SOURCES

Question 11: It looks for analogs in what database? CompTox?

EPA Response: Yes, DSSTox which underpins the EPA CompTox Chemicals Dashboard

Question 12: Is the analogue selection by GenRA restricted to the substances present in the
CompTox/ToxRef DB?

EPA Response: Yes, the analogues are limited to chemicals in the CompTox Dashboard/DSSTox
database. However, the target can be any structure (one can input a new smiles string or draw a
new chemical using the "Ketcher button").

Question 13: The dashboard is now including genotoxicity data that I welcome it. Can the GenRA used
for read-across for genotoxicity endpoints?

EPA Response: Thanks for the question. It is in the works.

Question 14: Are the physchem data for the anologs predicted data or experimental data, or a mix of
both?

EPA Response: The physchem data is obtain from the OPERA tool:
https://github.com/kmansouri/OPERA

-------
Question 15: What technology is used for generating interactive graphs?

EPA Response: here is the free tool: https://github.com/vasturiano/force-graph

Question 16: Does ToxRef contain in vivo ecotoxicity endpoints?

EPA Response: No, ToxRefDB only contains human-health related endpoints. We are planning to
include EcoTox endpoints in the future.

Question 17: The data matrix can bring information of the effect of these molecules in the envioroment
? or only on human health .

EPA Response: Currently GenRA makes predictions for in vitro assay outcomes and in vivo
toxicity endpoints. We have not explored ecotoxicity predictions as yet.

Question 18: regarding the p-chem properties, it looks as though the properties are plotted relative to
one another, but would also be helpful if the chems were relative to the target

EPA Response: The PhysChem properties are obtained from OPERA

https://github.com/kmansouri/OPERA

https://icheminf.biomedcentral.com/articles/10.1186/sl3321-018-0263-l

Question 19: ToxRefDB contains industry studies, correct? So, it will miss academic studies in peer-
reviewed literature?

EPA Response: More information on ToxRefDB can be found in the following article
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6944327/

Question 20: Are you increasing the tox data in your database form other global databases?

EPA Response: Yes, our next priority is to include data from the Toxicity Values database which is
broader in coverage than ToxRefDB.

Question 21: Will you have a plan to incorporate more biological databases other than ToxCast and
ToxRefDB?

EPA Response: Great question. Please contact through the website to let us know which specific
biological databases you were referring to. On the in vivo toxicity side, we are currently working
to incorporate aggregate data from the Toxicity Values DB that exists in the Dashboard
(ToxValDB).

Question 22: Are there references available for the physchem data? is there a way to link those, where
do the values come from?

EPA Response: The physchem data are based on OPERA predictions. The software and methods
are described in the following paper:

Mansouri, Kamel, Chris M. Grulke, Richard S. Judson, and Antony J. Williams. 2018. "OPERA
Models for Predicting Physicochemical Properties and Environmental Fate Endpoints." Journal of
Cheminformatics 10 (1): 1-19. https://doi.org/10.1186/sl3321-018-0263-l.

Question 23: What is the difference between ToxRef and ToxCast?

EPA Response: ToxRef DB is for in vivo and ToxCast data filter is for in vitro studies

-------
Question 24: Can user data be incorporated?

EPA Response: Not currently but this is something we are exploring as a future release.

Question 25: ToxRef is for in vivo data?

EPA Response: Yes. That's correct. Filtering analogues on ToxRef Data only shows those
analogues that have in vivo data for chemicals within the ToxRef database.

Question 26: Are you working with FDA on this database? I know they are not big fans of this approach.

EPA Response: No, though we have demonstrated the tool to colleagues at CFSAN.

Question 27: Are you able to see the studies associated with ToxRef data?

EPA Response: Not within the application. In a future version, we hope to provide a means of
linking back to toxicity data presented elsewhere on the Dashboard.

Question 28: Brilliant presentation, thank you! Usually in the "panel 4" we see in red some PoD related
to a certain endpoint/effect. My question is: Is there a way to retrieve the reference/study for this data
of the PoD?

EPA Response: Thank you and thank you for your question. We hope to provide the
source/study information from ToxRefDB as part of the "download" from Panel 4, Data Matrix
view.

Question 29: Any thoughts about incorporating artificial intelligence into making comparisons?

EPA Response: Yes, we plan to incorporate additional AI/ML-based approaches in GenRA. In fact,
the similarity-weighted activity algorithm used by GenRA is based on k-nearest neighbour (KNN)
prediction, which is the simplest form of ML approach.

FINGERPRINTS

Question 30: This is great work. I wonder if there are plans to 'merry' it with OECD Toolbox to help
inform custom fingerprints

EPA Response: Helpful to understand what specifically this user has in mind. Happy to have a
separate conversation about this.

Question 31: 1. Step two: How to select the fingerprints? which one to be used when? Filter ToxRef vs
Toxcast? 2. Step 3: Group and by selection? When to use drop-down 3. Step 4: GenraPy vs GenraPred?
4. Most confusing step is once I download the excel template with lot of data, I'm lost with 100+
endpoints, how can I derive a POD/NOAEL from this step?

EPA Response: Selection of fingerprints is up to the end user and requires some investigation. If
the end user is interested in making binary predictions of in vivo toxicity - then using the filter
by ToxRef which is the default setting is the approach to go. This will ensure that source
analogues returned are associated with in vivo data. In Panel 4, the recommendation is to use
the GenraPred as this will ensure that some confidence is calculated for the predictions
generated.

-------
If ToxCast assay hitcall predictions are desired, then the end user needs to filter by ToxCast in
panel 1. This will ensure that source analogues with binary hitcalls from ToxCast data are
returned in panel 1. These will be used to make read-across predictions in Panel 4. Again, the
GenraPred engine is recommended.

The only exception to use of GenraPred for binary predictions of in vivo or in vitro outcomes is if
the hybrid fingerprints are used in Panel 1. Panel 4 will default to only showing genra-py as the
prediction engine.

If POD predictions from in vivo toxicity data are desired, then filter by ToxRef in Panel 1 and
switch the Group by in Panel 3 to Tox Dosage Fingerprint. When generate data matrix is pressed,
a potency-based data matrix is presented in Panel 4. In this case, genra-py is the default
prediction engine. From here the end user can filter the matrix to only predict specific study-
toxicity effect PODs. Alternatively, the end user can predict across all study-toxicity effects and
then sort by values to identify the most conservative predictions.

Question 32: I didn't get Bishpenol A as the 1st analogue with any of the fingerprinting method. Which
was the approach used in the demo?

EPA Response: AIM fingerprints.

Question 33: Jaccard similarity metric - this method is integrated in which fingerprint?

EPA Response: The Jaccard similarity is used with all the fingerprints

Question 34: Which set of fingerprints most correlates with metabolism and toxicity?

EPA Response: Metabolism considerations are not currently implemented in GenRA but is a
functionality we are working to incorporate into GenRA. Assessments of toxicity of analogues to
base your read-across calculation on are made via the comparisons in Panels 3 and 4, and
depend on whether you wish to assess that with in vivo data (ToxRef) or in vitro data (ToxCast).

Question 35: How did you know the total # of features?

EPA Response: Not sure what this question refers to - the total number of possible features in a
given fingerprint representation? We will endeavor to document this in the manual.

Question 36: Thank you. Is there a recommendation which of the approach for the similarity analysis is
better than the other (for example: Chem: torsion Fingerprints or AIM etc)?

EPA Response: No, some interactivity and judgment is required to make decisions on the best
choice of fingerprint based on the overall quality of the source analogs that are produced.

Question 37: In what cases will we need to use morgan fingerprints? For standard risk assessment what
parameters do you recommend we select?

EPA Response: Read-across is an interactive process where it is difficult to recommend hard-
and-fast rules that are generalizable across all use cases. Choice of fingerprint will largely
depend on the kind of similarity that matters most for your use case. For example, if it is critical
that you find analogues that are structurally similar to your target, then using structurally based
fingerprints (e.g., Morgan, Torsion, AIM) for the Jaccard similarity calculation that populates the
radial plot in Panel 1 will be important for your use case. If measured in vivo data for specific
endpoints matters for you, then making sure to filter by ToxRef in Panel 1 is critical, and you

-------
may not mind if your analogues aren't as highly similar in terms of structure if they have a lot of
available in vivo potency data across the analogue set for your endpoints of interest. It's likely
that physicochemical parameters will also play a role in determining and filtering the best
analogues for your use case. So, the process can be an iterative one where several sets of
parameters are tested before the best compromise on a set of available analogues is settled on.

Question 38: Could you explain more about the color coding and how does one determine the data
quality. Thank you!

EPA Response: We will answer this in more detail later however, it doesn't reflect the data
quality. Each fingerprint is a binary bit vector reflecting the presence/absence of features (e.g.,
ToxPrints comprise 729 features, whereas Morgan fingerprints comprise 2048 features). The
color density is scaled by fingerprint type from light to dark and reflects a measure of 'data
availability'. The number of data records is reflected in each cell.

Question 39: IS EPA thinking to develop AIM further from its beta version. It's another excellent tool.

EPA Response: Great question. Do you have specific suggestions on what you would like to see
here?

Question 40: Step2: Is there a guidance which highlights, when to use which fingerprints/hybrid and the
filter by toxref vs toxcast vs all?

EPA Response: We have systematically evaluated the utility of different fingerprints for specific
chemical clusters for inferring hazards. We hope to share this information with users in a
manner that can suitably guide them on their usage.

Question 41: What are the differences/limitations/advantages of Morgan Fgrprts vs Torsion vs
ToxPrints vs AIM vs hybrid. Similarly

EPA Response: Thank you for your question. Each of these fingerprints consists of a bit vector
containing presence/absence of different structural moieties (such as in Morgan) versus
structure and bond angle vectors (as in Torsion) versus other fingerprints. Some of these
fingerprints are based on structural data, others are based on presence or absence of assay data
such as the biology based fingerprints. It might be worth combining fingerprints in a hybrid
format depending on exactly what you would like to base your similarity metric calculation on. If
structure is your main consideration to initially generate source analogs, Morgan, AIM, Torsion,
etc. would be best.

Question 42: What's the difference between Morgan Fingerprints, Torsion Fingerprints, Toxprints and
AIM?

EPA Response: You can hover your mouse over the options in panel 1 to get brief summary
descriptions of each fingerprint and its descriptors.

Question 43: Please, can you explain again the Morgan fingerprint basis?

EPA Response: It is a presence/absence bit vector of different common structural moieties found
within organic compounds.

-------
Question 44: What do the torsion and Morgan fingerprints represent?

EPA Response: Morgan represents presence/absence of different structural moieties whereas
Torsion capture structure and bond angle vectors

Question 45: Can the AIM fingerprints be explained in more detail? How does this differ from the
results generated in the AIM program?

EPA Response: The AIM fingerprints are as faithful a reimplementation of the same fragments
that exist in the AIM standalone tool. See https://doi.Org/10.1016/i.comtox.2022.100256 for a
more detailed description of the work conducted to create these fingerprint representations.
The results generated by the AIM program are likely to be different since the AIM tool relies on
an internal AIM database of analogues tagged by data availability whereas GenRA relies on the
CompTox Chemicals Dashboard database to identify analogues. Although the vast majority of
the AIM's database of analogues overlaps with the Dashboard chemicals, our filter by in vivo
data relies on ToxRefDB which is likely to be more limiting than a more general tag for data
availability. We are working on extending the coverage of toxicity databases.

Question 46: Are there descriptions of the options that can be used to sort the neighbors? e.g., what's a
torsion fingerprint versus a Morgan fingerprint?

EPA Response: Thanks for your question. You can hover your mouse over each fingerprint option
in Panel 1 to get a brief summary description of the basis set of descriptors that each fingerprint
consists of. Morgan fingerprints are based solely on the presence/absence of structural
moieties. Torsion fingerprints contain information of the bond torsion angles as well as
structural moieties. Each of the fingerprints uses different numbers of descriptors at varying
levels of granularity to define the fingerprint. Hence, some interactivity and judgment are
required to make decisions on the best choice of fingerprint based on the overall quality of the
source analogs that are produced (based on the information Grace has shown in panels 1-4).

Question 47: Is there a reference document for us to understand the different fingerprints?

Question 47A: How do AIM FPs differ from other FPs available?

EPA Response: AIM fingerprints are explained in more detail herein
https://doi.Org/10.1016/i.comtox.2022.100256

ToxPrints are described in more detail in https://pubs.acs.org/doi/full/10.1021/ci500667v

Morgan fingerprints are described https://pubs.acs.org/doi/10.1021/cilQ0050t and torsion
fingerprints are described here https://pubs.acs.org/doi/10.1021/ci00054a008

Question 48: Does the dataset include ecotoxicology data?

EPA Response: Currently, it only includes human health endpoints from ToxRefDB. We are
planning to incorporate ecotox endpoints in the future

Question 49: Is there any article that describes about AIM fingerprints?

EPA Response: AIM fingerprint manuscript: https://doi.Org/10.1016/j.comtox.2022.100256

-------
SUBSTANCE TYPES

Question 50: Does GenRA include much information / is it useful for metal containing substances
(industrial catalyst type substances and so on)?

EPA Response: Currently, chemical fingerprints capture the organic portion of substances. We
are exploring approaches to develop fingerprints that cover metal-containing substances. The
bioactivity fingerprints, on the other hand, represents substances based on assay results and
may consider metal-containing substances.

Question 51: If the first step is the structure, how does GenRA work for metals?

EPA Response: Metals aren't currently supported

Question 52: Does GenRA work for polymers?

EPA Response: This is a good question. We do not currently treat polymers using any special
structure notation, to capture monomeric units, etc.

Question 53: Does GenRA allow you to search analogues by substructures? For example, if my target
compound is a nitrosamine and I am only interested in other nitrosamines as potential analogues?

EPA Response: I can see how this would be a very useful feature. Right now, the analogues are
only identified by overall similarity, and we have not implemented a "substructural moiety"
filter on the neighbourhood. Thanks for bringing this up.

Question 54: We found some mixtures and salts (two or three structures) reported as individual
neighbors and some problems of similarity (molecules containing fragments or structures very dissimilar
to the target). Were these datasets curated before selection by fingerprints? The search is directly done
in ToxCast and ToxRefDB?

EPA Response: Please reach out to us directly and/or share the specific substances so that we
can address this issue. We use QSAR-ready structures from the CompTox Dashbaoard / DSSTox
to build chemical structure fingerprints. So the analogues may have some limitations when the
substance is a salt.

Question 55: It seems GenRA pulls up similar substances based on structural similarity primarily. What
about the sorting of the structures based on other aspects (e.g., physchem, structural alerts, metabolic
similarity etc?)

EPA Response: We are working on multiple contexts of similarity. Since most chemicals have
structure data, we started with chemical fingerprints. We have investigated bioactivity, phys-
chem properties, gene expression/transcriptomics, phenotypic profiling and these are being
introduced in GenRA. We are also actively researching how to incorporate metabolic similarity
in identifying analogues.

Question 56: By discrete organic chemicals in GenRA- does these mean stereoisomers for eg. are more
difficult? I would just like a little clarification on discrete in this sense. Thank you!

EPA Response: Unfortunately, the similarity search does not consider stereo information in the
chemical structures. Let us know if this answers your question.

-------
Question 57: Does EPA or others have visibility of what structures users enter? I'm thinking about
whether confidential structures can be analysed using the tool?

Question 58: Thanks, Imran. We tried from simple and small structures (i.e. 3-aminophenol) to more
large and complex structures (a fluconazole-related compound). If you have mixtures and salts, the
similarity and fingerprints calculations will be problematic, confusing halogens with fragments, and
mixtures containing not related fragments. The interface and workflow is very interesting, but I think it
would be a suggest curation steps of the data (normalization, canonization, mixtures exclusion) and data
prepared by different fingerprints and descriptors (ECFP, MACCS, Morgan, etc.), endpoint, and other.
The workflow is very logical and scientific-based, with a not complex workflow of data preparation and
structuration it will be probably a very powerful tool.

EPA Response: Thank you for the feedback! We developed the current GenRA workflow based
on the use-case of conducting read-across for a single chemical. The complex mixture use-case is
certainly very interesting, and we'd be happy to talk about it further.

SIMILARITY CONTEXT

Question 59: Does the GenRA tool "just" use chemical similarity to identify read-across target
substances? Could this element of the tool be explicated at a high level??

EPA Response: As you're going to hear from Grace, GenRA is designed to consider multiple
contexts of similarity: chemical structure, bioactivity, and more coming soon.

Question 60: Any idea why the similar compounds feature in the Dashboard gives only a handful of
analogs for chlorofluorocarbons? Seems like Tanimoto similarity doesn't work very well for these
compounds.

EPA Response: More likely that the fingerprint representations are not customised for such
substances. We have been developing new PFAS specific ToxPrints see

https://pubs.acs.org/doi/10.1021/acs.chemrestox.2cQ0403 which we are considering adding in a
subsequent version of GenRA.

Question 61: Is 0.39 maximum similarity worth pursuing?

EPA Response: (Answered live) Thanks for your question. The Jaccard similarity based on
structure is only one metric to consider in read across. Fingerprint choice can impact the
magnitude of that similarity, depending on which structural aspects you wish to base similarity
on. As Grace mentioned, we can also look at Physicochemical properties and their similarities as
another means of isolating "good" versus "bad" analogs, and the presence or absence of
relevant toxicological endpoint potency data for the read-across metrics of interest between the
target chemical and the available source analogs, as we will see in future slides.

-------
Question 62: How confident we feel if to select an analogue that has chemical similarity of only .2 or
0.3...I think we should select only those with >0.8 similarity score

EPA Response: This is a very difficult question to answer in general. The suitable Jaccard index
will vary from one group of chemicals to another. This is why we enable users to explore the
potential analogues using different contexts (chemical, bioactivity) and evaluate the similarity in
toxicity endpoints.

Question 63: What should be ideal Jaccard score for selecting analogue?

EPA Response: The Jaccard score is only one consideration when selecting an analogue.
Consideration of the analogue toxicity data - concordance and consistency, physicochemical
similarity are other considerations that should be brought to bear in making a selection.

Question 64: what is the Jaccard similarity metric?

EPA Response: It is the same as the Tanimoto similarity metric.

Question 65: if the no filter is applied- how is the Jaccard similarity based on weight of evidence. I saw
that changed the ranking completely with a higher similarity score for a diff chemical

EPA Response: The purpose of the filters is to restrict analogues with toxicity (ToxRefDB) or in
vitro bioactivity (ToxCast) data. Therefore, selecting a filter will change the number of
analogues, and their level of similarity to the target.

Question 66: when using a Jaccard similarity metric, is there a cutoff that is ideal?

EPA Response: It is difficult to define a single Jaccard similarity threshold that will be ideal for all
chemicals. When using chemical structure fingerprints, it is important to visually inspect the
analogues and use expert judgement to compare them with the target to determine suitability
for read-across.

Question 67: Is the structural similarity always based on Tanimoto? Can other methods be selected?

EPA Response: We have compared dice, Euclidean, and a couple of other similarity metrics
without a substantial improvement in performance. If data suggest a particular metric would be
more advantageous, we would be open to considering it.

Question 68: What is a good similarity value?

EPA Response: see earlier answer

Question 69: When all analogues have low similarity, do we say read across cannot be done?

EPA Response: (Answered Live) Thanks for your question. The decision whether to use different
source analogs for a target should not be solely based on the fingerprint similarity metric.
Depending on the toxicological endpoint of interest, one should consider data availability of the
source analogs (panels 2-4), similarity of physchem properties (panel 1), as well as similarities
based on structure (panel 1). Ultimately, it comes down to expert judgment whether analogs
produced by each fingerprint are overall "good" for read-across based on all of these factors
together. This can be an iterative process where different fingerprints and their resultant source
analogs are chosen and compared on the aforementioned criteria, whether individually or as a
hybrid.

-------
Question 70: May you please review how the similarity score is computed?

EPA Response: In simple terms, the Tanimoto similarity between two chemical fingerprints is
calculated by dividing the total number of elements that are in common between the
fingerprints by the total number of elements in the two fingerprints.

For example, if chemical 1 has a fingerprint FP1 = {fl, f2, f3, flO, fll, fl2, f20} and chemical 2
has a fingerprint FP2 = {f 10, fll, fl2, f20, fl92, f243, f567}, where {fl, f2, f3,..., f567} are

FP1H FP2 4

elements like structural features, then the Tanimoto similarity metric =	= —= 0.4

'	'	FP1UFP2 10

On a separate note, the Tanimoto similarity and Jaccard index are equivalent when the
fingerprints are the same as binary vectors.

Question 71: What is similarity weight?

EPA Response: GenRA uses the similarity-weighted activity to predict and endpoint for a target
using analogues. The "weight" in this case, is the Tanimoto similarity.

Question 72: What similarity index is considered good enough or more reliable?

EPA Response: The similarity metric is just one consideration in evaluating and selecting
analogues.

Question 73: Is there a typical similarity threshold cut-off?

EPA Response: see earlier answer

Question 74: In vitro data will most often provide higher similarity scores?

EPA Response: see earlier answer

Question 75: Quite interesting to see that with a data rich compound like the one used in this demo
session, the best analogue has chemical similarity of 0.39. Wonder how good is uncertainty score for this
prediction?

EPA Response: Please remember this is using neighbors by Morgan prints, depending on the
end-user and the outcome, you can always look at alternative fingerprints or custom
fingerprints

ANALOGS

Question 76: How can we best choose the "Neighbors by"?

EPA Response: Currently, this requires interactive exploration, starting with the default options
(Morgan fingerprints and filtered by ToxRef data).

-------
Question 77: If we have potential analogues from other sources then how can we analyze those
analogues through GenRa?

Question 78: I know GenRA has ability to let you deselect the chemicals, but can it allow to select the
chemical analogues that the assessor desired?

EPA Response: This is something we are actively working on now. Allowing the end-users to
define their own neighborhoods or select analogues from the network exploration tool.

Question 79: Is there a best practice recommendation for evaluating choice of "similar" analogous, i.e.,
looking at multiple methods for finding nearest neighbors?

EPA Response: Currently, this requires interactive exploration, starting with the default options
(Morgan fingerprints and filtered by ToxRef data).

We have systematically evaluated the utility of different fingerprints for specific chemical
clusters for inferring hazards. We hope to share this information with users in a manner that can
suitably guide them on their usage.

Question 80: If an identified analogue, doesn't fit (not a good analogue despite high tanamito), can it be
excluded?

EPA Response: Yes, they can be excluded. We will show that soon. See the checkmark in Panel 4
next to the pairwise similarity.

Question 81: Is there a way to know the analogues collected are actually good enough for the read-
across? I guess sometimes, a new structure can be so different that there is no good analogue. In this
case, will the system give some "analogues" anyway?

EPA Response: Indeed, this is a value judgement by the end user. GenRA will return back the
most similar analogues with data that the end user can review and evaluate relevancy for based
on the predicted physicochemical properties and available toxicity data.

Question 82: Are the read-across predictions taken directly from analogues, or are they weighted
assemble values?

EPA Response: The read-across prediction is a similarity weighted activity outcome derived from
calculating the pairwise similarities of the analogues multiplied by their activity outcomes
divided by the sum of the pairwise similarities.

Question 83: if you have more than one source analogue for data, is the most conservative POD
chosen?

EPA Response: The focus within GenRA should be that the set of analogues that are kept after
filtering are those that adopt the criteria that suit your use case the best in terms of structural
similarity, type (in vivo/in vitro) and quantity of data available for your endpoints of interest,
physicochemical properties, etc. Given that the read-across calculations that GenRA performs
occur across the set of analogues kept after filtering, a few good representative analogues (e.g.,
possess the endpoint data you desire, structural similarity, chemical similarity, etc.) would be
ideal to achieve trustworthy read-across results.

Question 84: Is the analogue selection restricted to the substances present in CompTox?

EPA Response: Yes, though you can introduce any chemical of interest using Ketcher.

-------
Question 85: can we run read across for more than 10 analogues?

EPA Response: Yes, you can select up to 20 analogues in Panel 1.

Question 86: Does the presence of analogues without data affect the predictions (ie., weaken the
prediction)?

EPA Response: No, since these are not taken into account when making the prediction.

Question 87: I did not see how to use the phys-chem properties to delete specific analogs?

EPA Response: Expand the properties in Panel 4 and use that to guide whether an analogue
ought to be deselected from consideration

Question 88: Do you see GenRA having the option to input your own batch of chemicals to perform read
across? For example, you have identified your own class of chemicals that you want to use to inform a
data poor chemical.

EPA Response: Yes, this is something we are working on right now - how to allow the user to
identify their own analogues. You may be better off using genra-py for batch processing, if you
are interested in batch processing - please contact genra.supportffiepa.gov for more details - we
have an API that you can use to run predictions also.

DATA OUTPUTS

Question 89: Is metabolism considered in the predictions?

EPA Response: Metabolism is not currently considered in the predictions; however, we are
looking into this actively.

Question 89: Does GenRA consider the similarity of metabolic/clearance pathways as part of the read-
across?

EPA Response: Not currently but this is something we are actively working on.

Question 90: I don't see the Bio options online. Does anyone else?

Audience Response: Bio options are only available for some chemicals.

Question 91: What are the numbers and colors in Panel 2?

EPA Response: The color density represents a measure of 'data availability' for the target - from
light to dark. The number of data records is reflected in the box itself.

Question 92: With this tool, can you access the specific tox effect (increase/decrease body weight or
liver enzymes)?

EPA Response: Panel 4 can be filtered by study type-toxicity effect e.g., CHR (chronic)-body
weight.

-------
Question 93: Is there a way to focus the GenRA results so that only in vivo data are included in the
output matrix view?

EPA Response: Yes, in Panel 3, select the "ToxRef Group" to see only the in vivo endpoints. We
will go through this in the breakout session.

Question 94: Is it possible to filter out substances in the inital step which do not have any relevant
data?

EPA Response: you can de-select analogues based on the endpoints shown in panel 4

Question 95: Is there a way to sort by lack of data?

EPA Response: In panel 4, searching by observations will provide a view of which endpoints and
source analogues are most data poor or not.

Question 96: Ql: After clicking on "Run Read-Across", would it be possible to save the generated table
(with red and blue boxes)?

EPA Response: It is possible to download the information presented in Panel 4 as an excel file or
CSV file but this presents the numeric data available not the heatmap view of red and blue
coloured cells.

Question 97: only in vivo data are shown in the data matrix (panel 4)?

EPA Response: Either in vivo or in vitro data can be shown in Panel 4. One can choose to see the
in vitro bioactivity data in panel 4 by selecting the "Group: ToxCast" in Panel 3. We will go over
this in the breakout session.

Question 98: how much data (in vivo and /or in vitro) opr what is the minimal data required to make a
RAthat is highly probable?

EPA Response: That's a really difficult question to answer generally for all chemicals. We have
explored this systematically in our publications and have been able to find "sweet spots" for
optimal performance in many cases. It continues to be a research problem.

Question 99: I think it was mentioned that the predictions can be exported to excel. How do we do
that?

EPA Response: Predictions can be exported in panel 4 by clicking on the Download
option. Most useful is to download the predictions in Excel format.

Question 100: Can you filter by route of exposure of the studies in ToxRefDB?

EPA Response: There are a limited number of inhalation exposure guideline studies in ToxRefDB,
which is why we have mostly oral exposures in GenRA. As our sources of toxicity data grow, and
we have additional information, we plan to include functionality for filtering by routes of
exposure.

-------
PREDICTIONS

Question 101: What does ACT stand for and what is its use?

EPA Response: similarity weighted activity = ACT. Act is the similarity weighted activity based on
analogues. AIM was used in the demo as the fingerprint option

Question 102: how to get AUC and p values?

EPA Response: AUC and p values only appear after a prediction is run and typically only show up
when there is a minimum of 2 positive and 2 negative chemicals.

Question 103: In case there is residual uncertainty in the read across prediction, would you consider
using an assessment factor to account for this when setting an acceptable limit?

EPA Response: That is for an end-user to determine relative to the decision context they are
interested in.

Question 104: Results: ACT=1 pos effect, high likelihood of effect? Interpretation of Neg, ACT=0.32,
AUC=0.75, p=0.13?

EPA Response: There are several factors to consider in using GenRA predictions:

The number of analogues: The more similar and greater number of analogues that are
available, the more confident you can be in the GenRA prediction. This is because GenRA uses a
statistical method called "nearest neighbors" to make its predictions. The more analogues there
are, the more likely it is that GenRA will find a close match to the target chemical.

ACT: The similarity-weighted activity (ACT) is a value between 0 and 1 that tells you how likely
it is that the target chemical will have the same activity as the analogues. A value of 0 means
that the target chemical is very unlikely to have the same activity as the analogues, while a
value of 1 means that the target chemical is very likely to have the same activity as the
analogues.

AUC: The area under the receiver operating characteristic (ROC) curve (AUC) is a measure of
how well GenRA can distinguish between active and inactive chemicals. An AUC of 0.5 means
that GenRA is no better than chance at making this distinction, while an AUC of 1 means that
GenRA can perfectly distinguish between active and inactive chemicals. An AUC of 0.7 or higher
is generally considered to be good.

p-value: The p-value is a measure of the statistical significance of the GenRA prediction. A low
p-value means that the prediction is statistically significant, which means that it is unlikely to
have occurred by chance. If p=l then it means that the ACT and AUC are unreliable.

It is important to consider all four of these factors when interpreting the results of GenRA
predictions. The more analogues that are available, the more confident you can be in the
prediction. The ACT and AUC values can give you an idea of how likely it is that the target
chemical will have the same activity as the analogues. The p-value can tell you how statistically
significant the prediction is.

-------
It is also important to remember that GenRA is a statistical method, and no statistical method
is perfect. There will always be some uncertainty in any prediction. However, by considering all
four of these factors, you can make more informed decisions about the reliability of GenRA
predictions.

Interpretation examples:

I)	ACT=1: "f AUOO.5 and p<0.1 there there is a high likelihood of an effect.

II)	ACT=0.32, AUC=0.75, p=0.13: there is a high likelihood that there is no effect as ACT<0.5,
AUOO.5 and p~0.1

Question 105: is there any cut off value for AUC or p value above which the prediction is reliable?
EPA Response: See detailed explanation above

APPLYING GENRA

Question 106: Is this tool accepted by reg agencies?

EPA Response: Policy determinations by EPA or other Agencies are beyond the scope of this
training.

Question 107: Are there any criteria to exclude chemicals from getting a read-across in GenRA

EPA Response: This is currently a judgement by the end-user based on the information that is
presented in Panel 4.

OTHER

Question 108: Is GenRA considering the number of scientific publications/articles per endpoint?
Because having a list of analogues without data is not very useful for read-across.

EPA Response: Good question. There are a number of factors to consider while using literature
information in read-across predictions. We are exploring a variety of text-mining approaches to
extract information about chemical-effects from the literature. This feature may be included in
future versions.

Question 109: Is read-across from read-across advisable though (ie: include source analogues of source
analogues of source analogues in the viewer)?

EPA Response: Recursive read-across? This is an interesting research idea. Please feel free to
reach out to us to discuss further.

Question 110: In what cases, do we need to use Toxcast vs Toxref data?

EPA Response: This depends on the end user, what outcomes they are interested in evaluating
as well as depending on the target and analogues? if they don't have enough end points from
toxref, they can evaluate toxcast?

-------
Question 111: Why do you discard the Benzoic acid if it has a value of 3?

EPA Response: We discarded Benzoic Acid because its logKow value was less than 2 even though
it was initially kept based on its physchem properties when we were filtering on melting
temperature since its melting temperature helped characterize that it was a solid.

Question 112: By selecting Chem: AIM and ToxCast data, we may get high similarity score, however the
analogue often lacks in vivo data. Is that a good approach?

EPA Response: If you just want to pick analogues then perhaps. If you need toxicity data to infer
hazard / POD, then you will probably need to include ToxRef.

Appendix A: List of Acronyms

Acronym

Definition

CPDat

Chemical and Products Database

EC50

50 percent effect concentration

ECHA

European Chemicals Agency

ECOSAR

Ecological Structure Activity Relationships

ECOTOX

ECOTOXicology Knowledgebase

ENVIROFATE

Environmental Fate Database

EPI Suite

Estimation Program Interface Suite

GLP

Good Laboratory Practice

MEC

Measured Environment Concentration

NAMs

New Approach Methodologies

OECD

Organization for Economic Co-operation and Development

OPERA

OPEn structure-activity Relationship App

PCA/PLS

principal components analysis/partial least squares regression

QSAR

quantitative structure-activity relationship

REACH

Registration, Evaluation, Authorisation and Restriction of Chemicals

SAR

structure-activity relationship

SET AC

Society of Environmental Toxicology and Chemistry

-------