GLTKD-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
Date: Oct 20] 9
Page 0 of 8?
STANDARD OPERATING PROCEDURE
GLTED-STB-SOP-3784-Q
"Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS) Testing"
Authors: Colin P, Finnegan, Donovan J. Blatz. and Carlie A. LaLone
Prepared bv: Carlie A. LaLone Date: October 2019
Note- Ortpnal allied Revision i
Reviewed by;
Team Leader: Carlie A. LaLone Date: October 2019
Branch Chief: Stem and Deeitz Date: October 2019
Approved by:
Quality Assurance Manauer: Barbara Sheedv Date: October 2019
\'oie: Tri
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
Date: Oct 2019
Page 0 of78
STANDARD OPERATING PROCEDURE
Gl.TF.D-STB-SOP-3 784-0
"Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS) Testing"""
Authors: Colin P, Finnegan, Donovan J, Blatz, and Carlie A, LaLone
Prepared bv: Carlie A, LaLone Date: October 2019
Sole: Original called Revision I
Reviewed by:
Team Leader: Carlie A. LaLone
Branch Chief: Siemund Devitz
Date: October 2019
Date: October 2019
Approved by:
Quality Assurance Manager: Barbara Sheedv Date: October 2019
Nose: Tri-unrmal ipwii
Carlie A. LaLone, Research Bioinformaticist
Revision 0: (Original) Prepared by: Signature: . Date: October 2019
Revision 0: (Original)
Approved by Branch Chief:
Revision 0: (Original)
QA Manager Approval by:
Sigmund Degitz, STB Branch Chief
Signature:
Barbara Sheedv, OA Manager
Signature: _ _
Date: October 2019
Date: October 2019
U.S. Environmental Protection Agency-
Center for Computational Toxicology and Exposure
Great Lakes Toxicology and Ecology Division -
DULIITIL MN
0
-------
GLTED-SGP ScqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
Date: Oct 2019
Page 2 of 78
The Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS)
Standard Operating Procedure (SOP) has been developed to test all current functionality
of SeqAPASS tool. It can be used at any time during development, however it is most
critical when testing the code upon being dropped into the staging environment and the
production environment. It is anticipated that as development of the SeqAPASS tool
continues the SeqAPASS SOP will be updated to incorporate the most up to data
features. Testing is divided into two components, first focusing on the Graphical User
Interface (GUI; FrontEnd) and second the Data/Database (BackHtid). Depending on
whether the SeqAPASS tool is moving toward a minor or major version release either
one or both components of the tool may be tested. This testing SOP is organized by pages
and features in the SeqAPASS tool as indicated by the header for each testing section,
'resting of the SeqAPASS tool may begin at the beginning of the SOP and work through
the complete SOP if the ent ire tool is moving into staging or production, or the testing
may be conducted for select features that have been modified or changed during iterative
development of the tool. Protein and taxonomy data from the National Center for
Biotechnology Information (NCB1) databases are periodically retrieved and used to
update the back end of SeqAPASS. R code has been developed to automate data
comparisons between new and old data versions to check for anomalies. The source code
that is used in R is provided at the end of the SOP. To run the R code, the tester must
have R studio installed on their computer.
2
-------
GLTED-50P SeqAPASS CL Oct 2019 Vcr. 0 (original)
Reference Number: GLTED-STB-SOP-3784-0
Revision No. 0
Date: Oci 2019
Page 4 of 78
¦ No username, real password
¦ Incorrect username, no password
¦ Incorrect username, incorrect password
¦ Incorrect username. correct password
¦ Correct username. no password
• Correct username, incorrect password
¦ Username should not be case sensitive
¦ Password should be case sensitive
¦ Log in with correct username and password
¦ Log out
o Attempt to log back in with PIV card
¦ Try to log in without PIV card
¦ PIV card in, no password
¦ PIV card in, wrong password
o Log in with PIV card in, correct password
4
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-0
Revision No. 0
Date; Oct 203 9
Page 6 of 78
HOME TAB
• Check all links
• Starting from the "Home" tab, click on the "Request SeqAPASS Run1' to the right
of this tab, and then hack to the "Home" tab. Repeat for all remaining tabs to the
right
• Click logout, then log back in
REQUEST SEQAPASS RUN tab - By Species
Generic checks
• Check all links
o Make sure Identify a Protein Target links all work.
• Starting from the "Request SeqAPASS Run" tab, click on the "SeqAPASS Run
Status" to the right of this tab, and then back to this tab. Repeat for all remaining
tabs to the right.
• Click logout, then log back in and navigate back to this tab.
Query Species Selection
• Click into the search bar under Query Species Selection, type in a three-letter
combination, all lower case (use "hom'"* by default). A dropdown list should
appear. Delete the three letters. The dropdown menu should vanish. Retype the
same three letters, this time all capitals. The same list of species should drop down
again.
• Clear the search bar and click the Add Query Species button. Nothing new should
happen.
• Copy ^zxvqwy5" into the search bar. The dropdown list should say "No Results
Found". Click the Add Query Species button. Nothing new should happen.
Search for "homo sapiens", click the top result in the menu and click the Add
Query species button. "Homo sapiens (Taxid:9606y should appear in the box
below.
• Repeat the above step with "Homo sapiens (Taxid:9606)" already present. Nothing
new should appear.
• Search for "human", click the top result, then click the Add Query Species button,
'"human (Taxid:%06f should appear in the box below.
6
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
Date: Oct 2019
Page 8 of78
• Pick one of the species in the Query Species box lit at has riot yet been selected
with more than 5 proteins. Using the Ctrl key, click on three different proteins and
add them simultaneously to the Final Query Protein(s) box.. They will he referred
to as "X", "Y", and "Z".
SeqAPASS Submission
• Click on the protein designated "X'\ then click the Remove Selected Protein(s)
button. "X" should no longer be in the Final Query Protein(s) box.
• Using the Ctrl key, select both "Y" and "Z", then click the Remove Selected
Protein(s) button, "Y" and "Z11 should no longer be in the Final Query Protein(s)
box,
• Click the Remove All Proteins button. The box should be empty.
• Add any protein to the Final Query Protein(s) box, then click the Clear button. The
page should return to its default state.
• Click the Remove Selected Proteinfs) button. Nothing should happen.
• Click the Remove All Proteins button. Nothing should happen.
• Click the Clear button. Nothing should happen.
• Click the Request Run button. An error should appear stating "'Must select query
proteins".
• Using Homo sapiens as the species, return "NP_0Ql062.l" to the Final Query
Protein(s) box. Click Request Run. A message stating "Submitted NP__00l062.I"
should appear in the top right corner of the screen.
By Accession
• Return to the top of the page and click the radio button marked "By Accession'".
You should be transported to a new page on the same tab.
• Click on the link, NCBI Protein Database. This should open a new tab to the url
littps:/Avww.ncbi.n 1 m.nih.gov/protein. Close this tab and return to SeqAPASS.
• Click Request Run. An error should appear stating "Must enter NCBI Accession".
• Type "hello" into the NCBI Protein Accession textbox. Then click the Clear
button. The text should vanish. Type it again and then hit the Request Run button.
You should see a message stating ''hello: not in database".
• Copy the contents of the text file "Accession_list.txt" into the NCBI Protein
Accession box, then click the Request Run button. You should see a series of
boxes stating that each individual accession was submitted.
8
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTKD-STB-SOP-3784-O
Revision No. 0
Date: Oct 2019
Page 10 of 78
the only button with numbers in it, then the arrows to the right should be
greyed out too.
o If the REQUEST RUN process has been completed, there will be at least
three pages of level 1 runs which is required for satisfactory testing. If this is
not the case, return to the REQUEST RUN stage and submit all runs from
that stage of testing.
o Click on the button containing the number 2. This should highlight the button
showing 2, and the data shown in the table should change. The arrows on the
left of the buttons with numbers should no longer be greyed out.
o Click the button numbered 1. This shouid return you to the previous state of
the table.
o Click the button showing two arrows on the right of the numbered buttons.
This should advance you one step forward to page 2. There should be no
observable differences between having arrived at the page by clicking on the
number 2 and arriving by clicking the arrow button. Return to page one by
clicking on the button numbered 1.
o Click the button showing an arrow and a line on the right side of the
numbered buttons. This should advance you to the last numbered page,
whatever that may be. There should be no more numbered buttons to the right
of the highlighted button. The arrows on the righthand side of the numbered
buttons should be greyed out.
o Click the button showing two arrows on the left of the numbered buttons.
This should advance you to a number one lower than the previous page. All
arrow buttons should be highlighted. Return to the last page by clicking on its
number.
o Click the button showing an arrow and a iine on the left side of the numbered
buttons. This should move you back to the first page.
Number of entries
o Count the number of entries in the table. There should be ten by default.
Click on the number ten to the right of the navigation buttons to open a
dropdown menu and change the value to twenty. There should now be 20
entries per page. Click on the 20 and change it to 50. If there are not enough
entries to fill a full page of 50, simply check that there are more than 20
entries.
Downloads
o Click on the Excel logo after the words '"Download Table:". The table should
download with an .xls extension. Open the download and check the top ten
10
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver, 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No, 0
Date: Oct 2019
Page 12 of 78
View SeqAPASS Reports
Generic checks
• Starting from this tab, click on the tab to the right of this tab, and then back to this
tab. Repeat for all remaining tabs to the left.
• Click logout, then log back in and navigate back to this tab.
Table checking
Data sorting
o The leftmost column title should be highlighted in blue and show a darker
blue downward pointing triangle after the end of the column name when you
enter the page.
o Click on the leftmost, column name. The triangle should point up and the data
ought to have reorganized themselves (assuming there was more than one
entry in the table).
o Proceed to click twice on each column heading and check that the data is
being organized correctly.
8 When the column entry is a number and you see the downward triangle,
the column should be sorted in descending numerical order, highest
values at the top. When the triangle points up, the list should be sorted
in ascending numerical order, lowest values at the top.
¦ When the column entry is a string of text and you see the downward
triangle, the column should be sorted in alphabetical order. When the
triangle points up, the list should be sorted in reverse alphabetical order.
¦ When the column entry is a date and you see the downward triangle, the
column should be sorted from most to least recent. When the triangle
points up, the list should be sorted from least recent to most recent.
Navigation
o At the bottom of the table there should be a series of buttons, some with
arrows and some with numbers. The button with the number 1 in it should be
blue and the arrows to the left of it should be greyed out. If the number one is
the only button with numbers in it, then the arrows to the right should be
greyed out too.
12
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
Date: Oct 2019
Page 14 of 78
should download with an .csv extension. Open the download and check the
top ten rows against the table on your browser.
Search function
o The search function does NOT consider the following columns when filtering
the data table: SeqAPASS Run ID, Ortholog Count, NCBI Taxonomy ID or
Data Version. When entering a search term from one of these columns, you
are looking for evidence that entries are being filtered out of the data set
despite containing the search term in one of those three columns,
o For each column, type in the value you observe in that column in the first
row. Except for the cases listed above, the top row should remain in the table
and any other rows that remain should have the search term somewhere in
their row, most likely in the same column (but not always),
o The search bar should not recognize search terms that span multiple columns.
Test this by entering a Query Common Name, then a space, then the
respective Accession from the first row. The table should be empty,
o Scroll to the bottom of the page and click the 'Top of Page'1 button. This
should move you back to the top of the page without changing anything
about the data table.
Run Selection
o The radio buttons on the left-hand side of the table are used to select a report.
Click on the button next to the topmost entry on the list. It should highlight,
and the row should turn grey,
o Click on the button next to the second highest entry on the list. The button
should highlight and the button that was previously highlighted should have
returned to its previous state,
o Using the "By accession" feature in the Request SeqAPASS Run tab, copy
the accession of the topmost entry and resubmit it. Navigate back to t his tab.
You should see the new run appear above the old one.
o Using the "By accession" feature in the Request SeqAPASS Run tab, submit
accession AAA58995.1. This accession is for a partial protein. Navigate back
lo the View Reports tab. The newest entry should have the Query Protein
Name highlighted in yellow. Click the box next to words Partial Protein
Sequence next to the Request Selected Report button. The highlighting
should disappear. Click I he box again, it should reappear.
14
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-0
Revision No, 0
Dale: Oct 2019
Page 16 of78
• The character # will not be used. Instead, when a word is bounded by #, the
expectation is that you will not copy those exact characters, but instead replace
them with the appropriate value. For example,
#Your_first name#_#Your_last name# should be written as your first name and
last name separated by an underscore rather than a space. Most commonly this will
be used in the form # Accession#, where the expected replacement is the Accession
ID of the protein you are currently looking at.
Take this moment to navigate to your current working directory (e.g.
'L:/Priv/Bioinformatics Team/SeqAPASS/Saved test data'). All files will be saved to this
location unless the instructions explicitly say otherwise.
Level 1
Generic checks of report
• Click on the five links NOT in the data table on this page.
o The first, which will be the accession of your current query after Query
Accession under the Level 1 Query Protein information, should open a new
tab to the url https://www.ncbijiiiTuiih.gOv/pTOtein/f ACCESSION], where
[ACCiiSIONJ is replaced by the relevant accession. Close this tab and
return to SeqAPASS.
o The second, NCBI Conserved Domain Database, should open a new tab to
the url
https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi?INPUT_TYPE=live
&5EQUENCE=[ACESS10N], where [ACCESION] is replaced by the
relevant accession. Close this tab and return to SeqAPASS,
o The third, NCBI Protein Database, should open a new tab to the url
https://www.ncbi.nhn.nih.go v/protein. Close this tab and return to
SeqAPASS.
o The fourth, NCBI COBALT, should open a new tab to the url
https://www.st-va.ncbi.nlm.nih.gov/tools/cobalt/re_cobalt.cgi?. Close this
tab and return to SeqAPASS.
o The fifth, NCBI Taxonomy Database, should open a new tab to the url
https://www.ncbi.nlm.nih.gov/taxonomy. Close this tab and return to
SeqAPASS.
16
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GI.TED-STB-SOP-3784-0
Revision No. 0
Date: Oct. 2019
Page 18 of 78
nearest hundredth. Repeat this process to ensure that you have observed the value
being rounded up and rounded down,
• Set the Cut-off to default.
Primary Report Settings
• In the E-value box, enter a negative value and click "'Update Reporf\ An error
message should appear, and the box border should turn red, Enter a number greater
than 10. An error message should appear. Enter a string of non-numeric characters.
An eiTor message should appear. Enter .01. The red border should vanish, and no
error message should appear.
• In the E-Value box, enter 1E-1, then click "Update Report". The value in the box
should be represented as . 1.
• To determine that the E-value is impacting the entries in the data table, note the
number of pages in the data table (next to the navigation tools you should see "(1
ofXf where X is the number of pages. Enter '"IE-100" into the E-value box and
click "Update Report". If that fails to change the number of pages, click the Full
Report radio button on the data table and check the E-value column in. the data
table. All entries should be less than the number you entered. (Because of
rounding, some entries may have an E-value of 0.000E0. If this is the case, choose
a different accession and repeat this process).
• Click "Use Default Settings". The E-Value should return to .01.
• In the dropdown menu of Sorted by taxonomic group, select each option and then
click "Update Report". After each change, look in the primary data table under the
column "Filtered Taxonomic Group" and check that the name in this column is of
the appropriate type. Note that some organisms may not have a classification for
each level SeqAPASS provides; in this case SeqAPASS will use the nearest more
specific classification that organism has. Clicking on the entry for a row in the
column "Scientific Name" will bring you to the NCBI Taxonomy Browser for that
species, which facilitates the process of confirming appropriate classifications.
• Click "Use Default Settings". The Sorted by Taxonomic Group should return to
class.
• In the Common Domains box, enter a negative value and click "Update Report".
An error message should appear, and the box border should turn red. Enter a
number greater than 100.000. An error message should appear. Enter a number that
isn't an integer. An error message should appear. Enter a string of non-numeric
18
-------
GLTED-50P SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0'
Date; Oct 2019
Page 20 of 78
Navigate to visualization
o Return to ihc originally chosen non-partial sequence,
o Click on the "Visualize Data" button in the Visualization box below the
primary report box. A new tab should open. Navigate to that tab.
Visualization
In the new tab, under Level 1 Query Protein Information you should see the five pieces of
information listed: SeqAPASS ID, Query Protein, Query Species, Ortholog Count, and
Query Accession. The values for these categories should be the same as they were in the
main SeqAPASS page. Confirm that this is the case.
Select to Open Information or Data visualization
o You should see only a gray box with an information symbol and a small box
and whisker plot. Hover over these icons. They should enlarge and have
captions.
o Click on the Information icon. Nothing should happen,
o Click on the Box Plot icon. After a brief loading screen, the lower box
should change from 'Info" to "Box Plot",
o Click on the Information icon again. The lower box should change back to
Info. Click on the Box Plot icon again. The transition to "Box Plot'1 should
not take significantly longer than the first time the transition occurred.
Box Plot
Controls
Taxonomic Groups
o You should see a variety of small blue boxes with taxonomic groups that
correspond to the hierarchical level selected in the primary settings report on
the previous tab. Return to that tab and select a different Sort By Taxonomic
Group value, then click Update Report. Return to this tab. The blue boxes
should have changed to reflect the new hierarchical level (Note: this can take
a very long time if the number of species is large and you selected a very
low hierarchical grouping. Be reasonable). Change the hierarchical grouping
to Class to minimize clutter while testing.
20
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number; GLTED-STB-SOP-3784-O
Revision No, 0
Date: Oct 2019
Page 22 of 78
o Click the checkbox in the upper left corner. It should become checked, as
should the unchecked box and you should see the previously removed
group's blue box return in the list. Click the box again, This should remove
all groups except for the query group. Click it again. All boxes should
become checked and blue boxes return to ihe list.
o Click the x below the search bar. This should close the window.
o Move down to the Boxplot section of the page. Hover the cursor over some
of the group names. The cursor should change to show and arrow with a
white x in a red circle, and a grey box should appear listing up to three
species from that group. Pick a group, note its name, and then click on it.
This should remove it from the graph. Return to the Taxonomie Groups box
and search for the blue box of the group you removed. It should not be
present. Open the list of groups and search for the name of the group you
deleted. The box next to it should not be checked. Check this box to return to
the original state of the page.
Select species for legend
o Hover the cursor over any white space in the box or the downward facing
arrow on the right side of the box. The cursor should change to a finger and
the arrow highlight blue. Click anywhere in (he area outlined above. This
should open a list that is similar to the list seen in the Taxonomie Groups
step.
o Check that the search bar is function properly by typing the name of a
species on the list. Stop after each letter and confirm that all remaining
species in the list have the letter combination in the search bar somewhere
inside their name. The search should not be capitalization sensitive. While
there is something in the search bar, close the window and then reopen it.
The characters in the search bar should still be present. After this test, delete
whatever term is in the search bar.
o Select, a species, preferably one whose taxonomie classification you are
familiar with. Find it on the list (if it is not present, choose a species that is
on the list and then resume this process). Click on the checkbox next to its
name.
o With that checkbox still checked, observe the graph. Somewhere on the
graph a legend should have appeared containing the species name thai: was
just selected with a colored symbol to its left. That colored symbol should
22
-------
GLTED-SQP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTKD-STB-SOP-3784-O
Revision No. 0
Date: Oct 2019
Page 24 of 78
o Hover over the name of a species in the legend. The cursor should change
to the arrow with an x in a red circle, Click on the name. This should
remove it from the graph. Confirm that it is removed in the Select Species
for Legend area.
o Begin selecting species to be on the legend and continue until the legend
extends off the end of the graph. From this point, retest all forms of single
deletion (by checkbox, by blue box, by name) and then all forms of deletion
by groups.
g Note: There is a historical issue where deleting a group with multiple
species while the legend is crowded will result in a graphical error where
the legend intersects with the graph. Sometimes this effect is temporary,
and other times it has persisted until another species is removed or added to
the legend. This is not intended behavior and should be reported,
o Add test on species crossing over tax groups depending on answer to non-
eukaryotes question.
Species Legend Options
o Close the visualization tab and then click on the "Visualize Data" button
again.
o Note: There is a historical issue where attempting to open the visualization
any time after the first will be extremely slow, showing a white screen for
15-45 seconds before opening. This is not intended behavior and should be
reported. If the page is not loading for longer than 45 seconds, close the tab,
navigate back to the main Request Visualization area, request the accession
you were in again, and open the visualization from that page. If you took
this action, ignore the next step.
o If you did NOT have to re-request the accession, then you should see that
the changes made to the Taxonomic Groups and Select species for legend
areas were preserved. Close the visualization tab and this time reset the
visualization page by navigating back to the main View SeqAPASS Reports
area, requesting the accession you were in again, and opening the
visualization from that page.
o The page should appear as it did the first time you opened it, with no
species selected and all Taxonomic groups present.
o If you have already identified a species in the current accession where two
or more species share the same common name, then remain on your current
24
-------
GLIED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No, 0
Date: Oct 2019
Page 26 of 78
the graph are likely to change position and the inscription next to the red
dot in the legend should read "Threatened Species", Hover over the dots
and confirm that information is displayed as usual. Click on one of the dots.
This should open a new tab containing the ECOS page for that species.
Close this new link,
o Click on the checkbox beneath Endangered Species. The checkbox beneath
Threatened species should automatically become unchecked. The dots in
the graph should change position and the inscription next to the red dot in
the legend should read ''Endangered Species". Hover over the dots and
confirm that information is displayed as usual. Click on one of the dots.
This should open a new lab containing the ECOS page for lliat species.
Close this new link,
o Click on the checkbox beneath Common Model Organisms. The checkbox
beneath Endangered Species should automatical!}' become unchecked. The
dots in the graph should change position and the inscription next to the red
dot in the legend should read "Common Model Organism". Hover over the
dots and confirm that information is displayed as usual,
o Pick a red dot that has rio other red dots close to it, Change which box has
been selected if need be. Hover over that dot and note the organism it
represents. In the Select Species for Legend menu, select that species. You
should see the appropriate legend pop up and the red dot should be replaced
by the new icon. Hover over the new icon. There should be no change in
in formation displayed,
Down i oa d BoxP I ot
o Click the Download Boxplot button. After a brief loading screen, a pop-up
menu should appear in the center of the screen, and the screen outside of
that pop-up should become grey and non-interactable.
o Initially, the radio button should be sel to SVG and the width and height
text boxes should be greyed out.
o Click the Download image button. Rename the file to indicate the accession
and save it to the SOP testing folder. Click anywhere on the visualization
page outside the menu. This should no! close the menu. Click on the x in
the upper right to close the menu. Find the downloaded file and open it. An
internet browser should automatically open with an image of the graph.
26
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GI.TED-STB-SOP-3784-O
Revision No. 0
Date; Oct 2019
Page 28 of 78
accession to open the visualization tab, then the radio button should still be
set to PNG, Change this to JPG. Click Download Image, renaming the file
to indicate and save it to the SOP testing folder,
o Open the file and compare it to the graph on SeqAPASS. There should be
no differences.
o Repeat the process of testing different magnification levels for the JPG
format. This should include both an upscaled and downsealed image,
o After finishing that process, set the width to the value that previously got an
out of memory error, and then switch the radio button to SVG. The boxes
should become greyed out. Click download image. Notice that this should
not cause an out of memory error. Save the file and compare it to the
original SVG file. There should be no differences. Close out of the
Download Box Plot menu.
Size Controls
o Click the Open Size Controls button. A pop-up menu should appear in the
center of the screen, but this time the rest of the screen should not grey out.
o Confirm that areas outside of the pop-up menu are interactable.
g In the Bar Width area, use the arrows to adjust the number in the box. As the
number increases, you should see the width of the bars in the graph get
larger, and smaller as the number decreases. However, within the confines of
the page, this will manifest as the entire graph becoming longer or shorter,
with the actual width of the bars being relatively unchanged. What you are
effectively doing is changing the aspect ratio of the image. This can be
confirmed by checking the Download BoxPlot menu. As the bar size
changes, you can observe that the ratio of width to height is being altered,
o Return to the Size Controls. Click inside the Bar Width text box and type 16,
then hit enter. This should have the effect of setting the bar width to 16, as if
you had gotten there by clicking on the arrows,
o Type in a decimal, then hit enter. The decimal should be truncated,
o Type non-numeric characters. These should be ignored and the value in the
box changed to the last valid input,
o Click the reset button. The bar width should be changed to the box specific
original value. If it was already 12, change it to something else and then
click the reset button again.
28
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
Dale: Oct 2019
Page 3D of 78
o Pick one of the taxonomic groups and click on the gray bar representing the
spread of percent similarities within that group (It is easier to pick a group
with a large bar. It is possible to click on smaller ones, but for groups with 2
or 3 entries this can become a pixel hunt) A summary tables should pop
open,
o The summary table should list the taxonomic group and how many species
are in that group in the upper left-hand corner, Directly underneath those
should be the mean and median percent similarity values, followed by
Susceptible Y or N, which is the susceptibility prediction for that taxonomic
group based on read-across - Y. Beneath those should be a data tabic like
ones that you have seen before. Download this data table as
#accession#_#Taxonomic group#_Summarylvl 1 in both .xls and .csv format,
o Perform the Table Checking steps appropriate to this table, In this case, there
is no search function and no radio buttons,
o Click View Level 1 Summary Report on the Level 1 Query page. The data
present should be; Taxonomic group, filtered taxonomic group, number of
species, mean percent similarity, median percent similarity, and
susceptibility prediction. Check to make sure that the data matches that of
the boxplot tables when selecting a species boxplot, (Repeat for levels 2 and
3 with their respective information).
Summary table data checking
o Return to the main SeqAPASS tab. Using the taxonomic groups selector in
the Request Level 3 box, select the taxonomic group whose summary table
you opened. In the primary data table, you should see that taxonomic
group's name appear in the search bar. Download this data table with the
name #Accession#_#Taxonornic Group#JSummary_Table_Check.
o Return to the visualization tab. Download the summary table for that group
with the name #Accession#_//Taxonomic_Group#_Summary_Table.
o No automated method of checking yet exists for these tables, so you will
need to perform the check manually. Open both downloaded tables,
o The NCBI Accession, Taxonomic Group, Filtered Taxonomic Group.
Scientific Name, Common Name, Protein Name, and susceptibility columns
should be identical between both tables. The percent susceptibility should
also be identical the tables, though the summary table will have lost a
significant digit as compared to the primary data table.
30
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
Date: Oct 2019
Page 32 of 78
Level 2
Generic Checks or Report
• By default, the Request Domain Run button should be greyed out and unclickahle.
The View Level 2 Data button should be clickable. Click this button. You should
receive an error telling you to select a run from the dropdown.
• Open the dropdown menu under Functional Domains by clicking on it. If the list is
empty, go back and select a different accession to test.
m Check that the search bar can recognize strings of numbers, strings of characters,
and strings of mixed numbers and characters. Check that spaces are not treated as
an AND operator.
• Select any of the listed domains. The Request Domain Run should become
interactable. Go back into the menu and select "-Select Domain-'" at the top of the
list. The button should grey out again. Reselect the previous group and click
Request Domain Run.
• Navigate to Run Status level 2 and confirm that the submission has been accepted
and is running/has completed (It should have completed already. If it has not
finished within 20 seconds, something is wrong). Navigate back to the level 1
report.
• Click the dropdown menu under Choose Domain to View. The request you just
submitted should be there. Click on it. Then click "View Le\el 2 Data". You
should be moved to a new page on the same tab.
Primary Report Settings
• In the E-value box. enter a negative value and click "Update Report". An error
message should appear, and the box border should turn red. Enter a number greater
than 10. An error message should appear. Enter a string of non-numeric characters.
An error message should appear. Enter .01. The red border should vanish, and no
error message should appear.
• In the E-Value box, enter 1 E-l, then click "Update Report". The value in the box
should be represented as ,1,
• To determine that the E-value is impacting the entries in the data table, note the
number of pages in the data table (next to the navigation tools you should see "(1
of X)" where X is the number of pages. Enter "I E-l 00" into the E-value box and
click "Update Report". If that fails to change the number of pages, click the Full
32
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No, 0
Date: Oct 2019
Page 34 of 78
Primary Data Table
Data sorting
o The leftmost column title should be highlighted in blue and show a darker
blue downward pointing triangle after the end of the column name when you
enter the page.
o Click on the leftmost column name. The triangle should point up and the data
ought to have reorganized themselves (assuming there was more than one
entry in the table).
o Proceed to click twice on each column heading and check that the data is
being organized correctly,
¦ When the column entry is a number and you see the downward triangle,
the column should be sorted in descending numerical order, highest
values at the top. When the triangle points up, the list should be sorted
in ascending numerical order, lowest values at the top.
¦ When the column entry is a string of text and you see the downward
triangle, the column should be sorted in alphabetical order. When the
triangle points up, the list should be sorted in reverse alphabetical order.
¦ When the column entry is a date and you see the downward triangle, the
column should be sorted from most to least recent. When the triangle
points up, the list should be sorted from least recent to most recent.
Navigation
o At the bottom of the table there should be a series of buttons, some with
arrows and some with numbers. The button with the number 1 in it should be
blue and the arrows to the left of it should be greyed out. If the number one is
the only button with numbers in it, then the arrows to the right should be
greyed out too.
o If there are not enough entries for at least three pages of entries, select a
different accession that does have more than three and then resume this
process.
o Click on the button containing the number 2. This should highlight the button
showing 2, and the data shown in the table should change. The arrows on the
left of the buttons with numbers should no longer be greyed out.
o Click the button numbered 1. This should return you to the previous state of
the table.
34
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver, 0 (original)
Reference Number: GLTED-STR-SOP-3784-O
Revision No. 0
Date: Oct 2019
Page 36 of 78
Search function,
o The search function does NOT consider the following columns when filtering
the data table: Data Version, Protein Count, Species Tax ID, Blast Bitscore,
Ortholog Count, Cut-off, Percent Similarity and Eukaryotes, When entering a
search term from one of these columns, you are looking for evidence that
entries are being filtered out of the data set despite containing the search term
in one of those three columns,
o For each column, type in the value you observe in that column in the first
row. Except for the cases listed above, the top row should remain in the table
and any other rows that remain should have the search term somewhere in
their row, most likely in the same column (but not always),
o The search bar should not recognize search terms that span multiple columns.
Test this by entering an accession, then a space, then the Common Name for
that accession. The table should be empty.
Navigate to visualization
In the new tab, under Level .2 Query Protein Information you should see the
five key pieces of information listed: SeqAPASS ID, Query Protein, Query
Species, Ortholog Count, and Query Domain. The values for these categories
should be the same as they were In the main SeqAPASS page. Confirm that this is
the case.
Select to Open Information or Data visualization
o You should see only a gray box with an information symbol and a small box
and whisker plot. Hover over these icons. They should enlarge and have
captions.
o Click on the Information icon. Nothing should happen,
o Click on the Box Plot icon. After a brief loading screen, the lower box
should change from "Info" to "Box Plot",
o Click on the Information icon again. The lower box should change back to
Info. Click on the Box Plot icon again. The transition to "Box Plot1' should
not take significantly longer than the first time the transition occurred.
36
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
IAiic; Oct Jt; 19
Page 38 of 78
characters in the search bar should still be present. After this test, delete
whatever term is in the search bar.
o Note the checkbox in the upper left corner. It should be unchecked. Scroll
through the list until you find the group that you removed previously. The
checkbox next to its name should be empty. Click on that checkbox. It
should become checked. You should see the blue box for that group come
back and the graph should update to display thai group again. The checkbox
in the upper left corner should become checked, because all boxes should be
checked now.
o Pick a different, group with a checkbox that is currently checked and click on
the checkbox. It should become unchecked. You should also see that group's
blue box disappear and the group should be removed from the boxplot. The
box in the upper left should become unchecked,
o Click the checkbox in the upper left corner. It should become checked, as
should the unchecked box and you should see the previously removed
group's blue box return in the list. Click the box again. This should remove
all groups except for the query group. Click it again. All boxes should
become checked and blue boxes return to the list,
o Click the x below the search bar. This should close the window,
o Move down to the Boxplot section of the page. Hover the cursor over some
of the group names. The cursor should change to show and arrow with a
white x in a red circle, and a grey box should appear listing up to three
species from that group. Pick a group, note its name, and then click on it
This should remove it from the graph. Return to the Taxonomic Groups box
and search for the blue box of the group you removed. It should not be
present. Open the list of groups and search for the name of the group you
deleted. The box next to it should not be checked. Check this box to return to
the original state of the page.
Select species for legend
o Hover the cursor over any white space in the box or the downward facing
arrow on the right side of the box. The cursor should change to a finger and
the arrow highlight blue. Click anywhere in the area outlined above. This
should open a list that is similar to the list seen in the Taxonomic Groups
step.
38
-------
GLTED-50P SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GI.TED-STB-SQP-3784-0
Revision No. 0
Date: Oct 20] 9
Page 40 of 78
o Note: Depending on which group was removed, the other species may have
shifted symbols, resulting in a map that looks very different. This is
intended behavior. The order of symbols is constant, the order of species is
the order in which you selected them, and so removing a species causes all
species selected after it to "move up the list" when it is removed,
o Return the group to the boxplot using the Taxonomic Groups list. As soon
as the list is returned, the species that was in that group should
automatically be added to the blue boxes in the Select Species for Legend
area and should reappear on the boxplot.
o Delete the group using the two other methods detailed earlier and confirm
that the results are the same,
o Click on the X in one of the boxes in the Select Species for Legend area.
This should remove it from that area. Open the list. The box next to the
name of the deleted group should now he unchecked,
o Hover over the name of a species in the legend. The cursor should change
to the arrow with an x in a red circle. Click on the name. This should
remove it from the graph. Confirm that it is removed in the Select Species
for Legend area.
o Begin selecting species to be on the legend and continue until the legend
extends off the end of the graph. From this point, retest all forms of single
deletion (by checkbox, by blue box, by name) and then all forms of deletion
by groups.
o Note: There is a historical issue where deleting a group with multiple
species while the legend is crowded will result in a graphical error where
the legend intersects with the graph. Sometimes this effect is temporary,
and other times it has persisted until another species is removed or added to
the legend. This is not intended behavior and should be reported,
o Add test on species crossing over tax groups depending on answer to non-
eukaryotes question.
Species Legend Options
o Close the visualization tab and then click on the "Visualize Data" button
again.
o Note: There is a historical issue where attempting to open the visualization
any time after the first will be extremely slow, showing a white screen for
15-45 seconds before opening. This is not intended behavior and should be
40
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
Date: Oct 2019
Page 42 of 78
legend should disappear. Hover over each instance of the symbol and
confirm that nothing has changed from the previous time.
Optional Selections
o Click on the checkbox beneath Ortholog Candidates. Red dots should show
up on the graph and the legend should update appropriately. Hover over
some of the red dots. They should show the same information that you
would have seen if they were in the Species Legend. Additionally, confirm
that there is a dot. that is on the dotted line crossing the graph (assuming you
have not altered the cut-off). Click on one of the red dots. Nothing should
happen.
o Click on the checkbox beneath Threatened Species. The checkbox beneath
Ortholog Candidates should automatically become unchecked. The dots in
the graph are likely to change position and the inscription next to the red
dot in the legend should read "Threatened Species". Hover over the dots
and confirm that information is displayed as usual. Click on one of the dots.
This should open a new tab containing the ECOS page for that species.
Close this new link,
o Click on the checkbox beneath Endangered Species. The checkbox beneath
Threatened species should automatically become unchecked. The dots in
the graph should change position and the inscription next to the red dot in
the legend should read "Endangered Species", Hover over the dots and
confirm that information is displayed as usual. Click on one of the dots.
This should open a new tab containing the ECOS page for that species.
Close this new link,
o Click on the checkbox beneath Common Mode! Organisms. The checkbox
beneath Endangered Species should automatically become unchecked. The
dots in the graph should change position and the inscription next to the red
dot In the legend should read "Common Model Organism". Hover over the
dots and confirm that information is displayed as usual,
o Pick a red dot that has no other red dots close to it. Change which box has
been selected if need be. I lover over that dot and note the organism it
represents. In the Select Species for Legend menu, select that species. You
should see the appropriate legend pop up and the red dot should be replaced
by the new icon. Hover over the new icon. There should be no change in
information displayed.
42
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTKD-STB-SOP-3784-O
Revision No. 0
Date: Oes JO] y
Page 44 of 78
o Try to type non-numeric characters in the box. This should not work,
Q Change the width to 8,300. The height should be 4,944. Click download
image. There wiii likely be a noticeable delay before the prompt comes tip
to save the image. Do not save this image, instead click cancel. Change the
width to 10,000. Click download image. You should get an internal server
error for running out of memory,
o Note; these numbers are not exact, because the limit seems flexible
depending on how large the jump between sizes that are consecutively
demanded. If the initial 8,300 causes a memory error, choose a number that
is smaller. If 10,000 fails to cause an error, choose something larger,
o Close the tab with the memory error and reopen the visualization. Click the
Download BoxPlot button. If you were not forced to re-request the
accession to open the visualization tab, then the radio button should still be
set to PNG. Change this to JPG, Click Download Image, renaming the file
to indicate and save it to the SOP testing folder,
o Open the file and compare it to the graph on SeqAPASS. There should be
no differences.
o Repeat the process of testing different magnification levels for the JPG
format. This should include both an upscaled and downscaled image,
o After finishing that process, set the width to the value that previously got an,
out of memory error, and then switch the radio button to SVG. The boxes
should become greyed out. Click download image. Notice that this should
not cause an out of memory error. Save the file and compare it to the
original SVG file. There should be no differences. Close out of the
Download BoxPlot menu.
Size Controls
o Click the Open Size Controls button. A pop-up menu should appear in the
center of the screen, but this time the rest of the screen should not grey out.
o Confirm that areas outside of the pop-up menu are interact able.
o In the Bar Width area, use the arrows to adjust the number in the box. As the
number increases, you should see the width of the bars in the graph get
larger, and smaller as the number decreases. However, within the confines of
the page, this will manifest as the entire graph becoming longer or shorter,
with the actual width of the bars being relatively unchanged. What you are
effectively doing is changing the aspect ratio of the image. This can be
44
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No, 0
Date: Oct 2019
Page 46 of 78
Boxplot Taxonomic groups
o Reset the Boxplot to its default state. All Taxonomic groups should be
present, no Species selected, Common name selected, and no optional
selections.
o Previously you should have tested that it. is possible to delete a single group
from the x axis by clicking on it. You should also be able to perform a
multiple deletion by Test multiple deletion,
o Hover over one of the dashes on the dotted line. The line should become
bold and a text box should appear listing the percent similarity cutoff. Click
on the line. There should be no further effects,
o Pick one of the taxonomic groups and click on the gray bar representing the
spread of percent similarities within that group (It is easier to pick a group
with a large bar. It is possible to click on smaller ones, but for groups with 2
or 3 entries this can become a pixel hunt.) A summary tables should pop
open.
o The summary table should list the taxonomic group and how many species
are in that group in the upper left-hand corner. Directly underneath those
should be the mean and median percent similarity values, followed by
Susceptible Y or N, which is the susceptibility prediction for that taxonomic
group based on read-across Y. Beneath those should be a data table like
ones that you have seen before. Download this data table as
#accession#_#Taxonomic group#_Summarylvl2 in both ,xis and ,csv format
o Perform the Table Checking steps appropriate to this table. In this case, there
is no search function and no radio buttons.
Level 3
Check the reference explorer
46
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number; GLTED-STB-SOP-3784-O
Revision No. 0
Date: Oct 2019
Page 48 of 78
Select at least one entry from the data table and then enter a valid Template
Sequence and a valid Run Name. Enter a nonsense string into Additionai
Comparisons, Click kt Request Residue Run", You should get an error that clears all
boxes.
Select all entries on the first page of the Primary Data Table, enter the query
accession as the Template Sequence, and use "Test!" as the Run Name. Nothing
should be entered in the Additional Comparisons slot. Click "Request Residue
Run". This request should process without error.
Repeat the previous step except select only the first nine entries in the data table
and use the Run Name 'Test2". Copy the accession of the last entry on the first
page and enter it into Additional Comparisons, Click "Request Residue Run", This
request should process without error.
Repeat the process of two steps ago except instead of entering the query accession
into Template Sequence, enter the FA ST A for that accession (FAST A can be
found by clicking on the link in the query details at the top of the page) and use the
Run Name "Test3". Click "Request Residue Run". This request should process
without error.
Repeat the process of three steps ago select only the first nine entries in the data
table and use the Run Name "Test4". Click on the link in the NCBI Accession
column of the remaining entry to find the associated FAST A and copy that FASTA
to Additional Comparisons. Click "Request Residue Run". This request should
process without error.
Navigate to Run Status level 3 and confirm that ail submissions have been
accepted and are running/have completed (All should have completed already. If
one or more have not finished within 20 seconds, something is wrong). Navigate
back to the level 1 report.
Repeat any of the early steps where you submitted a run, using the same Run
Name, You should get an error stating that run names must be unique, clearing all
boxes.
In the dropdown menu under Choose Taxonomic Group(s), select the first entry on
the list. This should cause the search bar in the primary data table to automatically
fill with the taxonomic group name that was selected. Check that this occurs for
each entry on the list, including returning to "All Groups" which should clear the
search bar. In the primary report settings, change the Sorted by Taxonomic Group
value, then return to the Choose Taxonomic Group(s) menu. The list should have
changed to reflect the selected taxonomic hierarchy level. Return to default
settings.
48
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-0
Revision No, 0
Date: Oct 2019
Page 50 of 78
should not change besides the ran name which should contain all of the
taxonomic groups,
o Select 5-10 amino acids at random and update the report,
o Make sure that all the information is correct in the Template Protein
Information page. The difference between a single report and a combined
report is the Level 3 Run Name should contain ail the species selected,
o Open up the View Level 3 Summary Report. See that all the species are
present as well.
• The table should operate identical to a single level 3 report. Test the function of the
table to make sure that it is functioning correctly and that the amino acids are
present for both the primary and full report.
Primary Data Table
Data sorting
o The leftmost column title should be highlighted in blue and show a darker
blue downward pointing triangle after the end of the column name when you
enter the page.
o Click on the leftmost column name. The triangle should point up and the data
ought to have reorganized themselves (assuming there was more than one
entry in the table).
o Proceed to click twice on each column heading and check that the data is
being organized correctly.
• When the column entry is a number and you see the downward triangle,
the column should be sorted in descending numerical order, highest
values at the top. When the triangle points up, the list should be sorted
in ascending numerical order, lowest values at the top.
¦ When the column entry is a string of text and you see the downward
triangle, the column should be sorted in alphabetical order. When the
triangle points up, the list should be sorted in reverse alphabetical order.
¦ When the column entry is a date and you see the downward triangle, the
column should be sorted from most to least recent. When the triangle
points up, the list should be sorted from least recent to most recent.
Navigation
50
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver, 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
Date: Oct 2019
Page 52 of 78
entries per page. Click on the 20 and change it to 50. If there are not enough
entries to fill a full page of 50, simply check that there are more than 20
entries.
Downloads
o Click on the Excel logo after the words "Download Table A prompt should
open to name the downloaded table. Name this table
LeveB Browser Check, Open the download and check the top ten rows
against the table on your browser. Click on the CSV icon. The table should
download with an xsv extension. Open the download and check the top ten
rows against the table on your browser.
Search function
° The search function does NOT consider the following columns when filtering
the data table: Data version, Protein Count, Species Tax ID, Position, Amino
Acid, and Total Match. When entering a search term from one of these
columns, you are looking for evidence that entries are being filtered out of
the data set despite containing the search term in one of those three columns.
o For each column, type in the value you observe in that column in the first
row. Except for the cases listed above, the top row should remain in the table
and any other rows that remain should have the search term somewhere in
their row, most likely in the same column (but not always).
o The search bar should not recognize search terms that span multiple columns.
Test this by entering an accession, then a space, then the Common Name for
that accession. The table should be empty.
Level 3 Data Testing
Results for level 3 rely on an NCBI COBALT alignment, not a BLAST query, and
so must be checked separately from level 1 and 2.
52
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTHD-STB-SOP-3784-O
Revision No. 0
Date: Oct 2019
Page 54 of 78
In this section the results of a select group of accessions will be tested to ensure
that information is not being corrupted or altered somewhere in the pipeline from query
to downloaded data table.
Preparation
• Earlier in this SOP there was an instruction to submit a query of the accessions in
'kaccession_list.txt". Hopefully some or all of those runs have completed at this
point. If missed, those accessions are at the beginning of page 4.
• For each accession in that list, request that accession from the View SeqAPASS
Reports tab. Remember that all downloads should be saved to your working
directory, (e.g. 'L:/Priv/Bioinformatics Team/SeqAPASS/Saved test data')
o Without altering any settings from the default, download the data table first
as a .xls file, then as a .csv file. These files should be named
"#Accession#_Level 1 Pnmary_BukOniy_New.xls"* and
"#Accession# Level I _Primary_EukOnly New.csv" respectively,
o Click on the radio button labeled "Full Report", download the data table first
as a .xls file, then as a .csv file. These files should be named
"#Accession#_Level 1 _Full_EukOnly _New.xls" and
"#Accession#_Lcvel 1 Full EukOnly New.csv1" respectively.
o Deselect the checkbox reading "'Show Only Eukaryotes". download the data
table first as a .xls file, then as a .csv file. These files should be named
"^Accession# Level 1 Full NotEukOnIy_New.xls'* and
"#Accession#_Levell_ Full NotEukOnly New.csv" respectively,
o Click on the radio button labeled "Primary Report", download the data table
first as a .xls file, then as a .csv file. These files should be named
"#Accession#_Level 1 _Primary_NotEukOnlv_New.xls" and
"#Accession/? Level l_Primary_NotEukOnly_New.csv1" respectively,
• Submit a level 2 run on the first common domain that appears on the common
domain list. This should only take a few seconds, but the page will need to be
refreshed for the run to appear in the Choose Domain to View dropdown menu.
Once it appears, select it and view that data.
o Without altering any settings from the default, download the data table first
as a .xls file, then as a .csv file. These files should be named
"#Accession#_Level2_Primary_EukOnly_New.xls" and
"# Accession# JLevel2__PrimaryJBukGnly_New.csv" respectively.
54
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOF-3784-0
Revision No. 0
Date: Oct 2019
Page 56 of 78
take somewhere between I and 2 minutes. If a more exhaustive search is desired or
required, open the algorithm parameters section and set the Max target sequences
value to a larger number.
For each accession in the Excel file, try to find that accession number in the
BLAST results (This can be expedited by using Ctrl 4- f to search the page. Be
careful, however, as it can take some time for the search to resolve. BLAST results
are large).
Once the accession number has been located, confirm that the bitscore value from
the Excel table and the bitscore in the BLAST results are similar. Note that they
will not be perfectly identical, as SeqAPASS records values to the hundredths
place and the BLAST web client records them to the nearest whole number, but
they should be within a rounding error.
Continue doing this until one of three things occurs: i) A difference in bitscore is
found that cannot be reasonably explained by rounding. Contact the SeqAPASS
tech staff and describe the error in detail. 2) An accession does not appear on the
BLAST results page but the associated bitscore is lower than the lowest bitscore
show in the BLAST results. This is a success and requires no error report. 3) An
accession does not appear on the BLAST results page but the associated bitscore is
NOT lower than the lowest bitscore show in the BLAST results. Contact the
SeqAPASS tech staff and describe the error in detail, specifying that SeqAPASS is
reporting a result that does not exist in BLAST.
o Note: With BLAST and SeqAPASS not being updated in conjunction with
each other, accessions, percent similarity and BLAST score could be different
from the downloaded data sheet. It is most important to see if the species is
still present in the BLAST data set. To see if the species database has been
updated, see if the species" protein count has been updated recently. This can
give you a better explanation to whether or not there is an error with
SeqAPASS.
If an error was reported, skip the following steps and repeat the above steps with a
new accession.
The previous step confirmed the accuracy of the data. Now it must be determined
that each accession found in SeqAPASS is the best possible sequence for its
species.
Return to the BLAST request query page. If the Max target sequences has been
altered, return it to the default value of 100. In the Organism line in the Choose
Search Set section, enter the scientific name of the species in the first row from the
56
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTKD-STB-SOP-3784-O
Revision No. 0
I Xue: ()cl JO i 9
Page 58 of 78
to confirm that an error is truly present, and then the list should be sent on to
SeqAPASS tech staff, noting that the list contains SeqAPASS runs where the xls
file download was different from the esv download.
• Find the element in the Global Environment called emptyMatchList. To the right
of the name will be a description that reads "List of X" where X is a Natural
number. If X is 0, then there are no downloaded files that are missing their
counterpart. If X is not 0, a problem has been located. Click on emptyMatchList to
open a new tab in R that contains a list of all file names that have this problem.
Check first that each file has been correctly downloaded to your working directory,
then check that the name was written correctly. This should solve the problem. If
problems persist, contact SeqAPASS tech staff with a description of the specific
problem being encountered.
m Find the element in the Global Environment called accessionOrTaxGroup. To the
right of the name will be a description that reads UX obs. of 4 variables" where X is
a Natural number. If X is 0, then no pairs of old and new files contain a species
whose best hit protein or taxonomic group has been changed. If X is not 0, at least
one such change has been noted. Click on accessionOrTaxGroup to open a new tab
in R that contains a table with four columns. The first and third columns will
contain file names, while the second and fourth contain row numbers from the file
whose name is to their left. Each row of the table indicates the first discrepancy
that was noticed and gives the coordinates to find it. Keep this tab open as you
move on to the next step.
• Find the element in the Global Environment called bitscoreNotProtCount. To the
right of the name will be a description that reads "X obs. of 4 variables" where X Is
a Natural number. If X is 0, then no pairs of old and new files contain a species
where the bitscore of the best hit protein has changed despite the protein count
being constant. If X is not 0, at least one such change has been noted. Click on
bitscoreNotProtCount to open a new tab in R that contains a table with four
columns. The first and third columns will contain file names, while the second and
fourth contain row numbers from the file whose name is to their left. Each row of
the table indicates the first discrepancy that was noticed and gives the coordinates
to find it. Keep this tab open as you move on to the next step.
• Find the element in the Global Environment called susPrediction. To the right of
the name will be a description that reads "X obs. of 4 variables" where X is a
Natural number. If X is 0, then no pairs of old and new files contain a species
whose susceptibility prediction has changed. If X is not 0, at least one such change
has been noted. Click on susPrediction to open a new tab in R that contains a table
58
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
Date; Oct Jul9
Page 60 of 78
then move on to the next row pair, if the discrepancy is inexplicable, contact
SeqAPASS tech staff with a clear description of both the inexplicable
discrepancy and the names of the files involved.
Find the element in the Global Environment called
deepDiveBitscoreNotProtGroup. To the right of the name will be a description
that reads ŁkX obs. of 2 variables" where X is a Natural number. If X is 0, then
no pairs of old and new files contain a species whose best hit protein or
taxonomic group has been changed. If X is not 0, at least one such change has
been noted. The first column is a row number in the file whose name you
entered in the flic 1 space, and the second column is a row number in the file
whose name you entered in the file2 space. Each row in
deepDiveBitscoreNotProtGroup thus specifies a pair of rows in the pair of files
entered in the function where the old and new files contain a species where the
bitscore of the best hit protein has changed despite the protein count being
constant. This list is exhaustive, meaning that every such discrepancy is
recorded.
Open the corresponding files in Excel and examine every pair of rows listed in
deepDiveBitscoreNotProtGroup. If the discrepancy can be explained by
knowing that the newer file is expected to equal or exceed the older file in terms
of accuracy of information, number of protein hits, and quality of protein hits,
then move on to the next row pair. If the discrepancy is inexplicable, contact
SeqAPASS tech staff with a clear description of both the inexplicable
discrepancy and the names of the files involved.
Find the element in the Global Environment called
deepDiveSusPredictionGroup. To the right of the name will be a description
that reads "X obs. of 2 variables" where X is a Natural number. If X is 0, then
no pairs of old and new flies contain a species whose best hit protein or
taxonomic group has been changed. If X is not 0, at least one such change has
been noted. The first column is a row number in the file whose name you
entered in the file I space, and the second column is a row number in the file
whose name you entered in the file2 space. Each row in
deepDiveSusPredictionGroup thus specifies a pair of rows in the pair of files
entered in the function where the old and new files contain a species whose
susceptibility prediction has changed. This list is exhaustive, meaning that
every such discrepancy is recorded.
Open the corresponding files in Excel and examine every pair of rows listed in
deepDiveSusPredictionGroup. If the discrepancy can be explained by knowing
60
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTHD-S ! B-SOP-3784-O
Revision No, 0
Date; Oct jui v
Page 62 of 78
R Source Code for Data Testing
Source Code for SeqAPASS Data Testing
iWif ##########################################
#
# Author; Colin P, Finnegan
# Oak Ridge Institute for Science an Education (ORISE)
# US EPA
# Office of Research and Development
ft Center for Computational Toxicology and Exposure
#
# Version: 1,0.0 October 8, 2019 Initial Write
H
#
M Purpose: Comparison of new and old data versions upon update of SeqAPASS backend with NCB1 data
####### if###########
Iff?
4 SET-UP
aiiUitmMittimm-tswm
#Uncomment below to install this package if it hasn't been installed on your system
# install, packages!'XLConriect')
library (XLC'onnect)
^Change working directory to where files are located
SSet by default to 1 .:/Priv/Bioinformatics Team/Set)APASS/Saved test data'.
set\vd('L:/Priv/Biomformatics Tcam/ScqAPASS/Savcd test data')
Helper function capable of evaluating floating point equal its (to a small margin of error I
#in mo lists of floating point values. Function assumes the fifth column of the provided
#lists will contain floating points.
Hoat.chcck <- function!\a H
value <- TRI !E
for (i in (l:lcngth{xf,5p»{
value <- value & isTRlJF-(a[!.equal(as.nurtKric(x[i.5]),as.numeric(y[ 3|)*)
i
#
value
i
ifmummmmmmmmu
#BtG)N INITIAL CHECK
#Scl the lists of testing filenames. If new accessions are added to the
^testing procedure, add the names of the save files to the appropriate lisi
62
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
Date: Oct 2019
Page 64 of 78
"NP_524699.1 _Level 2_Fu1 l_EukOnly_0Id.csv"NP_524699.1I.cvel2 Full NotEokOnly. Old.csv",
"WP 001138045.1 Level l_PrimarylAjkOnly Old.csv". "WP 001138045.1 Level 1 Primary_NotEukOnly_()ld.csv",
"WP 001138045.1 _Le ve)I_Ful[_EukOnly_01d.csv". "W P 001 13X045.1 Level l_Full_NotF.ukOnly_OId.csv".
"WP_001138045,1_Le vel 2_Pr i maiy_EukOnly_01 d.csv", 'WP 001 138045. l_Leve!2. Primary. NotFukOniyOld.csv".
"WP 001138U45. l_Levcl2_Full EukOnly Old.csv". "WP 001138045.1 Eevel2 FullNotEukOnly Old.csv".
"NP_t)0 1032915.1 ..Levcll Primary EukOnly Old.csv". "NP 001032915.1 1 evell Primary NotF.ukOnlv Old.csv",
"NP_O0J032915.1 LevelI_FullJi«kOnly_01d.csv". "NP 0010329 13.1 Lwcll Full NotEukOnly. Old.csv",
"NP_00IO32915.1_Ltfvcl2_Primar>'_Euk0nly_01d.cs\"NP 001032915.1 1 eve 12_ Primary NotFukOniy Old.csv",
"NP 001032915.1 l.cvc)2 Full KukOnly_01d.cs\"NP 001032915.1 Lcvcl2 Full NotEukOnly Old.csv".
"CAB41615.I Level I Primary EukOnly. Old.csv". "CAB41 ft 15.1J evel! Primary NolEuLOnJj_Old.esv".
"CAB4I615.1 Level I Full EukOnly .Old.csv". "C4B41615.1 Lev el! Full NotFukOniy Old.csv
"CAB416)5.1 l.cvc)2 Primary EukOnly Old.csv". "CAB41615.1 Eevel2 Primary NotEukOnlv Old.csv".
"CAIJ41615.l_Lcvcl2 Full EukOnly Old.csv". "CAB416I5. 1 J. evel2. Fuli_NotEukOnly_01d.csv".
"NP 001062.l_Level 1 JPrimary HukOnly Old.csv", "NP 001062 I Lcvell Primary NotKukOr.lyOld.csv".
"NP 001062.! Level l_Ful l_EukOnly.pid.csv". "NP 001062 J l evel 1 _Full_NotEukOn!y_01d.csv".
"NP 001062.1_I evel2 Primary I tikOnly Old.cnly Old.csv". "A0044939.) Level 1 Priman_\otFukOnly ()ld.csv".
"AC044939.1 Level 1 Full EukOnly Old.csv", "ACD44939.I_Leven_Full_NotEiikOnly__Oid.csv",
"AC044939.1 Level2 PrimaryJ.ukbnly Old c.sv", "ACn44939.1I.cvel2_Primar>_NotEukOnly Old.csv".
"ACD44939.1J eve!2 Full F.ukOnly oid.csv" "ACD44939.I lxvcl2 Full_NotFukOnl> (id.csv".
"NP_037166 2 J evell PrimnryJ-nkOnh Old.csv". "NP 037166.2. Ixvel I Primary NotFukOniy. Old.csv".
"NP_03716h 2 I evell Full FlukOnly Old.csv". "N'P 037166.2J evell. Full No:EukOnly_01d.cs\',
"NP 037166.2_Level2_Primary EukOnly ()1 d.csv", "NP_037166.2_I^;vel2 Primary NotEukOnly. Old.csv",
"XP 037166.2 Uvcl2 Full EukOnly_01d.esv". "NP_037166.2_Level2_Full_NotEukOnlj_OJd.csv*.
"BAA84101.1 I.evel 1 Primary F;ukOnly_Old.csv". "BAA84101.1 Level 1 Primary. NoiEukOnlv f)ld.csv".
64
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver, 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
Date: Oct 20 jQ
Page 66 of 78
"NP 001032915.1 Level LLull_EukOnly.New.csv"NPJ)01032915.1_I eve! 1 Full NoU-.ukOr.ly New.csv",
"NP 001032915.1 I evel2 Primary_EukOn)> New .csv", 'NP 001032915.1 1 evcl2 Primary NotEuk(>n!yNew.csv".
"NP_001032915.1 I.evel2 FullJFukOnfyNew,csv'\ ''NP_flO 1032915.1 _Level2 Full NofEukOnlv New csv™,
"CAB416S5.1 Levell PrimaryEukOnly New.csv". "CAB416L5.1 Levell _Prirrary_NoiEukOnK New.csv",
"CAB41615.1 l evel 1 Full EukOnlv New.csv". "CAB4[6l5.1J.c\ill. Full_NotEukOnl>_New.es\
"CAB41615.1 I.cvcl2 PrimaryHukOnlv New.csv" "C AB416I5.11 evel2 PrimaryNoiEukOntv New.csv",
"CAB41615.1 I.evel2 Full EukOnly New.csv". "CAB41615.1__Level2__FulLNotEuk.Only_Nciv.csi,™,
"NP_001062.1 Level 1 Primary Luk(>iily New .csv". "NP 001002.1 I.cvel l_Priman.__NotEukOnlj__NcH.csv",
"NP_001062,1 _Level 1_ Full EukOnly New.csv', "NPJXM062.1.lxvcll Full NoiEukOnlv Jsew.csv".
"NP 001062.1 I,evel2 Primary. EukOnly New.csv", "NP 001062.1 Level2 Priman. NotE:ukOnl> New.csv",
"NP 001062.1 I evel 2 Full EukOnlvNew.csv". "NP 001062,1 1 cvel? Full Noll ukOnly New .csv".
"ABC68616.1 Levell Primarv_FukOiily New.csv". "ABC68616 1 Levell Primary _NoiE.uL<>r.JyNew.csv",
"ABC68616.1 ..Level 1 _Full_EukOnI.y_Ncw.esv", "ABC68616.1_L evel 1 _FullJ«lotEukOnl> .New.csv".
"ABC68616.1 Level2Primary EukOnlv New.csv", "ABC68616.1 I,* vel2 _Primary NotEukOnly Ncv.csv",
"ABC68616.1 Level2 Full .EukOsiSv New.csv", "ABC68616.1 _Level2 Full NotEukOnly_New.csv\
"NP_034125.2_Levell_Primary_EukOniy New.csv". "NP 034125.2 l.evell_Primarv NotEukOnly New.csv",
"NP_034125.2_Levcll_Full_EukOnly_New.esv". "NP 034125.2 Lcvell_Ful!>'ot[:uk(Vi;y..New.e?v".
"NP_034125.2_lx'vel2 Primary EukOnlv Nevv.csv", "NP 034125.2 Level2 Primary NoiEukOnlv New.csv".
"NP 034125.2 Level2J''ull_EukOn!> New.csv", "NP 034125.2 Leve12 Full NotEukOnly New.csv",
"NP 001117847.1 Levell Primary KukOnly New.csv", "NPJW1117847.1 Levell Primarv NotEukOnlv New.csv",
"NP001117847.1 J.ev el 1 Full EukOnly New .csv". "NP .00111784"?.! _ 1 evel I _Fuli NolF. ukOnly, New csv".
"NP 001117847.l_Li.vel2 Primary_F.ukf)nly New.csv". "NP 001 1 17847.1 Level2 Primary NotEukOnly Ncw-.csv".
"NP_001117847.1 Lcvcl2l ull EukOnly New.csv", "NP 001117847.1 Level2_Ful! Not EukOnlv New.esv '.
"NP_001009476.1 Levell Primary EukOnlv New.csv". "NP 001009476.1 Level I Primao NolF.ukOnK New.csv".
"NP 001009476 1 Levell Lull EukOnly New.csv". "NP 001009476.1 Levell Full No«Euk(>nl\ New.csv',
"NP 001009476.1 _ Level2Primary_EukOnly New.csv". "NP 001009476. l_lxvel2_Primars NolFukOnly_.New.csv",
"NPOO1009476.1J-evel2 Full EukOnly New.csv". "NP 001009476 11 evel2_Fu11 NotEukOnlv New .csv".
"AM 15744.1 Levell Priman' EukOnlv New.csv"AHJ 15744.1Levell Primary NotEukOnlvNew csv™,
"Allll 5744. l_l.evell Full EukOnly N*cvv csv", "AM 15 744.1 lev el 1 Full NotFuk( >nly New.csv".
"AHI 15744.I _l.evel2 PrimaryEukOnly New.csv", "AHI 15744.1 Lcvel2 Primary _No;LukOnl> New.csv",
"AHI 15744, [_I.e\cl2_rull_KukOnly_Ncw.cs,v". "AHI 15744.1 .Level2_Fuil_NotEukOnh_New ci\
"CAC38767.1 Level 1 Primarv EukOnlv New.csv"C At'3876". 1 I evel 1 Primary NotEukOnly New.csv".
"CAC38767.1 I .evel 1 Full.EukOnlv_New.csv"CAC38767.1 Ja-vcI 1.Full_NotEukOnly_N<.vv,c;v".
"CAC38767.I_Levc!2_Primary_F.ukOnly New.csv". "CAC38767.1 J.cvel2_Primary NotLuU >nl> . New.csv".
"CAC18767. t l.evel2 Full EukOnly New.csv". "CAC38767.l_Le\el2 Full NotEukOnlyNew.csv\
"NP 001296002.1_I.cvcll_Primary EukOnly New.csv". "NP 001296002.l_I.cvclI J>nmary NotEukOnly New.csv",
"NP 001296002.1 Levell_Full_EukOn.Iy_Ntw.csv", "NP 001296002.1 Jxvell_Full NotEukOnlvNew.csv".
"NP_001296002.l_I.eve)2 Primary FiikOnlv New.csv". "NP..001296002.1 J.evcl2.Primar>.NotEukOnly_New.csv",
"NP 001296002.1 _l evel2_Ful l_EukOnlv New.csv", "NP 001296002.1 J.ev el2 Full NotF.ukOr.lv New.csv",
"WP (KK>529945.1 Level I. Primary .EukOnly New.csv". "UP. 0OQ529945.IUvel I Primary NotEukOnlyNew.csv".
"W P 000529945.1 _Level l_Full_EukOnly_New.csV. " W P 000529945.1 Level 1 Full_NotEukOnl> New .csv".
"ftP2000529945.l_Level2_Primaiy_EukOnly_New.csv", "WP_000529»45.1 1 cvel2 Primary_NotKukOnly New.csv".
"WP_000529945.l_Level2_Full_Euk0nly_New.csv". "WP 000529945.1 I eve!2 Eull.NotLukOnl^Ncw.csv",
"ACI >44939.1 I .evel 1 Primary EukOnly ..New .csv"ACD44939. I I evel I Primary NotEukOnly New.csv",
"ACD44939.1 Level 1 Full EukOnly New.csv". "AUM4939.1 I ev el I Full NoiEukOnly New.csv".
"ACD44939.l_Level2_Primary_EukOnly_New.c5v". "AH>44939.1 l.evel2 Primary NotEukOnlv Niw.csv",
"ACD44939,1 _Level2_Full_EukOnly_Ncw .csv", "ACD44939.1J evel2 Full NotEukOnly New.csv
"NP (137166.2 Level 1 Primary_EukOnly New.csv"NP 037166.2.Level 1. Primary _ NotEukOnlv New.csv".
"NPJ)37166.2_Lcve 11 ~Full_EukOnly_New.csv". "NP_037166 2 Levell Full. NofFukOnl) New.csv".
"NP 037166.2 Level2 Primary EukOnlyNew.csv". "NP 037166 2 I^:vel2 I'rimary Not);' ukOr.ly New .csv".
"NP 037166.2J.evcl2j"ull EukOnlv New.csv". "NP 037166.2_I.evel2 Full NolEukOnfy.Ncvv csv",
"BAA84101.1 Levell Primary FukOnlv New.csv". "BAA84101.1 Level I Pri'iian NotEukOiily New.csv",
"BA \84]01, l_Level 1 F ull 1-ukOnlv New.csv" "BVA84101.11 evel 1 Full No!F.ukOi!ly_ New.csv
"BAA84I01.1 LeveI2 Primary LukO)i!y New.csv". "BAA84101. i Level2 Primary NotEukOnK New.csv",
"BAA84l01.1_ l.evel2 Full EukOnly New.csv". "BAAX4I01.1 J evcl2 Full NolEukOnl>_New csv".
"WP 001180963.l_l.evel l_Primary_EukOnlv New'.csv"WP tM) 1180963.1 .1 evel 1 Primary NoiEukOnly_New.esv",
"WP 001180963,1 1-eve! 1 Full EukOnlv New.csv". "WP_001180963.1 Level 1 Full_NotFukOnly New.csv".
"WP_001180963.1 Uvel2_l'rimary _ EukOnly New.csv". "WP (K)1180963.1 _Ixve!2 Piimarv NutFukOnly_New.csv",
66
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number; GLTHD-S TB-SOP-3784-O
Revision No, 0
Date: Oct 2019
Page 68 of 78
"CAB416S5.I J evel 2 F ull F ukOniv Old.xls". "CAB416J5 l_Level2_Fu!t_NotEukOnlv Old.xls",
"NPJ)01062.1JLcvcl I_Primary_FukOrtly Old.xls". "NP_0(H 062. l_LevcIl_Primaiy_NotEukOnly_Old.xls",
"NP_001062 1 t ewll_FulI_EukOtil\_Old.xls", "NP .001062 I Level) FulI__NotEukOnly_Old.xls",
"NP_00I062.1 I.cvel2 Primary J'likOnly Old.xls".' NP 001062.I _Levcl2_Primar> NotEukOnK .Old.xls".
"MP_001062,1 J,evtl2 Full Fuk(>nly Old.xis", "NPJM)I062 1 LeveI2 Full NotEukOnK Old.xis",
"ABC68616.1_Lcvcll_Primaiy_EukOnl>_Ofd.x!s". "ABC68616.1 I evel I Priman- NotEukOnly Old.xls".
"ABC68616.I l evel) J ul! EukOnlyOld.xls". "ABC68616.1 1 e»ell .Full NotEukOnK. Old.xis".
"ABC68616.1 Leve!2 Priman FukOnly Old.xls". "ABC68616.) Leve!2 Primary NotEukOnK Old.xls",
"ABC68616.l_Level2_Full_Euk0nly_01d.xls", "ABC68616,1 Levcl2 Full_NotEukOnlyjrtld.xls".
"NP_034l25.2 Ixvcl 1 Primary _ F.ukOnlyJ)ld.\ls". "NP.034125.2 Level 1 Primary NotF.ukOnly Old.xls",
"NF> 034125.2 Level M ull. FukOnly. Old.xls". "NP 034125.2 1 evel I F ull NotEukOnK Old.xls",
"NP_.034125.2_l.evel2 Primarx KukOrily Old.xls". "NP 034125.2 I xvcl2. Primary NotEukOnK Old.xls".
"NFM)34125.2 Ixvel2 1 ull FukOnly Old.xls", "NP_034125.2_Level2_Fu]]_>iotEukOnlv_OId.xls".
"NP_001117847.1 JLcvcl 1 Primary EukOnK Old xls\ "NP 001117847.1 Level 1 FVim3n_NotEukOnly_01d.xls",
"NP_001117847.1 Level 1 Full EukOnly Old.xls". "IsP 001117847.1 Level l_FulI_NotEukOnly_Old.xls".
"NP 001117847.1 Level 2. Primary EukOnK Old.xls". "NP 001117847.1 Ixve!2Priman Notl-.ukOnK O'a.xls",
"NPJ101117847.LLevel2_FoILEukOnK_01d.xls". "NPjMl117847.1. Level2 Full.NotEukOnlv. Old.vk".
"NP_001009476.1 level I F'rimary LukOnK_Old.xls", "NP_001009476.1 J.eve!l_Primarx NotEukOnIy_Oid.xls".
"NP_001009476,1 _L evel 1 Fu!l r.ukOnly_Old.xls", "NP 001009476. ]_Levell_FulI_NotEukOn lv_01d.xls",
"NP_001009476.1_Level2_Primary_I-ukOnly_01d.xls", "NP (X) 1009476.1 _I,cvel2_Primary_NoiEukOnly_Old.xls",
"NP_001009476.1 _Lcvcl2_Full_FukOnly_01d.xls", "NT 001009476,1 I evel2 Full. NotEukOnK .Old.xls",
"AM 15744.1 J evel I ..Primary. FukOnly Old.xis", "AH! 15744. l l.e\ell Primarx NotEufcOniy Old.xls",
"AH115744,1 _Lcvel 1 l ull EukOnly Old.xls", "AIII15744.1 Level I Full NotFukOnly Old.xls",
"AH115744.l_Level2 Priman EukOnly Old.xls", "AHI15744.l_Lcvel2 Priman NotEukOniv.01d.xls",
"AH115744.1 Jxvcl2_Full_FukOnly_Oid.xLs". "Al1115744.] J.cvel2_FulfNoiEukOnlj_01d.xls",
"CAC38767.1 Levell_Priman. EukOnly.old.xls". "CAC38767.1 Level) Priman NotEukOnlj Old.xls",
"t Al'38767 1J cvcll. Full. EukOnlyOld.xls". rTAC38767.1l evell .F ull. NotEukOrJyOld.xis".
"CAC38767.1 1 eve!2 Primary EukOnly_Old.xls". "CAC38767.1 I xvcl2_ Primary NotEukOnK Old.xls".
"( AC38767.) I.evcl2 Full EukOnly_Old.xls". "CAC38767. t_ Ixvel2_Full NotEukOnK. Old.xis".
"NP 0012960(12.1J xvel I JPrimary EukOnly Old xls", "N'P 001296002.1 1 evel I Primary NolFukOnly_01d.xls".
"NP 001296002.1 Level) Full EukOnly Old.xls". "NP_IX)i2%002.l Level I l-ul! NotEukOnl> Old.xls".
"NP_00l29fi002.1_Levcl2_Primaiy_Eukbiily_Old.xls", "NP 001296002.1 J.evel2_Primary_NoiukORly_Old.xls",
"NP 001296002.1 l,evel2 Full EukOnly o'ld.xts". "NP. 00i 296002.1 lev el2 FullNotEukOnly Old.xls",
"WP_000529945.1 . Ixvcl! Primary EukOnly Old.xls", "WP 000529945.1 Jxvel!..Primary_NotEukOnK Old.xls".
"WP 000529945.1Level! Full EukOnly Old.xls", "W P 0QO529<>45.1Level 1 Full NotEukOnK Oid.xls".
"WP_000529945.1 J.cvcl2_Primar>_Eukbi)h Old.xls". "WP,000529945.1 I evel2 Priman NotFukOr.K. Old.xls",
"WP 000529945.1 Level2Full. EukOnK Old.xls", "WP 000529945.1 Levcl2 Full NotEukOnK Old.xls".
"ACD44939.I Level! Primary F.ukOnK Old.xls". "ACD44939.1 Level!. Primary NotFukOniv Old xls",
"ACD44939.1_lxvel!_Full_EukOnly_Old.\ls", "ACD44939.1 Level) I nil NotLukOnlj. Old
"ACD44939.1 Ixvel2 Primary. EukOnly Old xls". "ACD44939.1 1 evel2_Prtmary NotEtikOnly.Old xk'\
"ACD44939.I Ixvel2 Full EukOnK Old.xls". "ACD44939. 1 Level2 l ull NotEukOnK Old.xls",
"NP 037166.2 Level! PrimaryJ-'ukOril> J") Id xls". "NP 037166.2 I evel 1 Primary>\->tEukOnly Old.xls".
"N P. 037166.2 lev el I Kill EukOnly oid.xls'', "NP 03 7166.2 r evel IJ u li_ NotEukOnK Old xls".
"NP 037166.2 Ixvcl2 Primary EukOnly Old.xls". "NP 037166.2 Lcvel2_Primar> NoiEukOtiK Old.xls",
"NP J3371 l)6,2_L«cl2_Full_EukOnlj_Old.xls". "NP 037160.2 Level? Full NotEukOnK' Old xls".
"BA484101.1J evell Primary FukOnly Old.xls". "BAA84101.1 Leveli Primary Noll.ukOtily Old.xls",
"BAA84101.1 1 cvcll Full EukOnly. Okl.xls". "BAA84101.1 1 c\el 1 J-ul!. NotEukt'nly Old xls",
"BAA84101.11 ,evel2 Primary FukOnly Old.xls", "BAA84101 .l_Levcl2_Priman_NolEukOnly_Old.xls".
"BAA8410I.I Levcl2 Full FukOnh_01d.xls". "BAA84101 1 _I.evcl2_Full NotFukOniv.Old.xls".
"WP 001180963.1 Level! Priman. liukOnly. Old xls". "WP 001180963.1 I evel I Primary NotFukOr.ly Old.xls",
"WP 001180963.) Levell_Full_EukOnly_Otd.xls", "VP_001180963.1 Level 1 .Full .NotFukOniv Old.xls".
"WP 001180963.1 J.evel2_Primarv__Euk0nly_01d.xls". "WPJ)01 !80%3.1..I.cvcl2 FVimary NotEukOnK Old.xls".
"WP 001180963.1 I evel2 Full. FiukOnly J)ld.xK". "WP (K)l 180963 l_Level2 Full NotFukOnly _Oid.xls".
"NI» 001171681.1 Level! J^imary EukOnly Old xls". "NP 001171681.1 Ixvel t_Priman NotEukOnly_01d.xls".
"NP fX)l 17168!. 1 J.evel I Full EukOnly ()id.xls". "NP. 001171681.1 level iruil.NotFukOnly Old.xls".
"NP 001171681.1 1 evel2 Primary FukOnly Old.xls". "NF'_001 P1681.1. Level2 FVimary. NoiKukOnlyOld.xls".
"NP (Mil 171681.) !.evel2 Full FukOnly_01d.xls", "NP (»1171681.! Levc!2 Full_Ni)tFtikOnK Old xls".
"NFMW1123908.!_l.eve) 1 F'rimary F.ukOnlv Old.xls", "NP 001123908 ! lxvcl!_Primdn .N.>1i:ukOnly Old.xls",
68
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver, 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
Date: Oct 2019
Page 70 of 78
" A BC6861 b. 1 _Le ve! 1 _Ful l_Euk On ly_ Ne w. \ ] s". 'ABC 68616.1 LevdI Rill NotEukOnlv Ne».xl.s".
"ABC68616.l_Leve!2_Primary_EiikOnh_New.xls". "ABC68616. J I evel2 Primary NotEukOnly New.xls".
"ABC68616.1 Level2 FullJ:fukOn!y_ New.xls". " ABC68616.T Le\el2 FullNotEukOnly New.xls'.
"NP_034125.2. Levell Primary EukOnlyNew.xls'. "NPJ134125.2 Level 1 Primary NotEukOnly New.xls",
"NP 034125.2 Level)_Full_EukOnly. New.xls". "NP. 034125.2 Levell Full NotEukOnly Nevv.xfs".
"NP 034125.2 I,evel2 Primary EukOnly New.xls". "NP 034!25.2 1 evel 2 Primary NotEukOnly New-.xls",
"NP_034125.2_Lcvel2_Full EukOnK New xls", "NP 034125.2 Levol2 Full NotEukOnly Kew.xls".
"NP_001117847.1_Levell_Primary_EukOnl\_New.xls". "NP 001117847.1 Lcvd I Priman_NiHF.ukOnly_New.xls".
"NP 001117847.1 I-evell Full EukOnly New.xls". "NP.001117847.1.Levell Full. NotEukOnly New.xls".
"NP 001117847.1 lxvel2_Primary EukOnK.New.xls", "NP.OOlt PX47.1.1 evcl2 Primarv. NotEukOnh New.xls".
"NP_001117847.1 Level2 Full EukOnly New xls". "NP 001117847.1 I evel2 Ful'i NotEukOnly New.xls".
"N'P (X) 1009476.1 Level 1 _Primary EukOnly New.xls". "NP 00 1009476.1 Level 1 PrimaryJv'otFukOnh_ New.xls",
"NP (X) 1009476.1 _Level I Full EukOnly New.xls", "NP 001009476.1 Level 1 J-Vl_NotEukOn]vJ^w.\]s"
"NP (>01009476.1 l,evel2Primary EukOnly New.xls". "NP 001009476.1 Levi 12 Primary NotEuk<>nly New.xls".
"NP 001009476.11 evel2 Full EukOnK New.xls". "NP 001009476.1 I.evel2 Full NotEukOnly New.xls".
"AHI 15744.1 Level]_Primary_EukOn)v New.xls". "Mil 15744.1 Level 1 Primary_N'oiEukOnly_New xls".
"AH115744.1 1 .e vel 1 Full EukOnK New .xl s", "AI1115744.1. Levell_Full_NotI-ukOnIy_New.xls".
"AIII15744.1 Levcl2 Primary EukOnK New.xls". "AH[15744.1 Level2 Primary_NotEukOnly_New.xls",
"AHI 15744.1'!evel2_Full _EukOnly..New xls", "AH! 15744,1 _Level2_Full_NotEukOnIy_New.xls".
"C'AC'38767.1 Level 1 Primary EukOnK New.xls". "CAC 38767.lLevel 1 Primary NotEukOnly_New xls".
"C'AC'3 8767.1 Level] Full EukOnly New.xls", "CAC38767.1 Level 1 _Full_NotEukOnlyNew .xls".
"CAC38767.1 lxve12_Primarv_EukOnly_New.xls". "CAC38767.l_Level2_Primarv NotEukOnly New xls".
"CAC38767.1 Level2_Full_EukOnly_New.xls". "I AC38767.I _Levc!2..Full NotEukOnlyNew.xls"
"NP 001296002.1 l-evel 1 Primary EukOnK New.xls". "NP 001296002.1 Level! Primary NotEukOnly_Ncw.xls".
"NP 001296002,1 Jxvel 1 Full EukOnly New .xls". "NPJO 1296002.1 Level t Full NotEukOnly .New xls".
"NP_001296002.1 I.cvcl2 Primary EukOnly New xls", "NP 001296002.1 _t.cvel2_Primary NotEukOnly New.xls".
"NP 001296002.1Level2 Full EukOnly New.xls", "NP 001296002. I l.eve12 Full NotEukOnly New.xlx".
"WP .000529945.1 Level 1 Primary ..EukOnly Ncvv.xh". "WP 000529945 1 Level 1 Primarv NotEukOnly New.xls".
"WP 000529945.1 l-evel 1 _Kull EukOnly .New. xl "WP 000529945.1 Lev el I Full NotEukOnly New.xls".
"WP 000529915.1 Levcl2 Primary EukOnlv Ncw.xh"."WP_00052<)945.1_ fx-vet2 Primary NotEukOnh. New.xls",
"WP 000529945.1 Level2 FullEukOnly New.xls"."WP_000529945.l_Uvel2_Full_NotEukOnly New.xl.s-.
"ACD44939.1 Level 1 Primarv EukOnly_New.xls", "AC7M4939.IJ.evel( Primary NotEukOnly_Ncw xls'.
"ACD44939.1_Levell_Full_EukOnlv_Ncw.xls", "ACD44939.I .Levell J ull_NotEukOnly New.xls".
"ACD44939.1 _Leve12_Primary_tukOnly_New.xls", "ACD44939.1_Levcl2__Primary_NrtlEuk0niy_New.xls",
" ACD44939.1 I.evel2 J ullEukOnly New.xls", "ACD44939.1 _Level2_Ful!_NotEukOnly New.xls",
"NP 037166.2 Levell Primary EukOnly. New.xls". "NP 037166.2 Level 1 Primars_NotFukOnlyNew.xls".
"N P_037166.2_ Lc ve 11 _Ful l_Eu kOn ly_N ew .xls". "NP 037166.2 Level IJ ull. NotEukOnly New.xls".
"NP 037166.2 Lcvcl2 Primary EukOnly. New.xls". "NP 03"H66.2 Level2. Primary. NotEukOnK New xls",
"NP 037166.2_Lcvel2_Full_E«kOnly New.xls". "NP 037166.2_Uvcl2_Kull. NolEukOnly New xls",
"BAA84101.ljxvell Primary J.ukC>nly. New.xK". "BAA84101.1 Level 1 Primarv. NotEukOnly New.xls",
"HAAX4101.1 Ixvel 1 Full EukOnly New .xls". "BAA84101.fI .evel 1 Full NotEukOnly. New xls".
"BAAS4101.1. l.cvel2_PrimaryJ ukOnh New .xls", "B A A84101 J_LeveI2_Primaiy_NotEukOnly_New.xls'',
"BAA8410L1J ,e\el2 Full EukOnly New .xls", "BAA84101J Leve)2 Full..NotEukOnly New .xls",
"WP 001180963.1 Levell_Primary EukOnly New.xls". "WP.001180963.1. Level I ..Primary _ N\HEukOnly_Ncw.xls".
"WP 001 180963.1 ~I-evelI Full EukOnly New.xls", "WP 001!80963 1 Levell_Full NotEukOnly New xls',
"WP 001180963.l_I.cvcl2_Primary EukOnK_New.xls"."WP 001180963.1 JLevel2_.Primary NotEukOnly New.xls".
"WP 001 !80%3.1 Ixvel2 Full EukOrtlv New.xls", "WP .001180963. l_Level2 Full. NotEukOnlv New xls".
"NP 001171681 ,!_Ixvel I. PrimaryEukOnly New.xls". "NP 00! P16X1.1 I .evel I Primarv_NotEukt>nIy New.xls",
"NP_001171681.1 Level I Full EukOnly_Ncw.xls", "NP 00117Kj8l.l_LevcIl_Full_NoiEukOnly_New.xls'',
"NP_001171681 LI,eve!2 Primiiry_EukOnly New xls". "NP 001171681.1 Level2 Primary. NotEukOnly_New.xls".
"NP 001171681.1 I cvel2_Fult EukOnly New.xls". "NP 001171681.1_Lcvef2_Full_NotEukOnly_New.xls".
"NP 001123908.1 _F.evel 1 Primary EukOnly New.xls'. "NP .0011239()8.l levell. Primary .NotEukOnly New.xls".
"NP_001123908,l_Levell_Full_EukOnlv_Ncw.xls", "NP_001123908.1 J evel 1 Full NotEukOnly Ncw.xl>",
"NP_001123908.1 Uvel2 Primary ..EukOnly_New.xls". "NP 001123908.1 Level2 f>rimary_NotEukOnly New.xls".
"NP 00112390S. l_Level2_Full EukOnK New.xls", "NP_001123508. l_Lcvcl2_Full_\otEukOnly_NewAls".
"NP 001083086.1 Level! Primary EukOnly_New.xIs". "NP 001083086.1 ..Level I .Primary NotEuW>nly New.xls".
"NP (H) 1083086.1 Level 1 Full_EukOnly_Ncw.xls". "NP 001083086. l_Level 1 Full NotEukOnly New .xls".
"NP (K) 1083086.1. Level2 Primary liukOnly New.xls". "NP 001083086.1 .Level2_Primary NotEukOnly New.xls".
70
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
i
Page 72 of 78
Prior version_.\ls_filcs[iI Ustfx)
}
else{
Prror_versionjds_files[i] <- NA
)
J
forti in (1:length!New version csv names)!)J
illlile.aeccss(loString(Kew version j.*SY__namcsfij)) 0)j
\ ;i>.dat;i.frame( read.csx (toString(New version csvn;tmes[i| >, stringsAsFactors !•'))
x \|c<"NCBI..Accession". "PrulciaCount". "Taxonomic.Group". "Scicnlific.Namc", "Bl. ASTp.Bitscore",
" S u sc ept i bi 1 i ty, P red i ct i on ")]
Ken version csv files|i] <- list{x)
>
else |
Nevv_version_csv_rilcsfi| <- NA
}
)
forj i in (1 :lengthlNew_version_vls names)))]
iff llie.accesslioStringi Nevversion xls names[i|)) = ()>{
x - load\Vorkbook(toString(Nevv _version_\lsjiames[i])|
\ read Worksheet! x, gctSheelsKlf 1 ])
x <- x[c|"NCBLAccession", "Protein.Count". "Taxonomic.Group". "Scientific.Name". "BLASTp.Bitscore".
"Susceptibility. Prediction") |
New version xls files[i] list(x)
}
elsej
New version xls filesfi] <- NA
}
}
* Begin comparisons
#CSV VS. xls
#!n all cases, these files should he identical If they are riot, cither
itthere was an error in naming or there is a problem with how SeqAPASS is
#parsing data.
#Set the length of the list
size <- length)Prior version csv files)
^initialize count of successful matches
good Match <- 0
s Initialize count of absences, where neither file exists
absence <-0
# Initialize the count of empty matches, where one file in a comparison is
^present but the other is not
emptyMatch 0
^Initialize list of empty matched files
emptyMateliList listf I
^Initialize the count of mismatches, where both files exist but are different
mismatch <- 0
^Initialize a list of mismatched file names
mismatch Li si <- Iist{)
«['or each element in a csv list, compare it to the corresponding element
72
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3784-O
Revision No. 0
Date; Oct JO 19
Page 74 of 78
^Initializations
# Initialize number of times comparison was not possible
noCornp <- 0
M Initialize list of files not compared because of one absence
noCompList <- list()
^Initialize list of file names with differences in accession or tax group
#And the first row in which an error appears
accessi onOrTaxGroup <- data.frame*file name 1 = charactcr(O). flrslRowl = numeric(0).file_name2 = character(0). firstRow2 =
nurncric(O))
^Initialize list of file names with differences in bitseore but not protein count
f?And the first row in which an error appears
bitscoreNotProlCount <- data. frame( file name I = character(0), first Row I = numeric(0).fi!e_name2 = character(O). firstRow2 =
numcric(O))
^Initialize list of file names with differences in susceptibility prediction
#And the first row in which an error appears
susPrediction <- data.framet file name! = charactcrfO), firstRowl = numcric(0),filc_namc2 = character{0). firstRow2 = numeric(O))
#for this comparison, entries without a partner are considered to be entirely
sabsent and will not be distinguished from double absences
#To save time, this analysis step will only use csv files, and not \k liles
for< i in (1 :size)>{
if (length!Prior vcrsion xls files) != size | lengthjNew_\ersionesv files} != size | lent>th(Ncw_version_xls_files) != size}{
print("Hrror: file lists are not of the same length")
break
J
iftis.nafPrio > r on csv files[i]) & is.na(New_version_esv__files[i]))i
noComp < iiot«np+ I
f
if(is.na{Prior version csv files{i|) & !is.na
I
#At this point we have two files that are both confirmed to exist, and
#ean be compared with one another. However, we do not expect identity
# between files at this point, or even that the dataframes have the same
snumber of rows
#The first comparison asks for any row in the first dataframe that shares a scientific name
# with a row in the second dataframe whether those two rows also share identical
^aeession numbers and taxonomic groups. Since the union of the datasets is symetrical,
|ij)) & !is.na(Ncw_vcrsion_csv_files|i])){
74
-------
GLTED-SGP SeqAPASS CL Oct 2019 Ver. 0 (original)
Reference Number: GLTED-STB-SOP-3 784-0
Revision No. 0
Dale; Oct JO IV
Page 76 of 78
}
#If they arc the same, we leave the (ruth value unchanged and observe
#the next scientific name, until we have observed all rows
}
flit after checking the list, some or all of these values are TRUE.
#then we add this file name to the appropriate listls). [f False,
#then no further action needs to be taken
if(accTa\Change){
accessionOrTaxGroup <- rbindfaccessionOrTaxGroup. c(Prior version _csv_names|i], AOTFirstRow, New_version_csv_names[i],
AOTOtherRow)}
}
i b i tProtC han ge) {
bitscorcNotProtCount <- rbind(bitscoreNotProtCount, c(Prior_version_csv_namcs[i], BPCFirstRow, Ncvv_version_csv_natries[i),
BPCOtherRow))
}
if|susChange){
susPrcdiction <- rbind(susPrediction, c(Prior_version_csv_nairiesfi]. SCFirstRow, New _version csv namcsli ]. SCOlherRow))
!
1
I
#Somc final clean up. Change some of the column names to make sure they are
# human readable
names(accessiont)rTaxCiroup) <- cC'fiie 1". "row in file I", "file 2". "row in file 2")
names(bilscoreNoiProlCount) <- cffile 1", "row in file I", "file 2", "row in file 2")
names(susPredktion) <- cf'file 1"row in file 1"file 2". "row in file 2")
xF.NDOF initial cfif.ck
uumnm#m#ummua#m
umuMmmmmntiuuM
^PAtRWISK COMPARISONS
UHMHUUHMM-HMmU-Um
#Thc purpose of this function is to examine the differences between two files
#in greater depth than the overview cart pro\ idc. X and Y should be the NAMFS
U of the two files you wish to compare and they should be the CSV versions
Sof the files. Output will be three lists detailing every instance where
#thc two tiles arc different, in categories identical to the general oierview
tfahove. Specifically, differences are evaluated when, given that the scientific
flnarne in a row from the first tile matches the scientific name in a row from
#thc second I) Either the Accession Number or Taxonomic groups do not match
tf2) i he Protein Count is the same hut the Uitscorc has changed or 3) The
-Susceptibility Prediction has changed,
deep.dive <- funclion(x.y){
^Knsurc directory is set correctly
sctwdCI.:/Priv/Bii»infotmatics Team/SeqAPASS/Savcd test data')
#Set up lists to be returned
deepDivcAccessionOrlaxGrotip data, frame! first Row I = numerie(O), firstRowl = numeric(U))
deepDheBitscoreNotProtGroup • - dala.framet first Row 1 - numericvO), tirslRow2 " numeric(O))
decpDii eSusPredietionGroup data.framet first Row I = numeric(O), firstRow2 = numeric(O))
#Get files from names
file 1 as.duta.frame!rcad.csvf toStringl x), stringsAsFactors=F|)
file! <- as.data.frarrte( read.csv( toString(>), stri n g s A s Fac tors ~ F) >
file I <- file 1 [c("NCBI..Accession", "Protein,Count". "Taxonomic.Group". "Scientific.Name". "BLASTp.Uitscorc".
"Susceptibility. Prediction")]
file2 -- llle2|c("NCBl..Accession". "Protein.Count"."Taxonomic.Group", "Scientific.Namc". "BLAS'fp.Bttscorc".
"Susceptibility. Prediction")!
76
-------
GLTED-SOP SeqAPASS CL Oct 2019 Ver. 0 {original)
Reference Number: GLTKD-STB-SOP-3784-O
Revision No. 0
Date: Oct 2019
Page 78 of 78
#To use, replace "fllel" and "file2" with the names of the files of interest
^Example:
#deepLists <- deep.dive("CAL36973. l_Lcvcl l_Primar> EukOn]y Old.csv","CAL.36973.1 Level 1 Primary EukOnly_Ncw.csv")
dcepLisls --- dcep.dive{"filcl","filt.'2")
dcopI)i\eAccessiont)r 1 a\(iroup <- As,damJrame(dccpI.ists[e( l.2)j I
dccpDiveilitscorcNotl'rotCirnup as.data.frame!decpLists[c<3.4)]|
decpDiveSusPredictionGroup <* us.data.frame(docpl ists|c(5.6||)
78
------- |