&EPA
                            United States
                            Environmental Protection
                            Agency
                         Environmental
                         Research Laboratory
                         Duluth, MN 55804
Research and Development    EPA/600/M-87/021    August 1987

ENVIRONMENTAL
RESEARCH    BRIEF
                     SMILES: A Line Notation and Computerized
                          Interpreter for Chemical Structures
                          Eric Anderson, Oilman D. Veith, and David Weininger
Introduction
As the use of structure-activity relationships matures in the
search for cost-effective molecular design and  chemical
safety evaluation, interaction with advanced computational
methods on  small computers is unavoidable. Methods for
specifying chemical structures  through  an interactive
construction vary widely  and  include specifying line
notations, atom  and bond list  matrices,  and graphical
building  blocks  of substructures. Line notations such  as
Wiswesser Line Notation (WLN) are rapid but require
extensive knowledge and experience  by the user (Smith,
1968; Granito et al.,  1972, Elkms et al. 1974), making WLN
of limited value to  the non-chemist computer user The
input of  atom  and  bond  lists  requires  a  minimum  of
knowledge by the user but is tedious and slow (Kaufmann,
1981). Moreover, lists are not efficient representations of
chemical structures in  computer memory  or storage
devices if large  sets of structures are desired. Graphically
building structures from menus of substructures (Kao et al.,
1985) is  user-friendly but requires more software overhead
and hardware costs. This paper presents a convention for
chemical structure notation which has the advantages of
line notation  and minimizes the chemical knowledge of the
user by programming many rules of chemistry into the line
notation interpreter.
SMILES  notation (Simplified Molecular  Identification and
Line Entry System) was developed by the Environmental
Research Laboratory-Duluth QSAR Research Program to
facilitate storage,  retrieval, and modeling of  chemical
structures and chemical information. This notation provides
a flexible and unambiguous method for  specifying the
topological structure of  molecules, and  interfaces with
additional software to specify the geometry of molecules.
SMILES notation reduces  the  difficulty of translating
structure into appropriate notation for humans and software
                      by focusing on common chemical conventions and only five
                      simple "rules"

                      SMILES Notation
                      To prepare SMILES notation, it is generally best to  draw
                      chemical structures on  paper, although experienced users
                      often write SMILES directly from the structure envisioned in
                      their minds. Chemical structures are most often hydrogen-
                      suppressed because the location  of hydrogens is  implicit
                      for  normal valences  of atoms. Hydrogen atom can be
                      explicitly designated in heterocyclic systems to avoid
                      ambiguities. Each atom in the structure is represented by
                      its atomic symbol. Atomic symbols having two characters,
                      such as lead, are designated with  the first character "upper
                      case" and the second character "lower case" (i.e., Pb). The
                      bond symbols are designated by special characters  -
                      explicit single bonds are designated by hyphen (-), double
                      bonds by equal sign (=), and triple bonds by (#). The bond
                      between two atoms is  represented by placing its  symbol
                      between the  atomic  symbols for the atoms which are
                      connected by the bond. If a bond is not specified, it is
                      interpreted as a single bond. Examples are.

                         Chloroethane        CH3CH2CI  C-C-CI  or  CCCI
                         Acetaldehyde        CH3CH = 0  CC = 0
                         1-Propyne           CH = CCH3  C#CC

                      The SMILES interpreter reads the SMILES string from the
                      left to  right  and identifies atoms  with  two-character
                      symbols first. There is then, no ambiguity  in the notation for
                      chloroethane  with respect to the character "C" m the
                      chlorine atom.
                      To represent branched  structures where atoms have more
                      than two atoms  connected to them, the additional
                      connections, or branches, are enclosed in parentheses. The
                      left parenthesis is interpreted to mean that all atoms in the

-------
string  until the  corresponding  right  parenthesis are
connected to the preceding atom:
    2-Methylbutane
    Di-n-butylphosphorate
                         or
CC(C)CC
0 = P(OCCCC)OCCCC
CCCCOP( = O)OCCCC
Note that atoms such as oxygen which are double-bonded
to the central atom are designated by placing the double
bond just inside the opening parenthesis
There are no limits to the  number of branches a structure
can have in SMILES notation, although associated software
in the  SMILES interpreter will  detect  disallowed  valence
states (too many connections) for the atoms.  Conceptually,
there are no limits to the number of branches that can be
designated within  other branches  because  the pairs  of
parentheses are interpreted  in logical order  This permits
very complex topological  structures  to  be  written  in  a
simple linear string of characters. The SMILES  interpreter
does,  however,  include some practical software  limitations
For example, there is usually a limit to the number of atoms
allowed  in a structure.
The only  remaining  topological feature  of  chemical
structures is that of rings or cycles  The simple SMILES
rule for  cyclic structures is that  one bond broken in  each
ring will result in a structure which can be expressed in  a
linear string. To identify the "broken" bond in  each ring, the
two atoms connected by the bond are each labeled with  a
digit, termed the  ring-closure pair  The digit  is placed
immediately following the two  atoms  connected by the
"broken" bond and the SMILES interpreter restablishes the
bond in the internal connection matrix for  the  structure For
example:
    Cyclohexane
    Benzene
CICCCCCI
CI = CC = CC = CI
To  avoid having to draw Kekule structures and designate
conjugated double bonds in aromatic structures,  SMILES
notation includes the convention of designating  atoms in
aromatic rings with lower case atom symbols
    Benzene
    Naphthalene
    4-chlorobenzoic acid
clccccc!
clcc2ccccc2ccl
O = C(O)clccc(C)ccl
More complicated structures generally require that  the
structure be drawn on paper first to facilitate "bookkeeping"
as illustrated in Figure  1. It  is obvious that  the  SMILES
notation can begin at any atom in a structure and still  be
valid. We have developed software which rapidly plots the
structure entered by the user to provide a visual verification
of the structure Moreover, we  have developed a  conical
ordering algorithm which translates all variations of  possible
SMILES  notation  for  a given structure into  a "unique"
SMILES  notation for that structure. This algorithm  and  its
use in storage and retrieval of chemical information will  be
the subject of a subsequent paper.

The  convention of using lower case symbols  for aromatic
atoms introduces  the  possibility  of ambiguous  SMILES
notation in the case where an aromatic atom has a double-
character symbol. For example, "Sn" designates the atom
tin; however, it could be interpreted as  an aliphatic sulfur
singly bonded to  an aromatic nitrogen  Also, if tin were to
be designated as aromatic, the lower  case "sn" could  be
interpreted as an aromatic sulfur connected to an aromatic
nitrogen. Wa  have  found these  ambiguities  to  occur
infrequently  and can be  omitted  by designating  double-
character atoms as aromatic  by placing the exclamation
point  (!) as a suffix for aromatic  atoms  immediately
following the atom (e.g ,  Sn!  designates aromatic  tin). In
general, aliphatic atoms connected  to aromatic rings most
frequently  would  be designated  as  a branch using
parentheses. Finally, if a user wished  to  designate an
aliphatic sulfur  connected  to an aromatic carbon,  the use of
the explicit  single bond  "S-c" would remove  ambiguity
and  prevent the  notation from  being interpreted as
scandium.

A summary of SMILES notation is as follows:

1) Represent  atoms with  their atomic symbols using
   hydrogen suppression in most cases.

2) Represent bonds  between  atoms  using  "-"  for single
   bonds, " ="  for double bonds, and "#" for triple bonds.
   Single bonds are implicit if not designated

3) Enclose branches from a central atom in parentheses.

4) Linearize  cyclic structures  by  removing one bond in
   each ring and  designating  the  atoms as  ring closure
   pairs with a digit immediately  after  the  atoms to be
   reconnected

5) Aromatic  ring  atoms  can  be represented using lower
   case  symbols  for single  character atoms and by
   appending an "!" suffix to double character atoms.

SMILES Interpreter
SMILES notation  is logically interpreted by  a  syntax
interpreter which  parses  the  character string from left to
right and assumes that.
   SMILES  = piece ([bond] piece) space

   Piece =  atom (digression)

   Digression = [bond] label |
                left	paren < [bond] piece >  right	paren

   Label =  "1" | "2" | "3" | "4" | "5" |  "6" | "7" | "8" | "9"
   Bond  =  bond_symbol [bond	qualifiers]
   Atom  = atomic	symbol [atom	qualifiers]

In this notation, (.  ) enclose  items  repeated  zero or more
times, [ .] enclose an optional item, | separates alternatives,
and  <..>   enclose items repeated  one or more times.
From this definition of SMILES, it can be seen that a space
is used to  designate the end of the  SMILES string.  The
labels are used io close rings  Consequently, a given  label
must  appear an even number of times in pairs.

A syntax diagram for the  purpose of software development
and  explanation   is  presented  in Figure  2.  Although
recursive implementations  of the syntax  diagram  are
certainly possible, we  have constructed a  non-recursive
implementation of  SMILES interpretation in FORTRAN 77.
The first subroutine in the software is designed to  identify
atoms by matching the characters  to the atomic symbols.
This routine also  records whether the atom is aliphatic or
aromatic as described   above.  The  second subroutine
connects two atoms  In addition to identifying the  explicit
bond  types,  the routine must identify implicit bonds  Implicit
bonds within a ring or fused ring system are identified as

-------
Figure 1.    Writing SMILES for branched, cyclic structures.

                                                  Mecamylamine
          CH3
                                                            Start
                   \
             CH3

             CH3
                   /
                 CH3
CH2
                                        SMILES = CNCI(C)C2CCC(C2)CI(C)C
                                                 Dimethothiazme
                                                  Start
                             SMILES  =  clcc2Sc3ccc(S( = O)( = O)N(C)C)cc3N(CC(C)N(C)C)c2ccl
aromatic  if both  atoms  connected by  the  bond  are
designated aromatic.

After processing the first atom, a WHILE  loop will check for
the presence of a terminating space. As long as the  end
has not been reached, each  pass through the loop  will
process either  one atom  (attaching it  to  the  preceding
atom), or one  end  of a ring  closure pair  When  two
matching  ends  have been located the two atoms can be
reconnected.

Within the loop, checks are made in the sequence indicated
by the diagram  for the optional presence  of parentheses
(indicating  branching)  or  explicit bonds  As opening
                            parentheses are encountered, the last atom encountered is
                            remembered on a stack (last in,  first out). This allows the
                            algorithm to return  to the  atom and proceed along  a
                            different branch after it has completed this current branch
                            Later as the matching closing parenthesis is encountered,
                            the atom is taken back off the stack and its status as the
                            "current" or "last" atom is restored  As each  new atom  is
                            encountered, it is attached to the  former "last" atom and in
                            turn becomes the new "last"  atom.

                            The routine also makes subsequent checks to make certain
                            that the stack is empty (all parentheses were matched) and
                            that all ring  closure digits used were  eventually matched.

-------
Figure 2.     Syntax diagram for SMILES notation.


                       SMILES
                        String
RING
CLOSURE


Also, each aromatic atom should have at least two aromatic
bonds associated with it. These conditions support proper
syntax checking.

It is also important  to perform various  checks on the
chemical meanmgfulness of the structure described These
include  identifying improper valence states  for a  given
atom The number of hydrogen atoms associated with each
atom is computed as part of the valence checking software.
The placement of hydrogens in heterocycles can be made
explicit  to avoid  interpretation  difficulties.  In  delocahzed
systems  such as the nitro group, we  have  adopted the
SMILES convention of double bonds to both oxygen atoms,
eg, RN( = 0) = 0, to prevent  the hydrogen  connecting
routine from  adding  a  hydrogen  on  the oxygens  These
specialized bonding systems can  be  readily detected and
modeled as delocahzed bonds as the need arises

Semantic checks on the chemistry  of the SMILES string are
best kept in  routines separate from the  interpreter so they
can be called sometime after interpreting a SMILES string
Additions  and refinements  to  such  a  routine can  then
proceed  independently,  without  affecting  the SMILES
syntax  interpreter.  In  the version used  at  ERL-D,  the
valence of carbon is 4, oxygen is 2, and halogens are all 1.
Nitrogen can have valences of 3, 4, or 5 as well as a variety
of special  states. For example, an  aliphatic nitrogen  with 4
single bonds is considered charged as designated below.
Nitrogens containing  two double bonds  decrease  the
neighboring atom  hydrogen count by one. In addition to
these common valence checks, simple checks to be certain
aromatic atoms are located in a  ring structure or  that all
atoms in an aromatic ring are so designated are made.

There  are numerous  other conventions which are being
incorporated into the SMILES interpreter to designate other
features of chemical structure. Simple conventions such as
{ + } or {-} locate explicit ionic charges on the preceding
atom The use of braces have been a convenient method of
designating specialized delocalized and tautomeric bonding
in substructures and for special  valences of inorganic
structures  in SMILES  by expanding the list of qualifiers on
atoms and bonds in the interpreter. The SMILES interpreter
described  herein is obviously capable only of the topology
of structure  and, in this simple  form, cannot  be  used to
designate conformation  or other  three-dimensional
attributes  of structure  The addition of geometry  to  the
SMILES conventions is beyond the scope of this paper and
will  be discussed separately    Moreover,   topological
structures  are adequate  for  models of  many chemical
properties and as  input  to  conformational  analysis
programs. The primary  advantages of  SMILES is  the
simplicity  of the conventions and the fact that the  software
can  be implemented  on almost  any size  computer using
BASIC, FORTRAN, Pascal or C compilers  The FORTRAN
version of the SMILES interpreter is available upon written
request to the Environmental Research Laboratory-Duluth

References
Smith, E.G  The  Wiswesser  Line-Formula Chemical
Notation; McGraw-Hill, New York,  N.Y., 1968, pp  187.

Granito, C.E , Roberts, S ; Gilbson, G W J. Chem.  Doc  12,
190-196  (1972).
Elkins, D; Leo, A.; Hansch C, J. Chem  Doc 14, 65-69
(1974)

Kaufmann, J J , Int J Quant. Chem 8, 419-439 (1981).
Kao, J , Eyerman,  C , Walt,  L ,  Maher, R., Leister, D ,  J.
Chem. Inf. Comput. Sci. 25(4), 400-409 (1985).

Authors
Eric Anderson is with the Computer Sciences  Corporation,
Falls Church, VA

David  Wemmger  is with the  Medchem  Project,  Pomona
College, Claremont, CA.

Address correspondence to:

   Oilman D. Veith
   U S. EPA
   Environmental Research Laboratory
   6201 Congdon Blvd.
   Duluth, MN 55804

-------