The molfile command is the generic command used to manipulate chemical structure and reaction files. These can be of any supported format, not just MDL molfiles.
Molfiles are major objects. They are uniquely identified by their label alone. Molfiles do not contain minor objects.
set fhandle [molfile open myfile.sdf]
set ehandle [molfile read $fhandle]
molfile get $fhandle record
As explained in more detail in the section about working with structure files, the molfile handle identifier can be replaced by a file name. This file is automatically opened, the command executed, and the file closed in a single one-shot operation.
In the context of structure files, file-related data is usually provided as attributes. However, molfiles can store property data like any other chemistry object.
molfile get $fhandle F_COMMENT
When property data is requested which is not of the molfile type, the next record is read from the file into a temporary ensemble, reaction or dataset object, depending on the file configuration. An attempt is then made to obtain the property data from that object. Afterwards, the object is automatically deleted.
set mw [molfile get „somefile.smi“ E_WEIGHT]
This example temporarily opens the file, reads the first record into an ensemble, and computes the molecular weight. Both the ensemble and the molfile object are transient and do no longer exist after the command completes.
This is the list of currently officially supported subcommands:
molfile add filehandle ?objecthandle/objecthandlelist?...
f.add(objectsequence/objectref,...)
Molfile.Add(filename,objectsequence/objectref,...)
If the
filehandle
argument refers to an open chemistry data file, this command is indistinguishable from
molfile write
.
A difference only exists if the
filehandle
argument is a file name (or, in cased of
Python
, the class method is used). In that case,
molfile write
overwrites an existing file, while
molfile add
attempts to temporarily open the file for appending, as with a
molfile open filename a
command. If the format of the output file supports appending, the output objects are written as new records after the last existing record.
molfile append filehandle ?property value?...
f.append({?property:value,?...})
f.append(?property,value,?...)
Standard data manipulation command for appending property data. It is explained in more detail in the section about setting property data. This is not a command to append file records. Use the
molfile write
command for this purpose.
The command returns the first data value.
molfile append $fh F_GAUSSIAN_JOB_PARAMS(route) “Opt=(AddRed,CalcFC)”
molfile assign filehandle srcproperty dstproperty
f.assign(srcproperty=,dstproperty=)
Assign property data to another property on the same ensemble. Both properties must be associated with the same object class. This process is more efficient than going through a pair of
molfile get/molfile set
commands, because in most cases no string or
Tcl/Python
script object representations of the property data need to be created.
Both source and destination properties may be addressed with field specifications. A data conversion path must exist between the data types of the involved properties. If any data conversion fails, the command fails. For example, it is possible to assign a string property to a numeric property - but only if all property values can be successfully converted to that numeric type. The reverse example case always succeeds, out-of-memory errors and similar global events excluded.
The original property data remains valid. The command variant
molfile rename
directly exchanges the property name without any data duplication or conversion, if that is possible. In any case, the original property data is no longer present after the execution of this command variant.
The command returns the object handle for Tcl , or object reference for Python .
molfile backspace filehandle ?nrecords?
f.backspace(?records=?)
Position the file pointer backwards. If no record counter is specified, the file is backspaced by a single record. It is an error to attempt to reposition the file before the beginning of the file.
molfile backspace $fh
molfile set $fh record [expr [molfile get $fh record]-1]
These two sample lines provide identical functionality.
The
molfile backspace
command is often used in combination with the
molfile copy
command in order to copy records with specific properties verbatim:
set eh [molfile read $fh]
if {[strucuture_passes_condition $eh]} {
molfile backspace $fh
molfile copy $fh $outfilehandle
}
molfile blob enshandle/reactionhandle/datasethandle ?attribute value?...
molfile blob enshandle/reactionhadle/datasethandle? ?attribute_dict?
Molfile.Blob(eref/xref/dref,?attribute,value?,...)
Molfile.Blob(eref/xref/dref,attribute_dict)
This is an alias of
molfile string
. Please refer to the section on that command for more information.
molfile close ?filehandle? ...
molfile close all
f.close()
Molfile.Close(“all”)
Molfile.Close(mrefsequence/mref/mhandle,...)
Close one or more file handles. If the file handle corresponds to a scratch file, the file is deleted. If it corresponds to a pipe, all programs in the pipe are shut down.
If all is passed instead of a set of file handles, all currently opened structure files are closed. Standard Tcl or Python files upon which a molfile handle has been piggybacked are not affected, i.e. these language channels are flushed and remain open, while the molfile object component is closed.
It is a good idea to close files when they are no longer needed. In addition, while most file format I/O modules commit all data to disk after each record has been written, so that a clean close-down is not absolutely required, there are file formats for which the I/O module has a cleanup or finalization routine which is only called if the file is properly closed.
The command returns the number of files which were closed.
set fhandle [molfile open scratch]
molfile close $fhandle
The example closes a scratch file, which is automatically deleted from disk when it is closed.
On normal interpreter program exit, the close functions of all remaining open file handles are automatically called.
molfile copy filehandle ?channel? ?count? ?startrecord/startrecordlist?
f.copy(?outfile=?,?count=?,?startrecord=?)
Copy a record to a Tcl or Python file I/O channel, to a Cactvs structure file handle, or retrieve it as a byte image. No interpretation or formatting of the data in the file record(s) takes place - the data is copied verbatim, byte by byte.
If file format conversion is desired, the data items (ensembles, reactions, datasets) must be explicitly read (
molfile read
command) as chemistry objects and written to another
molfile
opened for output in the desired format (
molfile write
command). That procedure involves re-formatting and potential loss of formatting or information which was not captured by the input routine, or cannot be written by the output routine.
By default the next record after the current file pointer position is returned as a byte image. The optional parameters allow the selection of a specific start record (beginning with 1 for the first record), the copying of multiple records in one command (by default, a single record is processed), and output to alternative
Tcl
or
Python
file I/O channels or
Cactvs
molfile
structure file handles. If an empty string,
None
in
Python
, or the value 0 are used as start record number, the file is copied from the current position. If the start record is negative, it is interpreted as offset from the current position. Therefore, passing -1 as parameter instructs the command to backspace by one record prior to copying. Not all files can be backspaced. The start record can also be specified as a record list (
Tcl
) or record sequence (
Python
). In that case, the input file pointer is positioned to every specified record in order, and from that position the selected number of records is copied. If the special record count values
end
or
all
are used, all remaining records in the input file are copied. Otherwise, if the number of available records is smaller than the requested copy count, an error results.
If the output channel argument is omitted, or set to an empty string, the record(s) are returned as a byte sequence command result. Otherwise, the data is written to the file handle the argument is connected to. For
Cactvs
molfile
handles, the destination is the current write position of the underlying file handle. On
Unix
/
Linux
systems, writable active
Tcl
file or socket handles (in the form
filexxx
or
sockxxx
) are also supported, but not on Windows. Additionally, the special output channel names
stdout
and
stderr
can be used. If output is written to a channel, and not returned as blob, the number of actually copied records is returned as the command result.
The I/O modules for some formats like SDF provide optimized fast copy routines and are thus notably faster to copy then other file formats without explicitly encoded record positions. These still need to read the file line by line and maintain a parser state, though they can avoid decoding the record contents as structures or reactions.
set eh [molfile read $fhandle]
set fhout [open “metal_compounds.sdf” w]
if {[ens atoms $eh metal exists]} {
molfile copy $fhandle $fhout 1 [expr [molfile get $fhandle record]-1]
}
This example reads a structure from an input file, checks whether is contains a metal atom, and if yes, copies the record unchanged to an output file, which is opened as a simple Tcl text file channel in this example. The expression which forms the last parameter backspaces the input file by one record, so that the same record which was just read can be copied. A simpler solution for the same functionality is to simply pass -1 as argument. This works of course only if the input file can be repositioned backwards. i.e. normal text files are fine, standard input or a socket connection do not work.
molfile count filehandle ?maxrecords? ?readscope?
f.count(?maxrecords=?,?readscope=?)
Molfile.Count(filename,?maxrecords=?,?readscope=?)
Count the number of records in the file.
If the file format contains an internal or external record index with information about the complete file, the answer is produced from the index, and thus is typically obtained fast. Otherwise, the file is skipped from the current position until the end, and the sum of the number of records encountered while skipping and the record index when the count started is returned. In case of files which are rewindable, the original input file pointer position is then be restored. On non-rewindable files, the file contents are consumed, and no return to the old input position is possible. For files which are opened for writing, the count usually is simply the current output position, except for those few file formats which support in-file record replacement in combination with a complete file index. In the latter case, the count is again extracted from the index.
During the record skipping part the file contents are not physically read if possible. Rather, the skip function of the responsible file format I/O module is used to scan the file effectively. After arriving at the end of the file, a full in-memory record position index has been assembled for the file, and future record selection within files which support re-positioning is fast.
The type of record boundaries counted depends on the input scope of the file. For file formats which support multiple input modes, such as for extraction of ensembles or molecules or datasets, the count is dependent on the type of object which is configured to be read. If the file input object type is changed, the in-memory record index table is discarded.
If the maxrecords parameter is specified, and is not a negative number, it is the maximum count reported. No attempt is made to position the file beyond this mark during the count process. This has no effect on future input operations - these may still proceed beyond the reported count. This option is not intended to be generally useful, but is used for example in the structure browser csbr with the -m option to enable quick inspection of a file without full scanning.
The optional readscope parameter can be used to temporarily modify the read scope under which the file is processed. It can be any of the generally recognized values (mol, ens, reaction, dataset). If the file format does not support the specified mode, its default mode is silently used. If the file is not positioned at the beginning of the data, the count reports the sum of the currently known records as perceived by the previous read scope, and the remaining file records under the new one. If these values are different, the result may only be useful under very specific circumstances. The the parameter is not set, or an empty string is passed, the currently set, or, for one-shot file operations, the default read scope, is used.
set nrecs [molfile count “thefile.sdf”]
set nrecs [molfile count “test.spl” -1 mol]
molfile dataset filehandle ?filterlist?
f.dataset(?filters=?)
Return the handle of the dataset associated with the file handle. If no such dataset is set, the command returns an empty string, or
None
for
Python
. The command
molfile get $filehandle dataset
This command is different from the dataset commands for ensembles, reactions or tables, where it indicates membership in a dataset. File objects cannot be a member of a dataset. This dataset association is explained in more detail in the
molfile set
command section.
molfile defined filehandle property
f.defined(property)
This command checks whether a property is defined for the structure file. This is explained in more detail in the section about property validity checking. Note that this is
not
a check for the presence of property data! The
molfile valid
command is used for this purpose.
molfile delete filehandle recordlist ?rebuildindex?
f.delete(records=,?rebuildindex=?)
Molfile.Delete(filename,records=,?rebuildindex=?)
Delete records from the file. The file must have been opened for writing or update, and be rewindable. In case the file is not a simple record sequence, the I/O module for its format must provide a deletion function, or the operation will fail.
The deletion record list is a single or set of record numbers in any order. They are sorted and duplicates removed before file modification commences. It is no error to specify an empty removal record list. The record numbering starts with one, and the record numbers are referring to the record numbering at the moment the command is issued. There is no need to compensate for intermediate record numbering shifts when more than one record is deleted.
The optional index rebuild parameter, a boolean value, can be set to optimize the deletion process for files in formats which maintain field index information. By default, indices are updated as part of the deletion process. In case many records are deleted, it may be more efficient to drop the indices prior to the deletions and rebuild them after the records have been removed. In order to select this alternative procedure, a true parameter value can be set. At this time, the only file format which actually can use that parameter is the bdb database file format.
In case the file is to be truncated, the
molfile truncate
command is usually more efficient.
This command returns the number of deleted records. It does not close or destroy the file handle, or the underlying file.
molfile dget filehandle propertylist ?filterset? ?parameterdict?
f.dget(property=,?filters=?,?parameters=?)
Molfile.Dget(filename,property=,?filters=?,?parameters=?)
Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.
For examples, see the
molfile get
command. The difference between
molfile get
and
molfile dget
is that the latter does not attempt computation of property data, but rather initializes the property values to the default and return that default if the data is not yet available. For data already present,
molfile get
and
molfile dget
are equivalent.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
molfile dup filehandle
f.dup()
This command duplicates a file handle. The duplicate handle or reference points to the same underlying file or other data channel, is opened in the same access mode, and positioned at the same record. Also, all file object attributes and file properties are set to identical values.
Currently, it is not possible to duplicate virtual file sets opened by a
molfile lopen
command.
molfile exists filehandle ?filterlist?
f.exists(?filters=?)
Molfile.Exists(mref,?filters=?)
Check whether a molfile handle is valid. The command returns 0 or 1. Optionally, the molfile may be filtered by a standard filter list, and if it does not pass the filter, it is reported as not valid.
molfile extract filehandle retrievallist
f.extract(retrievallist)
Extract the contents of data fields from the file, without reading full structure or reaction records if possible. This operation requires a support function in the I/O module for the file format. Generally, only formats optimized for query operations, such as the Cactvs bdb and cbs formats provide such a function in their I/O module.
This command is essentially a shortcut for a
molfile scan
command with an empty query condition and a
propertylist
retrieval mode. Please refer to that command for details about the possible contents of the retrieval list.
The result is a nested list of extracted property values, with one outer list element for every file record to the end of the file, and inner list with one element per retrieval field.
molfile filter filehandle filterlist
f.filter(filters)
Check whether the structure file passes a filter list. The return value is boolean 1 for success and 0 for failure.
molfile filter $fhandle $filter
molfile fullscan filehandle queryexpression ?mode? ?parameterdict?
f.fullscan(query=,?mode=?,?parameters=?)
Molfile.Fullscan(filename,query=,?mode=?,?parameters=?)
This command is the same as
molfile scan
, except that an automatic rewind (see
molfile rewind
) is performed before the query is executed. The same effect can be achieved by setting the
startposition
parameter value to 1.
molfile get filehandle propertylist ?filterset? ?parameterdict?
molfile get filehandle attribute
f.get(property=,?filters=?,?parameters=?)
f.get(attribute)
f[property/attribute]
f.property/attribute
Molfile.Get(filename,property=,?filters=?,?parameters=?)
Molfile.Get(filename,attribute)
Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.
The molfile object possesses a rather extensive set of built-in attributes, which can be retrieved with the
get
command (but not its related subcommands like
dget, sqlget
, etc.). Most of them can also be manipulated with a
set
command. In addition,
molfile
objects can possess file-level properties. The standard prefix for these is
F_
.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
set c [molfile get $fhandle F_COMMENT]
These built-in attributes are:
filex get
command). For normal files, this attribute is empty, and setting it to a string value has no effect.
molfile close
command. The attribute is read-only.
Molfiles
which are, for example, property data values or a part of a
molfile loop
command cannot be deleted by standard means.
CR/LF
on Windows and a single
CR
on Macs. This attribute has no effect on input. All input routines automatically recognize and read all three variants on all platforms.On setting, the magic strings windows , mac (both checked for the first three characters only) as well as unix and linux are translated to the standard platform line terminators and not copied verbatim. Alternative names for these standard system encodings are crlf , cr and lf . The special value default resets the attribute to the platform-dependent default.
molfile lopen
command to access a virtual file assembled from multiple physical files, this can be a list with more than one element.
molfile scan
command) which input records must match to yield a result object when a
molfile read
command is run. The read command is automatically looped until a matching record is found, or the end of the input source is reached. Since the test is only applied after a prospective input object has already been fully read internally, this style of record filtering is in many cases considerably less effective than using
molfile scan
for file formats which possess query acceleration features, such as
CBS
,
BDB
or the
Pubchem
virtual file module. For the reading of simple text files, such as
SDF
, there is no performance difference to using
molfile scan
in the ens or reaction object retrieval mode, and this type of filter which can be easily adjusted or disabled (by setting it to an empty string) can be convenient.
filex
command for the format. The possible values are
default
(or -1), which is the default and selects the default hydrogen write mode of the file format,
none
(or 0) which suppresses hydrogen output,
special
(or 1) which writes hydrogens shown normally with a symbol only, and
all
(or 2), which writes all extant hydrogens. Since this attribute does not change the hydrogen atom set, setting for example the mode to all when there are no hydrogens attached to the structure has no effect.
hydrogen add
command, automatic hydrogen addition on input, or similar mechanisms) and
addblind
(which is the same as add, but does not register the added hydrogen atoms as implicit in property
A_IMPLICIT
). When writing a structure object to a file with enabled hydrogen processing, the original object is not changed. Hydrogen processing takes place on a ephemeral duplicate object. On input, hydrogens which are no explicitly encoded, but defined via implicit valence rules in the format specification are still instantiated in
asis
mode. For example, a single C atom in an
MDL
Molfile is read as a single atom, because there are no default valence rules, but a C as a
SMILES
string is expanded into one carbon plus four hydrogen atoms. For a method to suppress the expansion of valence-implicit hydrogen atoms, see the
readflags
attribute.
molfile read
command. This is normally the value of the
record
molfile attribute after the read operation minus one and corresponds to the file record number of the read object in the data file.
https://loinc.org/
) describing the file contents. For I/O formats where this information us used (currently
SPL
) and the attribute is not set, the default is
64124-1
(
Indexing - substance
). Setting a
LOINC
value for the first time takes a second or two because this is a controlled vocabulary, and the term table is loaded from disk for verification.
molfile loop
statement. This is the same as the content of the loop variable. If no loop is active, this is an empty string. This is a read-only attribute.
molfile open
or
molfile lopen
command. Possible values are
append
,
pipe
,
read
,
string
,
write
and
update
. Note that in this attribute there is no difference between the standard read and the restricted read-only modes (see
molfile open
). The file mode cannot be changed at a later time by directly changing the mode attribute. However, with some limitations, a file may be switched back and forth between input and output modes with the aid of the
molfile toggle
command.
record
attribute minus one, but if reactions from files where reagents and products are separate sub-records, or complete datasets were read, the difference may be larger.
molfile read
operation. To this decoded object the contents of the other columns is attached as property data. Typically the content of the structure column is a Reaction
SMILES
string or similar line notation. A negative value of this attribute indicates that the presence of structure data in a specific column is unconfirmed. In that case, an attempt is made to determine the reaction column automatically, and the attribute is updated accordingly. However, setting it explicitly may still be required in case there are multiple columns with reaction data, or there are too many unreadable or
NULL
row entries to allow automatic determination.
::cactvs(default_reaction_screen_property)
and is usually
X_SCREEN
. If a file is opened that contains information about the screen property set when the file was written (for example,
CBS
and
BDB
formats), this attribute is automatically set to the value stored in the file.
molfile append
command can also be conveniently used. There are also a few shortcut alias attribute names which set or reset selected, frequently used flags directly (
complexresolver
). The following flag names are currently recognized:
molfile scan
command and is used there in order to perform full-file queries starting from an arbitrary position in the middle of the file.
CR/LF
files on
CR
-only or
NL
-only platforms, or vice versa, which is always possible and fully automatic. This flag addresses the problem that, due to mishandling by obscure transfer software, duplicated
EOL
-markers are introduced in the file (two identical
CR/LF
, or
CR
, or
NL
pairs after each data line).
NL
(
ASCII
10) character as data content instead of examining it as potential line break symbol. This flag is necessarily ignored on Mac-style input files which only use
CR
as
EOL
markers.It is possible to set this attribute in order to reposition the file pointer. In case the file is opened for output, and is not in update or append mode, this operation truncates the file. Repositioning while reading does not modify the file. It is not possible to position the file pointer any further to the rear of a file than immediately behind the end of the last existing record. In case of virtual files, a record setting implicitly changes the vrecord attribute, not the current physical file record.
When setting the attribute, the special values end and last can be used to position the file pointer behind the last, or before the last record, and a negative value is interpreted as a backspace from the current position. The return value is the resolved record number.
molfile count
command before querying the record table.
molfile scan
command. The returned data is a
Tcl
dictionary with keys
start_time
(in seconds since 1970-1-1),
stop_time
(in seconds since 1970-1-1),
scan_time
(in seconds),
ens_read
(count of ensemble objects instantiated),
miniens_read
(count of Minimol objects decoded),
reactions_read
(count of reaction objects instantiated),
properties_read
(count of property records read),
ens_screened
(count of bit-screen filtering operations performed for substructure/superstructure searches),
reactions_screened
(count of bit-screen filtering operations performed for reaction matching),
records_examined
(count of records looked at),
records_matched
(number of matched records),
start_record
(record the scan started at),
end_record
(last visited record),
eof_reached
(boolean indicator whether the end of the file was reached),
max_mmap_used
(maximum used size of memory mapping arena),
max_mmap_requested
(maximum requested size of memory mapping arena),
records_skipped
(number of records which where skipped with need for re-synchronization),
records_repositioned
(number of records which were finished without the need for a re-synchronizing skip operation)
scores_computed
(the number of scoring function calls executed).
::cactvs(object_scope)
is also set, the object is visible only in the
Tcl
interpreter which set the scope flag and thus claimed it. Object list commands executed in other interpreters omit this object, and attempts to decode its handle in other interpreters will fail. The most common use of this feature is the hiding of persistent chemistry objects in scripted property computation functions.
molfile set
with a list of record numbers in any order to modify the attribute resets the current flags, and creates a new set. Modifying the attribute via
molfile append
adds selection flags without resetting the current selection. The selection flag can only be set for existing records. If an attempt is made to set the selection flag ahead of the currently known position set, the command scans the record structure (as in
molfile count
), which can be a problem in case of non-rewindable input. In order to facilitate resetting of selection flags, the virtual attribute
deselection
can be accessed as the inverse of the selection. Setting it to an empty list selects all records up to the end of the file (again this triggers automatic forward scanning, if necessary), and appending a list of records removes them from the selection. The default value of the selection flag for any record is
false
.
::cactvs(default_similarity_property)
and is usually either
E_SCREEN
or
E_QUERY_SCREEN
. If a file is opened that contains information about the similarity property set when the file was written (for example,
CBS
and
BDB
formats), this attribute is automatically set to the value stored in the file.
molfile read
operation. This string is decoded and the content of the other columns is attached as property data to this object. Typically the content of the structure column is a SMILES, SLN or InChI string. A negative value of this attribute indicates that the presence of such structure data is not confirmed. In that case, an attempt is made to determine the structure column automatically, and the attribute is updated accordingly. However, setting it explicitly may still be required in case there are multiple column with structure data, or there are too many unreadable or
NULL
row entries to allow automatic determination.
::cactvs(default_substructure_screen_property)
and is usually either
E_SCREEN
or
E_QUERY_SCREEN
. If a file is opened that contains information about the screen property set when the file was written (for example,
CBS
and
BDB
formats), this attribute is automatically set to the value found in the file.
::cactvs(default_superstructure_screen_property)
and is usually either
E_NO_HYDROGEN_SCREEN
or
E_NO_HYDROGEN_QUERY_SCREEN
. If a file is opened that contains information about the screen property set when the file was written (for example,
CBS
and
BDB
formats), this attribute is automatically set to the value found in the file.
molfile scan
command. When the time is exhausted, the scan terminates after the respective current record has been cleanly processed by all query threads, even if the end of the file has not been reached. Setting the attribute to zero, which is the default, allows an unlimited time to be spent on a query. Another function where the timeout value is used is in reading a record via an Internet connection, for example an
http
or
ftp
URL. If the timeout expires and the record has not been downloaded, an error results.
The allowed field names are
hash
,
host
,
hostname
,
href
,
pathname
,
port
,
protocol
,
search
,
user
,
password
,
directory
,
file
,
ipaddr
,
lastmodified
and
mimetype
. Note that in this context the
port
field name is the port the file is transferred via the Internet connection, which generally is not the same as the listener port for remote requests (see
molfile get
attribute
port
). Likewise, the mimetype here is the MIME type as reported by the server, not the file MIME type defined by the file format handler module. Example:
set ip [molfile get $fh url(ipaddr)]
molfile lopen
command, this attribute is the global line number in the virtual file, while
line/lc
refers to the line count within the current physical file. The attribute name
vlc
is an alias.
molfile lopen
command, this attribute is the global record number in the virtual file, while
record/rc
refers to the record count within the current physical file. The attribute name
vrc
is an alias. This attribute can be set and changing it results in repositioning of the file pointer, and potentially even a change in the active physical file.
When setting the attribute, the special values end and last can be used to position the file pointer behind the last, or before the last record, and a negative value is interpreted as a backspace from the current position. The return value is the resolved record number.
ens expand
command to achieve this). Rather, it expects the full set of atoms of the expanded form in the ensemble, plus one or more properly set up group objects indicating the atoms of the expanded form of a functional group or fragment which are not shown in the contracted style. If these groups are present, only the first atom in any group is shown, with the
G_NAME
data as atom tag, which overrides all other label information. However, the output file still contains the hidden atoms and their data. Tools like
ChemDraw
use this data to support interactive group expansion utilizing the original layout coordinates of the previously hidden atoms and other information.
The attribute list above is also referenced by the
molfile set
command. This is the reason why it contains information about the read-only status of the individual attributes. Only attributes that can be set can be addressed by the
molfile set
command.
For the use of the optional property parameter list argument, refer to the documentation of the
ens get
command.
Filters in the optional filter set must apply directly to the file object. Filters which operate on other object types are ignored.
Variants of the
molfile get
command are
molfile new, molfile dget, molfile jget, molfile jnew, molfile jshow, molfile nget, molfile show, molfile sqldget, molfile sqlget, molfile sqlnew,
and
molfile sqlshow
. These commands only work on property data and cannot be used to access attributes..
molfile getline filehandle ?skiprecord?
f.getline(?skiprecord=?)
Read a text line from the file, with repositioning of the file pointer. This operation is only possible on text files which have been opened for reading. The command is not frequently used, because it tends to disrupt the normal file record parsing.
If the skiprecord boolean argument is set, the file is positioned to the beginning of the next record after the line has been retrieved.
The command returns the line read. Line termination characters are removed.
molfile getparam filehandle property ?key? ?default?
f.getparam(property=,?key=?,?default=?)
Retrieve a named computation parameter from valid property data. If the key is not present in the parameter list, an empty string is returned (
None
for
Python
). If the default argument is supplied, that value is returned in case the key is not found.
If the key parameter is omitted, a complete set of the parameters used for computation of the property value is returned in dictionary format.
This command does not attempt to compute property data. If the specified property is not present, an error results.
molfile getparam $fhandle F_QUERY_GIF format
returns the actual format of the data in that property, which could be a GIF , PNG or a bitmap format.
molfile hloop filehandle objvar ?maxrec? body
f.hloop(function=,?maxloop=?,?variable=?)
Molfile.Hloop(filename,function=,?maxloop=?,?variable=?)
This command is functionally equivalent to the
molfile loop
command. The difference is that for the duration of the loop command hydrogen addition is enabled for the file handle. The original hydrogen addition mode of the file object is restored when the loop finishes.
molfile hread fhandle ?datasethandle/enshandle/#auto/new? ?recordcount?
molfile hread fhandle ?datasethandle/enshandle/#auto/new? ?parameterdict?
f.hread(?target=?,?parameters=?)
Molfile.Hread(filename,?target=?,?parameters=?)
This command is identical to the
molfile read
command, except that standard hydrogen addition is enabled for the duration of the command. The original hydrogen mode is reset when the command completes.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
set eh [molfile hread “myfile.mol”]
This is a simple single-record structure input with hydrogen addition, using a file name instead of a file handle. The file is automatically opened and then close for the duration of the command.
molfile jget filehandle propertylist ?filterset? ?parameterdict?
f.jget(property=,?filters=?,?parameters=?)
Molfile.Jget(filename,property=,?filters=?,?parameters=?)
This is a variant of
molfile get
which returns the result data as a
JSON
formatted string instead of
Tcl
or
Python
interpreter objects.
The Python class method is a one-shot command. The transient dataset created from the initialization items is automatically deleted when the command finishes.
molfile jnew filehandle propertylist ?filterset? ?parameterdict?
f.jnew(property=,?filters=?,?parameters=?)
Molfile.Jnew(filename,property=,?filters=?,?parameters=?)
This is a variant of
molfile new
which returns the result data as a
JSON
formatted string instead of
Tcl
or
Python
interpreter objects.
The Python class method is a one-shot command. The transient dataset created from the initialization items is automatically deleted when the command finishes.
molfile jshow filehandle propertylist ?filterset? ?parameterdict?
f.jshow(property=,?filters=?,?parameters=?)
Molfile.Jshow(filename,property=,?filters=?,?parameters=?)
This is a variant of
molfile show
which returns the result data as a
JSON
formatted string instead of
Tcl
or
Python
interpreter objects.
The Python class method is a one-shot command. The transient dataset created from the initialization items is automatically deleted when the command finishes.
molfile list ?filterlist=?
Molfile.List(?filters=?)
This command returns a list of the molfile handles currently registered in the application. This list may optionally be filtered by a standard filter list.
molfile list
molfile lock filehandle propertylist/objclass/all ?compute?
f.lock(property=,?compute=?)
Lock property data of the file handle, meaning that it is no longer subject to the standard data consistency manager control. The data consistency manager deletes specific property data if anything is done to the file handle which would invalidate the information. Property data remains locked until is it explicitly unlocked.
The property data to lock can be selected by providing a list of the following identifiers:
The lock can be released by a
molfile unlock
command.
This command is a generic property data manipulation command which is implemented for all major objects in the same fashion and is not related to disk file locking. Disk file locks can be set or reset by modifying the
molfile
object attribute lock. This is explained in more detail in the paragraph on the
molfile get
command.
The return value is the original molfile handle or reference.
molfile loop filehandle objvar ?maxrec? body
f.loop(function=,?maxloop=?,?variable=?)
Molfile.Loop(filename,function=,?maxrecords=?,?variable=?)
for obj in f:
Execute a loop over the file. Objects are read from the file from the current file position onwards. The type of object read (usually ensemble or reaction, but in principle it could also be a table or dataset object) depends on the read scope of the file. In the
Tcl
variant, the handle of every object input from a file record is assigned to the specified
Tcl
object variable. Next, the
Tcl
script code in the body argument is executed. The body code typically uses the value of the variable to perform some operations with the currently read object. After the body code has been executed, the object which was just read is deleted, and the cycle is repeated, either until
EOF
has been reached on the file (the default), or the maximum number of records specified by the optional parameter has been reached, whichever comes first. In either case, no error is generated when the end of file has been reached. Setting the maximum record count parameter to an empty string, or to a negative value, results in the default processing style running until the end of the file.
For
Tcl
scripts, within the loop, the standard
Tcl
break
and
continue
commands work as expected. If the body script generates an error, the loop is exited. If the loop code generates an error, the loop is terminated and the error reported. Programs should not expect that the same object handle value stored in the variable is reused in each iteration.
Since the input objects are automatically deleted after they have been processed, it is not required to delete them in the loop code. Deletion requests on the loop object executed within the loop are ignored. Any other operation on the structure object is allowed. The loop code may perform repositioning operations on the input file, but not close it.
The Python version of the loop method does intentionally have a different argument sequence for convenience. The function argument may either be a multi-line string (similar to the Tcl construct), or a function reference. Functions are called with the reference of the current loop object as single argument, and have their own context frame, so that the specification of a reference variable is not generally useful in that call style, though is is allowed. For string function blocks the code is executed in the local call frame, and the variable with the current object reference is visible locally. Script code blocks must be written with an initial indentation level of zero. Within the Python functions, the normal break and continue loop control commands cannot be used to to scope limitations. Instead, the custom exceptions BreakLoop and ContinueLoop can be raised. These are automatically caught and processed in the loop body handler code.
In
Python
, there is also an object iterator so that simple loops over structure file contents can be written with a
for
statement. The
molfile
object iterator is of the
self
style (i.e. there is one per
molfile
, these are not independent objects), so nesting them is not possible on the same dataset. There is no distinct
hloop
iterator, but that can be emulated by setting the
hydrogens
attribute on the
molfile
object.
Python object loop constructs and their peculiarities are discussed in more detail in the general chapter on Python scripting.
The return value is the number of processed records.
set th [table create]
table addcol $th E_NAME
table addcol $th E_WEIGHT
molfile loop $myfile eh {
table addrow $th #auto end [list [ens get $eh E_NAME] [ens get $eh E_WEIGHT]]
}
This sample loop successively reads all records from the file and stores the ensemble handles in variable eh . In the loop body, the handle is used to extract name and molecular weight information from the structure and store it in a table object.
molfile lopen filelist ?mode? ?attribute value?...
molfile lopen filelist ?mode? ?attributedict?
Molfile(filenamesequence,?mode=?,?attribute=?)
Molfile.Open(filenamesequence,?mode=?,?attributes=?)
Molfile.Lopen(filenamesequence,?mode=?,?attributes=?)
Open a list of files as a virtual file. The files identified by the file list items are implicitly concatenated in the list order. In addition to normal files, the standard set of special input types such as URL s, pipes, Tcl file handles or standard channels may be used. This command returns a single file handle, regardless of the number of input files passed as parameter.
A file list can only be opened for read operations on input objects. Writing, appending, updating or string input are not supported.
Most input file operations can be performed on virtual files. One important exception is currently file scanning with query expressions. This only works for lists of standard sequential files, not files which contain optimized query layouts, such as the native
Cactvs
CBS
and
BDB
file formats. These can only be used as a single file for
molfile scan
commands. However, simple structure input is possible across file boundaries even with these formats.
The rest of the options are processed in the same way as the standard
molfile open
command.
In the
Python
interface, there is no distinction between the
lopen
and
open
commands, because it can be unequivocally established whether the filename argument is a sequence (tuple or list of filenames), or a single file name. The interpretation is performed according to the argument type. The
Python
command always uses a file attribute dictionary, not a keyword/value argument set.
set fhandle [molfile lopen [lsort [glob *.mol]]]
molfile max filehandle property ?filterset?
f.max(property=,?filters=?)
Molfile.Max(filename,property=,?filters=?)
Scan the file for the maximum value of the specified property from the current read position to the end of the file. If no error occurs, the file is at end-of-file after the end of the command.
If a filter set is provided, it is applied to the objects read from the file during the scan, not the molfile object proper. Objects which do not pass the filter are ignored.
The property may correspond either to a data column in the file, or to a computable property on the structure or reaction objects read during the scan. Read objects are transient and automatically discarded. The property argument may contain a field specification, and in that case, only the field value is compared.
The maximum value determination uses the standard property comparison function associated with its data type. For properties which are implicitly defined during file I/O, an explicit property definition with a correct data type may be beneficial. For example, when testing the values of an SD data field, by default the data is read as an implicitly created string property. If the field content is actually an integer, the comparison as a string value does not yield the same results as when the data is compared as an integer. For file formats which encode a proper data type of its contents this is not necessary.
The return value is the maximum property or property field value found, or an empty string if no input was processed.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
molfile metadata filehandle property ?field ?value??
f.metadata(property=,?field=?,?value=?)
Obtain property metadata information, or set it. The handling of property metadata is explained in more detail in its own introductory section. The related commands
molfile setparam
and
molfile getparam
can be used for convenient manipulation of specific keys in the computation parameter field. Metadata can only be read from or set on valid property data.
Valid field names are bounds , comment , info , flags , parameters and unit .
molfile min filehandle property ?filterset?
f.min(property=,?filters=?)
Molfile.Min(filename,property=,?filters=?)
Scan the file for the minimum value of the specified property from the current read position to the end of the file. If no error occurs, the file is at end-of-file after the end of the command.
If a filter set is provided, it is applied to the objects read from the file during the scan, not the molfile object proper. Objects which do not pass the filter are ignored.
The property may correspond either to a data column in the file, or to a computable property on the structure or reaction objects read during the scan. Read objects are transient and automatically discarded. The property argument may contain a field specification, and in that case, only the field value is compared.
The minimum value determination uses the standard property comparison function associated with its data type. For properties which are implicitly defined during file I/O, an explicit property definition with a correct data type may be beneficial. For example, when testing the values of an SD data field, by default the data is read as an implicitly created string property. If the field content is actually an integer, the comparison as a string value does not yield the same results as when the data is compared as an integer. For file formats which encode a proper data type of its contents this is not necessary.
The return value is the maximum property or property field value found, or an empty string if no input was processed.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
molfile mutex filehandle mode
f.mutex(mode)
During the execution of a script command, the mutex of the major object(s) associated with the command are automatically locked and unlocked, so that the operation of the command is thread-safe. This applies to toolkit builds that support multi-threading, either by allowing multiple parallel script interpreters in separate threads or by supporting helper threads for the acceleration of command execution or background information processing.
Going beyond this automatic per-statement protection, this command locks major objects for a period of time that exceeds a single command. A lock on the object can only be released from the same interpreter thread that set the lock. Any other threaded interpreters, or auxiliary threads, block until a mutex release command has been executed when accessing a locked command object. This command supports the following modes:
There is no trylock command variant because the command already needs to be able to acquire a transient object mutex lock for its execution.
molfile need filehandle propertylist ?mode? ?parameterdict?
f.need(property=,?mode=?,?parameters=?)
Standard command for the computation of property data, without immediate retrieval of results. This command is explained in more detail in the section about retrieving property data.
The return value is the original file handle or reference.
molfile need $fhandle F_AVERAGE_ATOM_COUNT
molfile new filehandle propertylist ?filterset? ?parameterdict?
f.new(property=,?filters=?,?parameters=?)
Molfile.New(filename,property=,?filters=?,?parameters=?)
Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.
For examples, see the
molfile get
command. The difference between
molfile get
and
molfile new
is that the latter forces the re-computation of the property data, regardless whether it is present and valid, or not.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
molfile nget filehandle propertylist ?filterset? ?parameterdict?
f.nget(property=,?filters=?,?parameters=?)
Molfile.Nget(filename,property=,?filters=?,?parameters=?)
Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.
For examples, see the
molfile get
command. The difference between
molfile get
and
molfile nget
is that the latter always returns numeric data, even if symbolic names for the values are available.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
molfile nnew filehandle propertylist ?filterset? ?parameterdict?
f.nnew(property=,?filters=?,?parameters=?)
Molfile.Nnew(filename,property=,?filters=?,?parameters=?)
Standard data manipulation command for reading object data and attributes. It is explained in more detail in the section about retrieving property data.
For examples, see the
molfile get
command. The difference between
molfile get
and
molfile nnew
is that the latter always returns numeric data, even if symbolic names for the values are available, and that property data re-computation is enforced.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
molfile open filename ?mode? ?attribute value?...
molfile open filename ?mode? ?attributedict?
Molfile(filenamesequence,?mode=?,?attribute=?)
Molfile.Open(filenamesequence,?mode=?,?attributes=?)
This command opens a structure file or other input source for input or output. The filename argument may be any of:
This is the most common case. File names may be absolute or relative. On the Windows platform, the path naming follows the
Tcl
convention, with backslashes replaced by forward slashes, and optional drive letters, in the same way as the standard
Tcl
open
command. Tilde substitution is also supported and built into the command. In case a file name could possibly collide with a reserved name, the file name can be prefixed with ./ in order to force interpretation as a file name. File name expansion can be conveniently performed by means of the standard
Tcl
glob
command. File names must currently be spelled in the 8-bit ISO8859-1 character set. Unicode file names are not yet supported. On Unix platforms, named pipes and sockets may also be opened with this command.
molfile open ./stdout r
molfile open ~theuser/data/newleads.sdf
molfile open C:/temp/calicheaamycin.pdb w
The file names stdout , stderr and stdin are reserved and connect the file handle to a standard I/O channel. stdout and stderr can only be opened for output, and stdin can only be read from. The character ’-’ (minus) is an alternative name for standard input.
molfile open stdout w format mdl
molfile open ./stdout
The first line opens an MDL file for output on standard output. The second sample line opens the file in the current directory which is named “stdout” for input. By prefixing file names with directory information any file with a reserved name can be opened as standard file.
The name
scratch
is reserved as the name of a generic scratch file. The file is initially opened for writing, but may be switched to input later by a
molfile toggle
command. The magic filename is translated into the name of a platform-specific temporary file. Every invocation of this command variant generates a new scratch file, with a different name. The true file name can be obtained with an attribute query:
set fh [molfile open scratch]
set name [molfile get $fh name]
Scratch files are automatically deleted when they are closed, or when the program exits.
If a file name starts with a vertical bar character “|”, a pipe is opened from (in read mode) or to (write mode) the commands listed after the bar.
molfile open “|gzip >thefile.sdf.gz” w format mdl
When the file is closed, the pipe and all programs connected to it are automatically shut down. Pipes cannot be rewound, or switched from input to output and vice versa.
The
Cactvs
toolkit supports reading from various types of URLs. Currently, the schemes
ftp
, http,
file
and
gopher
are supported.
file
URLs are just another notation for normal disk files, as described above. From among the other URL schemes, only
ftp
and
http
connections may be opened for writing. The support for
ftp
URLs includes username and password components. If the server side supports it, passive
ftp
is the preferred mode.
Http
connections opened for writing use the
PUT
http
command, which often is not activated in standard Web server set-ups and may therefore be of limited practical usefulness. URL connections can be rewound and backspaced, but this is costly because the existing connection has to be disconnected and the initial data from the beginning of the file to the desired position needs to be re-transferred and discarded.
set fh [molfile open http://www.yourcompany.com/repository/jcamp/ir1.jcp]
molfile open ftp://yourid:yourpasswd@ftp.yourcompany.com/upload/ideas.sdf
If the target is a directory, all files in the directory are scanned. Those files which were identified as structure data files by any of the built-in or currently loaded I/O module extensions are concatenated to a virtual file which comprises all individual files. The order in which the files are concatenated is largely unpredictable, because it is defined by the order of the file name entries in the directory, and not any alphabetic sort criterion. The files may be of different formats, and may be any mixture of single-record and multi-record files. Subdirectories of the opened directory are not entered by default, but this may be activated by appending a ‚d‘ character to the open mode. Directories may only be opened for reading.
set fh [molfile open .]
set fh [molfile open $mydir rd]
The second example opens not only perceived structure files in the source directory, but also in all subdirectories thereof.
The
Cactvs
toolkit can read most file formats directly from a string. There is no need to write structure data which was obtained as a string image to a temporary file to decode it. Data strings are opened as structure file with mode ’s’. Only input is possible, but navigation within the string with
molfile rewind
etc. works as expected. The complementary
molfile string
command can be used to generate a string image of a file record.
set fh [molfile open $thedatablob s]
set eh1 [molfile read $fh]
set eh2 [molfile read $fh]
molfile close $fh
Any file name beginning with file or sock , and where the rest of the file name is a sequence of digits, are interpreted as references to Tcl file handles.
set tcl_fh [open thefile.txt w]
set cactvs_fh [molfile open $tcl_fh w]
A
Tcl
handle can only be accessed by this command in a mode which is compatible to the mode it was opened with, i.e. it is not possible to write to a file via a
Tcl
handle if it was opened for reading. If a structure file coupled to a
Tcl
handle is closed with a
molfile close
command, the
Tcl
handle remains valid, and my be used freely once the association to the structure file I/O object is broken. Closing the
Tcl
file handle while the piggybacked structure file handle is being used is illegal. No input, output or positioning should be performed on the
Tcl
handle with standard
Tcl
commands while it is being referred to by a
molfile
object.
In the Python interface, the same mechanisms apply, except that the argument is a Python file handle object.
The
Tcl
handle functionality is not available on Windows, because on this platform
Tcl
internally uses Windows handles for I/O, while the
Cactvs
toolkit builds on standard Posix C library
FILE*
pointers.
Some I/O modules implement access to a variety of information sources as a virtual file, which has neither a presence on the local disk, nor is one of the standard magic file names or access methods. Such virtual file names are by convention written with pointed brackets.
set fh [molfile open <pubchem>]
This command loads the
PubChem
virtual file access module, and returns a handle which may be used in a similar fashion as, for example, a handle to a huge local SD file. Depending on the I/O module, various operations on the handle may be optimized to be performed remotely. For example, the
PubChem
module offloads as many query operations of
molfile scan
commands as possible to the NCBI computers and downloads result structures only if they are needed as results, or query sub-expressions were specified which cannot be processed by the NCBI system.
The first optional parameter is the file access mode. It may be one of:
molfile count
, are automatically blocked until the thread has completed,, and then directly use its results. Operations which change the nature of the access to the file, or its record contents or positions, silently terminate the status thread.
molfile toggle
command. If the file permissions do not allow write access, the standard ‘r’ mode automatically falls back to this variant. Mode ‘rot’’ is also possible and additionally starts a file status thread (see mode ‘rt’).
molfile read
commands. For such files, an automatic format detection would fail. The ‘o’ and ‘t’ flags may also be appended, and have the same meaning as in the standard ‘r’ mode.For some files and file formats, two more mode characters have meaning if appended to the primary mode: They are silently ignored if the file argument or file format do not support them.
The remaining parameters of the
molfile
command are optional keyword/value pairs, or alternatively a single dictionary with the same function. The processing of these parameters is exactly the same as in the
molfile set
command.
In the
Python
interface, there is no distinction between the
lopen
and
open
commands, because it can be unequivocally established whether the filename argument is a sequence (tuple or list of filenames), or a single file name. The interpretation is performed according to the argument type. The
Python
command always uses a file attribute dictionary, not a keyword/value argument set.
set fhandle1 [molfile open thefile.pdb]
molfile set $fhandle1 hydrogens add nitrosyle ionic
set fhandle2 [molfile open thefile.pbp r hydrogens add nitrostyle ionic]
The first two lines and file final line perform exactly the same task: Open an input file, and set up input flags so that a complete set of hydrogens is added, and nitro groups and similar groups are converted to an ionic (as opposed to pentavalent) representation.
When a file is opened for reading, its format is automatically determined. Do not use the format attribute except under very special circumstances.
The command returns the file handle or reference of the opened input file. This is the handle which is required by most other
molfile
commands which refer to an opened file.
Depending on the encoding of the opened file, the actual access mode to the file may be different than expected. In case a disk file is compressed with gzip or bzip2 , the file is opened via a pipe to the responsible decompressor program. Likewise, an UCS-2 encoded file is opened via a pipe to the iconv program which converts the contents to the UTF-8 encoding. Files which are opened indirectly via such helper pipes have different access characteristics than directly addressed files. For example, backspacing is expensive, because the pipe has to be closed, re-opened, and the data stream skipped to the desired position. This takes much longer than simply repositioning a file pointer.
molfile peek filehandle
f.peek()
Molfile.Peek(filename)
This is a convenience command which combines three operations: Read the next record (
molfile read
), discard whatever object is read by the command as configured by the file handle settings (
ens/reaction/dataset delete
), and backspace by one record (
molfile backspace
).
The purpose of this command is to learn more about the contents and characteristics of the file by performing a full parse of the next record. One of the most common applications of the command is to detect the field structure (such as SD data fields) of that record before the read. The detected field set is consequently the return value of the command, equivalent to a
molfile get filehandle fields
statement.
This command can only be used on files which can be backspaced, or at least rewound and skipped forward to the last position. It cannot be used on files not opened for input, on empty files, or files which are at EOF . In all these cases, an error results.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
molfile properties filehandle ?pattern? ?noempty?
f.properties(?pattern=?,?noempty=?)
Generate a list of the names of all properties attached to the molfile object. Optionally, the list may be filtered by a string match pattern.
In most cases, this list is empty. Only structure file properties, such as
F_COMMENT
, etc., are listed, but no object attributes, such as
readflags
,
nitrostyle
, etc. Few file formats support the concept of storing file-level properties, and therefore an empty property set is usually reported. Since file objects do not contain minor objects, and currently cannot be a member of other major objects such as datasets or reactions, no properties belonging to other classes except file objects are ever listed.
If the noempty flag is set, only properties where at least one data element is not the property default value are output. By default, the filter pattern is an empty string, and the noempty flag is not set.
The property list may become modified by input operations. In some cases, the defined file-level properties may vary with the record position, or may become only available only after the first input operation, not immediately after opening the file.
The command may be abbreviated to
props
instead of the full name
properties
.
set plist [molfile properties $fhandle]
molfile purge filehandle propertylist/molfile/all ?emptyonly?
f.purge(?properties=?,?emptyonly=?)
Delete property data from the molfile object. Only molfile property data may be deleted with this command (these usually have a F_ prefix). Molfile attributes are not deletable.
If the optional flag is set, only file property values which are identical to the default of the property are deleted. By default, or when this flag is 0, properties are deleted regardless of their values. In case a listed property is not present, or not a file property, the request is silently ignored, but using property names which cannot be resolved leads to an error. If the object class name molfile is used instead of a property name, all file-level property data is deleted from the molfile object.
The command returns the original molfile handle or reference.
molfile purge $fhandle F_COMMENT
molfile purge $fhandle all
The first command deletes a specific property, the second command deletes all file property data associated with the handle.
molfile putline filehandle ?lines?
f.putline(?line?,...)
Write user-specified string lines to a file, bypassing the normal record writing mechanism. This operation is only supported on files which are opened for output and contain text data. The lines should not contain end-of-line characters. These are automatically supplied depending on the file object configuration set in the eolchars attribute.
molfile read fhandle ?datasethandle/enshandle/#auto/new? ?recordcount?
molfile read fhandle ?datasethandle/enshandle/#auto/new? ?parameterdict?
m.read(?target=?,?parameters=?)
Molfile.Read(filename,?target=?,?parameters=?)
This important command reads chemistry objects from a structure or reaction file. The type of objects returned depends on the read scope of the file. They can be ensembles, reactions, or datasets. Read scope mol returns single-molecule ensembles, but (with I/O modules supporting this feature) reads only individual molecules into the output ensemble, splitting a multi-molecule file data ensemble if necessary. The return value of the command is a list of the handles or references of all objects which were generated, except when the #auto dataset creation method was used, or an unlimited number of objects was read into a dataset. In that case, the recipient dataset handle or reference is returned.
By default, the returned objects are not a member of any dataset. If a dataset handle is passed as fourth parameter, the returned objects are appended to that dataset if possible. The special value
#auto
or
new
creates a new dataset as container. This is equivalent to using the nested statement
[dataset create]
as dataset handle argument. If the fourth parameter is an ensemble handle, and the object read from the file is also an ensemble, the read data is stored in the shell of the old ensemble, after all old ensemble data has been deleted. Its object handle remains unchanged, as is its dataset membership. The reuse of reaction handles is currently not supported. This parameter can be skipped by specifying an empty string.
In addition to passing an empty string, or a simple dataset or ensemble handle or reference, as the fourth command argument, a list/tuple consisting of a handle or reference and a modifier flag set can be specified. The only flag value which is currently recognized is
checkroom
. If that flag is set, and the input objects are to become members of a dataset with enabled maximum size or insertion mode control, a test is made whether the dataset has sufficient room to allow the insertion of the new object(s), or whether a suitable alternative action is configured to handle the read object in a different fashion, such as discarding it. If that is not the case, the command returns immediately, without performing any input, and returns an empty string (
None
for
Python
). If the test succeeds, the input operation is atomic, since the dataset is locked for the full duration of the command, so that no other threads can manipulate its status between the initial check and the file input result object transfer.
The final optional parameter is either a single argument specifying the number of objects which should be read, or a dictionary with key/value attributes. The default is equivalent to passing a simple numerical value of one, in the first, simple format. In order to read until the end of the file, the special value
all
may be used instead of a numerical count. With an all parameter value, the input operation is finished when no more data is available on the file. Until this condition is met, an unlimited number of records is read. No error is generated when
EOF
is met. There are also no
EOF
errors reported if a numerical record count of more than one was specified, and at least one object could be successfully read. Another magical value of the simple argument form is
batch
, which is substituted by the batch record set size configured on the
molfile
handle (see
molfile get/set
).
In the second form of the final parameter, an attribute dictionary is persistently applied equivalent to a
molfile set
command before the input commences. Standard file handle attributes and an input limit may be both set in parallel by using the special attribute name limit as part of the dictionary. It is only recognized in this context, but not with
molfile set
or
molfile string
. The allowed values of the limit attribute are the same as in the simple command variant.
The command raises an error if input could not be completed, regardless whether the reason is a file syntax error, or simple
EOF
(but see above for exceptions). If an input error occurs, the
EOF
attribute of the file handle should therefore be checked in order to distinguish between these two conditions. In case the input file was opened for pipe reading (mode ’p’), or is connected to a
Tcl
channel, an
EOF
report may only indicate that no current data is available on the pipe or
Tcl
channel, but it could still arrive at a future point in time.
if {[catch {molfile read $fhandle} ehandle]} {
if {![molfile get $fhandle eof]} {
puts “Error: $ehandle”
}
} else {
puts “Read [ens get $ehandle E_NAME]”
}
The prototypical snippet above shows the input of the next ensemble record from a previously opened file, with proper error checking.
molfile read “acd.sdf” [dataset create] all
This sample command reads a complete input file (we are using the single-operation feature of the
molfile
command to open and close the file acd.sdf automatically for the duration of this command) into a newly created dataset in memory. Reading huge datasets is of course not necessarily a good idea without large amounts of
RAM
. On typical current workstations, 10.000 or 20.000 compounds are no problem, but beyond that the risk of running out of memory is a real problem.
In default mode, hydrogens are not automatically added to the read items with the exception of file formats where a clearly defined hydrogen set is implied, like SMILES , but not MDL molfiles ). This is probably the most common problem developers run into when using this command. Generally, Cactvs wants to operate on hydrogen-complete structures, and its internal file formats use explicit hydrogen encoding. Working with hydrogen-incomplete structures is possible, and sometimes useful, but can lead to unexpected artifacts like radical centers on atoms with missing hydrogens. In order to continue with a standard hydrogen set, the most common options are:
molfile set
to change the
hydrogens
attribute to a suitable automatic hydrogen addition mode. The
molfile open
command can also configure this attribute directly in a single statement, or you can use the attribute dictionary form of this command for the same purpose.
molfile hread
instead of
molfile read
ens/reaction/dataset hadd
command after the input object is returned and before processing the read objects further.Molfile.Ref(identifier)
Python
only method to get a
molfile
reference from a handle or another identifier. For
molfiles
, other recognized identifiers are
molfile
references, integers encoding the numeric part of the handle string or the
UUID
of the molfile object.
molfile rename filehandle srcproperty dstproperty
f.rename(srcproperty=,dstproperty=)
This is a variant of the
molfile assign
command. Please refer the command description in that paragraph.
molfile reorganize filehandle
f.reorganize()
This command only has an effect for file formats for which the I/O module provides a reorganizer function. This function typically optimizes and compacts the file for input and queries, and should usually be called after all records have been written. Writing to a reorganized file is typically at least initially slower than writing to a file which has not been processed.
The function returns a boolean value indicating whether any reorganization has actually been performed. In case the command is applied to a file which is not writable, an error results.
molfile rewind filehandle
f.rewind()
Reposition the file before first record, and clear all error status information. If the file is already at the first record, and no error condition is set, this command does nothing.
Not all file channels can be rewound, and for some which can, it can be an expensive operation. For example, standard input or pipe input channels are not rewindable, and an FTP URL channel has to be closed and re-opened.
Rewinding a virtual file set positions the file pointer before the first record of the first file in the set.
Standard text-stream style output files can be rewound, too. This effectively truncates them. Files which are opened for appending are truncated to their original length.
Rewinding is not necessary in all cases. The
molfile scan
command automatically rewinds the input file if it is at
EOF
at the begin of a scan.
The return value of the command is the original file handle or reference.
molfile rewrite filehandle recordlist propertylist ?values? ?query? ?callback?
m.rewrite(records=,properties=,?values=?,?query=?,?callback=?)
This command updates specific property fields in a file, without rewriting the complete record. This is only supported if the file was opened for writing or updating, and the I/O module for the format of the file supports this operation by a special function. This typically limits the applicability of this command to database-style file formats such as Cactvs CBS and BDB .
The record list parameter is either a simple sequence of numerical records, with one as the first file record, or one of the special values all (all file records are updated), current , next , previous (the indicated record is updated), or a table handle, optionally followed by a table column name. In the last case, the table is expected to contain the data for rewriting, and in case a column name is specified, that column should contain the applicable record numbers. If the table version is selected without a record column, the file records from one to the number of table rows are updated. None of the special values can be combined with the simple numerical record sequence style. If the parameter is a numerical record sequence, the order of the records is significant.
The values sequence can be empty, or it must match the length of the property list. In the latter case, every specified value must be a valid value for the property in the same list index position. Note that while it is possible to manipulate multiple records in one step with this command, it is not possible to assign a different set of values to the data fields for each processed record. For this operation, multiple rewrite statements must be issued. If the value list is absent, or empty, the values are recomputed from the structure or reaction object that is temporarily read from the file record for this purpose. This is a useful feature in case the computation function for a computable property has changed. In case the record list references a table instead of a numerical record list or a magic record name, the value list is ignored. Instead, the table is expected to contain table columns which match the properties in the list, but not necessarily in the same column order, or containing exclusively the properties in the list.
The optional query argument is a query expression in the same style as used in the
molfile scan
command. If a filter expression is supplied, only records which match the expression are changed. Non-matching records are skipped. In case no filter is used, all records selected by the record list are processed
After processing, the file pointer is on the last processed record.
If the name of a callback procedure is specified in the Tcl interface, it is called after each processed record. The Tcl procedure arguments depend on the processing mode. In case of table-based processing, the arguments are the table handle, the current table row, the file handle and the current file record. In the Python interface, the callback is either a function name given as string, or a function reference.
This command is not fully implemented yet. CBS files currently only support re-computation of property data from object data, not updates from explicit value lists. Neither BDB nor CBS I/O modules currently call the Tcl or Python callback procedures except in table-based processing mode.
The command returns the number of updated records.
molfile rewrite $fh current E_NAME “Black tar, grade A”
molfile rewrite $fh all E_XLOPG2
molfile rewrite $fh [list $mytable records] [list E_IDENT E_REGID]
The first command changes the property field E_NAME in the current record to the specified value. The second variant recomputes all E_XLOGP2 values in the file from the stored structure data - for example after updating the computation function of that property, or having added it as a new field to the file. The final version changes the fields E_IDENT and E_REGID for the records stored in table column records, replacing them with the data found in the table columns of the same name.
A complication in the use of this command is that database-type files like the
Cactvs
CBS
and
BDB
formats store property definitions themselves. After opening the file, a newly set up property definition, which may for example possess an upgraded computation function, can have been replaced by the old definition from the file. In that case, the new property definition must be explicitly re-read to gain the upper hand again, for example with a
prop read
command.
molfile scan filehandle|remotehandle expression/queryhandle ?mode? ?parameterdict?
f.scan(query=,?mode=?,?parameters=?)
Molfile.Scan(filename,query=,?mode=?,?parameters=?)
Execute a query on the file and return results. The structure file is scanned, by default starting from its current read position, and results are gathered until either the end of the file has been reached (or the scan wrapped once around the file, if the wraparound file flag has been set) or a scan condition caused the stopping of the scan procedure. If the scan finished without reaching the end of the file, it can be resumed with another
molfile scan
command at a later time.
The file scan works in principle on any file, but with very different efficiency. Files managed by file format I/O modules which support direct field access, and can supply structure and reaction data in binary form, can be queried much (often a factor of 1000 or more) faster than, for example, a plain SD file. In the latter format, every record needs to be fully parsed, the structure compared against the query expression, and most of the structure data is discarded immediately after the record has been checked. Files in formats which support various types of indexing for numerical values, bit-screen filtering for super- and substructure searches, hash codes for full-structure matching and other means of acceleration can be effectively queried with typical expressions in a few seconds, even while containing millions of compounds.
The two basic built-in Cactvs formats for effective searching are CBS (static files, good performance on CDROM and other linear media) and BDB (efficiently updateable, and with more advanced indexing than CBS ) . In contrast the systematic reading of a million-record SD file takes a few hours. Nevertheless, the feature of universal query support is very useful for working with typical data sets of a few thousand records. These do not need to be converted from their original formats to a query file for a quick exploratory data scan.
The toolkit currently supports two syntactically unrelated classes of query expressions: Native Cactvs expressions, which are described below, and Bruns/Watson structure queries as described in J. Med. Chem. 2012, 55, 9763-9772, The exact syntax supported is that of the internal Lilly suite in October 2014, which is significantly extended from the description in the paper, but also discards some outdated syntactic elements briefly mentioned in the paper.
set demerits [molfile scan $fh [read_file 9_aminoacridine.qry] {record demerit}]
This expression returns a nested list of records which match the query, and their merit/demerit score computed by that rule. Note that records which do not match the expression are omitted, they do not report a zero demerit in the result. Internally, Bruns/Watson queries are mapped to the standard toolkit query expression data structure. Many of the queries in the standard Lilly rule set can be expressed equivalently as a native query. However, at this time there are a few specific Lilly query features which cannot be expressed in native toolkit syntax.
If a query expression cannot be parsed as Bruns/Watson code, an attempt is made to interpret is as native Cactvs expression, and all error messages relate to that interpretation attempt. The following paragraphs all apply exclusively to the native toolkit expression style.
The expression argument is a tree of individual query statements. It is formatted as a nested Tcl list. The he allowed depth of branching as well as the allowed number of leaf nodes is unlimited. The following branch operations are supported in this tree:
If more than one branch is specified, the query expression branches (first, third, etc. argument) are linked by an identifier which determines how these branches interact under the umbrella of the bind node. The link argument it itself a list. Its first element is the link type identifier (currently one of independent , singlebond or doublebond ). Except in case of the first mode, the next element is the index (starting with 0) of the query branch in the bind node. It must refer to an existing branch index, i.e. forward declarations are not possible. For the determination of the branch index only the query branches count. The interspersed link arguments do not generated query branches.
If the mode is not independent , the allowed atoms or other minor objects which are tested in the additional branches depend on the current minor object in the referred branch. In modes singlebond and doublebond , these can only be atoms linked via the specified bond type to the referrer object, not the full atom set of the tested ensemble. In case of linked query branches, these are recursively checked. If a minor object in the leading branch matches, but fails to match in a dependent linked branch, more allowed minor object combinations are tested until they are exhausted or a combination of suitable minor objects is found which matches all branches. In any case, a minor object is only utilized once per bind node, so that for example a chain of three singlebond connected query branches needs to match three different atoms - the third branch cannot go back on the bond between the atoms selected for the first and second branch matches.
set q {
bind atom {and {A_ELEMENT in {7 8 16}} {A_NEIGHBORS = 2} {A_RING_COUNT = 0}}
{singlebond 0}
{and {A_ELEMENT = 6} {A_UNSATURATION = 0} {A_RING_COUNT = 0}}
{singlebond 1}
{and {A_ELEMENT in {7 8 16}} {A_NEIGHBORS = 2} {A_RING_COUNT = 0}}
}
This query tests for a fragment of three atoms, which are connected by single bonds and where the individual atoms are each subject to a check on different set of atomic attribute conditions. The same query could also be realized as a SMARTS pattern. The advantage of this notation is that arbitrary properties can be used as attributes and an extended operator set and the full set of comparison mode flags is available. The disadvantage is a less readable pattern representation, and that no substructure query accelerator techniques such as bitvector screening are automatically employed.
range {0-1} [list structure >= $ss1] [list structure >= $ss2] \
[list structure >= $ss3]
This expression requires that zero or one of the three test substructures match.
Here are a few simple expression patterns:
molfile scan $fh $leafexpression1
molfile scan $fh [list “and” $l1 $l2]
molfile scan $fh [list “or” $l1 [list “and” $l2 $l3 $l4]]
molfile scan $fh [list “orcontinue” [list not $l1] [list “xor” $l2 $l3]]
molfile scan $fh [list bind mol [list and $l1 $l2]]
All branch nodes need to end in leaf expression nodes. An empty query expression is valid and matches every input record. Also, it is legal and actually a common case to have an expression which is just a single leaf node expression. The order of the branches does not matter. An automatically invoked optimizer sorts the branches, and simplify them, in order to achieve maximum performance.
These are the supported classes of leaf node expressions:
The various leaf expression classes have different syntax schemes, which are explained in the next paragraphs.
The record and vrecord expression classes are always written with three list elements: The expression class name, the operator, and the value or value list. The operators can be from the standard six numerical types, the range operator (<->), and the in or notin set operators. Numerical comparisons require a single comparison value, the range operator a pair of values, and the set operators a list. Examples:
“record <= 100”
“vrecord <-> {1 1000}”
“record in {1 7 19 230}”
The filename expression class is even simpler. It always consists of three elements: The expression class name, the operator (which can only be = or !=), and the file name. The actual file comparison operation uses device and inode identifiers on Linux/Unix platforms if the file is accessible, so the exact spelling of any path components does not matter. Example:
“filename = part1.sdf”
The isnull and notnull expression classes are written with two elements. The first element is the expression class name, and the second a property name. The property name may be qualified with an ensemble class modifier. If the modifier is not specified, the query applies to the main database structure. Otherwise, the property of the specified ensemble class is addressed. Examples:
“isnull E_NAME”
“notnull product:E_ASSAY_RESULT”
random or subset node expression classes (these names are aliases) are written with two elements. The first element is the expression class name and the second a floating point value between zero and one. When this node is encountered for evaluation, a random number between zero and one is generated. If it is less than or equal to the specified value, the node is considered to match. Example:
„subset 0.6“
This expression will match 60% of the time and let the query proceed for further evaluation or result output.
The property query expression class is a little bit more complex. It has a variable number of elements, between three and eight. The general syntax scheme is
property {operator ?modifiers?..} value ?threshold? ?multimode? ?filter? ?c1? ?c2?
The first three elements are always the property name, which can be qualified with an ensemble class, the comparison operator, and one or more values. The number of required values is dependent on the operator. The comparison operator can be a nested list. It needs to contain as a list element the basic comparison operator (numerical, range or in/notin set operators) and may additionally contain modifier words, which are translated into flags potentially influencing the datatype-specific comparison functions. It depends on the data type of the property whether any flag word has an effect.
If the object flag word is supplied as part of the operator list, the value part of the query is parsed as a chemistry object handle, more specifically an ensemble handle, a decodable string representation of an ensemble, a reaction handle, or a decodable string representation of a reaction. The ensemble variants are accepted if the query property is attached to an ensemble or an ensemble minor object, and the reaction variants can be used if the property is reaction-related. The value of the query is then automatically extracted, even computed if needed, from the object. Properties with fields can be entered with the basic name, or any qualified field name. In addition, the property name may be prefixed by a structure class designator (see paragraph on structure queries). By default a property is assumed to be data of the main structure of the file record, or the main reaction. Examples:
“E_NAME = methane”
“solvent:E_NAME {in ignorecase} [list benzene toluene ethylbenzene]”
“E_IRSPECTRUM(source) {= shell nocase} *bruker*”
“E_WEIGHT {<= object} $ehtest”
“E_CAS {= ignoredashes ignorecase} 88337-96-6”
These are the comparison flag words which are recognized:
If the operator is the in or notin word, the value part is interpreted as a list. The value, or value list item, must be parseable according to the property data definition. Enumerated values and similar encodings may be used if properly defined in the property descriptor record.
If the comparison function computes a score (for example, the Tversky or Tanimoto variants), the next optional argument is a threshold value which needs to be exceeded to register as hit. If the threshold parameter is not specified, or given as a negative value, any score passes. Example:
“E_SCREEN {>= tanimoto object} $eh 95”
The next two optional arguments concern the case when there is more than one file data value to compare against the expression value. This generally happens when the tested property is not a major object property, but a minor object property, such as an atom or molecule property. In that case, the database record often contains multiple values, because there is more than one atom, or more than one molecule in the structure in the record. The first argument is the general match criterion. It can be set to one , all , none , or both . The default is one . Mode one means that it is sufficient if one of the record values matches. Mode all requires all to match, mode none requires that none matches, and mode both requires that there are both matches and mismatches.
The next optional parameter is a filter which can be used to restrict the values tested. If it is not present, or an empty string, no filter is applied. Example:
“A_ELEMENT = 6 {} all ringatom”
Above expression checks whether all ring atoms in the structure are carbon. Any record with a hetero ring atom fails the test.
The final two optional arguments are integer constants which may be used by the comparison operation. If they are not specified, both are implicitly passed as zero. If the first is specified, but not the second, the second is set to 100 minus the first value. Almost all comparison operations on the various data types ignore these.
One comparison mode which does make use of them is the Tversky bit vector similarity score. Here c1 and c2 are the weights of the bits in the first and second compared value. For scoring, both parameters are divided by one hundred and the floating point results are used as weight multipliers. Example:
“E_SCREEN {>= tversky object} $eh 90 {} {} 30 70”
Above expression computes a Tversky score on the standard structure search screen E_SCREEN with 30% weight for the database structure features and 70% of the query structure features (i.e. imbalanced towards a substructure rating), and report the record if the score is 90% or higher.
Starting with version 3.358 of the toolkit, property expressions where the data type of the query property is structure or reaction are no longer parsed as standard property expression, but as structure or reaction query expressions, respectively. Example:
"V_ONTOLOGY_TERM(substructure) {>= swap stereo isotope charge} $eh"
Since the data type of the field of V_ONTOLOGY_TERM is structure, the syntax rules of normal property expressions no longer apply. Instead, the syntax for structure expressions explained below is substituted.
Structure expressions are used to invoke structure comparison operations, such as sub- and superstructure search. The expression is a list, with three to eight elements. A structure expression starts with the structure identifier, followed by the operator, which, as in property queries, may be written as a list with auxiliary modifier words, and as third mandatory argument the comparison structure source.
The structure identifier is the name of a structure class. Usually it is present as part of the record in the queried file, but some structure classes can be computed from the main structure if necessary. If a structure class can neither be found in a file record, nor computed, the node will not match. The following structure classes are supported:
At minimum, the operator section (the second, mandatory argument) contains a standard numerical operator symbol. Additionally, modifier words may be present as additional list elements. The following operators are supported.
match ss
command and the
count
modifier below). The optional fourth
argument can be used to set a range condition for this mode. If only a single number is supplied, the match count for a successful node match must be exactly this number. If a list of two numbers is used, these define a range of acceptable match counts. If no explicit range is set, its implied value is one to 65535. It is possible to use a lower bound of zero which lets structure mismatches pass the query condition. This can be useful when match-dependent data is retrieved, for example the
matchcounts
pseudo property (see below).
The default substructure match mode has the
bondorder
,
useatomtree
and
usebondtree
flags set (see
match ss
command). The initial flag set can be modified with modifier words linked to the operator. As far as it makes sense, the modifier words also change the operation of derived query modes, such as full-structure matching via hash codes.
These are the modifier words which can be used in structure expressions:
match ss
command. The normal substructure match mode is equivalent to the first mode in the
match ss
command, yielding only counts zero or one.
match ss
command). Has an effect only for substructure matches. The default substructure match mode is
first
, except if the match operator is
range
for counted pattern matches. In that case, it is
distinctinneratoms
.
match ss
command). Has an effect only for substructure matches. The default substructure match mode is
first
, except if the match operator is
range
for counted pattern matches. In that case, it is
distinctinneratoms
.
match ss
command). Has an effect only for substructure matches. The default substructure match mode is
first
, except if the match operator is
range
for counted pattern matches. In that case, it is
distinctinneratoms
.
molfile scan
modes
ens
,
enslist
,
reaction
or
reactionlist
), the bonds and atoms matched by a substructure are marked in the returned structure-side ensembles with the highlight flags in properties
B_FLAGS
and
A_FLAGS
. In case multiple matches occur, the highlight set is an union of all processed matching substructure mapping. This flag is also automatically set if the property retrieval set in the
molfile scan
command includes related pseudo properties, such as
matchatoms
or
matchbonds
.
molfile scan
modes
ens, enslist, reaction or reactionlist
), the bonds and atoms matched by a substructure are marked in the returned structure-side ensembles by attached properties
A_SSMATCH
and
B_SSMATCH
. These are set to the labels of the matching substructure atoms or bonds. Unmatched structure ensemble parts have match property values of zero. In contrast to the
sethighlight
flag, this option attaches a new match property instance for any successful and processed match. Returned ensembles may therefore possess series of property instances like
A_SSMATCH
,
A_SSMATCH/2
... and so on.Many of these global flags can be overridden, or activated on a local level, for individual atoms or bonds, in the A_QUERY and B_QUERY properties. For example, A_QUERY has fields for flags which can request the matching of stereo or charges for specific atoms, or to allow missing stereochemistry at a specific center. These per-atom or per-bond requests override global query flag settings.
The third mandatory expression list element is the structure source. It can be one of
ens create
command. The string is decoded into a transient ensemble, which is automatically discarded when it is no longer needed. The exact decoding specifications depend on the operator. For full-structure search, a fully specified structure is created, while for substructure-type queries implicit hydrogens are not attached, and the full range of query specifications of the encoding format is allowed.
molfile set
), otherwise without any conversion flags. However, since the hydrogen addition flag is the only file attribute which may be temporarily overridden, other molfile object attributes may be set before the file is used in the query expression. Of course, using a file with a huge number of records in this fashion may cause problems. In case the file does not contain any records behind the read pointer at the time the command is parsed, an error is raised.Query specifications found in structure sources are understood in a variety of formats. Daylight and MDL formats are decoded and translated into an internal representation in an almost completely compatible fashion. That includes Recursive SMARTS , ISIS 3D queries, MDL stereo groups and MDL reaction queries. A significant range of Sybyl SLN and CambridgeSoft ChemFinder query expressions are also understood, as well as features found in the CSD ConQuest software. Finally, in Cactvs there is no fundamental difference between a query fragment and a normal structure object. Query structures are just structures with additional information stored in properties A_QUERY , B_QUERY and possibly B_REACTION_CENTER . For basic matching, any structure object will do, even if they do not possess these query attribute properties. However, an eye should be kept in the hydrogen status of query fragments. If no specific flags are set, substructure matches attempt to match hydrogen atoms just like any other atom. Example:
set ehss [ens create C]
set ehss [ens create C smarts]
The upper substructure ensemble does not, in the absence of hydrogen ignore flags, match any structure ensemble except those which contain a full methane (one C plus four H) molecule as fragment, because that is what the substructure represents. The second code line decodes the substructure in full
SMARTS
mode. Not only now the full range of
SMARTS
expressions can be parsed (though absent in this example), but the structure is also be created without implicit hydrogens. The first substructure could still be used in a
molfile scan
command as a simple carbon match test if the
nosubstructureh
modifier flag were supplied.
In order to read query structures from a file, the following generic open statement is the standard approach:
molfile open $file r hydrogens asis readflags noimplicith
Simple query formats, such as
MDL ISIS
query
Molfiles
, are read into a flat set of attributes. More complex formats, such as
SMARTS
, may require the use of a tree of expressions on individual atoms and bonds, similar to the overall query tree with branch and leaf nodes described here for the
molfile scan
command. These complex formats are nevertheless also translated, to the degree possible, to the flat model. For example, a
SMARTS
expression with only uses simple atom lists or atom and bond query attributes all connected just by and can be fully represented in this way. This also means that, format translation into other query file formats is also possible for these simple expressions. The use of the full query trees in matching can in some cases be a performance issue. The
noquerytree
flag is available to restrict the match to those parts of the full query which can be expressed in the flat model.
The fourth and optional expression list element in the query expression is used only for a few match modes. If it is not set, the default value is minus one.
“structure ~=> $eh 90”
“product <-> C(=O)\[OH\] {2 3}”
The first sample expression is a standard Tanimoto similarity query, with a 90% threshold. The second query matches product structures with two to three carboxyl groups.
Optional expression list elements five and six correspond to the c1 and c2 parameters in property query expressions. These are currently only used in Tversky similarity queries:
“structure %>= $eh 90 30 70”
This is an expression for a skewed Tversky similarity (70% query structure, 30% file structure weight) with a 90% reporting threshold.
The seventh optional structure expression list element can be used to specify exclusion substructures. It only applies to substructure matching. In this mode, the parameter encodes a list of substructures which are matched first on the test structure, before the actual substructure match. All atoms which are matched by the exclusion substructures are blocked from consideration in the main match operation. Every exclusion list element can either be an ensemble handle, a list consisting of an ensemble handle and a molecule label, or a structure line notation string (usually a SMARTS string) which is decoded in default pattern mode. Exclusion substructures are for example useful to hide structure parts which are already matched by a different pattern, without actually removing structure atoms. Exclusion substructures are always matched exhaustively, so a single exclusion fragment can block multiple matched structure locations.
set ss [ens create {C=C=C.C=C} smarts]
set q "and {structure {<-> exactaro distinctfgatoms} {$ss 1} 1} {not {structure {>= exactaro} {$ss 2} {} {} {} {{$ss 1}}}}"
echo [dataset scan [list C=C C=C=C C=C=CCCCC=C] $q reclist] (2)
The example scan only matches the second test structure. It first tests that the first (allene) fragment is matched exactly once (under application of the distinctfgatoms duplicate filter, so two different possible positionings of the substructure on the structure count only once) by the test structure, and then checks that the second (ethylene) fragment does not match the sane structure. Without an exclusion substructure on the second substructure match node, the test would always fail because the ethylene fragment also matches part of the larger allene fragment. In order to prevent this, the negative ethylene query also uses the allene fragment as exclusion fragment. In that case, all carbons in the second test structure are covered, and the query succeeds. In the third test structure, the allene exclusion fragment also covers part of the test structure, but the true simple ethylene part remains unblocked and matches the negative structure query, which results in overall rejection.
An optional eighth argument can be used to fine-tune how exclusion matches are processed. It can be a bitset combination of the enumerated values burnatoms , burncarbon , burnterminals , burnringsystems and burnaroringsystems . The default burn mode is burnatoms . In any case, the exclusion processing only applies node-locally - every node is independent. Exclusion marking does not apply to the matching of other exclusion fragments in the node in case more than one fragment is tested, so these may overlap in their matched structure parts.
set q {structure >= [c][OH] {} {} {} {{[n]}} burnaroringsystems}
echo [dataset scan {c1ccccc1O c1cccnc1O} $q reclist]
Above query for a phenolic substructure only matches the phenol (first) molecule. The hydroxypyridine (second) molecule is excluded because the exclusion fragment (aromatic nitrogen) not just blocks the nitrogen with the nonstandard burn mode, but the whole aromatic ring it is part of so the aromatic carbon in the main test structure can no longer match. If a test structure had both non-annealed phenol and hydroxypyridine moieties, the match would again succeed because only the aromatic carbons of the hydroxypyridine would have been excluded.
If exclusion fragments are used, the test structures must be fully expanded, i.e. a direct accelerated match on Minimols is not possible.
If the file format supports it, bitvector screening is automatically be applied to reduce the number of records for which structures need to be loaded and sent to graph-based atom-by-atom substructure matching. The default structure match screening property is E_SCREEN . The standard versions of E_SCREEN implement three predefined fragment sets. The higher sets are identical to the lower ones in the leading bits. Sets zero to two, which yield bit vectors of increasing length and selectivity, but also storage requirements can be requested by setting
prop setparam E_SCREEN extended 0/1/2
The bit set read from the query file must correspond to the parameter setting for E_SCREEN in the current Tcl interpreter, if the screen bits are automatically computed on the query structure. The CBS and BDB file formats, which are optimized for structure query operations, contain screen bit version information in the file header and automatically configure the property parameter setting when the file is opened. For other file formats with screen bits this needs to be done explicitly in the application script. It is also possible to change the structure bit-screen property associated with a file by setting the appropriate molfile handle attribute, so it is easily possible to use custom screen bit sets instead of the default property.
Starting with version 3.358 of the toolkit, property query expressions where the data type of the property is structure are automatically parsed as structure expressions.
This query expression takes the same arguments as a structure expression. It is internally expanded into four alternative queries, linked by a pass-dependent switch control node. The four alternative queries are a full-structure query (equivalent to operator = in a structure query), a substructure query (operator >=), and two Tanimoto similarity queries with thresholds of 95% and 90% (operator ~>=).
When such a query expression is a component of query expression tree, the query is first run with the full-structure query. If that query yields less results than the pass match limit (by default one, i.e. the query does not match anything, this can be configured via the molfile passlimit attribute), the input data source is repositioned to the original start record and then the substructure query is run, and if that run also does not yield sufficient hits, the two similarity queries are tried one after another.
Running the second and later alternatives is only possible of the data source can be repositioned to the original start position of the first pass. If that fails, the query is silently terminated early. The pass match limit comparison triggering the possible re-execution of the query is with the global hit count of the query, not the number of hits returned by the smartquery branch. If other parts of a complex query produce sufficient hits, the query is not re-run even if a smartquery branch did not return any hits.
Hits returned in different passes can be distinguished by including the pass pseudo-property in the retrieval data.
By convention, smartsearch expressions are written with an = operator. The actual operator in a smartsearch expression is ignored, but modifiers are not. So specifying options like the use of stereochemistry or isotopes is supported and useful.
It is possible to have multiple smart search expressions in a query. The query pass index for these is incremented in parallel, not independently.
The smart search feature was inspired by a similar functionality in the Accelrys Isentris system.
“smartsearch = c1ncccc1”
“smartsearch {= stereo} \“L-lysine\””
Formula expressions are used to match file structures by element composition. Conceptionally, this is a special syntax for a complex property match on file structure properties E_ELEMENT_COUNT and M_ELEMENT_COUNT . A formula search expression is always a list of three elements. The first element is always formula, the second element the comparison operator, and the third word the formula specification. The following operators are supported:
For formula queries, there are no modifier words for the operator.
The syntax of the formula is built on the lowest level by element or pseudo-element symbols, which may be grouped into sum or difference expressions and may possess a prefixed count multiplier. The symbol or symbol group can then be suffixed by a simple count, or an open or closed count range. If no count range is specified, the default count is one. In case an element is entered more than once, all counts for that element are added. Finally, the expression may be grouped by period characters into sub-expressions to be applied to different molecular fragments in the tested structures.
Besides normal elements, the following pseudo-elements, which are compatible to the set of the CSD ConQuest software, are recognized:
Element items can be grouped with round brackets into sums or differences. However, this is no full arithmetic expression parser. Element symbols can only be used as stand-alone syntactic elements, bracketed all-sum expressions, or bracketed all-difference expressions.
An element or an arithmetic group can have an appended count. This count can be:
“formula = C6H6”
“formula = C5-6H6-”
“formula >= (Cl+Br)2”
“formula > \[4M\]>=3” or {formula > [4M]>=3}
“formula = (2C-H)-6”
“formula = CH3COOH”
“formula = \[Het\]>1“ or {formula = [Het]>1}
“formula = N1-{0.25C}“
The first expression is a simple test which matches any ensemble with a composition of six carbon and six hydrogen atoms. The second looks for compounds with five to size carbon and six or more hydrogens, but no other elements. The third example finds compounds where the sum of chlorine and bromine atoms is two. Other elements may be present but are not required, so this expression matches Cl2, Br2 and ClBr as well as dichlorobenzene. The fourth expression finds structures with three or more metal atoms. The fifth expression finds compounds where twice the carbon atom count minus the count of hydrogen atoms has a value up to six. Element sum and difference multiplier factors may be floating point numbers, but the ultimate comparison step is performed with the rounded sum or difference by integer comparison. The next line finds compounds with a formula of C2H4O2. The counts for elements repeated in the formula string are summed up. The next example matches any compound with one or more hetero atoms. The square brackets in the first writing style are properly escaped to survive standard
Tcl
command parsing
The final example shows how to use computed comparison values, which are specified within curly braces. This expression matches compounds which contain at least one nitrogen, but the number of nitrogens cannot be more than a quarter of the carbon count. For computed comparison values, only natural elements and the
[Hev]
,
[Het]
and
[Any]
pseudo elements are currently recognized. At this time, only a single element, optionally prefixed by a floating-point multiplier and adjusted by a positive or negative floating-point offset, is supported in the specification of a computed comparison value.
Vertical bars can be used to define separate formula match sections. These are applied to individual molecules in the tested structures, not the full ensemble. If a single bar is specified at the beginning or end of the expressions, it signifies a single expression section to be applied to a molecule. When a test for formula sections is applied, all permutations of possible matches between the molecules in an ensemble and the formula expression sections are tried. It is neither required that there is any specific order of the molecules in the ensemble, nor a specific order in the formula expression sections, not is there a need for a match between the molecule and formula section count. However, every expression section in a formula needs to match a different molecule in the tested ensemble for a final match.
“formula = C6H6|C7H8”
“formula = |H2O”
The first expression looks for ensembles which contain one molecule with the formula C6H6, and another with formula C7H8. The second expression matches ensembles with one or more water molecules. In both cases, molecules/fragment with different composition may be present in the record. In order to test for two or more formulae with the additional conditions that there are no other molecules/fragments, use two formula expression nodes connected with an and branch node, as in
and “formula = C6H6|C7H8” “formula = C6H6C7H8”
Element symbols which stand for specific isotopes, such as D for deuterium, are currently not processed. D and T are read as a simple alias for hydrogen, disregarding the isotope label.
It is possible to use an ensemble handle instead of a formula expression. In that case, the elemental formula of that ensemble is used in the query, as computed by property E_FORMULA .
Reaction expressions are the construct used for reaction substructure searches, for example when looking for certain bond transformations in a database of reactions. Obviously, the scanned file needs to contain reaction information for this to succeed.
An important aspect for reaction searches are atom mapping numbers, which link atoms in the reagent ensemble to the product ensemble, and likewise in the transformation scheme which needs to be matched. The central property for this is A_MAPPING . If this property is present, it is used to restrict matches to those reactions which embody a certain transformation, and are not a simple pair of ensembles which match substructures of the left and right part of the query transformation somewhere in their connectivity. Nevertheless, it is still possible to query reaction without a mapping scheme. That is identical to a pair of substructure searches. Also, individual parts of a reaction (the reagent and product ensembles, but potentially also the catalyst or solvent entries) can be used as targets for single-ensemble sub/super/full-structure searches via structure query expressions (see above).
A reaction expression is a list of three to six elements. The first element is always reaction , the second element the operator, and the third element the reaction source. The following operators can be used:
Similar to structure query expressions, the operator can be modified by adding flag words as additional list elements to the operator list element. The following flags are recognized:
molfile scan
modes
ens, enslist, reaction or reactionlist
), the bonds and atoms matched by a substructure are marked in the returned structure-side ensembles with the highlight flags in properties
B_FLAGS
and
A_FLAGS
. In case multiple matches occur, the highlight set is an union of all processed matching substructure mapping. This flag is also automatically set if the data retrieval set in the
molfile scan
command includes related pseudo properties, such as
matchatoms
or
matchbonds
.
molfile scan
modes
ens, enslist, reaction or reactionlist
), the bonds and atoms matched by a substructure are marked in the returned structure-side ensembles by attached properties
A_SSMATCH
and
B_SSMATCH
. These are set to the labels of the matching substructure atoms or bonds. Unmatched structure ensemble parts have match property values of zero. In contrast to the
sethighlight
flag, this option attaches a new match property instance for every successful and processed match. Returned ensembles may therefore possess series of properties like
A_SSMATCH
,
A_SSMATCH/2
... and so on.The third mandatory parameter is the query reaction source. It can be any of
reaction create
statement, for example a Reaction
SMILES
,
SMIRKS
,
RInChI
or a
Cactvs
serialized reaction object string. This query reaction is only temporarily instantiated and automatically deleted when the command finishes.Reading one or more query reactions from a file handle directly in the query statement, as it is possible for structure queries, is currently not supported. Also, the tautomer match mode is not available for reaction matching because it interferes with atom map processing.
The optional query list items four to six are identical to those for structure query expressions. They represent a reporting threshold value and the c1 and c2 comparison algorithm parameters. Please refer to the paragraph on structure match expressions for more details.
The general approach to reaction sub- and superstructure matching is as follows:
Besides the ensemble-level query attribute properties A_QUERY and B_query , reaction matches also make use of B_REACTION_CENTER (for constraints on the type of transformation a bond undergoes) and E_REACTION_ROLE (for the identification of reagent and product ensembles in the reaction object).
Reaction similarity queries use the reaction screen set (by default, property X_SCREEN ) instead of the structure screen that is used for structure similarity. This operation returns a single score. There is no scoring of the reagent or product ensembles.
Full-structure reaction matches are performed via hash code checks both the reagent and product sides. Atom mapping information is not used for this query operation. The suitable hash code is automatically selected depending on the operator modifiers (stereo, isotopes).
Starting with version 3.358 of the toolkit, property query expressions where the data type of the property is reaction are automatically parsed as reaction expressions.
The return value of the
molfile scan
command depends on the query mode. The default mode is
enslist
for the
molfile scan
command, but may be different when scanning other objects, such as datasets, networks or tables. The following modes are supported for file queries via the
molfile scan
command. Scan modes for other objects may include specific additional modes, while disallowing others.
In this mode, the
molfile scan
command returns a list of the names of the created arrays. For each name, a global
Tcl
array variable or
Python
dictionary is created, and for each match, a
Tcl
array element with an element name equal to the value of the first item specification index and an element value equal to the value of the third item specification is created (or a dictionary entry with key and value for
Python
). For example, the scan mode specification
{array {E_NAME name2rec} {record rec2name E_NAME}}
results in the creation of two global Tcl arrays or Python dictionaries in the current interpreter, called name2rec and rec2name . The first has array elements (for Python , dictionary keys) where the element name is the name of the matching structure (property E_NAME ), and the value the file record number (because it is the default). The second array has elements where the record number is the array element name, and the corresponding value the structure name. The return value of the scan statement is the list (tuple for Python ) “name2rec rec2name” , containing the names of the two variables created.
If array or dictionary elements for a specific key already exist, the new value is appended as a list or tuple object. The result registration procedure does not overwrite the existing content. So, for example in above case, if there are multiple records with the same structure name, the array element indexed by name would contain a list or records, not just a single record. Since the global arrays or dictionaries are persistent, data is also appended over multiple scan statements. If this is not desired, a statement like
unset -nocomplain $arrayname
should be executed before the scan is started. It is legal to use the same array or dictionary name for the registration of multiple properties. In this case, each match appends a new list element for every reported property, though these lists will not be nested.
The individual properties may also each be specified as a list consisting of the property name, and an arbitrary string. In that case, the string is used as the column name. By default, the column names are the same as the name of the property they store. Example:
{table {E_NAME name} {E_CAS casno} record}
sets up a table with three columns called name , casno and record . The first two columns contain property data from the matching file records, the last one the record in the file which matched.
Instead of the keyword table , an existing table handle or reference may also be used. In that case, any existing matching table columns are automatically re-used to store result data. Additionally specified properties are added as new columns to the right of the previously existing columns. New table rows generated by matches are appended to the bottom of the table.
The row names of added table rows are set to Record%u , with the file record number as variable part.
The scan command mode returns the table handle or reference as result. The associated row objects are stored in the general namespace, and are not be a member of any dataset. They are visible like any other object of their type, for example via
ens list
or
reaction list
commands. Commands
table ens
and
table reaction
are useful to get the object subset associated with this table. Note that these table-associated objects are not automatically deleted when the table is destroyed - only their association is severed. If they are no longer needed, they should be destroyed explicitly.
If requested property data is not present on the object representing a hit, an attempt is made to compute it. If this fails, the retrieval modes
table
and
tablecollection
generate
NULL
cells, and property retrieval as list data produces empty list elements, but no errors. For minor object properties, the property list retrieval modes produces lists of all object property values instead of a single value. In
table
-based mode, only the data for the first minor object associated with the major object is retrieved, which makes this mode less suitable for direct minor object property retrieval.
The following pseudo properties can be retrieved in property/properylist scan modes or as table values, in addition to standard property data:
match ss
command).The optional visitation order parameter, one of the optional query parameters listed in the next section, is primarily intended to be used for convenient execution of queries on a subset of records which were selected by a previous query on the same file. It can either be a numerical record list, with the first file record indicated as record one, or one of the keywords sortup or sortdown , followed by a property name. If this parameter is not set, or set to an empty string, or the magic string all , records are visited from the current input position in simple sequential order. If the query parameter dictionary additionally contains a startposition value, this start position refers to the index (plus one) of the first element of the specified record set, not to the original underlying file.
In the record list variant of this argument, the specified (virtual) records in the file are visited in the list order, and all other file records are ignored. For optimum performance, the records should be sorted in ascending order, but this is not necessary, and, since it does affect the order of the returned results, record visitation sets with record sequences in custom order sorted to some criterion can have uses. A suitable format for a record list is a saved result of
molfile scan
in the
recordlist
or
vrecordlist
scan modes. It is possible to use a sorted record list with a non-rewindable input file, but an unsorted list will fail in that case if the file input pointer needs to be positioned backwards.
The sort property option variant implies a visit of all file records, but in the order of the values of a property in that file, not the native record sequence in the file. Using this access method is not too much overhead for indexed file formats such as CBS or BDB with an index on the sort property , but a serious performance hit for standard text files. This method cannot be used with files which cannot be rewound and do not have the sort property data in some direct access field, since it requires a full pass through the file to gather the sort property data values before the actual query is processed.
molfile scan $fh “structure >= C1NCCC1” vrecordlist \ [dict create “order” [list 3 6 29 157]]
molfile scan $fh “structure ~>= $ehcmp 90” {table E_SMILES score} \ [dict create “order” {sortup E_WEIGHT}}
The final optional parameter is a keyword/value list of various additional attributes for fine-tuning the execution of the query. The following keywords are recognized:
scan
command no longer match regardless of the contents of additional records they are tested against. The hit count is increased whenever the branch returns a positive result, even if an overall positive match is not found because conditions in other branches are not met. The option can also be applied to logical nodes, such as and or or. In case of or nodes and in circumstances with similar optimization opportunities, the use of this option does not force the execution of lower branches if the match result of the node can already be determined by a partial testing of its branches, so the count may be less than expected.
molfile scan
command), the current number of record scans performed so far, the hit count and finally the size of the scanned object (file record count, dataset element count) as record or element count. If the object size is not known, minus one is passed. If a progress callback argument has been specified, it is passed as an additional and final parameter.
The init and final function calls are made only once each, and before respectively after any scan calls for the execution of this statement. The short form callback is an alias for this keyword. Setting the option to an empty string disables all progress callback function calls.
The arguments passed to this function are, in this order, the substructure object handle, the structure object handle, a nested list with label pairs of all matched substructure and structure atoms, and a nested list with label pairs of all matched substructure and structure bonds. In case of superstructure searches, the roles of substructure and structure are reversed, i.e. the substructure handle and the listed atoms and bonds refer to the current structure read from the scanned data source. The check function should either return 1 for a successful final check, or 0, which leads to a rejection of the match. It is also possible to raise an error, which terminates the query with an error, or exit with a break, which terminates the query without an error.
While the callback routine is free to perform any additional match analysis, it must neither delete the structure or substructure, nor change its connectivity (remove or add atoms and bonds), nor discard or invalidate any property data used in the matching process. The computation or setting of any additional property data on the substructure or structure ensembles is allowed.
EOF
. In that case, the file is automatically rewound. If a record visitation order list is used, the start position parameter indicates the record list index plus one to use as first file record to visit, not the file record proper.
#new
creates a new dataset. In that case, the command output is the handle or reference of the new dataset, overriding other output modes.molfile scan $fh {structure = c1ccccc1} recordlist
molfile scan $fh {E_WEIGHT < 100} {propertylist E_SMILES E_NAME E_WEIGHT}
molfile scan $fh {notnull E_CAS} {table E_SMILES E_CAS}
molfile scan $fh {structure ~>= c1nnccc1 90} {score record}
molfile scan $fh “and {structure >= $ehss} {formula >= N3}}” ens
Molfile object handles can be configured to listen on specific ports for remote scan requests. The syntax of a remote scan request is the same as for a normal file. The only exception is the handle argument. The command is executed asynchronously. Since because of this no direct results are returned, the remote scans are typically of a type which yields network-transferable objects (modes ens , enslist , reaction , reactionlist , table ) and specify a target dataset object on the local system.
On the local system, a typical set-up looks like this:
set dh [dataset create]
dataset set $dh port 10001
molfile scan $remotehost:10002 {structure >= c1ncccc1} \
{table record E_NAME E_CAS} {} {target $localhost:10001 startposition 1}
while {![dataset tables $dh {} count]} {
sleep 1
}
In above code, we first create a recipient dataset object, and configure it to listen on port 10001 for incoming Cactvs objects - we are expecting a table object as result later. We then issue the query for execution on the remote host, and wait until the table object containing the results has arrived.
On the remote server, the set-up could look like this:
molfile open $dbfile r port 10002
vwait
Here the database file is opened, and a port for incoming requests opened. The
vwait
Tcl
statement does nothing, but keeps the interpreter running, while waiting for and processing events such as incoming scan commands. In this sample set-up, the remote server needs to be started first, because otherwise the connection to the remote file fails on the client.
Since execution of remote queries is asynchronous, the client could issue multiple query requests to different remote handles and then wait until results from all these requests have been collected, or a timeout or other error condition has been reached. The results could arrive in any order. The scan commands for a group of servers could, for example, specify different start positions and maximum scan values for distributed searching of a big file, or could gather results from different small files. Additionally, the use of multiple scan threads could be requested on the server by passing appropriate parameters in the control section of the command. Nevertheless, only a singled remote scan command per Tcl script thread is executed on the server at any time. If multiple scans need to be executed in parallel on a single server, a collection of script threads need to be created via the Thread package, and then every thread told to open its own port listener.
The mechanism for the reception of messages for remote scans on
molfile
handles which listen on ports is subtly different from the processing of commands sent to listening dataset objects. The execution of scans requires active collaboration of a
Tcl
interpreter. Commands are only read and processed when the interpreter is idle, for example while sitting in a
vwait
or
sleep
statement. In contrast, dataset object listeners do not rely on
Tcl
interpreters, and are implemented as independent threads. Remote dataset commands, such as
ens move
or
dataset pop
with a remote dataset handle, are therefore executed at any time when a mutex lock on the database object and other accessed objects can be secured.
molfile set filehandle ?property/attribute value?...
molfile set filehandle attribute_dictionary
f.set(property,value,...)
f.set({property:value,...})
f.property = value
f[property] = value
A standard data manipulation command. It is explained in more detail in the section on setting property data. The alternative short form with the single dictionary argument is functionally equivalent to using the expanded dictionary as separate property and value arguments.
molfile set $fhandle F_GAUSSIAN_JOB_PARAMS(link0) [list \ “%chk=144__303_2EVE_PDB_Opt8.chk” “%mem=128MB” “%nprocshared=2”]
The command can also be used to set a broad range of object attributes. The list of attributes is documented in the section on the
molfile get
command.
If an attribute is set for a multi-file virtual file, in most cases the attributes are set for all the files in the set. Some attributes, such as the record, apply to the virtual handle only and their modification indirectly addresses only one physical file. An important example for this is the record attribute, which positions the record I/O pointer into the physical file which contains the record in the virtually concatenated file.
molfile set $fhandle record 2
Above command repositions the file read/write pointer to the second record.
This command supports a special attribute value syntax for manipulating bitset-type attributes (only attributes, not property values). If the first character of the argument is a minus character (-), the named bits in the set identified by the remainder of the argument are unset. If it is a plus (+), they are additionally set. With an equal sign (=), or no special lead character, the flag set replaces the old value. A leading caret character (^ ) toggles the selected bits.
molfile set $fhandle readflags +pedantic
molfile setparam filehandle property ?key value?...
molfile setparam filehandle property dictionary
f.setparam(property,?key,value?...)
f.setparam(property,dict)
Set or update a property computation parameter in the metadata parameter list of a valid property. This command is described in the section about retrieving property data. The current settings of the computation parameters in the property definition are not changed.
The return value is the updated property computation parameter dictionary.
molfile show filehandle propertylist ?filterset? ?parameterdict?
f.show(property=,?filters=?,?parameters=?)
Molfile.Show(filename,property=,?filters=?,?parameters=?)
Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.
For examples, see the
molfile get
command. The difference between
molfile get
and
molfile show
is that the latter does not attempt computation of property data, but raises an error if the data is not present and valid. For data already present,
molfile get
and
molfile show
are equivalent.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
molfile skip filehandle ?recordcount?
f.skip(?records=?)
Skip records in a file opened for input. If the file pointer is at the beginning of a new record, this next record is the first skipped. If the file pointer is stuck in the middle of a record, for example because a
molfile read
command failed due to a file syntax error, the first record counted is the remainder of the current record. An attempt is made to re-synchronize to the beginning of the next record.
By default a single record is skipped. If the record count parameter is specified, more than one record can be skipped. Because of the partially read record re-synchronization feature, negative record counts are not allowed in this command. The
molfile backspace
and
molfile set record
commands can be used to go back in a file.
The command returns the number of the next record to be read. In case an attempt was made to position behind the end of a file, or a record re-synchronization failed, an error is reported.
molfile sort fhandle {{property ?direction ?cmpflags ?cmpvalue???}...} ?outfile/handle?
f.sort(sortby=,?output=?)
Sort the records in the file according to the values of one or more properties or property fields contained in the file records, or computable on the objects read from the file. The output are byte-for-byte identical images of the input records, not records reconstructed from read data objects.
The property sort set consists of o sequence of zero or more sort specification elements. Every specification element is parsed as a sublist, but only the first element therein is mandatory. This element is either a property name, a property field name, or one of the magic names
#record
or
record
(for the file record) or #
random
or
random
or rnd (for a random number assigned to that record). The optional sort direction element may be
up
/ascending or
down
/descending. The default sort direction is upwards. The third optional comparison flags parameter can be set to a combination of any of the values allowed with the
prop compare
command. The default is an empty flag set.
If a comparison value is supplied as fourth argument, the sort utilizes the comparison results of read file object property values against this value for ranking, not the direct comparison result between the read file object property values. This is for example useful when sorting according to a bitvector similarity value to an external structure.
The first property or magic name in the sort list has the highest priority. In addition to the specified properties, the original record number is implicitly added as tie breaker to yield a stable sort. This automatic value is always sorted upwards. If an empty property list is specified, the result is thus a simple file copy without record rearrangement. In order to randomize the record order in a file, use a single #random sort property.
The sort properties do not need to be already present in the file. If necessary, an attempt is made to compute these on the objects read from the file in the first pass. It is possible to sort on properties which are not of the object class read from the file, for example atom properties when ensembles are read, or ensemble properties when reactions are read. In that case, the record is output at the position determined by the lowest sort rank of the property of that object, for example the minimum or maximum value of all values of an atom property in an ensemble. Additional data instances of the property associated with a given record are ignored, so no record duplicates are output.
The optional output parameter can either be the handle or reference of an opened Tcl or Python channel, including standard output and standard error or the name of a (preferably new) file, or a pipe construct. Output is appended to this output channel. If the parameter is omitted, the output is first written to a temporary file, the original file deleted and the temporary file renamed to the original file. In that case, the original file handle is automatically re-opened for reading on the new file. The input file handle must be positionable, because file records are accessed twice, once for reading the sort data and once for copying the records out. Sorting from standard input, pipes or other non-rewindable sources is therefore not supported, and neither is the sorting of files which are not simple record sequences. Sorting such files is currently only possible by using explicitly scripted record data buffering mechanisms.
On Windows, output to an open Tcl file handle or Python file reference s not supported, except for the standard output and error channels.
The return value of the command is the number of records written. The position of the sort file handle is set to the same location as before the command.
molfile sort $fh {{E_NAME up {dictionary nocase}}} dict.sdf
molfile sort myfile.sdf {{record down}}
set fhtcl [open “randomized.sdf” w]; molfile sort $fh {{random}} $fhtcl
molfile sort $fh {{A_ELEMENT down} {E_WEIGHT up}} “|gzip >heavy.sdf.gz”
The first example creates a new file dict.sdf which contains the remaining records in the file associated with the file handle sorted by the value of property E_NAME in case-insensitive dictionary order. The second example reverses the order of the records in the file, replacing the original file in the process. The third example randomizes the record sequence in the original file, outputting the records in a new file which was opened for writing as a normal Tcl text file. The final example outputs a compressed SD file, with structures sorted by the heaviest element in the ensembles, and using the molecular weight as tie breaker.
molfile sqldget filehandle propertylist ?filterset? ?parameterdict?
f.sqldget(property=,?filters=?,?parameters=?)
Molfile.Sqldget(filename,property=,?filters=?,?parameters=?)
Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.
For examples, see the
molfile get
command. The differences between
molfile get
and
molfile sqldget
are that the latter does not attempt computation of property data, but initializes the property value to the default and returns that default, if the data is not present and valid; and that the
SQL
command variant formats the data as
SQL
values rather than for
Tcl
or
Python
script processing.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
molfile sqlget filehandle propertylist ?filterset? ?parameterdict?
f.sqlget(property=,?filters=?,?parameters=?)
Molfile.Sqlget(filename,property=,?filters=?,?parameters=?)
Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.
For examples, see the
molfile get
command. The difference between
molfile get
and
molfile sqlget
is that the
SQL
command variant formats the data as
SQL
values rather than for
Tcl
or
Python
script processing.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
molfile sqlnew filehandle propertylist ?filterset? ?parameterdict?
f.sqlnew(property=,?filters=?,?parameters=?)
Molfile.Sqlnew(filename,property=,?filters=?,?parameters=?)
Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.
For examples, see the
molfile get
command. The differences between
molfile get and molfile sqlnew
are that the latter forces re-computation of the property data, and that the
SQL
command variant formats the data as
SQL
values rather than for
Tcl
or
Python
script processing.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
molfile sqlshow filehandle propertylist ?filterset? ?parameterdict?
f.sqlshow(property=,?filters=?,?parameters=?)
Molfile.Sqlshow(filename,property=,?filters=?,?parameters=?)
Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.
For examples, see the
molfile get
command. The differences between
molfile get
and
molfile sqlshow
are that the latter does not attempt computation of property data, but raises an error if the data is not present and valid, and that the
SQL
command variant formats the data as
SQL
values rather than for
Tcl
or
Python
script processing.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
molfile string enshandle/reactionhandle/datasethandle ?attribute value?...
molfile string enshandle/reactionhadle/datasethandle? ?attribute_dict?
Molfile.String(eref/xref/dref,?attribute,value?,...)
Molfile.String(eref/xref/dref,attribute_dict)
This command creates a byte vector representation of a structure file. The third argument in the Tcl variant (first for Python ) is an ensemble, reaction or dataset handle or reference, not a file handle or reference as for other molfile commands.
If the selected output format module supports direct output into a string, the record image is created without intermediary forms. Otherwise, a anonymous temporary file is opened, the ensemble or reaction(s) written to that file, and the file content returned as string with all newlines etc.. The file is then removed.
Writing to binary formats is possible. The return value of the command is a byte vector, not a simple text string, so it may contain
NUL
bytes. By default, in the absence of an explicit format specification, a
MDL
Molfile is written.
The remaining parameters are interpreted as in the
molfile set
command. There are two equivalent command variants, either using attribute and value argument pairs or a dictionary as a single argument. The parameters in the extra arguments or dictionary are typically used to set a hydrogen status, select the output format, etc.
molfile blob
is an alias to this command.
set jmestring [string trim [molfile string [ens create C1CC1] format jme]]
The example creates an input string for the popular JME Java structure editor by P. Ertl/Novartis. The
string trim
statement deletes the trailing newline. The necessary
JME
output module is automatically loaded if it is not already loaded or compiled-in when the format parameter is decoded.
String record representations generated by this command can be opened for input as string data with the s mode of the
molfile open
command:
set fh [molfile open [molfile string $eh] s]
molfile subcommands
dir(Molfile)
Lists all subcommands of the
molfile
command. Note that this command does not require a handle.
molfile sync filehandle
f.sync()
This command synchronizes the file contents with the file system. The I/O modules for most file formats automatically performs a simple file buffer flushing upon finishing the output of a record, so this command is needed only under special circumstances where complete file system synchronization is required, the file was written without immediate commits, the I/O module for the file format provides a special synchronization function, or the output was done via asynchronous I/O. In any case, every file is fully synchronized when it is closed, so calling this function for normal output operations is not required.
molfile toggle filehandle
f.toggle()
Switch a file from input to output, or vice versa. If the file was in write, append or update mode when the command is executed, the file is rewound and the read pointer is now pointing to the first record, or the original end point for append files. If the file was configured for input, the file output mode is changed to append if the file is a normal file. If the file is a scratch file, the file is truncated to an empty file and the write position set to the first record.
Not all file types can be toggled. Special file types except FTP streams cannot, and it is not possible to toggle a simple disk file which was originally opened in
read only
mode (see
molfile open
command).
molfile transfer filehandle propertylist ?targetpropertylist?
f.transfer(properties=,?target=?,?targetproperties=?)
Copy property data from one molfile object to another molfile object or other major object, without going through an intermediate scripting language object representation, or dissociate property data from the molfile object. If a property in the argument property list is not already valid on the source file object, an attempt is made to compute it.
If a target object is specified, the return value is the handle or reference of the target object. The source and target object cannot be the same object.
If a target property list is given, the data from the source is stored as content of a different property on the target. For this, the data types of the properties must be compatible, and the object class of the target property that of the target object. No attempt is made to convert data of mismatched types. In case of multiple properties, the source property list and the target property list are stepped through in parallel. If there is no target property list, or it is shorter than the source list, unmatched entries are stored as original property values, and this implies that the object class of the source and target objects are the same.
If no target object is specified, or it is spelled as an empty string or
Python
None
, the visible effect of the command is the same as a simple
molfile get
, i.e. the result is the property data value or value list. The property data is then deleted from the source object. In case the data type of the deleted property was a major object (i.e. an ensemble, reaction, table, dataset or network), it is only unlinked from the source object, but not destroyed. This means that the object handles returned by the command can henceforth the used as independent objects. They can be deleted by a normal object deletion command, and are no longer managed by the source object.
molfile truncate filehandle ?record?
f.truncate(?record=?)
Molfile.Truncate(filename=,?record=?)
Truncate a file. If no explicit record is given, the file is truncated after the current record. In case the current record count of the file is less than the specified record, the command raises an error.
Only files which are rewindable can be truncated. In addition, the program must have write permission to the file, although it is not required that the file handle is opened for writing. The I/O modules for files formats which are not a simple record sequence must provide a truncation function or the operation will fail.
The command returns the original file handle or reference.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
molfile unlock filehandle propertylist/molfile/all
f.unlock(property=)
Unlock property data for the file object, meaning that they are again under the control of the standard data consistency manager.
The property data to unlock can be selected by providing a list of the following identifiers:
Property data locks are obtained by the
molfile lock
command.
This command is a generic property data manipulation command which is implemented for all major objects in the same fashion and is not related to disk file locking. Disk file locks can be set or reset by modifying the
molfile
object attribute lock. This is explained in more detail in the paragraph on the
molfile get
command.
The return value is the original molfile handle or reference.
molfile upgrade filehandle
f.upgrade()
Molfile.Upgrade(filename)
If the I/O module provides a function to upgrade the format of an older file to the latest version of the format, for example after a support library upgrade, that function may be used. The only format which currently supports this feature is BDB .
The command returns the original molfile handle or reference.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
molfile valid filehandle propertylist
f.valid(property/propertysequence)
Returns a list of boolean values indicating whether values for the named properties are currently set for the structure file. No attempt at computation is made.
if [molfile valid $fhandle F_COMMENT] {...}
molfile vappend filehandle objectlist
f.vappend(objectref/objectrefsequence)
Virtually append records to an open file handle. The underlying file is not modified, but all future input operations on this file behave as if the extra records were present.
Because no actual output is generated, this command can only be applied on files opened for reading , not output files. In addition, the file handle needs to refer to a normal disk file and to support going backwards in the file, i.e. this command cannot be used on structure files opened via URL s, standard I/O channels, socket connections or composite virtual files with multiple physical files or the contents of a directory. The file format must support multiple records and the records must be encoded as a simple concatenated byte sequence. Examples for formats which work are SMILES or SD files for structures, or RXN or RD files for reactions.
The object list may contain ensemble, reaction or dataset handles. The data is split into virtual records according to the storage capabilities of the file. The format of the data written to the virtual records can be controlled by setting the writelist , droplist and hydrogens status attributes on the file handle.
When executed for the first time on a file handle for which the record count is yet unknown, the existing file records must be tallied and all current physical record positions be registered. For very large files, this can take some time. However, this is not equivalent to reading the complete file, so it does not consume much memory and the command can in principle work on arbitrarily large files.
Virtual records are held as string images in memory. A couple of thousand such records should not be a problem for typical workstations, but for systematic editing of large files where every record is touched an explicit scripted input/output loop is preferable.
The return value is the new record count of the file.
Changes to the file can be committed to disk by means of the
molfile vrewrite
command.
molfile vappend $fhandle [ens create c1ccccc1]
molfile vdelete filehandle recordlist
f.vdelete(record/recordsequence)
Virtually delete records from an open file handle. The underlying file is not modified, but all future input operations on this file behave as if the specified records had been deleted.
Because no actual output is generated, this command can only be applied on files opened for reading , not output files. In addition, the file handle needs to refer to a normal disk file and to support going backwards in the file, i.e. this command cannot be used on structure files opened via URL s, standard I/O channels, socket connections or composite virtual files with multiple physical files or the contents of a directory. The file format must support multiple records and the records must be encoded as a simple concatenated byte sequence. Examples for formats which work are SMILES or SD files for structures, or RXN or RD files for reactions.
When executed for the first time on a file handle for which the record count is yet unknown, the existing file records must be tallied and all current physical record positions be registered. For very large files, this can take some time. However, this is not equivalent to reading the complete file, so it does not consume much memory and the command can in principle work on arbitrarily large files.
The record list is a list of integer values, with one as the first file record. The list does not need to be sorted, and duplicate record numbers or record numbers out of range are ignored. It is possible to virtually delete file records which are themselves virtual, i.e. were added by the vappend, vreplace or vinsert subcommands and are not physically present in the file.
Virtually deleted records have negligible memory demands, but will slightly slow down input operations on edited files.
The return value is the new record count of the file.
Changes to the file can be committed to disk by means of the
molfile vrewrite
command.
molfile vdelete $fhandle [list 3 9 6]
molfile verify filehandle property
f.verify(property)
Verify the values of the specified property on the molfile object. The property data must be valid, and a molfile property. If the data can be found, it is checked against all constraints defined for the property, and, if such a function has been defined, is tested with the value verification function of the property.
If all tests are passed, the return value is boolean 1, 0 if the data could be found but fails the tests, and an error condition otherwise.
molfile vinsert filehandle objectlist
m.vinsert(objectref/objectrefsequence)
Insert virtual records for the specified objects into the file. The insertion position is before the current read position.
Except for the difference in the location where the virtual records are inserted, the command is equivalent to the
molfile vappend
command and has the same features and limitations. Please refer to that command for details.
The return value is the new record count of the file.
Changes to the file can be committed to disk by means of the
molfile vrewrite
command.
molfile vreplace filehandle objectlist
m.vreplace(objectref/objectrefsequence)
Insert virtual records for the specified objects into the file. The current input record is virtually overwritten.
Except for the difference in the location where the virtual records are inserted, and the fact that an existing record is replaced, the command is equivalent to the
molfile vappend
command and has the same features and limitations. Please refer to that command for details.
It is possible to replace a record which is itself virtual, i.e. was introduced by a vappend, vinsert or vreplace subcommand. If more than one output object is passed, or the object is written as multiple file records, additional virtual records are created and the record count of the file increased accordingly.
The return value is the new record count of the file.
Changes to the file can be committed to disk by means of the
molfile vrewrite
command.
set eh [molfile read $fh]
ens expand $eh
molfile backspace $fh
molfile vreplace $fh $eh
ens delete $eh
This command sequence virtually replaces a record with a version where superatoms are expanded.
molfile vrewrite filehandle ?filename?
m.vrewrite(?filename=?)
Commit all virtual record additions, deletions or replacements to a physical file. If no file name is given, the current file name is used. After writing, the file handle remains valid. It is open for reading, and positioned before the first record. At this moment, the file no longer contains any virtual modifications, but the file handle may again be subjected to virtual edit operations. In case a file name is specified, and is not the same as the name of the current file, the file handle refers to the new file when the command has finished.
All valid records are copied verbatim to the new file, without going through decoding and re-encoding or records (see
molfile copy
command). A temporary file in the same directory as the current file is created, and sufficient disk space needs to be present to hold both the original file and the edited version at the same time. In case a problem occurs, the temporary file is deleted and the current file remains active. Only if all write operations succeed the old file is deleted and the temporary file renamed if necessary. In case a file name is specified, and it is not the same as that of the current file, the original file remains untouched, but is no longer linked to the
molfile
handle. For large files, this operation can take some time because massive amounts of data may need to be moved.
If the file referenced by the file handle has not been edited with virtual record operations (
vappend, vdelete, vinsert, vreplace
), the command does nothing and is equivalent to a
molfile rewind
.
The command returns the number of records written.
set fh [molfile open „myfile.sdf“]
molfile vinsert $fh 1 [ens create c1ncccc1]
molfile vrewrite $fh „myfile_with_pyrdine_inserted_in_rec_1.sdf“
molfile write filehandle ?objecthandle/objecthandlelist?...
f.write(objectsequence/objectref,...)
Molfile.Write(filename,objectsequence/objectref,...)
This commands writes structure and reaction data to a file. Object handles may be ensemble handles, reaction handles, dataset handles, or molfile handles.
If an object is an input
molfile
handle, objects are read from the file until
EOF
is encountered if the output file supports multiple records. If the output file type is single-record, only the next record is read. The types of objects which are collected from the input
molfile
handle are dependent on its read scope. These objects are then treated as if they were used as parameter objects directly. Objects obtained via a
molfile
handle are automatically deleted after they have been written. If the input file is already at
EOF
when the command is executed, no objects are read, and no error is generated. However, this does not trigger the
NULL
record output handling described below, because the file object was specified as an argument.
The type of data which is actually written to the file depends on its format. A file opened for ensemble output can be fed with any type of handle. If reactions or datasets are passed, these are taken apart and written as individual records. If the output file is a reaction file, and an ensemble is passed, the reaction it is a member of is looked up and used as output object. If the ensemble is not a reaction ensemble, an attempt is made to store it as a plain ensemble outside any reaction. If the output routine rejects this, an error is raised. In case of datasets passed as objects for reaction output, the individual dataset objects (ensembles or reactions) are written, in combination with reaction reference substitution in case ensembles instead of reactions are found. For full-dataset output, it is legal to pass non-dataset objects. No dataset-level information is written and the objects stored as an anonymous dataset.
It is legal to supply no object handles at all. Normally, this means that simply no output is performed. However, I/O modules for specific file formats may support the output of special
NULL
records. In that case, the output function is called once without any objects. An example are
Gaussian
job files, which allow you to write records in multi-link files, where the computation instructions are taken from the file property
F_GAUSSIAN_JOB_PARAMS
, without supplying a structure record.
As part of the output process, new information may be computed on the objects. In case the active settings on the output molfile handle demand a structural change of an object, for example the addition or removal of hydrogen atoms, or the re-coding of ionic versus pentavalent nitro groups and similar functionality, the write objects are temporarily duplicated and these duplicates undergo the structure changes. The original output objects are never indirectly edited in their connectivity by this command.
The
writelist
attribute of
molfiles
may be set to a list of properties which should be included in the output. This has an effect only for file formats which support the storage of custom data values and which can cope with the data types of the listed properties. By default, no attempt is made to actively compute these properties for output. If they are not present in the input data, their output is silently omitted, or
NULL
values are written, depending on how the output format encodes these things. However, if the
computeprops
flag is set on the output
molfile
, an attempt for computation is made, and after output, the objects retain this additional data if the computation succeeds.
If the hydrogen set mode of the output molfile calls for a change in hydrogen status, the stage when these computations are performed depends on the hydrogen addition mode. If the output mode calls for potential hydrogen additions, the computations are executed after the addition - and this means, on the temporary duplicate, so the original object does not see the new property data. If the hydrogen mode does not change the hydrogen set, or potentially removes hydrogens, computations are performed on the original objects and then the object is potentially duplicated, with all its data, for hydrogen removal and output. In the latter case, the additional property data is visible on the original input objects.
The command returns a list of the object handles or references which were actually written to file. In cases like a reaction being split into ensembles, or a dataset taken apart, this is not necessarily the same object handle collection as the input object list. For output from an input molfile argument, the total number of objects written is returned instead, because the read objects are not retained.
The Python class method is a one-shot command. The transient molfile created from the initialization items is automatically closed when the command finishes.
molfile write “myfile.sdf” $eh1 $eh2
set fhandle [molfile open z.cbin w hydrogens add format cbin]
molfile write $fhandle $dset1
molfile write $fhandle $dset2
molfile close $fhandle
The first sample line uses the single-shot file operation feature of the
molfile
command. Instead of a
molfile
handle, a file name is passed, and that file is automatically opened, the output performed, and then the file is closed. Two ensembles are written with a single statement to the output file myfile.sdf. The desired file format is guessed from the file name suffix. No change in hydrogen status, etc. is performed, and no extra data is written out.
The next four example lines show how two complete datasets can be written to a native Cactvs toolkit binary file. Hydrogens are added to structures or reactions in the dataset - but the original dataset elements are not changed, since the addition is performed on temporary object duplicates. Also, the Cactvs binary format is requested explicitly by setting the format attribute. In this case, this is not really required, since the file format could also be guessed from the file name suffix. However, in case a non-standard file name suffix is used, formats must be specified explicitly, or the default format ( MDL SD -file) is used. If the Cactvs binary file is later opened for reading with a read scope of dataset , all dataset elements plus the dataset-level property data can be recovered.