Similarity Search Programs

Introduction

This is a package of programs similar to the ones used by ChemMine for similarity searches. Provided are statically linked binaries compiled for use on x86 and x86_64 computers.

System Requirements

Operating system: x86 Linux, x86_64 Linux

Notes

These program needs to have each compound assigned to an unique integer identification number that are greater than 0
Results returned from the program are just a list of identification numbers
Currently there is no way to delete what has been added to the database with out deleting the entire database
64-bit version has only been tested on AMD processors

Running the programs

There are a total of 5 programs provided each requiring its own specific set of parameters. These parameters must all be provided in order

descriptor_gen (dbdir) (id list file) (sdf dir) (setname) (dbtype) (create)
load_smi (dbdir) (id list file) (sdf dir) (create)
descriptor_compare (dbdir) (sdf file) (dbtype) (cutoff) (sort) (set1) (set2)...
substructure (dbdir) (smiles file) (set1) (set2)
dbstat (dbdir)

(dbdir)	directory that was created in step 1
(id list file)	file with list identification numbers
(sdf dir)	directory where all sdf file were placed
(setname)	database name that you want to give to this group of compounds
(dbtype)	type of database that is to be created this can only be 1 or 2.
	1 for atom pair 2 for atom sequence
(create)	run initial setup of database can only be 0 or 1
	0 to not run initial setup
	1 to run initial setup the first time you create a database you should set this variable to 1
(sdf file)	an sdf file containing the query compound
(cutoff)	program will only return scores higher than the number provided here
(sort)	1 to sort the results 0 for unsorted
(setx)	set names to search
(smiles file)	a file with the query smiles string

Example

In the example subdirectory contains sample data that this example will follow.
The files myset and myset1 contains a list of identification numbers for assigned for the compounds.
The sdf directory contains sdf files that are used.

Create a directory for the database
- This can be simply done by using the mkdir command in linux. This example assumes that there is a empty directory named db in the current directory.
Creating a database

Initialize the database and create the first compound set with the descriptor_gen and load_smi programs.
- To generate the data for a compound set execute: "./descriptor_gen db/ example/myset example/sdf/ myset 1 1"
  This will generate a atom pair for the compound set defined by the file myset and name this set myset
- To generate atom sequence data for the compound set execute: "./descriptor_gen db/ example/myset example/sdf/ myset 2 0"
  Note that the create parameter given to the program is now set as 0
- To generate the smiles data for the compound set execute: "./load_smi db example/myset example/sdf/ 0"
Adding another compound set to the database

This step is similar to the previous steps taken in creating the database. The following commands are executed
- ./descriptor_gen db/ example/myset1 example/sdf/ myset1 1 0
- ./descriptor_gen db/ example/myset1 example/sdf/ myset1 2 0
- ./load_smi db example/myset1 example/sdf/ 0
Notice that for this step the create parameter are all set to 0 since the database has already been created

Searching the database

Similarity searches are done by executing the descriptor_compare program.
To do a similarity search with a compound against set myset execute:
- ./descriptor_compare db/ example/sdf/1.sdf 1 0.3 1 myset
Same thing but search both myset and myset1
- ./descriptor_compare db/ example/sdf/1.sdf 1 0.3 1 myset myset1
Use atom sequence instead of atom pairs
- ./descriptor_compare db/ example/sdf/1.sdf 2 0.3 1 myset
Substructure searches with the substructure program.
- ./substructure db/ example/test.smi myset

download x86 program
download x86_64 program