SMILES Converter
The chemrof CLI converts SMILES strings into chemrof-compliant data records.
Given a SMILES like CCO (ethanol) or [Ca+2] (calcium ion), it parses the
structure with RDKit, determines the correct chemrof type, and fills in
structural properties automatically.
Quick start
# Install (from the repo root)
uv sync
# Convert a SMILES to YAML
chemrof from-smiles "CCO"
# Multiple molecules, JSON output
chemrof from-smiles "CCO" "c1ccccc1" "[Ca+2]" --format json
# OWL output (OWL Functional Syntax)
chemrof from-smiles "CCO" "[Ca+2]" --format owl
# Pull names from PubChem
chemrof from-smiles "CCO" --enrichers pubchem
What it produces
For each SMILES, the converter outputs a dict with these slots:
| Slot | Source | Example |
|---|---|---|
id |
InChIKey (computed) | INCHIKEY:LFQSCWFLJHTTHZ-UHFFFAOYSA-N |
name |
Molecular formula (or enricher) | C2H6O |
type |
Auto-classified | chemrof:SmallMolecule |
smiles_string |
Canonical SMILES | CCO |
inchi_string |
Computed InChI | InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 |
inchi_chemical_sublayer |
Parsed from InChI | C2H6O |
empirical_formula |
RDKit | C2H6O |
molecular_mass |
Exact mass (Da) | 46.0419 |
is_organic |
Contains carbon | true |
elemental_charge |
For ions only | 2 |
has_element |
For monoatomic ions | Ca |
Auto-classification
The converter inspects the parsed molecule and picks the right chemrof class:
| Structure | chemrof type |
|---|---|
Single atom, positive charge (e.g. [Ca+2]) |
AtomCation |
Single atom, negative charge (e.g. [Cl-]) |
AtomAnion |
Single atom, neutral (e.g. [He]) |
UnchargedAtom |
Multi-atom, net positive (e.g. [NH4+]) |
MolecularCation |
Multi-atom, net negative (e.g. CC([O-])=O) |
MolecularAnion |
Multi-atom, neutral (e.g. CCO) |
SmallMolecule |
Enrichers
By default, the converter fills in only what RDKit can compute from the
structure. To pull additional data from external databases, use --enrichers:
chemrof from-smiles "CCO" --enrichers pubchem
This adds a PubChem lookup by InChIKey, filling in the preferred IUPAC name and a PubChem CID cross-reference.
Available enrichers
| Name | Status | What it does |
|---|---|---|
pubchem |
Working | Looks up the compound in PubChem by InChIKey. Fills name (IUPAC preferred) and pubchem_cid. |
chebi |
Stub | Will resolve CHEBI identifiers via the OLS API. |
wikidata |
Stub | Will resolve Wikidata QIDs via SPARQL. |
Multiple enrichers run in sequence:
chemrof from-smiles "CCO" --enrichers pubchem,chebi
Writing a custom enricher
Enrichers follow a simple protocol. Any class with a name attribute and an
enrich(obj, context) method works:
from chemrof.converter.enrichers.base import EnrichmentContext
class MyEnricher:
name = "my-source"
def enrich(self, obj: dict, context: EnrichmentContext) -> dict:
# context.inchikey, context.smiles, context.inchi, context.mol
# are available for lookups
obj["my_custom_field"] = look_up_something(context.inchikey)
return obj
The EnrichmentContext carries the RDKit mol object and computed identifiers
(InChIKey, canonical SMILES, InChI) so enrichers can use whichever key their
data source accepts.
To use a custom enricher programmatically:
from chemrof.converter.smiles import SmilesConverter
converter = SmilesConverter(enrichers=[MyEnricher()])
result = converter.convert("CCO")
OWL output
The --format owl option emits an OWL ontology in Functional Syntax.
Each entity becomes an OWL Class with a SubClassOf axiom linking it to
its chemrof type, plus annotation assertions for each structural property:
Declaration(Class(chemrof:INCHIKEY:LFQSCWFLJHTTHZ-UHFFFAOYSA-N))
SubClassOf(chemrof:INCHIKEY:LFQSCWFLJHTTHZ-UHFFFAOYSA-N chemrof:SmallMolecule)
AnnotationAssertion(rdfs:label chemrof:INCHIKEY:LFQSCWFLJHTTHZ-UHFFFAOYSA-N "C2H6O")
AnnotationAssertion(chemrof:smiles_string chemrof:INCHIKEY:LFQSCWFLJHTTHZ-UHFFFAOYSA-N "CCO")
The Python API equivalent:
from chemrof.converter.smiles import SmilesConverter
from chemrof.converter.owl_output import dicts_to_owl
converter = SmilesConverter()
objs = [converter.convert(s) for s in ["CCO", "[Ca+2]"]]
print(dicts_to_owl(objs))
Python API
The converter is also usable as a library:
from chemrof.converter.smiles import SmilesConverter
converter = SmilesConverter()
result = converter.convert("CCO")
# result is a plain dict with chemrof slots
print(result["type"]) # chemrof:SmallMolecule
print(result["empirical_formula"]) # C2H6O