📖

Gao Lab SAR Studio — User Manual

Complete guide to all features, input formats, and scientific interpretation

🚀 Quick Start Guide

Get from zero to a full SAR analysis in under 2 minutes

Step 1 — Choose your activity metric

In the left panel, select the measurement type that matches your assay data from the dropdown. Most common choices:

  • IC50 (nM) Half-maximal inhibitory concentration in nanomolar — most common for enzyme and receptor assays
  • EC50 (nM) Half-maximal effective concentration — common for cell-based and agonist assays
  • % Inhibition Single-concentration inhibition percentage (0–100%)
  • pIC50 −log₁₀(IC50 in M) — already log-transformed, use if your data is in this format

Step 2 — Enter your compounds

Paste SMILES strings with activity values into the text area, one compound per line. You can also drag-and-drop a CSV file or use the Manual Entry fields below the text area.

# Format: SMILES, Name (optional), Activity value c1ccc(F)cc1NC(=O)CCC, Cpd-001, 85 c1ccc(Cl)cc1NC(=O)CCC, Cpd-002, 15 c1ccc(Br)cc1NC(=O)CCC, Cpd-003, 340

Step 3 — Run analysis

Click ▶ Run SAR Analysis. All tabs populate simultaneously. Navigate between tabs to explore different aspects of your dataset.

Step 4 — Advanced features (optional)

For machine learning features, visit the tabs in order: 🔬 Similarity → ⚡ Act. Cliffs → 🌲 Random Forest → 📊 Cross-Validation. Then use 🔮 Predict Activity with new SMILES to get predictions.

💡 Tip: SAR Studio works best with 10–100 compounds. With fewer than 5 compounds, ML features are disabled. With fewer than 3, SAR analysis is limited.
Limitations: Descriptors are calculated from SMILES pattern matching, not a full cheminformatics engine. Results are for hypothesis generation and prioritisation — always verify with experimental data.

📥 SMILES Input Formats

Everything you need to know about entering compound data correctly

What is SMILES?

SMILES (Simplified Molecular Input Line Entry System) is a text notation for chemical structures. Each atom, bond, and ring is encoded as characters. Most chemistry databases (ChEMBL, PubChem, Reaxys, SciFinder) can export SMILES directly.

Accepted input formats

Format A — With name (recommended):

SMILES, Name, ActivityValue c1ccc(F)cc1NC(=O)CCC, Fluorobenzamide-1, 23.5 c1ccc(Cl)cc1NC(=O)CCC, Chlorobenzamide-2, 8.1

Format B — Without name (auto-numbered):

SMILES, ActivityValue c1ccc(F)cc1NC(=O)CCC, 23.5 c1ccc(Cl)cc1NC(=O)CCC, 8.1

Format C — CSV file (drag and drop):

smiles,name,ic50_nM # header row is auto-detected c1ccc(F)cc1NC(=O)CCC,Cpd-1,23.5 c1ccc(Cl)cc1NC(=O)CCC,Cpd-2,8.1

SMILES notation tips

  • Lowercase letters (c, n, o) = aromatic atoms
  • Uppercase letters (C, N, O) = aliphatic atoms
  • Numbers indicate ring closures: C1CCCCC1 = cyclohexane
  • Branches in parentheses: CC(=O)N = acetamide
  • Double bonds: =, triple bonds: #
  • Stereo: @ or @@ in brackets, / and \ for double bonds

Common SMILES examples

Benzenec1ccccc1
Pyridinec1ccncc1
PiperazineC1CNCCN1
MorpholineC1COCCN1
AmideCC(=O)NC
SulfonamideCS(=O)(=O)N
🔗 Free SMILES sources: PubChem (pubchem.ncbi.nlm.nih.gov), ChEMBL (ebi.ac.uk/chembl), ChemDraw (draw structure → Copy as SMILES), or MarvinSketch (free).

Activity values

  • For IC50/EC50/Ki in nM: enter the raw number, e.g. 45 or 0.5
  • For µM values: select the µM metric and enter e.g. 0.045
  • For % Inhibition: enter 0–100, e.g. 78
  • For pIC50: enter the negative log value, e.g. 7.3
⚠ All compounds in one session must use the same metric. Mixed units (some nM, some µM) will give wrong results.

📊 Overview Tab

High-level summary statistics and top compound insights at a glance

Metric cards (top row)

CompoundsTotal number of loaded structures
ScaffoldsUnique core ring systems detected
Best ActivityMost potent compound (lowest IC50 or highest inhibition)
Avg MWMean molecular weight of the dataset
Ro5 Pass% of compounds passing Lipinski's Rule of Five (MW ≤500, cLogP ≤5, HBD ≤5, HBA ≤10)

Top compounds panel

The ranked compound list shows each structure (drawn from SMILES), its activity, key physicochemical properties, and Lipinski compliance. Green = good drug-likeness, amber = borderline, red = violation.

What to look for

  • Is there a wide spread in activity? A large dynamic range makes SAR easier to interpret
  • Do the most active compounds share a scaffold? This suggests a privileged scaffold
  • Are the best compounds also Ro5-compliant? High potency + poor drug-likeness may be a problem
💡 The pActivity (pAct) column is the internal normalised scale used for all ML calculations: pAct = −log₁₀(IC50 in M). Higher pAct = more potent.

📋 SAR Table Tab

Full sortable compound table with all descriptors and activity data

Columns explained

  • Rank — Activity rank (1 = most potent)
  • Structure — 2D structure rendered from SMILES
  • Name/ID — Your compound identifier
  • Activity — Raw activity value in your chosen metric
  • pActivity — Log-normalised activity (higher = better)
  • MW — Molecular weight (Da). Drug-like range: 150–500
  • cLogP — Calculated lipophilicity. Drug-like range: −1 to 5
  • TPSA — Topological polar surface area (Ų). Oral bioavailability: <140; CNS penetration: <90
  • HBD / HBA — H-bond donors / acceptors. Lipinski limits: ≤5 / ≤10
  • Ro5 — Number of Lipinski rule violations (0 = fully compliant)
  • Scaffold — Core ring system type

Filtering and sorting

Use the filter bar above the table to search by name, or filter by scaffold type. Click any column header to sort. Use this to quickly find your best compounds or spot outliers.

💡 Sort by pActivity descending to see your most potent compounds first. Then sort by MW or cLogP to check if potency correlates with size or lipophilicity (ligand efficiency analysis).

🏗 Scaffolds Tab

Ring system classification and scaffold–activity relationship analysis

What is a scaffold?

In medicinal chemistry, the "scaffold" is the core ring system shared by a series of compounds. Scaffold analysis reveals which ring systems are associated with high activity — these are called privileged scaffolds.

How scaffolds are detected

The app matches SMILES patterns against a library of common heterocyclic and carbocyclic ring systems (quinazoline, pyrimidine, indole, piperazine, etc.). If no match is found, the compound is classified as Bicyclic, Monocyclic, or Acyclic based on ring count.

Scaffold cards

Each scaffold card shows:

  • Number of compounds with this scaffold
  • Average pActivity for the group
  • Best (most potent) compound in the group
  • Activity range (min to max)

What to look for

  • Which scaffold has the highest average activity? That's your most promising series
  • Large activity range within one scaffold = good SAR diversity — fine-tuning is possible
  • Small activity range within one scaffold = the scaffold is the key driver, not the substituents
🔬 Med chem context: Scaffold-hopping (switching from one ring system to another while maintaining activity) is a common strategy to escape IP, improve ADMET, or overcome resistance.

📈 Charts Tab

Visual exploration of descriptor–activity relationships

Available charts

  • Activity Distribution — Histogram of pActivity values. A bell-shaped distribution is ideal. A bimodal distribution may suggest two distinct series
  • cLogP vs Activity — Scatter plot. Positive correlation = more lipophilic compounds are more potent (common for membrane targets). Watch out for this driving Ro5 violations
  • MW vs Activity — Scatter plot. Potency that increases only with MW is a red flag — this is "molecular obesity" and predicts poor ADMET
  • TPSA Distribution — Bar chart of polar surface area. Important for permeability predictions
  • Scaffold Activity Comparison — Bar chart comparing average pActivity by scaffold. Identifies privileged scaffolds visually
  • Lipinski Compliance — Pie chart of Ro5 violations across the dataset

Ligand efficiency (LE)

The MW vs Activity chart implicitly shows ligand efficiency (LE = pActivity / heavy atom count). Ideal: high potency at low MW. Compounds with high MW and low potency are inefficient and will likely fail ADMET.

💡 Hover over any data point to see the compound name and exact values. Use these charts in team meetings to communicate SAR trends visually.

🔄 MMP Analysis Tab

Matched Molecular Pair analysis — identifying which structural changes drive activity

What is an MMP?

A Matched Molecular Pair (MMP) is two compounds that differ by exactly one structural transformation — for example, H→F at a specific position, or NH₂→OCH₃ on a ring. By comparing the activity of the two compounds, you can directly measure the effect of that single change on potency.

How it works in SAR Studio

The app compares all compound pairs and identifies those that differ by a single fragment from a predefined transformation library (halogen substitutions, N-methylation, carboxylate→amide, etc.). For each transformation found, it reports:

  • Which compounds were compared
  • What structural change was made
  • The activity difference (Δ pActivity and fold change)
  • Whether the change is beneficial or detrimental

What to look for

  • Beneficial Large positive Δ pActivity = this transformation improves potency — prioritise it in synthesis
  • Detrimental Large negative Δ pActivity = avoid this transformation
  • Consistent direction across multiple pairs = high confidence SAR signal
🔬 Med chem context: MMP analysis is one of the most powerful tools for lead optimisation. A consistent +1 log unit gain (10×) from H→F substitution across 3+ pairs is an actionable SAR rule.
⚠ The app's MMP detection uses SMILES substring matching, not formal bond-cutting algorithms. It may miss some pairs or produce false positives for complex structures. Treat results as hypotheses to verify.

🔗 Correlations Tab

Pearson correlation between physicochemical descriptors and biological activity

What is Pearson correlation (r)?

The Pearson r coefficient measures the linear relationship between two variables. It ranges from −1 (perfect negative correlation) to +1 (perfect positive correlation). r = 0 means no linear relationship.

|r| ≥ 0.7Strong correlation — likely a real SAR driver
0.4 ≤ |r| < 0.7Moderate — suggestive but not definitive
0.2 ≤ |r| < 0.4Weak — may be noise, especially in small datasets
|r| < 0.2No meaningful linear relationship

Descriptors correlated

  • cLogP ↔ Activity — Lipophilicity drives potency for many targets (positive r = more lipophilic = more active)
  • MW ↔ Activity — High correlation suggests "molecular obesity" if MW is high
  • TPSA ↔ Activity — Negative r is expected for CNS targets (high TPSA = poor BBB penetration)
  • HBD ↔ Activity — High HBD often reduces membrane permeability
  • Rotatable bonds ↔ Activity — High rotatable bond count reduces oral bioavailability (Ro5 limit: ≤10)
⚠ Correlation ≠ causation. A strong correlation between MW and activity may simply reflect the fact that larger compounds have more hydrophobic surface area — the real driver may be cLogP, not MW. Always cross-reference with MMP and Conclusions tabs.

💡 Conclusions Tab

Auto-generated SAR interpretation cards summarising key findings

What the Conclusions tab does

This tab synthesises the data from all other analyses into human-readable SAR conclusions. It automatically generates insight cards for:

  • Physicochemical drivers — Which descriptor (cLogP, MW, TPSA, etc.) has the strongest correlation with activity, and what that means for your series
  • Moiety analysis — Which structural fragments (e.g. fluorine, sulfonamide, carboxylic acid) are statistically associated with higher or lower activity
  • Scaffold conclusions — Which core ring system is most advantageous

Reading the moiety cards

Each moiety card compares two groups: compounds containing the fragment vs compounds without it. The % change in average pActivity is reported:

  • ✅ Favorable Fragment is associated with ≥12% higher average pActivity — retain or expand this group
  • ❌ Unfavorable Fragment is associated with ≥12% lower average pActivity — consider replacing
  • ~ Neutral Less than 12% difference — fragment has little impact on potency
  • ⬡ Universal Present in all compounds — cannot assess contribution without a comparison group
💡 These conclusions are the most directly actionable output for medicinal chemistry decision-making. Share this tab with your team to align on synthesis priorities.

🔬 Similarity Tab

Morgan fingerprint-based Tanimoto similarity matrix with hierarchical clustering

What are molecular fingerprints?

A molecular fingerprint is a binary bit vector that encodes which structural features are present in a molecule. SAR Studio uses circular (Morgan-style) fingerprints with radius 2 and 512 bits — similar to ECFP4 used in industry QSAR workflows.

Each bit represents a circular neighbourhood of atoms up to 2 bonds away from a given atom. If that substructure exists in the molecule, the bit is set to 1.

Tanimoto similarity

Given two fingerprints A and B, Tanimoto similarity = (bits set in both A and B) ÷ (bits set in A or B). Range: 0 (completely different) to 1 (identical).

≥ 70%High similarity — compounds are close structural analogues
40–70%Moderate — related scaffolds or common fragments
< 40%Low — structurally diverse

Hierarchical clustering

The heatmap reorders compounds so that similar ones appear adjacent, revealing clusters of related structures. Each cluster likely represents a chemical series.

What to look for

  • Clear cluster blocks = distinct chemical series in your dataset
  • All compounds with high similarity = low chemical diversity (limited SAR exploration)
  • Very low similarity overall = chemically diverse set (good for building a broad model, harder for optimising one series)
🔬 Industry context: Tanimoto ≥ 0.4 is often used as the threshold for "structurally similar" in patent analysis and compound clustering. The Bemis-Murcko scaffold approach is complementary.

⚡ Activity Cliffs Tab

Pairs of similar compounds with large activity differences — the richest SAR signals

What is an activity cliff?

An activity cliff occurs when two structurally similar compounds (high Tanimoto similarity) have dramatically different biological activities. These cliffs are particularly informative because a small structural change causes a large potency jump — pinpointing the exact pharmacophoric feature responsible.

Detection thresholds used

  • Structural similarity: Tanimoto ≥ 0.35 (similar enough to be meaningful)
  • Activity difference: ΔpActivity ≥ 0.7 (equivalent to ≥5× difference in IC50)
⚡ Major cliffΔpActivity ≥ 2.0 (≥100× potency difference) — highest-value SAR signal
⚡ Moderate cliff0.7 ≤ ΔpActivity < 2.0 (5–100× difference)

How to act on activity cliffs

  • Identify the structural difference between the cliff pair (use the SMILES shown on each card)
  • Design 2–3 analogues that probe this specific change to confirm the trend
  • If the cliff is consistent across multiple pairs, it is a robust SAR rule
  • If only one pair shows the cliff, it may be a noise artefact — verify experimentally
🔬 Why cliffs matter: Activity cliffs are where the most dramatic improvements in potency can be achieved with the fewest structural changes — ideal for lead optimisation where synthetic effort is limited.

🌲 Random Forest Tab

Non-linear ensemble QSAR model with permutation feature importance

What is a Random Forest?

A Random Forest is an ensemble of decision trees, each trained on a random bootstrap sample of the data using a random subset of features at each split. The final prediction averages all tree outputs. This approach:

  • Captures non-linear descriptor–activity relationships that linear regression misses
  • Is more robust to outliers than single models
  • Reduces overfitting through the averaging of many trees
  • Provides feature importance estimates without requiring a separate analysis

SAR Studio RF settings

  • 60 trees, max depth 5, minimum 2 samples per leaf
  • Features: cLogP, MW, TPSA, HBD, HBA, rotatable bonds, aromatic rings, halogens, heavy atom count
  • Feature subset per split: √9 ≈ 3 features (standard for regression)

Feature importance (permutation)

Each bar shows how much the model's error increases when that feature is randomly shuffled (breaking its relationship with activity). Higher % = more important. This is more reliable than impurity-based importance used in some implementations.

Predicted vs Actual scatter plot

Points close to the diagonal red dashed line = good fit. Points far from the diagonal = compounds the model predicts poorly (often outliers or compounds with unusual activity driven by features not captured by these descriptors).

⚠ The training R² shown here is optimistic — it uses the same data the model was trained on. Always check the Cross-Validation tab for an honest estimate of predictive performance on unseen compounds.
💡 Once the RF tab has been visited and the model built, the 🔮 Predict tab automatically blends RF (60%) with Ridge regression (40%) for better predictions.

📊 Cross-Validation Tab

Honest estimate of model generalisation performance using k-fold CV

Why cross-validation matters

Training R² is almost always misleadingly high — any model can memorise its training data. Cross-validation (CV) gives you an honest estimate of how well your model will predict new, unseen compounds by systematically testing on held-out data.

How k-fold CV works

The dataset is randomly divided into k equal folds (SAR Studio uses k=5, or fewer if the dataset is small). In each iteration:

  • 1 fold is held out as the test set
  • The remaining k−1 folds are used to train the model
  • Predictions are made on the held-out fold
  • This repeats k times until every compound has been predicted exactly once

Metrics reported

  • CV-R² — Fraction of activity variance explained by the model on test data. Higher = better generalisation
  • CV-RMSE — Root mean square error in pActivity units. RMSE = 0.5 means predictions are off by ~3× on average; RMSE = 1.0 means ~10×
CV-R² ≥ 0.7Good — model generalises well; predictions are reliable for prioritisation
0.4 – 0.7Moderate — useful but add more compounds or consider 3D descriptors
< 0.4Poor — model is likely overfitting or data is too small/diverse
RF >> RidgeSuggests non-linear relationships — RF is the better model for this dataset

RF vs Ridge — which to use?

Both models are run in parallel. If RF CV-R² is substantially higher than Ridge, non-linear descriptor effects dominate and RF should be trusted more. If they are similar, the dataset may be too small for RF to outperform the simpler linear model.

💡 A per-fold CV bar chart lets you spot problematic folds — if one fold has much lower R² than others, those held-out compounds may be structurally dissimilar from the training set (applicability domain issue).

🔮 Predict Activity Tab

QSAR-based activity prediction for new compounds using the trained models

How to use Predict

  1. Load your training compounds and run SAR Analysis
  2. (Recommended) Visit the 🌲 Random Forest tab to build the RF model
  3. Navigate to 🔮 Predict Activity
  4. Enter SMILES of your new compounds (one per line, with optional name)
  5. Click Predict Activity

Prediction algorithm

If the RF model has been built (RF tab visited): 60% RF + 40% Ridge regression, plus fragment-based corrections from the moiety analysis and scaffold-average bias. If RF is not built: Ridge regression only with fragment corrections.

Confidence score

Confidence combines model R² with a similarity estimate between the new compound and the training set. Higher similarity to training compounds = higher confidence.

  • ≥65% Reliable prediction — compound is well within the applicability domain
  • 40–65% Moderate — use with caution, compound may be somewhat novel
  • <40% Low — compound is very different from training set; prediction is speculative

Predicted activity levels

  • High Activity Predicted pActivity in the top 33% of training set
  • Moderate Middle 33%
  • Low Activity Bottom 33%
Critical disclaimer: These predictions are based on only ~9 physicochemical descriptors calculated from SMILES patterns. They cannot capture 3D binding interactions, conformational effects, or ADMET. Use predictions to prioritise synthesis — always confirm with experimental assays. A compound predicted as "High Activity" may still fail in the assay.

↓ Excel & PDF Export

Exporting your SAR analysis results for reporting and archiving

Excel Export (↓ Excel)

Downloads an .xlsx file containing:

  • Compounds sheet — Full SAR table with all descriptors, ranks, and activity values
  • Scaffold Analysis sheet — Per-scaffold statistics
  • Correlations sheet — Pearson r values for all descriptor–activity pairs
  • MMP sheet — All matched molecular pairs found

PDF Export (↓ PDF)

Downloads a multi-page .pdf report containing:

  • Cover page with dataset summary and key metrics
  • Top compounds with structures drawn
  • Scaffold analysis page
  • Correlations and physicochemical summary
  • MMP analysis findings
  • SAR Conclusions narrative
  • Prediction Results page — included automatically if you have run predictions in the Predict tab (compound cards with SMILES, predicted activity, confidence bar, and fragment annotations)
💡 For the most complete PDF report: (1) Run SAR Analysis, (2) Visit the RF tab to build the RF model, (3) Run predictions in the Predict tab, (4) Then click ↓ PDF. All sections will be populated.
ℹ The PDF is generated entirely in the browser using jsPDF — no data is sent to any server. Your compound structures and activity data remain private.
Gao Lab SAR Studio
Structure–Activity Relationship Engine v2
0
Compounds
0
Scaffolds
Best Activity
Avg MW
Ro5 Pass
🧬
Load compounds to begin analysis
Use the left panel to add structures with SMILES notation
📊
No data to display
🔬
No scaffolds detected
Activity Ranking
cLogP vs pActivity
MW vs pActivity
TPSA vs pActivity
Descriptor Radar — Top 5
Avg Activity by Scaffold
Matched Molecular Pair (MMP) analysis identifies analog pairs in the same scaffold where a small structural change (Δ14–80 Da) produces measurable activity shifts.
🔄
Load at least 2 same-scaffold compounds
📈
Need ≥3 compounds for correlation analysis
💡
Load data to generate SAR conclusions
🔬
Run SAR Analysis to compute fingerprint similarity
Run SAR Analysis to detect activity cliffs
🌲
Run SAR Analysis to build Random Forest model
📊
Run SAR Analysis to run cross-validation
🔮Activity Prediction
Input SMILES of new compounds (one per line, optionally with a name) to predict activity using SAR patterns learned from your training set. Run SAR Analysis first.
Format: SMILES or SMILES, Name
📂
Or drop a .csv / .txt SMILES file here
📐Model Summary
Run SAR Analysis on training data first to build the prediction model.