🚀 Quick Start Guide
Get from zero to a full SAR analysis in under 2 minutes
Step 1 — Choose your activity metric
In the left panel, select the measurement type that matches your assay data from the dropdown. Most common choices:
- IC50 (nM) Half-maximal inhibitory concentration in nanomolar — most common for enzyme and receptor assays
- EC50 (nM) Half-maximal effective concentration — common for cell-based and agonist assays
- % Inhibition Single-concentration inhibition percentage (0–100%)
- pIC50 −log₁₀(IC50 in M) — already log-transformed, use if your data is in this format
Step 2 — Enter your compounds
Paste SMILES strings with activity values into the text area, one compound per line. You can also drag-and-drop a CSV file or use the Manual Entry fields below the text area.
# Format: SMILES, Name (optional), Activity value
c1ccc(F)cc1NC(=O)CCC, Cpd-001, 85
c1ccc(Cl)cc1NC(=O)CCC, Cpd-002, 15
c1ccc(Br)cc1NC(=O)CCC, Cpd-003, 340
Step 3 — Run analysis
Click ▶ Run SAR Analysis. All tabs populate simultaneously. Navigate between tabs to explore different aspects of your dataset.
Step 4 — Advanced features (optional)
For machine learning features, visit the tabs in order: 🔬 Similarity → ⚡ Act. Cliffs → 🌲 Random Forest → 📊 Cross-Validation. Then use 🔮 Predict Activity with new SMILES to get predictions.
💡 Tip: SAR Studio works best with 10–100 compounds. With fewer than 5 compounds, ML features are disabled. With fewer than 3, SAR analysis is limited.
⚠ Limitations: Descriptors are calculated from SMILES pattern matching, not a full cheminformatics engine. Results are for hypothesis generation and prioritisation — always verify with experimental data.
📥 SMILES Input Formats
Everything you need to know about entering compound data correctly
What is SMILES?
SMILES (Simplified Molecular Input Line Entry System) is a text notation for chemical structures. Each atom, bond, and ring is encoded as characters. Most chemistry databases (ChEMBL, PubChem, Reaxys, SciFinder) can export SMILES directly.
Accepted input formats
Format A — With name (recommended):
SMILES, Name, ActivityValue
c1ccc(F)cc1NC(=O)CCC, Fluorobenzamide-1, 23.5
c1ccc(Cl)cc1NC(=O)CCC, Chlorobenzamide-2, 8.1
Format B — Without name (auto-numbered):
SMILES, ActivityValue
c1ccc(F)cc1NC(=O)CCC, 23.5
c1ccc(Cl)cc1NC(=O)CCC, 8.1
Format C — CSV file (drag and drop):
smiles,name,ic50_nM # header row is auto-detected
c1ccc(F)cc1NC(=O)CCC,Cpd-1,23.5
c1ccc(Cl)cc1NC(=O)CCC,Cpd-2,8.1
SMILES notation tips
- Lowercase letters (
c, n, o) = aromatic atoms
- Uppercase letters (
C, N, O) = aliphatic atoms
- Numbers indicate ring closures:
C1CCCCC1 = cyclohexane
- Branches in parentheses:
CC(=O)N = acetamide
- Double bonds:
=, triple bonds: #
- Stereo:
@ or @@ in brackets, / and \ for double bonds
Common SMILES examples
Benzenec1ccccc1
Pyridinec1ccncc1
PiperazineC1CNCCN1
MorpholineC1COCCN1
AmideCC(=O)NC
SulfonamideCS(=O)(=O)N
🔗 Free SMILES sources: PubChem (pubchem.ncbi.nlm.nih.gov), ChEMBL (ebi.ac.uk/chembl), ChemDraw (draw structure → Copy as SMILES), or MarvinSketch (free).
Activity values
- For IC50/EC50/Ki in nM: enter the raw number, e.g.
45 or 0.5
- For µM values: select the µM metric and enter e.g.
0.045
- For % Inhibition: enter 0–100, e.g.
78
- For pIC50: enter the negative log value, e.g.
7.3
⚠ All compounds in one session must use the same metric. Mixed units (some nM, some µM) will give wrong results.
📊 Overview Tab
High-level summary statistics and top compound insights at a glance
Metric cards (top row)
CompoundsTotal number of loaded structures
ScaffoldsUnique core ring systems detected
Best ActivityMost potent compound (lowest IC50 or highest inhibition)
Avg MWMean molecular weight of the dataset
Ro5 Pass% of compounds passing Lipinski's Rule of Five (MW ≤500, cLogP ≤5, HBD ≤5, HBA ≤10)
Top compounds panel
The ranked compound list shows each structure (drawn from SMILES), its activity, key physicochemical properties, and Lipinski compliance. Green = good drug-likeness, amber = borderline, red = violation.
What to look for
- Is there a wide spread in activity? A large dynamic range makes SAR easier to interpret
- Do the most active compounds share a scaffold? This suggests a privileged scaffold
- Are the best compounds also Ro5-compliant? High potency + poor drug-likeness may be a problem
💡 The pActivity (pAct) column is the internal normalised scale used for all ML calculations: pAct = −log₁₀(IC50 in M). Higher pAct = more potent.
📋 SAR Table Tab
Full sortable compound table with all descriptors and activity data
Columns explained
- Rank — Activity rank (1 = most potent)
- Structure — 2D structure rendered from SMILES
- Name/ID — Your compound identifier
- Activity — Raw activity value in your chosen metric
- pActivity — Log-normalised activity (higher = better)
- MW — Molecular weight (Da). Drug-like range: 150–500
- cLogP — Calculated lipophilicity. Drug-like range: −1 to 5
- TPSA — Topological polar surface area (Ų). Oral bioavailability: <140; CNS penetration: <90
- HBD / HBA — H-bond donors / acceptors. Lipinski limits: ≤5 / ≤10
- Ro5 — Number of Lipinski rule violations (0 = fully compliant)
- Scaffold — Core ring system type
Filtering and sorting
Use the filter bar above the table to search by name, or filter by scaffold type. Click any column header to sort. Use this to quickly find your best compounds or spot outliers.
💡 Sort by pActivity descending to see your most potent compounds first. Then sort by MW or cLogP to check if potency correlates with size or lipophilicity (ligand efficiency analysis).
🏗 Scaffolds Tab
Ring system classification and scaffold–activity relationship analysis
What is a scaffold?
In medicinal chemistry, the "scaffold" is the core ring system shared by a series of compounds. Scaffold analysis reveals which ring systems are associated with high activity — these are called privileged scaffolds.
How scaffolds are detected
The app matches SMILES patterns against a library of common heterocyclic and carbocyclic ring systems (quinazoline, pyrimidine, indole, piperazine, etc.). If no match is found, the compound is classified as Bicyclic, Monocyclic, or Acyclic based on ring count.
Scaffold cards
Each scaffold card shows:
- Number of compounds with this scaffold
- Average pActivity for the group
- Best (most potent) compound in the group
- Activity range (min to max)
What to look for
- Which scaffold has the highest average activity? That's your most promising series
- Large activity range within one scaffold = good SAR diversity — fine-tuning is possible
- Small activity range within one scaffold = the scaffold is the key driver, not the substituents
🔬 Med chem context: Scaffold-hopping (switching from one ring system to another while maintaining activity) is a common strategy to escape IP, improve ADMET, or overcome resistance.
📈 Charts Tab
Visual exploration of descriptor–activity relationships
Available charts
- Activity Distribution — Histogram of pActivity values. A bell-shaped distribution is ideal. A bimodal distribution may suggest two distinct series
- cLogP vs Activity — Scatter plot. Positive correlation = more lipophilic compounds are more potent (common for membrane targets). Watch out for this driving Ro5 violations
- MW vs Activity — Scatter plot. Potency that increases only with MW is a red flag — this is "molecular obesity" and predicts poor ADMET
- TPSA Distribution — Bar chart of polar surface area. Important for permeability predictions
- Scaffold Activity Comparison — Bar chart comparing average pActivity by scaffold. Identifies privileged scaffolds visually
- Lipinski Compliance — Pie chart of Ro5 violations across the dataset
Ligand efficiency (LE)
The MW vs Activity chart implicitly shows ligand efficiency (LE = pActivity / heavy atom count). Ideal: high potency at low MW. Compounds with high MW and low potency are inefficient and will likely fail ADMET.
💡 Hover over any data point to see the compound name and exact values. Use these charts in team meetings to communicate SAR trends visually.
🔄 MMP Analysis Tab
Matched Molecular Pair analysis — identifying which structural changes drive activity
What is an MMP?
A Matched Molecular Pair (MMP) is two compounds that differ by exactly one structural transformation — for example, H→F at a specific position, or NH₂→OCH₃ on a ring. By comparing the activity of the two compounds, you can directly measure the effect of that single change on potency.
How it works in SAR Studio
The app compares all compound pairs and identifies those that differ by a single fragment from a predefined transformation library (halogen substitutions, N-methylation, carboxylate→amide, etc.). For each transformation found, it reports:
- Which compounds were compared
- What structural change was made
- The activity difference (Δ pActivity and fold change)
- Whether the change is beneficial or detrimental
What to look for
- Beneficial Large positive Δ pActivity = this transformation improves potency — prioritise it in synthesis
- Detrimental Large negative Δ pActivity = avoid this transformation
- Consistent direction across multiple pairs = high confidence SAR signal
🔬 Med chem context: MMP analysis is one of the most powerful tools for lead optimisation. A consistent +1 log unit gain (10×) from H→F substitution across 3+ pairs is an actionable SAR rule.
⚠ The app's MMP detection uses SMILES substring matching, not formal bond-cutting algorithms. It may miss some pairs or produce false positives for complex structures. Treat results as hypotheses to verify.
🔗 Correlations Tab
Pearson correlation between physicochemical descriptors and biological activity
What is Pearson correlation (r)?
The Pearson r coefficient measures the linear relationship between two variables. It ranges from −1 (perfect negative correlation) to +1 (perfect positive correlation). r = 0 means no linear relationship.
|r| ≥ 0.7Strong correlation — likely a real SAR driver
0.4 ≤ |r| < 0.7Moderate — suggestive but not definitive
0.2 ≤ |r| < 0.4Weak — may be noise, especially in small datasets
|r| < 0.2No meaningful linear relationship
Descriptors correlated
- cLogP ↔ Activity — Lipophilicity drives potency for many targets (positive r = more lipophilic = more active)
- MW ↔ Activity — High correlation suggests "molecular obesity" if MW is high
- TPSA ↔ Activity — Negative r is expected for CNS targets (high TPSA = poor BBB penetration)
- HBD ↔ Activity — High HBD often reduces membrane permeability
- Rotatable bonds ↔ Activity — High rotatable bond count reduces oral bioavailability (Ro5 limit: ≤10)
⚠ Correlation ≠ causation. A strong correlation between MW and activity may simply reflect the fact that larger compounds have more hydrophobic surface area — the real driver may be cLogP, not MW. Always cross-reference with MMP and Conclusions tabs.
💡 Conclusions Tab
Auto-generated SAR interpretation cards summarising key findings
What the Conclusions tab does
This tab synthesises the data from all other analyses into human-readable SAR conclusions. It automatically generates insight cards for:
- Physicochemical drivers — Which descriptor (cLogP, MW, TPSA, etc.) has the strongest correlation with activity, and what that means for your series
- Moiety analysis — Which structural fragments (e.g. fluorine, sulfonamide, carboxylic acid) are statistically associated with higher or lower activity
- Scaffold conclusions — Which core ring system is most advantageous
Reading the moiety cards
Each moiety card compares two groups: compounds containing the fragment vs compounds without it. The % change in average pActivity is reported:
- ✅ Favorable Fragment is associated with ≥12% higher average pActivity — retain or expand this group
- ❌ Unfavorable Fragment is associated with ≥12% lower average pActivity — consider replacing
- ~ Neutral Less than 12% difference — fragment has little impact on potency
- ⬡ Universal Present in all compounds — cannot assess contribution without a comparison group
💡 These conclusions are the most directly actionable output for medicinal chemistry decision-making. Share this tab with your team to align on synthesis priorities.
🔬 Similarity Tab
Morgan fingerprint-based Tanimoto similarity matrix with hierarchical clustering
What are molecular fingerprints?
A molecular fingerprint is a binary bit vector that encodes which structural features are present in a molecule. SAR Studio uses circular (Morgan-style) fingerprints with radius 2 and 512 bits — similar to ECFP4 used in industry QSAR workflows.
Each bit represents a circular neighbourhood of atoms up to 2 bonds away from a given atom. If that substructure exists in the molecule, the bit is set to 1.
Tanimoto similarity
Given two fingerprints A and B, Tanimoto similarity = (bits set in both A and B) ÷ (bits set in A or B). Range: 0 (completely different) to 1 (identical).
≥ 70%High similarity — compounds are close structural analogues
40–70%Moderate — related scaffolds or common fragments
< 40%Low — structurally diverse
Hierarchical clustering
The heatmap reorders compounds so that similar ones appear adjacent, revealing clusters of related structures. Each cluster likely represents a chemical series.
What to look for
- Clear cluster blocks = distinct chemical series in your dataset
- All compounds with high similarity = low chemical diversity (limited SAR exploration)
- Very low similarity overall = chemically diverse set (good for building a broad model, harder for optimising one series)
🔬 Industry context: Tanimoto ≥ 0.4 is often used as the threshold for "structurally similar" in patent analysis and compound clustering. The Bemis-Murcko scaffold approach is complementary.
⚡ Activity Cliffs Tab
Pairs of similar compounds with large activity differences — the richest SAR signals
What is an activity cliff?
An activity cliff occurs when two structurally similar compounds (high Tanimoto similarity) have dramatically different biological activities. These cliffs are particularly informative because a small structural change causes a large potency jump — pinpointing the exact pharmacophoric feature responsible.
Detection thresholds used
- Structural similarity: Tanimoto ≥ 0.35 (similar enough to be meaningful)
- Activity difference: ΔpActivity ≥ 0.7 (equivalent to ≥5× difference in IC50)
⚡ Major cliffΔpActivity ≥ 2.0 (≥100× potency difference) — highest-value SAR signal
⚡ Moderate cliff0.7 ≤ ΔpActivity < 2.0 (5–100× difference)
How to act on activity cliffs
- Identify the structural difference between the cliff pair (use the SMILES shown on each card)
- Design 2–3 analogues that probe this specific change to confirm the trend
- If the cliff is consistent across multiple pairs, it is a robust SAR rule
- If only one pair shows the cliff, it may be a noise artefact — verify experimentally
🔬 Why cliffs matter: Activity cliffs are where the most dramatic improvements in potency can be achieved with the fewest structural changes — ideal for lead optimisation where synthetic effort is limited.
🌲 Random Forest Tab
Non-linear ensemble QSAR model with permutation feature importance
What is a Random Forest?
A Random Forest is an ensemble of decision trees, each trained on a random bootstrap sample of the data using a random subset of features at each split. The final prediction averages all tree outputs. This approach:
- Captures non-linear descriptor–activity relationships that linear regression misses
- Is more robust to outliers than single models
- Reduces overfitting through the averaging of many trees
- Provides feature importance estimates without requiring a separate analysis
SAR Studio RF settings
- 60 trees, max depth 5, minimum 2 samples per leaf
- Features: cLogP, MW, TPSA, HBD, HBA, rotatable bonds, aromatic rings, halogens, heavy atom count
- Feature subset per split: √9 ≈ 3 features (standard for regression)
Feature importance (permutation)
Each bar shows how much the model's error increases when that feature is randomly shuffled (breaking its relationship with activity). Higher % = more important. This is more reliable than impurity-based importance used in some implementations.
Predicted vs Actual scatter plot
Points close to the diagonal red dashed line = good fit. Points far from the diagonal = compounds the model predicts poorly (often outliers or compounds with unusual activity driven by features not captured by these descriptors).
⚠ The training R² shown here is optimistic — it uses the same data the model was trained on. Always check the Cross-Validation tab for an honest estimate of predictive performance on unseen compounds.
💡 Once the RF tab has been visited and the model built, the 🔮 Predict tab automatically blends RF (60%) with Ridge regression (40%) for better predictions.
📊 Cross-Validation Tab
Honest estimate of model generalisation performance using k-fold CV
Why cross-validation matters
Training R² is almost always misleadingly high — any model can memorise its training data. Cross-validation (CV) gives you an honest estimate of how well your model will predict new, unseen compounds by systematically testing on held-out data.
How k-fold CV works
The dataset is randomly divided into k equal folds (SAR Studio uses k=5, or fewer if the dataset is small). In each iteration:
- 1 fold is held out as the test set
- The remaining k−1 folds are used to train the model
- Predictions are made on the held-out fold
- This repeats k times until every compound has been predicted exactly once
Metrics reported
- CV-R² — Fraction of activity variance explained by the model on test data. Higher = better generalisation
- CV-RMSE — Root mean square error in pActivity units. RMSE = 0.5 means predictions are off by ~3× on average; RMSE = 1.0 means ~10×
CV-R² ≥ 0.7Good — model generalises well; predictions are reliable for prioritisation
0.4 – 0.7Moderate — useful but add more compounds or consider 3D descriptors
< 0.4Poor — model is likely overfitting or data is too small/diverse
RF >> RidgeSuggests non-linear relationships — RF is the better model for this dataset
RF vs Ridge — which to use?
Both models are run in parallel. If RF CV-R² is substantially higher than Ridge, non-linear descriptor effects dominate and RF should be trusted more. If they are similar, the dataset may be too small for RF to outperform the simpler linear model.
💡 A per-fold CV bar chart lets you spot problematic folds — if one fold has much lower R² than others, those held-out compounds may be structurally dissimilar from the training set (applicability domain issue).
🔮 Predict Activity Tab
QSAR-based activity prediction for new compounds using the trained models
How to use Predict
- Load your training compounds and run SAR Analysis
- (Recommended) Visit the 🌲 Random Forest tab to build the RF model
- Navigate to 🔮 Predict Activity
- Enter SMILES of your new compounds (one per line, with optional name)
- Click Predict Activity
Prediction algorithm
If the RF model has been built (RF tab visited): 60% RF + 40% Ridge regression, plus fragment-based corrections from the moiety analysis and scaffold-average bias. If RF is not built: Ridge regression only with fragment corrections.
Confidence score
Confidence combines model R² with a similarity estimate between the new compound and the training set. Higher similarity to training compounds = higher confidence.
- ≥65% Reliable prediction — compound is well within the applicability domain
- 40–65% Moderate — use with caution, compound may be somewhat novel
- <40% Low — compound is very different from training set; prediction is speculative
Predicted activity levels
- High Activity Predicted pActivity in the top 33% of training set
- Moderate Middle 33%
- Low Activity Bottom 33%
⚠ Critical disclaimer: These predictions are based on only ~9 physicochemical descriptors calculated from SMILES patterns. They cannot capture 3D binding interactions, conformational effects, or ADMET. Use predictions to prioritise synthesis — always confirm with experimental assays. A compound predicted as "High Activity" may still fail in the assay.
↓ Excel & PDF Export
Exporting your SAR analysis results for reporting and archiving
Excel Export (↓ Excel)
Downloads an .xlsx file containing:
- Compounds sheet — Full SAR table with all descriptors, ranks, and activity values
- Scaffold Analysis sheet — Per-scaffold statistics
- Correlations sheet — Pearson r values for all descriptor–activity pairs
- MMP sheet — All matched molecular pairs found
PDF Export (↓ PDF)
Downloads a multi-page .pdf report containing:
- Cover page with dataset summary and key metrics
- Top compounds with structures drawn
- Scaffold analysis page
- Correlations and physicochemical summary
- MMP analysis findings
- SAR Conclusions narrative
- Prediction Results page — included automatically if you have run predictions in the Predict tab (compound cards with SMILES, predicted activity, confidence bar, and fragment annotations)
💡 For the most complete PDF report: (1) Run SAR Analysis, (2) Visit the RF tab to build the RF model, (3) Run predictions in the Predict tab, (4) Then click ↓ PDF. All sections will be populated.
ℹ The PDF is generated entirely in the browser using jsPDF — no data is sent to any server. Your compound structures and activity data remain private.