The original is one click away. Open original ↗
Using ChatGPT Advanced Data Analysis to predict and qualify leads
Executive overview
Einar Vollset, a partner at TinySeed and a sell-side M&A investment bank, argues that the real AI opportunity for bootstrapped SaaS founders is not content generation but predictive lead qualification. By feeding labelled internal data and publicly observable company signals into ChatGPT's Advanced Data Analysis mode, a non-data-scientist can build a working ML classifier in under 45 minutes for roughly 17 cents. The core insight is that ChatGPT acts as both an execution engine (Python interpreter) and an on-demand educator, collapsing what previously required a dedicated data team into a single exploratory session. The same approach that Vollset uses to predict whether a SaaS business has 100K+ MRR can be repurposed by any founder to score leads, predict churn, optimise pricing, or surface upsell opportunities — without halting feature development.
Why asking LLMs for lists fails
- LLMs are fundamentally bad at "give me all of X" questions; they are trained to produce plausible next tokens, not exhaustive catalogues.
- Asking ChatGPT for a list of B2B SaaS businesses with 2–10M ARR returns irrelevant sources (BizBuySell, Flippa) that don't serve the actual need.
- Platforms like Flippa are too late-stage: by the time a prospect is listed, the decision has already been made.
- The information simply does not exist in the training data — hallucinated lists are worse than no list.
- The lesson applies broadly: if a lead has already chosen a competitor, the window to influence them has closed.
The BuiltWith spend-estimate dead end
- BuiltWith reports which technologies are installed on any website and provides a rough monthly software-spend estimate.
- The intuition — higher spend implies higher ARR — is reasonable but the heuristic is only 50/50 accurate in practice.
- A 50% error rate is unacceptable when enterprise outreach is costly: wasted effort on sub-threshold companies consumes resources that should go to qualified leads.
- Website design quality is equally unreliable as a revenue signal; some high-revenue businesses have ugly sites and vice versa.
- Raw heuristics need to be replaced by a trained model that combines multiple features.
ChatGPT Advanced Data Analysis: what it is and why it matters
- Advanced Data Analysis (formerly Code Interpreter) is available to paying ChatGPT users under GPT-4 settings.
- It combines GPT-4 with a live Python interpreter and the ability to upload files up to ~100 MB.
- Think of it as a cheerful, capable second-year university statistics student available on demand.
- The tool oscillates between executor (writes and runs code) and educator (explains results at whatever level you ask).
- Avoiding "prompt engineering writer's block" matters: just talk to it conversationally and let it teach you what you need to know.
Building a lead-scoring classifier in 45 minutes
- Start with two datasets: one with known MRR for a sample of companies (the labelled set), one with BuiltWith technology-stack data for the same companies.
- Upload both files and describe the problem in plain language; ChatGPT will outline the full ML pipeline — data prep, feature engineering, model selection, training, evaluation, deployment.
- Data preparation: ChatGPT detects missing values automatically (e.g., 48 missing MRR rows) and offers to remove or impute them — tasks that would otherwise take a day of Stack Overflow research.
- Feature engineering: technology lists are one-hot encoded into 2,161 binary features; the binary target variable (MRR > $100K) is created — another half-day of work collapsed into seconds.
- Model training: a Random Forest Classifier is selected because precision can be tuned above recall, which is the right trade-off when false positives are expensive.
- Results: the prototype model achieves 100% precision on the evaluation set (every predicted positive is correct) with ~20% recall (misses 80% of true positives) — acceptable given the goal is to avoid wasted outreach, not to find every lead.
- A second model (SVM) can be trained and compared in one further exchange.
- At the end of the session, ChatGPT produces downloadable artefacts: the serialised model file, the feature-name encoder, and a Python script to run predictions locally.
What this actually replaces
- The traditional stack for this kind of analysis costs roughly $500K/year: data analysts, BI tooling, ETL pipelines, data lakes, and dedicated engineering time.
- Most bootstrappers have none of that — and can't pause feature development to build it.
- The same outcome now costs 45 minutes of exploration and ~17 cents of API compute.
- The output files are deterministic (Python/sklearn artefacts), not probabilistic LLM text — so the hallucination concern does not apply here.
- Vollset's live deployment: running the model against ~5,000 of 40,000 SaaS companies in the database, prioritising outreach to those predicted above the 100K MRR threshold.
Applying the pattern to your own SaaS business
- Lead lookalike scoring: label your 200 best and 200 worst customers, gather publicly observable features (BuiltWith, job postings, LinkedIn headcount), and build a classifier to score inbound or outbound prospects.
- Churn prediction: use internal clickstream or engagement data as features; binary target is churned vs. retained.
- Free-to-paid conversion: model which trial behaviours predict conversion; act on the signal before the trial ends.
- Upsell identification: among active customers, predict which are most likely to be receptive to an upgrade offer.
- PPP loan data (US-specific): pandemic-era payroll-protection disclosures correlate with headcount and salary, providing a free training signal for company size.
- The key constraint is that training labels must come from data you already own, while the features used to score new prospects must be publicly obtainable.
Pricing analysis as a second example
- Pricing consultants (e.g., ProfitWell) charge ~$150K per engagement because the insight is genuinely valuable.
- ChatGPT can teach demand-curve theory interactively — elasticity, shifts, revenue-maximising price points — and generate visual demand and revenue curves on demand.
- More usefully, it will produce a template spreadsheet specifying exactly which data to collect (price points, units sold, revenue, time period) to perform a real pricing analysis on your own business.
- The practical path: fill in the template with your actual data, feed it back to Advanced Data Analysis, and get a pricing recommendation — replacing a five-figure consultant engagement.
Data privacy and security considerations
- Uploading data to ChatGPT is functionally equivalent to sending data to any third-party cloud service (S3, etc.).
- The feeling of risk is psychological — it feels more personal because there appears to be an intelligence reading the data.
- Opt out of training-data use via ChatGPT settings to prevent data leakage.
- For regulated industries, Azure OpenAI offers enterprise-grade ChatGPT with HIPAA and SOC 2 compliance.
Recommendations for building future-ready data infrastructure
- Start recording significantly more event-level data now, even if you have no immediate use for it — storage is cheap and future models will benefit.
- However, the highest-value training sets combine internal labels (you know who is a good customer) with external, publicly observable features (so the model generalises to prospects you haven't met yet).
- Fancy infrastructure (Redshift, data lakes) is helpful but not required for early-stage exploratory work — a CSV and Advanced Data Analysis will get you surprisingly far.
- The bottleneck is not technology; it is identifying which hidden variable, if known, would most improve your business decisions.
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.