Building Flurry: How We Used Agents to Democratize Data Access | Builder's Corner

At Fullscript, we’ve put a lot into our data platform over the years. Most of the core data we need to understand the business lives in our Snowflake Enterprise Data Warehouse: orders, patients, practitioners, fulfillment, financials, and more. But strong infrastructure does not automatically translate into easy access.

Until recently, if someone at Fullscript had a data question, the usual paths were familiar: find the right Looker report, write SQL against Snowflake, ask the data team for help, or use another tool.

That works, but only up to a point. Most people should not need to learn SQL to answer routine business questions, and the data team’s time is better spent on work that actually requires deeper analytical judgment. We saw a steady stream of requests that were important, but not especially complex, the kind of questions that should be answerable through self-serve access.

That led to a pretty practical question: how do we make it possible for someone to ask a business question in plain English and get back a useful answer without needing SQL or waiting on the data team?

The Core Insight: Business Context is the Gap

When we first explored this problem, the obvious idea was to give an LLM access to Snowflake and see what happened. That sounds reasonable because the models are already good at writing SQL.

The limitation is that schema is not the same thing as business context. Take AOPA, Average Ordering Patients per Account. It is one of the ways we think about engagement because it reflects how many unique patients are actually ordering through their practitioner’s account. But it is not a field sitting in a table somewhere. It has to be calculated as unique ordering patients divided by unique ordering accounts, and that only works if the agent understands the definition we use.

Without that context, an agent can still generate SQL, but it is forced to guess. The query may look fine and still be wrong in ways that are hard to catch.

That was one of the main design constraints for Flurry. We were not trying to solve SQL generation in the abstract. We were trying to teach the system how Fullscript defines its business and metrics.

How We Construct the Agent Prompt

The agent's system prompt is assembled at startup from multiple sources using a template system. The base prompt contains placeholders that get filled with curated content:

1# base_prompt.md
2
3You are an expert data analyst that provides thorough,
4insightful analysis of healthcare practitioner data...
5
6## Available Schemas
7{{SCHEMA_DESCRIPTIONS}}
8
9## Semantic Models
10{{SEMANTIC_MODEL_DESCRIPTIONS}}
11
12## Business Definitions
13{{BUSINESS_DEFINITIONS}}
14
15## Current Date Context
16{{CURRENT_DATE_CONTEXT}}
17

Each placeholder pulls from a directory of markdown files organized by type:

1prompts/
2├── base_prompt.md               # Main agent instructions 
3├── definitions/                 # Business glossary
4│   ├── metrics.md              # Revenue, funnel, North Star definitions
5│   └── milestones.md           
6├── models/                      # Per-model descriptions 
7│   ├── ORDERS.md
8│   ├── COMMERCIAL_FUNNEL.md
9│   ├── FINANCIAL_MODEL.md
10│   ├── NORTH_STAR.md
11│   ├── PATIENTS.md
12│   ├── PRACTITIONERS.md
13│   └── .....
14└── schemas/                    # Schema-level documentation
15    └── CORE.md
16    └── .....
17

The {{BUSINESS_DEFINITIONS}} placeholder injects a glossary of metrics, acronyms, and business terms that don't exist as literal columns in the database. Here's a snippet from the metrics definitions:

1## Core Business Metrics
2| Term        | Definition                                              |
3|-------------|---------------------------------------------------------|
4| **GMV**     | What the customer paid (after discounts).               |
5| **ARPU**    | Average revenue per user/account for a given period.    |
6| **COGS**    | Direct cost of products sold.                           |
7...
8

The {{SEMANTIC_MODEL_DESCRIPTIONS}} placeholder loads each model file, which tells the agent when to use a given model, what tables and metrics it covers, and what types of questions it's good for. For example, here's an excerpt from the COMMERCIAL_FUNNEL model description:

1### COMMERCIAL_FUNNEL
2
3This semantic model is specifically designed for practitioner acquisition funnel analysis. It tracks the progression of practitioners from initial signup through certification, ordering, and key activation milestones.
4
5**When to use this model:**
6- User explicitly mentions funnel, conversion rates, or
7  stage progression (e.g., "SU to CAL", "funnel performance")
8...
9
10**When NOT to use:**
11- Order revenue or GMV analysis → use ORDERS
12

The prompt_loader.py module handles the assembly. All model descriptions are loaded alphabetically and concatenated into the placeholder:

The assembled prompt is intentionally large because most of the system’s usefulness comes from the context it carries. All of this context is what enables the agent to understand our business. Adding a new semantic model is as simple as dropping a new markdown file into the models/ directory and restarting the agent.

How Flurry Works

Flurry is a conversational agent built to answer data questions inside Slack. A user can ask a question in plain English, and Flurry does more than just return query results. It interprets the request, determines what data is needed, generates and runs the SQL, and responds with analysis, visualizations, and a confidence signal that tells the user how much the system relied on verified logic versus on-the-fly inference.

In practice, Flurry works by combining the question, the relevant internal context, and a set of tools for querying and analysis. Depending on the request, it may use a verified query, generate SQL against the warehouse, or pull in additional context before responding. The output is not just raw query results, but a response that includes analysis, visualizations, and a confidence signal.

The first version came together quickly. We built an initial prototype in about a week using Claude Code and Claude 4.5 Opus, then kept iterating on it part-time. Over time we added things like conversation memory, context compaction, PII/PHI detection, confidence scoring, and tighter integrations with the rest of our stack.

By December, it was ready for a small executive pilot. Adoption spread quickly from there, largely through word of mouth as early users shared it with their teams. What stood out in the pilot was how quickly executives became repeat users: they were not just checking top-line metrics, but using Flurry to explore conversion questions, cohort behavior, forecast scenarios, and emerging market trends through a series of follow-up questions in the same thread. It is now running on production infrastructure, handling roughly 800 queries a week across more than 50 users, and is available across the company.

Architecture Overview

Flurry is structured as a monorepo with several Python packages, each responsible for a distinct part of the system.

At the core is the agent, built on Google's Agent Development Kit (ADK) with Gemini as the underlying model. ADK gives you a framework for building agents with tools, session management, and execution runners out of the box. The agent has access to a set of tools that let it introspect the Snowflake schema, load semantic models, execute SQL queries, generate visualizations, and search Looker reports.

In front of the agent sits a FastAPI gateway that handles HTTP requests, manages sessions, streams responses via Server-Sent Events, formats output for Slack, and logs every conversation to Snowflake for analytics.

The Slack app is the primary interface. Built with Slack Bolt, it handles slash commands, mentions, direct messages, and reaction-based feedback. When a user asks a question, the Slack app streams the request to the gateway, posts intermediate progress updates in the thread, and delivers the final response as a top-level message.

Supporting all of this are a few additional packages: a Looker search service that uses vector search to find relevant dashboards, a scheduler for session cleanup and index refreshes, and a Claude research agent that can explore our DBT and Rails codebases as a last resort for questions the semantic models don't cover.

Key Technology Choices

We used Google ADK as the agent framework. It handles the tool loop, session state, and context compaction for longer conversations. That was important because once users started treating Flurry like an ongoing thread, we needed a way to preserve context without carrying the full history forward every time.

Gemini 3 Flash is the main model. We chose it because it performed well in our evals and gave us a good balance of speed and quality for this workflow.

We also use Gemini to generate visualizations. Given a result set, Flurry can decide whether the output is best represented as a chart or a table and render something directly in Slack.

Langfuse handles tracing and observability across the request path, including tool use, query execution, latency, and evaluation results.

Snowflake is both the analytical backend and the logging layer. It powers the answers and stores the interaction data we use for auditing and system analysis.

Why the Data Warehouse Still Matters

Flurry only works because the underlying data foundation is strong. We have spent years investing in the warehouse itself: dbt, Kimball-style dimensional modeling, clean fact tables, clear dimensions, and business logic that is defined upstream rather than reconstructed ad hoc. The semantic layer matters, but it only works because it sits on top of data people can trust.

The Semantic Model Layer

It is relatively easy now to build an agent that can generate SQL against a database. What is still hard is getting that agent to answer business questions correctly and consistently.

For us, the difference came from adding a semantic layer between the agent and the raw Snowflake schema. We built semantic models around the main business domains at Fullscript, including orders, patients, practitioners, fulfillment, financials, and treatment plans.

Each model gives the agent more than just structural metadata. It includes business definitions for tables and columns, relationship mappings that help it join data correctly, verified SQL queries that have already been tested, and guidance on when a given model should be used instead of another. That context turned out to matter much more than raw schema access alone.

Here is a simplified excerpt from the ORDERS semantic model YAML to show what that looks like in practice (full yml spec can be found here).

1name: ORDERS
2description: >
3  Order and account data. Tracks revenue, discounts,
4  and earnings across platforms and segments.
5
6tables:
7  - name: FCT_ORDERS
8    facts:
9      - name: ORDER_NET_REVENUE_USD
10        synonyms: [net_revenue_usd, revenue_after_discounts]
11        description: Revenue after discounts and earnings.
12    metrics:
13      - name: AOPA
14        synonyms:
15          - average_ordering_patients_per_account
16          - avg_ordering_patients_per_account
17          - patients_per_account
18        description: >
19          Average Ordering Patients per Account. Calculated as the number
20          of distinct ordering patients divided by the number of distinct
21          ordering accounts for the selected grain.
22        expr: >
23          COUNT(DISTINCT PATIENT_SK) /
24          NULLIF(COUNT(DISTINCT ACCOUNT_SK), 0)
25
26  - name: DIM_ORDERS
27    dimensions:
28      - name: SALES_CHANNEL
29        synonyms: [revenue_stream, order_channel]
30        description: Wholesale or Direct.
31
32relationships:
33  - left_table: FCT_ORDERS
34    right_table: DIM_ORDERS
35    columns: [ORDER_SK]
36
37verified_queries:
38  - question: "Net revenue by month since 2024"
39    sql: |
40      SELECT DATE_TRUNC('MONTH', fo.completed_at_utc) AS month,
41             SUM(fo.order_net_revenue_usd) AS net_revenue
42      FROM DW.CORE.FCT_ORDERS fo
43      JOIN DW.CORE.DIM_ORDERS do ON fo.order_sk = do.order_sk
44      WHERE fo.completed_at_utc >= '2024-01-01'
45        AND do.is_reportable = TRUE
46      GROUP BY month
47      ORDER BY month DESC;
48

A few things to note: Every column has a plain-English description that tells the agent what it means. Synonyms map the different ways people might refer to the same concept. So when someone asks about "net revenue" or "product revenue after discounts", the agent knows what they mean. Metrics like AOPA (Average Ordering Patients per Account) include the actual SQL expression, so the agent can compute them inline. Relationships tell the agent exactly how to join fact and dimension tables. And verified queries provide tested, working SQL patterns that the agent can adapt for new questions.

The verified queries are especially important. When someone asks "What was our revenue last month?", the agent doesn't have to figure out the SQL from scratch. It loads the ORDERS semantic model, finds a verified query for revenue by month, adapts the date filter, and executes. That's a high confidence answer. When the agent has to build SQL from scratch without a verified query, confidence drops to medium. When it has to go outside the semantic models entirely, it drops to low.

This confidence scoring is surfaced to users with every response so they know how much to trust the answer. It's a simple but powerful signal.

Building Models with Agents

We did not build the semantic models entirely by hand. We used agents to accelerate that work too.

Claude Code was pointed at our dbt repos, where much of the warehouse logic is already defined, and at our Rails application, where a lot of the underlying business rules live. That gave it enough context to draft definitions, surface key metrics, and produce an initial version of the semantic model documentation.

We still reviewed and refined that work, but it dramatically reduced the amount of manual documentation involved.

Because the warehouse and the business both keep changing, this became a repeatable process rather than a one-time exercise. One of the more effective patterns in the project was using agents to help maintain the context that Flurry itself depends on.

Quality and Evals

Getting an agent to work in a demo is not the same thing as getting it to work reliably. We needed a way to measure quality, understand failure modes, and improve the system without relying on intuition.

We ended up with two layers of evaluation. The first is an offline eval suite built with DeepEval. We keep a fixed set of test questions that cover a range of Fullscript business scenarios, and we run those before shipping changes to Flurry. That gives us a quick read on whether a change is actually improving performance or creating regressions.

The second layer is evaluation on live production traffic. Each response is scored across several dimensions using model-based judges. We look at whether the answer is factually grounded, whether it actually addresses the user’s question, whether it introduces any PII or PHI concerns, and whether the confidence score matches how the answer was produced. For example, there is a meaningful difference between a response grounded in a verified query and one that required more inference from the model.

Those evaluation signals all flow into Langfuse along with the rest of the request trace. Over time, that gives us a clearer picture of how the system is behaving. If groundedness starts to slip, we know to investigate. If certain question types repeatedly score poorly on relevance, that usually points to gaps in the semantic layer or in the agent guidance.

This ended up being one of the things that made the system practical to improve. We could move quickly, but still have a reasonably objective read on whether a change made Flurry better or worse.

Safety and Compliance Guardrails

Making data more accessible also meant being explicit about the boundaries of the system. From the beginning, we treated safety less as a feature and more as a design constraint.

Some of that is enforced at the infrastructure layer through a least-privilege access model. Flurry’s Snowflake role is read-only, so it cannot create, update, or delete anything in the warehouse and is limited to SELECT queries. Sensitive fields are further protected through column-level masking and, where applicable, de-identified representations of patient data, giving us a defense-in-depth approach where the guardrail is enforced at the data layer itself rather than relying on prompt instructions alone. These controls are designed to operate within Fullscript’s broader HIPAA privacy and security frameworks. Even when a query touches patient-level tables, protected values are masked or de-identified and cannot be returned in raw form through Flurry.

We also put guardrails into the agent itself. The system is explicitly instructed not to return individual patient information or other sensitive health data. If someone asks for a specific person’s medical history, order details, or similar identifying information, Flurry is expected to decline rather than answer.

That combination matters. The prompt-level rules help shape behavior, but the stronger protection comes from enforcing the boundary in the underlying systems as well.

Why Slack

One of the early decisions was where Flurry should actually live. We could have built a separate web app or another internal tool, but we chose to put it in Slack.

That choice was mostly practical. Slack is already where a lot of day-to-day work happens at Fullscript. It is where people ask questions, share context, and work through problems with each other. Putting Flurry there meant people did not have to learn a new interface or remember to open a separate tool. They could ask a question in the same place they were already working.

That turned out to matter a lot for adoption. Because Flurry was already present in Slack, using it felt lightweight. People could ask a question, follow up, refine what they meant, and keep exploring in the same thread without breaking their flow.

Slack also made feedback easier to capture. People can react directly to a response, leave comments in the thread, and signal when something was helpful or off. That gave us feedback in the course of normal usage instead of relying on a separate process that most people would probably ignore.

Results and Adoption

Flurry started as a small proof of concept in December 2025 with a limited group of executives. From there, usage expanded gradually as people shared it with their teams and others started finding their own use cases for it. It is now available across Fullscript and has become part of how people get answers to routine data questions.

One of the more interesting things has been the range of questions people ask. We expected a lot of straightforward business metric questions, things like revenue, account growth, or other top-line performance cuts. Those do come up, but the usage has been broader than that.

People also use Flurry for more operational questions, for investigating fulfillment issues, looking at promotion performance, or digging into data quality concerns. Some requests are narrow and tactical. Others are much more exploratory. That spread has been a useful reminder that once access gets easier, people use it for far more than just executive reporting.

Another useful pattern we saw was report discovery. In some cases, Flurry was not just answering the question directly, but helping users find the right existing Looker report or dashboard faster.

Key Learnings

A few things became pretty clear as we built Flurry:

Most of the hard work ended up in the semantic layer.
Agents were useful not just in the product but in the build process itself.
Model quality improved more from better grounding than from model changes.
Evals were what made fast iteration possible.
Putting Flurry in Slack mattered a lot more for adoption than we expected.

What's Next

Flurry is already useful, but it is still early. There is more semantic coverage to build, more workflow integration to add, and more iteration to do on the prompts and evals as usage expands.

The next phase is less about proving the concept and more about extending it. That means broader semantic coverage, tighter integration with the rest of our workflow, and continued improvement in how Flurry handles more complex or ambiguous questions. It also means moving beyond a purely reactive model. Today, Flurry answers the questions people ask. Over time, there is an opportunity for it to become more proactive: surfacing changes, patterns, or emerging signals that someone might want to investigate before they think to ask. The opportunity now is to make it more comprehensive and more deeply embedded in how teams work.

Building Flurry: How We Used Agents to Democratize Data Access at Fullscript

Share this post