The Scaling Data Framework: Data Informed to Data Driven to Data Led

One of the most common questions I get from founders who are trying to scale data for their organizations is: “When should I hire my first data person?’

Invariably, the same types of questions are asked over the lifecycle of the company:

  • Should I hire a data engineer or a data analyst?

  • Do I need a data scientist right now?

  • What kind of analysis should the data team be doing?

  • Is the PM or data team responsible for data collection?

  • Should I be using Looker (an advanced data transformation and visualization tool)?

  • What’s the right ratio of analysts to PMs?

  • Where do analysts report into?

At the core of these questions about scaling data is the same common mistake: viewing data as a team to hire or set of tools to implement — instead of a strategic lever for growth. The answers to these questions are dependent on your product, business, and points of leverage.

In this article, I lay out:

  1. Why scaling data is not about hiring a team or implementing tools

  2. The right way to approach scale: Using data as a strategic lever for growth

  3. The 3 stages and 4 capabilities of the Scaling Data Framework

  4. Scaling Data, Stage 1: Data Informed

  5. Scaling Data, Stage 2: Data Driven

  6. Scaling Data, Stage 3: Data Led


Crystal.png

This post is written by Crystal Widjaja, Reforge Partner/EIR, and the former SVP Growth and Business Intelligence at Gojek, one of the largest super apps in Southeast Asia. Crystal helped Gojek scale from 20K orders per day to 5M. Crystal is also a co-founder and advisor to Generation Girl which is dedicated to helping young girls engage in STEM fields. Crystal previously wrote Why Most Analytics Efforts Fail and currently leads Reforge’s Advanced Growth Strategy program.

Thanks to contributions from Dan Wolchonok (Head of Data at Reforge), Elena Verna (EIR at Reforge, Advisor at Miro, Netlify, MongoDB), Behzod Sirjani (EIR at Reforge, ex-Slack/Facebook), Shani Hadiyanto, and Sarah Catanzaro.


Scaling Data Is Not About Hiring a Team or Implementing Tools

It is easy for orgs to settle into treating data as a set of tools or a team to implement. Meaning, the org thinks they just need to adopt a new technology or grow the data team in a certain way to fix their data needs. Often the sources of data problems are in other areas.

  1. Data Capabilities Don't Match The Product Strategy: Answering questions around who to hire, the tools to implement, and the analyses that need to be done are ultimately informed by what the product strategy is and how data plays a role in helping achieve that product strategy. But oftentimes the product strategy isn't well defined, and even if it is well defined, where data fits in isn't.

  2. Your Stage of Data Mismatches The Stage of Business: Often times there is a mismatch between the stage a person has historical experience with and the stage the company is at. For example, a Data PM coming into a new company having worked only with a mature company's data. They never saw the steps it took to get from 0 to great, and end up misapplying technology, team needs, and a lot more. Scaling data requires many evolutions and it is rare that someone has seen the entire lifecycle.

  3. Incorrect Incentives Between Data and Other Teams: Often, the culture and incentives of the org create a non-functioning data environment. For example, data teams should not be measured by the answers they give but rather the impact of those answers on the business. In a lot of culture-poor organizations, PMs or others take credit for "asking the right question" instead of attributing it to the data team. This type of system rewards bad behavior and disincentivizes the data team from doing more impactful analysis — it instead incentivizes them to design pretty result tables. It may even incentivize the data team to seek out new questions that aren't relevant to the business but can provide "interesting" answers, which leads to a negative cycle.

The Right Way to Scale: Strategy to Stage to Team to Tools

Instead, data needs to be seen as a strategic lever for growth. Viewing data from this perspective leads to different answers on the questions we started with around team and tools. What does it look like when data is treated as a strategic lever for growth? I recommend walking through four areas:

  1. Strategy - What are your points of leverage? How does data improve those points of leverage?

  2. Stage - What stage of maturity is our product in? What stage of maturity is our Data in?

  3. Team - What people do we need to achieve the data strategy? Are they set up for success internally?

  4. Tools - What tools do we need to adopt to facilitate the team's impact?

Strategy: What are your points of leverage?

Everyone thinks they need to have a highly scalable, mature data organization. The reality is that most businesses don’t have the necessary scale to build advanced ML-led capabilities that could meaningfully impact the business. The answer to questions like “when do I need a data scientist?” really starts with an objective reflection of the company’s strategy, roadmap, and goals:

  • How much data do the product and business operations generate each day?

  • How can customer value props be improved by leveraging data?

  • What kinds of decisions could the data help inform today?

  • How could decision-making change if we had 1000x the data?

  • How much more efficient could business operations be with data automation?

It's more about identifying the right points of leverage — and not just jumping to the end because you think everything else will come as a result of it.

Going through some of the above questions tends to reveal some uncomfortable truths. The most common one is that the company doesn't have enough data for advanced data infrastructure to be impactful to a company’s business operations. Even the most sophisticated data science team and infrastructure will fail to add value to a business that just isn’t generating enough usable data — there aren’t enough signups, retained users, or actions in the product for meaningful data science solutions to exist.

Stage: What stage of maturity is our product? Data?

Just because you've identified the points of leverage, doesn't mean you are at the right stage in data or product maturity to execute on those strategic initiatives. Questions the team should ask:

  • How much of this data is tracked, stored, and owned by the company?

  • How consistent and descriptive is the data for our market and trends? (For example, in Covid times current data might be out of whack with historical data and make it difficult to leverage in a useful way.)

  • Are you tracking data at the right level of granularity or asset class (event-based, timebound, derived, aggregated)?

  • How deterministic is the company's data? What level of granularity do we require to make deterministic calls?

  • How differentiated and proprietary is the company's data?

  • How timely, recent, and accurate is the company's data?

  • How accessible is the company's data — by both people and systems?

Understanding both what stage of maturity your product is at and where your data is at is critical. It helps you understand where you should be, where you are, and informs the kind of tools and team you need to fill the gap. The most common scenarios I see are:

Overbuilding - Data Stage Is Ahead of Product Maturity

If the company is still searching for product-market-fit, building data infrastructure that meets the needs of a post-product-market-fit organization will actually impede growth. As an example, the first data scientist I hired at Gojek was tasked with building a fraud detection service before we even had the infrastructure in place to collect enough data. This was a mistake — after a few unproductive weeks, we realized that a few simple business logic rules could capture the majority of suspicious transactions. Many years later, those business rules are still responsible for preventing 80% of bad actors, even with advanced data science teams in place. Stories like these are incredibly common. The shiny, enterprise-level data products prevent companies from making productive use of the data they have and tend to favor high-risk, high-resource projects that fail to deliver impact.

Companies that waste resources on projects like these failed to identify an appropriate data strategy for their stage of the business, and instead of building appropriate capabilities, looked to solve an advanced, specific problem. The key is to identify the right sequence of problems to solve with the right foundations built in tandem. This means understanding how data should be leveraged at each stage to meet the needs of the business today in preparation for what the company will need in the (near) future.

Under building - Data Stage Is Behind Product Maturity

Under building is when the maturity of the product is ahead of data maturity. You can under build in different areas - infrastructure, analytics, team, and operations. This is most problematic when some of the company’s business operations are at scale but are totally unprepared to leverage data as a strategic, competitive advantage. Some signals that you've underbuilt:

  • You have multiple products using inconsistent data attributes. For example, timestamp fields use different time zone logic and definitions the taxonomy is all over the place and inconsistent.

  • Months or even years of data haven’t been tracked at all.

  • Data that has been tracked is stuck in a 3rd party system that the company doesn’t have ownership of. For example, I recently worked with a company that uses Firebase, thinking they could eventually export logs. But Firebase does not store individual event data making this impossible — they have literally wasted years of data collection.

  • The business has been operating with sub-optimal decision-making without data for so long, that it’s unlikely to change easily.

The realization of this opportunity cost is painful, and making up for it can take hours of realigning metrics definitions, sourcing available data, backfilling data pipelines, and a realignment of the company’s culture.

Team + Tools

Once you understand what role data plays in the overall strategy, and what stage the product and data are in, then you can begin to understand what team and tools you need and where there are gaps. Team and tools is not just about having the right heads in place, but about making sure that the org is working well together. Signals that teams are not aligned:

  • Teams aren't collaborating on both problems and solutions with the data team. They are instead coming to the data team with a hypothesis to validate.

  • The data and product org don't have time to align on strategic initiatives because they are bogged down by minutiae of tasks to be done.

  • Analyses aren't treated as valuable findings that help people move closer to their objectives, and instead simply evaluate whether something was a win or not.

Data scaling for startups

3 Stages (and 4 Capabilities) of the Scaling Data Framework

To help guide teams through the phases of strategy, stage, team, and tools, I have laid out a friendly framework to designing a scalable data organization. In this framework there are two parts:

  • 3 Stages of Data Maturity: What the business needs to grow and how data plays a role informs the data strategy at each stage.

  • 4 Capabilities Within Each Stage: The necessary building blocks and capabilities of each stage across 4 key work streams (infrastructure, analytics, operations, and team).

Data scaling for startups

The 3 Stages of Data Maturity

Most companies can fit themselves into one of three stages:

  • Stage 1: Data Informed. These companies are focused on building the business and getting to product-market-fit (stable user retention rates). The key business need is for data to provide operational visibility.

  • Stage 2: Data Driven. These companies have reached product-market-fit and are actively optimizing for specific users, behaviors, and experiences in the product at the feature-level. The key business need is for data to support the organization’s growth with scalable tooling, data products, and deep-dive insights.

  • Stage 3: Data Led. These companies are operationally run by data products, infrastructure, and services. The key business need is the “productization” of data services that unlock Product and Data Science teams, allowing them to automate operational decision-making and user product experiences.

The successful advancement from one stage to the next requires two things:

  • Needs: The company’s activities and desired business objectives have evolved due to new levels of growth, scale, or product-market-fit

  • Capabilities: The dependencies and foundations required for the next stage have been built and unlock new leverage and capabilities

Note: Not all companies need to become Data Led.

The implication here is that each stage is a linear progression, but it’s important to note that not all companies become data led. While most companies may self-describe themselves today as Data Informed or Data Driven (or aspiring to reach those stages), some businesses envision reaching the Data Led Stage.

However, this stage does not apply to all businesses; it describes a globally scaled organization in which data dictates what and how you operate. Businesses with meaningful traction may find that building Stage 3: Data Led capabilities are possible, but would not dramatically impact their strategy due to the nature of their business, such as having a small number of SKUs to optimize for or a low-frequency product in an evolving market that renders prediction and forecasting models less effective.

The 4 Capabilities You Need at Each Stage of Data Maturity

Founders should use this playbook by considering the needs of their business (what they need to achieve) in comparison to the next section in this framework: the recommended capabilities (what’s needed to fulfill their business needs) for each stage.

Mature data organizations work in conjunction with mature businesses by empowering, unlocking, and building off of one another. The right strategy is to match the needs of the business at its current stage with building the appropriate data capabilities in order for the business to scale into its next growth phase. There are four pillars of data capabilities in any organization:

  • Infrastructure: the scalability of the tools, architecture, and technical projects that enable analytics

  • Analytics: the complexity and scalability of insights generated in the company

  • Operations: the level of direct impact data has on business operations

  • Team: who is hired to support the above capabilities

How and what to develop across these 4 capabilities differs by stage but ultimately leads to building sustainable infrastructure, developing compounding insights, unlocking business operations, and evolving skill sets.

Data scaling for startups, Stage 1

Scaling Data, Stage 1: Data Informed

The most common pitfall at the data informed stage is being indecisive about the truth (and allowing multiple versions to co-exist). If the company has already reached product-market-fit, but is missing one of the crucial capabilities above, teams might think they have a shared understanding and single source of truth for data when in reality, they really don't.

If Finance believes we gained 100 new transacting users from Facebook Ads in October, but Marketing thinks it was 120, we’re likely operating from different tools, metrics, definitions, time zones, or even accrual vs. cash based accounting. This friction commonly leads to wasting time on alignment, frustration, and avoiding using data at all. Organizations that do not stamp this out quickly will fail to mature as a data-driven company.

Data Informed Business Needs

  • Business health monitoring

  • Visibility of product KPIs/success metrics

  • Functional operations support for a handful of multi-disciplinary individuals and ICs

  • Getting to product-market fit (flattened retention rates across cohorts)

  • Metrics definitions and alignment

Data Informed Capabilities

  • Reliable availability of transactional and financial data

  • Off-the-shelf data visualization, integration, and tooling

  • Broad organizational understanding of unique user, retention, and monetization metrics through the use of company level KPI dashboards

  • Aligned metric definitions through the use of a data dictionary

stage 2.png

Scaling Data, Stage 2: Data Driven

The biggest pitfall at the data driven stage is misalignment of organizational incentives in relation to data-driven decision making. Business decisions don't stop getting made in the absence of data - they just get made with no data at all. Data Driven companies are focused on scaling with the growing functional and product operations and refining their product offering for an ever-expanding set of users. This usually starts by reorganizing teams or hiring product- and feature-specific PMs and functional specialization (e.g. separating Sales Sales Development Reps, Account Executives, Account Managers, and operational roles).

The increased capabilities of the data function unlocks deeper accountability as specific teams can now be responsible for input metrics (# of hand-raisers on feature walls, # contacts added, days to first call, or active team members per org) instead of a generalized, org-wide shared responsibility of output metrics (revenue, retention, # of paid upgrades, etc).

Organizations at this stage leverage the Data team for decision-making guidance, as opposed to operational data retrieval and visibility. To improve data-driven decision making, the organization must have some self-serve access to information, comprehensive insights that answer ***why something is happening (not just what is happening)***, and an early set of productized data products that unlock operational capabilities.

Data Driven Business Needs

  • Feature-level product optimization

  • Smarter & faster function-specific business operations (sales, customer service, ops)

  • Data-informed decision-making guidance on marketing campaigns, growth tactics, and support operations

  • Expanded monetization of core & adjacent users

Data Driven Capabilities

  • Proactive data governance policies

  • Scalable data warehouse infrastructure and tooling through a data lake, customer data platform, and more

  • Self-serve analytics tools

  • Org-wide experimentation and decision-making guidance through experimentation tooling and a reporting framework.

Data scaling for startups, stage 3

Scaling Data, Stage 3: Data Led

The most differentiated thing about Stage 3: Data Led businesses is that they cannot operationally function without data products. The scale and complexity of both the company’s operations and its active user base is such that relying solely on business-generated recommendations, rule sets, and SOPs are not enough to maintain a defensible product experience. A good example of this would be Amazon, which could not successfully manage the scale of their business without the proprietary predictive models that power fulfillment, logistics routing, and warehouse SKU storage.

At this stage, the Data team has built out a self-service data infrastructure platform that solves for ingestion, governance, monitoring, and automation. It is no longer the data team’s sole responsibility to take care of onboarding new data sources and integrating them into the product’s feedback loops and ML models. This “productization” of data services unlocks the Product and Data Science teams and allows them to quickly build new products and features with the data they need.

Data Led Business Needs

  • Data-leveraged defensibility

  • Self-serve data onboarding and productization

  • Prescriptive, automated operational decision-making

  • Deepened user engagement, frequency, and monetization

Data Led Capabilities

  • Platformized, scalable data warehouse infrastructure and self-serve tooling

  • Feature engineering to feed and train data science models

  • Near-real-time data availability

Evolving needs and capabilities of an org

Thoughtful Sequencing, Not All At Once

A key to success is to not try and enable a stage all at once. At a high-level, infrastructure and analytics must be balanced between two things:

  1. Unlocking product insights and business capabilities

  2. Scaling data operations and infrastructure

The right balance is achieved with a thoughtful sequencing of architectural engineering work, analytics, and application of analytics to business and product. The struggle for early-stage Data Informed companies will be cultivating the necessary blend of technical and business skills in the organization that can unlock meaningful insights efficiently. Teams with strong communication lines between business, product, and engineering will sequence these efforts more efficiently than teams with equivalent skill sets but siloed communications. Shared business and technical fluency encourages the right sequencing by focusing on understanding what grows the business and having the complementary technical know-how to select tools, research solutions, and implement quickly without taking on large engineering projects that would not add proportional value. It is a constant process of identifying business needs, building the necessary capabilities, and seeing it unlock growth, which leads to new business needs.