I used to believe in full data transparency. Put all data out there even in its rawest form with little context. If someone cares enough, they’ll ask about it. But over the years, I realized I was shirking responsibility and pushing it to others.

The data team should help stakeholders get business value. And that includes helping them avoid the dangers of swimming in too much data.

Danger #1: confusion and overload

What do these fields mean? Why are there five versions of a similarly named table? What is the difference between them?

A stakeholder will naturally feel confused given a bunch of data with little explanation.

If you give access to the raw data, you should help them answer those questions. So they know what they’re working with.

Writing and maintaining documentation is good step. And cultivating a culture of reading the documentation is important as well. You should also leave time for training.

Danger #2: bad decisions made from incorrect usage

What can I do with this data? How complete is it? Is it biased? How frequently are new records added? Are past records ever updated or deleted?

If a stakeholder assumes incorrectly, they could make a bad business decision. And they would distrust the data. Then it’s an uphill battle to regain that trust.

To avoid that, make sure your stakeholder knows what they can and cannot do with the data. Is it appropriate for backtesting? Will survivorship bias cause biased trend lines?

Having explicit assumptions about the data is very helpful in the documentation. And more importantly examples of what do and not do with the data.

Danger #3: time wasted on manual spreadsheet process

To work with the raw data, your stakeholder has plugged it into a spreadsheet. And now every week they load the raw data from a query, do some spreadsheet transformations for an hour to generate a report.

Sounds innocuous at first, but once they do it few times, that manual spreadsheet time adds up.

Not only that, but if they have an issue with their spreadsheet process, they may come to your team for support.

To save your time and your stakeholder’s time, keep tabs on how data is used. If there is a repeated spreadsheet being used, you can reach out and help them move it to something more sustainable (e.g. dbt for transformations, and Looker for business intelligence).

Danger #4: decreased agility

Once stakeholders are using certain raw or intermediary datasets / tables, you’re implicitly committed to keeping those datasets / tables intact.

So if you ever change the way you model data or the data you ingest, you need to re-train your stakeholders and help migrate their processes.

Ideally, you’ll have already proactively moved any recurring processes away from the raw data. And the only thing to migrate is the documentation and to make stakeholders’ aware.

Caveats

If the stakeholder is themselves a data professional e.g. an embedded data scientist on the marketing team, then you may not need to help them think through the assumptions behind data sets.

Similar for raw data being delivered to a client or customer’s data team.

So what?

TL;DR – you should hold yourself to a high standard when giving data to stakeholders. And be conservative about what you support.

Tell them what the data is
How to model it and use it
With what tools to model it
And help them move to new data when the time comes

Like what you read?

(Photo by Zac Harris on Unsplash)