Skip to main content

Cataloging your data sources

Data cataloging for Databricks, Fabric and other data platforms

Most organizations have data spread across multiple platforms: a data warehouse like Microsoft Fabric or Snowflake, a data lakehouse like Databricks, and a stack of operational databases on the side.

For analysts, engineers, and data consumers, navigating that landscape is its own job: "Which table is the source of truth for customer data?", "What does this column mean?", "Is this dataset reliable, or has it been deprecated, and who owns it?"

dScribe Catalog gives you a single place where every table, view, and column is discoverable, documented, owned, and connected back to the business meaning the data carries.


What you get out of it

When your data sources are cataloged in dScribe:

  • One discoverable entry of every table and view across your data platforms

  • Clear ownership for each dataset, so users know who to ask

  • Trusted descriptions that explain what each table and column is for and how to use it

  • End-to-end lineage — trace the dependencies of a dataset all the way down to the dashboards making use of it

  • Linked business context — columns are connected to the glossary definitions that explain what they mean

The result: data engineers spend less time answering "where does this come from?", analysts trust the data they're working with, and the same business definitions apply consistently from source to report.


How it works

1. Connect your data platform

dScribe pulls metadata directly from your data warehouse, lakehouse, or database via native connectors:

  • Databricks

  • Microsoft Fabric

  • Snowflake

  • Google BigQuery

  • ...

Good to know: If your organization uses a platform we don't natively support, you can build a custom connector leveraging the dScribe data loader. See &&Custom connectors (link TBD) for setup details.

For a list of available connectors and step-by-step setup guidelines, see the connector articles.

2. Let dScribe catalog your assets

Once a connector is configured, dScribe catalogs your data assets automatically on a schedule. For each source, dScribe captures:

  • Tables, views, and their underlying schemas

  • Columns with data types & calculation formulas

  • Existing technical descriptions (where present in the source)

  • Lineage between tables, views, and downstream reports

You don't need to do anything for this step — assets appear in your Catalog as soon as the sync runs.

3. Document what matters

Synced metadata is a starting point. Add the context that turns a list of tables into a usable catalog:

  • Use no-code automations to automatically classify data and assign ownership

  • Write a description that explains the table's purpose, granularity, and any known caveats

  • Link columns to glossary definitions, so a column called cust_id is tied to the business concept of "Customer"

  • Set a validation status to indicate whether a table is trusted, in development, or deprecated

  • Flag sensitive data using custom properties (e.g. PII, financial data) so users know to handle them carefully

Don't try to document everything at once — start with the tables your reports and analyses actually rely on.

4. Connect source to report

Once both your data sources and reporting layer are in dScribe, the lineage picture becomes end-to-end. A user looking at a Power BI dashboard can:

  • See which datasets the report is built on

  • Drill into the underlying tables and columns

  • Read the glossary definition tied to each measure

  • Reach the team that owns the source data

This is where your data catalog starts being a navigation layer for your whole data stack.


Rolling it out

You don't need to document every table on day one. A practical sequence:

  1. Connect your most-used platform first. Most organizations start with the warehouse or lakehouse that feeds their primary reporting.

  2. Document your source-of-truth tables. The ones that downstream reports and analyses depend on.

  3. Assign ownership before descriptions. Even an undocumented table is more useful when users know who owns it.

  4. Mark deprecated tables explicitly. A clear "do not use" status prevents new work from being built on dead data.

  5. Layer in linked glossary definitions over time. These get more valuable as your catalog grows.

Most organizations find that the top 50–100 tables generate the bulk of their analytical value. Start there.


Where to go next

→ &&Catalog — overview of how assets work in dScribe
→ &&Glossary — define the business terms your tables represent
Documenting your reporting landscape — the matching use case for the BI side
Connector articles — setup guides for Databricks, Fabric, Snowflake, BigQuery, etc.


Have a question or can't find what you're looking for? Use the chat icon in the bottom right to reach the dScribe support team.

Did this answer your question?