The QVR Framework for Data Source Valuation
Given a set of options for data sources, how do you know which of these data sources are best for your needs?
The QVR framework offers a series of prompts to answer these questions. The QVR framework answers the which and why (not) questions of data sources, not the what; you cannot use it to find suitable data sources.
Four stages
The QVR framework proceeds in four stages:
-
Is the quality of the data sufficient for your purposes?
- If it is not, discard the data source.
- If its quality is satisfactory, continue:
-
What is the business value based on the quality assessment?
- If the benefits do not outweigh the costs, drop the data source.
- If both the quality and value check out, proceed:
-
What are the risks associated with the data source and how can you mitigate these?
- If the risks cannot be mitigated sufficiently, judge whether the risks are acceptable given the expected value.
- If the risks can be mitigated, score the data source:
- Compute the QVR score to rank data sources
Stage 1: Quality
Data quality has many dimensions along which you can assess a data source’s viability. Most dimensions can be assessed based on a sample and/or aggregates. Data engineers or data scientists with domain knowledge are in the best position to assess the quality.
Availability
Is the data set available and accessible to users who are authorized to access it?
- What is the means of data delivery: REST API, SFTP, blob storage, …?
- Does data access require authentication?
- Are there limitations on data access?
- In what format is the data made available?
- Are special libraries needed to read the data?
- Who maintains these libraries?
Timeliness
Is the data set available on time?
- What is the delivery schedule: weekly, daily, hourly, every 5 minutes, …, real time?
- Is on time vs late measured?
- Are SLAs in place?
- What recourse exists in case of delays in data availability?
- Are there any technical constraints (e.g. quotas, rate limits) that can negatively affect the timeliness?
Freshness
Is the data source up to date (i.e. fresh)?
What is the difference between timeliness and freshness?
Think of it in terms of fruit. A delivery company brings a box of apples to the supermarket every day at 8 o'clock. If the delivery of a box of apples is on time, it arrives within a few minutes around 8 AM every day. The apples (i.e. data) could have been harvested ages ago (stale) or just before loading the truck (fresh). If the apples are fresh and on time, it means they were plucked off the trees moments before being shipped and the shipment arrived on schedule. But apples can also arrive rotten (i.e. not fresh) yet on time, or apples can also arrive fresh but late, i.e. they were harvested just before shipping out, but the truck got stuck in traffic.
Completeness
Is the data set complete?
- Are all records included in the data source?
- Are record counters tracked over time?
- Are aggregates tracked over time?
- Is the metadata complete? Do all records have appropriate metadata?
Validity
Are all fields semantically correct and stored with appropriate data types?
- Are the schemas documented? Are schemas backwards compatible?
- Are semantic checks in place to ensure the validity of individual fields?
- Do numerical values have ranges of validity specified?
- Do categorical values have predefined values specified?
- Are the number of significant digits published?
- Are error ranges for measurements specified?
- Are the correct data types used (where applicable)?
- Are dates and times stored properly? Do they include time zones?
- Are monetary figures stored with the right precision (i.e. 2 decimals)?
Consistency
Are all fields and records defined, understood, measured, and stored consistently?
- Are relationships among fields and data sets within the data source defined?
- How are identifiers assigned?
- Are identifiers stable?
- If identifiers are repurposed, is that documented?
- If identifiers are repurposed, do they all come with valid date ranges?
- Are identifiers unique across data sets within the same data source?
- Is the naming consistent across the data source?
- Are invariants for (combinations of) fields defined?
- How are missing values encoded (e.g. NULL, N/A, blank)?
Correctness
Is the data set accurate, reliable, or at least fit for purpose?
- For measurements, are calibration procedures documented?
- Are the reported values plausible?
- For matching data with existing internal sources, are values corroborated?
- Is the data biased? Is there a difference in accuracy of the data for different demographic or geographic slices?
What is the difference between validity, consistency, and correctness?
Let's go back to our fruit delivery, but look at it from the perspective of the system that logs arrivals and departures of goods. The system must check if what's being delivered is valid, that is, one of the expected deliveries. If that checks out, it registers the arrival, so that the staff can take the box and place its contents in the shop. The logs will thus show the entry of a box of apples and later the exit of a box of apples. If the arrival is logged after the removal, the data would not be consistent because an invariant (enter before exit) would have been violated. However, it turns out the box was mislabelled and it contained oranges. So, the data in the system is valid and consistent but unfortunately not correct.
Trustworthiness
Is the data source trustworthy?
- Are intended use cases and limitations documented?
- What is the reputation of the data source and/or vendor?
- What is the provenance of the data? From which sources has it been derived?
- If the data is derived (i.e. not raw), what operations have been applied?
- Is there external corroboration of the data source’s quality?
- How do data distributions drift over time? Is that reasonable to expect?
What if you cannot answer questions about provenance?
If the origins of data are hazy, it is unclear what sort of transformations are applied, or they are too complex to describe, the data source has a higher likelihood of having bugs and is by definition less trustworthy. Independent verification (e.g. in the scientific community) or at the very least social proof (e.g. a large user base of well-known companies) is required to increase trustworthiness in such cases.
Understandability
Is it clear what the data source contains?
- Is there documentation on the data source?
- Is there documentation for each field?
- Is any documentation maintained?
Confidentiality
Are there any concerns with using the data source outside of a specific context? Are you free to use it in all products and for any customer or industry?
Ethics
Is there an ethical/moral dimension to the usage of the data source?
Compliance
Does the data source comply with applicable regulations in the countries you operate?
- Is personal information properly obfuscated?
- Are identifiers that can be used to identify people obfuscated?
- If access to un-obfuscated data is required, do you have appropriate access controls in place?
- Is personal information stored only for its required purpose and the required duration?
- Is personal information deletable upon request?
Stage 2: Value
The best-case value depends on the quality of the data and of course its primary purpose. This is a task for product management.
Main goal
What is the main goal of the data source?
- Replace existing data sources (e.g. deprecation)
- Improve machine learning models and/or product performance
- Expansion of capabilities
Best-case value
In the best possible scenario, what is the value of that data to the company and/or customers?
Variety
Does the data source increase overall data variety in the company or is it more of the same?
- Assess overlap with existing internal sources: do new sources complement existing sources, provide additional validation, or is there no real marginal value?
- Assess overlap with business goals and/or product weak spots: does the new source provide data in areas where you are underserving or underperforming?
Cost
What is the total expected cost of the data source?
- What is the estimated cost of acquisition, integration, and maintenance of the data source?
- What is the expected lifetime/relevance of the data?
The costs of integration and maintenance are a direct consequence of the quality of the source. If a source requires a lot of cleansing and monitoring, these costs must be factored into the calculation. Equally, data quality is needed to compare, say, a high-quality but very expensive source against a lower-quality but cheap source. Without an understanding of the quality, the comparison would be unfair.
Stage 3: Risk
Here engineering and product come together to assess the risk associated with additional data sources.
- How unique is the data source?
- How unique is the vendor?
- What alternate/backup sources are available?
- How do the backup sources compare in terms of quality and value?
- How easy is it to swap data sources and what issues can you expect?
- How reliable is the vendor?
- What is the reputation of the vendor?
- Can the vendor be trusted to maintain the data source?
- Can the vendor be trusted to maintain the price for the data source?
- Is the vendor a possible competitor, i.e. is there a risk they may not serve you in the future?
- Do they publish (breaking) changes?
- Do they have a clear deprecation path with grace periods?
- Do the terms and conditions prohibit usage of the data in any way?
- What if the terms and conditions change over time?
Stage 4: QVR score
The QVR score for each data source is just the product of the quality, value, and risk scores:
- Quality: 1 (low/bad) – 2 (medium/acceptable) – 3 (high/good)
- Value: 0 (none) – 1 (lowest) – 2 (low) – 3 (medium) – 4 (high) – 5 (highest)
- Risk: 1 (high) – 2 (medium) – 3 (low)
There are 20 distinct QVR scores, sufficient to have a ranked list of viable data sources. A data source of high quality (3) that is expected to deliver the highest value (5) with little risk (3) has the maximum score of 45.
The reason for value having a larger range is that there must be any significant value for the quality or risk to matter at all; if there is no value whatsoever, not even the quality or risk matter. Still, that lack of (marginal) value may very well depend on the quality of the data. We can therefore not short-circuit the decision unless the lack of value is obvious from the outset.
Example
In the following I have listed five data sources (A, B, C, D, E) with different values for quality, value, and risk. Both A and B are of high quality. Even though the expected value is higher for B, the associated risk is also much higher than for A. Hence, A is the preferred choice.
E also has a high quality, but there is no value. Hence, its QVR score is zero. The data source is irrelevant, regardless of the fact that its quality is high and the associated risks are low.
| Source | Quality | Value | Risk | QVR score | Rank |
|---|---|---|---|---|---|
| A | 3 (good) | 3 (medium) | 3 (low) | 27 | 1 |
| B | 3 (good) | 5 (highest) | 1 (high) | 15 | 3 |
| C | 1 (bad) | 2 (low) | 3 (low) | 6 | 4 |
| D | 2 (acceptable) | 4 (high) | 2 (medium) | 16 | 2 |
| E | 3 (high) | 0 (none) | 3 (low) | 0 | N/A |
If all sources are complementary (i.e. similar), D and B can act as backups for A. The difference in risk comes from, say, a vendor being unreliable or having unfavourable terms and conditions.
If all sources are not complementary (i.e. dissimilar), a decision can still be made whether B’s highest value is worth the risk. Such a risk could arise from a data source being unique in the market; there may be no backup (vendor lock-in). An exclusive partnership or licence may be a way of decreasing that risk.
FAQ
Why do you need QVR at all?
Suppose you have a set of potentially interesting data sources. You are asked to make a decision on whether to use any of these. How do you approach that problem? Particularly, if at least one is free and another is expensive.
An engineer would likely approach the problem from the perspective of data quality and the difficulty of integration, possibly combined with the acquisition cost. A product manager would look at the potential value of that data to the business and risks associated with it, including perhaps the trustworthiness of the source. The problem is that quality and value are linked. QVR approaches the valuation of data sources as a combined effort.
What other frameworks exist?
Many frameworks for the valuation of data sources focus exclusively on quality (e.g. ORME-DQ, Databand) or (organizational) maturity (e.g. FAIR, DataFlux). Others are overly generic (e.g. Anmut, OFDV) or mostly relevant to internal data assets (e.g. Deloitte), which ignores the fact that quality primarily needs to be monitored continuously and not done as a one-off process (e.g. BDQ). Or they ignore the business implications, such as value and risk (e.g. DVF).
Various measures of information value are either lagging indicators (i.e. they do not tell you anything ahead of time) or they are leading indicators but very limited in scope.
Is the assessment of quality before value/risk not wasteful?
No. The QVR framework ensures a reasonable balance of quality, value, and risk. If a decision were based on value alone, there might be nasty surprises once the data is in use: the quality may turn out to be insufficient or the costs prohibitive. Likewise, if a decision were based on quality alone, the value may turn out to be limited.
Why not perform A/B tests for the estimation of value?
Because you obtain an answer to whether a data source is valuable after doing all the work of acquiring and integrating it. It may be possible to check whether a sample of data has any measurable impact, but that needs to be assessed on a case-by-case basis.