Skip to content

Add class ReferenceData #1193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
6 tasks
rjyounes opened this issue Feb 12, 2025 · 18 comments
Open
6 tasks

Add class ReferenceData #1193

rjyounes opened this issue Feb 12, 2025 · 18 comments
Assignees
Labels
impact: major Non-backward compatible (changes inferences; e.g., adding a restriction, domain, range)

Comments

@rjyounes
Copy link
Collaborator

rjyounes commented Feb 12, 2025

  • Add class (no definition yet)
  • Make gist:Aspect a subclass of gist:ReferenceData.
  • Make gist:UnitOfMeasure a subclass of gist:ReferenceData.
  • Make gist:Language a subclass of gist:ReferenceData.
  • Make gist:Category a subclass of gist:ReferenceData.
gist:Category
	a owl:Class ;
	rdfs:subClassOf
		gist:ReferenceData ,
		[
			a owl:Restriction ;
			owl:onProperty gist:isAllocatedBy ;
			owl:someValuesFrom [
				a owl:Class ;
				owl:unionOf (
					gist:IntellectualProperty
					gist:Organization
					gist:Person
				) ;
			] ;
		]
		;
  • Check disjointness. E.g., currently gist:Organization is disjoint with gist:UnitOfMeasure. This should be changed to gist:ReferenceData, presumably.
@rjyounes rjyounes added the impact: major Non-backward compatible (changes inferences; e.g., adding a restriction, domain, range) label Feb 12, 2025
@rjyounes rjyounes changed the title Make gist:Category a subclass of gist:ReferenceData Add class ReferenceData Feb 12, 2025
@Jamie-SA
Copy link
Contributor

Would you really want to say that you can have Organizations as reference data? What about importing a list of Organizations from IndustryKG, DnB, or maybe some day from an official Organization licensing entity.

@philblackwood
Copy link
Contributor

I'm not sure a reference data is a real-world thing. It's more of a way we think about and use a set of data.

@uscholdm
Copy link
Contributor

This does not make any sense to me, what is the rationale? What are the pros & cons? A con is it is not intuitive.

@rjyounes
Copy link
Collaborator Author

rjyounes commented Feb 13, 2025

Would you really want to say that you can have Organizations as reference data?

Organizations are not reference data in this model; categories are both types of reference data and allocated by organizations (or people or IP). This is the same thing we do now with IDs.

@rjyounes
Copy link
Collaborator Author

rjyounes commented Feb 27, 2025

Jamie: Reference data is a feature of the data and how it's used, not inherent to the data. Organizations could be reference data if you use a controlled vocabulary.
Phil: Ditto
Michael: Data is a triple, so this is a misnomer.
Peter: Is this a synonym for master data?
Dave: Master data is often things like customers, which we wouldn't consider reference data.
Ryan: Example: reference data from Mitre. It's a set of triples that are relevant to my use case and provided by an external source.
Dan: Just as we say that one person's metadata is another person's data, so one person's reference data is another person's non-reference data. Not appropriate as a class.

Dave: I want a covering concept for things we make up and are not real-world things. A lot of things refer to them, but they don't refer to other things (much). Originally thought of this as immutable, whereas master data is highly mutable.

Michael: These things are similar at the metadata level (information about classes), not about the instances of the classes.
Dan: We introduced Artifact as a covering concept and have now backed away from it. It feels like we are now hiding the concepts that you actually want to get at.

Rebecca: take language out - you can say a lot of things about languages.
Dave: OK with taking Language out.

Michael/David: What is the use case? Is it just to tidy up the top level of the class hierarchy?
Dave: Things that don't have independent existence.
Peter: Reference data tends to be static. Things that you look up in another source.
Ryan: Reference data is standard to an industry. New entities can be added. It's stable but can grow.
Michael: Having a covering class doesn't hurt anything, so why not just have it?
Scott: In principle, objects to defining something just because it doesn't do any harm. Wants clarification about independent existence.
Dave: Tables, chairs, persons exist independently.
Scott: What about Ownership?
Dave: That's a modeling pattern allowing us to say things about relationships.
Ryan: We may be making gist hard to use by adding concepts like this.

SchemaMetaData will be gone in gist 14.

Rebecca: we need to pin down exactly what this concept means. We've mentioned various criteria but haven't agreed on which are defining:

  • Immutable
  • Don't have independent existence
  • Things refer to them but they don't refer to other things much
  • Industry standard
  • Come from an external source

@MichaelSullivanArchitect
Copy link

MichaelSullivanArchitect commented Feb 28, 2025

On a somewhat related note, we are considering adding a new extension for our concept of a Person, namely that of Persona. Reason? An online presence is NOT the same thing as the Person that "manages" that online presence. And individuals (and organizations too) typically have many personas. To wit: my LinkedIn persona is not "me". Neither is my Facebook persona. One is focused and tailored to business purposes, the other to social/family. In addition to those, I also have political personas and art-related personas, and automobile-related personas, and motorcycle-related personas, etc with very little overlap. So while there can be persona-to-persona relationships (e.g. someone "follows" one of your personas), these may or may not correspond with actual person-to-person relationships. Anyway, I'd like to suggest considering such a refinement.

@rjyounes
Copy link
Collaborator Author

@MichaelSullivanArchitect I think the concept of Persona has value, but I don't see it as a subclass of Person. People have personas, but they are not themselves personas.

@MichaelSullivanArchitect
Copy link

@MichaelSullivanArchitect

Agree on not subclassing it. Thanks for the feedback.

@rjyounes
Copy link
Collaborator Author

rjyounes commented Mar 13, 2025

SA dev team meeting:

Phil: Just proposed in order to make the periodic table work?

Rebecca: I feel like there's something there, but not sure how to define it.

Mark: Feels like it's proposed as a covering class, when not necessary (as Dan said last meeting); similar to Artifact. Also agree with Phil. Periodic table shouldn't be so important that it makes us put in classes we don't need. It's a useful visual.

Cheryl: Is an example something like tax tables (look-up data)?

Mark: If those are rules, that sounds like a spec. Reference data could be an aspect of something.

Doug: Likes Dan's comment. Categories are stubs for things we don't want to model out in detail, but they could be with different use cases. So not inherently reference data.

We asked if anyone wants the class and is willing to defend it; no one raised their hand.

Pel: Not useful in enough contexts to warrant it being included in gist.

DECISION: Do not define this class.

@mkumba please note. Can keep as column header in gist 14 periodic table.

@rjyounes rjyounes removed the status in gist Version 14.0.0 Mar 13, 2025
@rjyounes rjyounes closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2025
@mkumba
Copy link
Contributor

mkumba commented Mar 13, 2025

Most large companies have "reference data." The finance industry calls it ref data (for the longest time at Goldman Sachs I thought people were saying "rough data" which i supposed was related to "rough computing" (https://www.sciencedirect.com/science/article/abs/pii/S0952197620302529 ) but it turned out to have nothing to do with that because it was ref data. Most other industries have different names for it, but everyone has it. What it is is mostly static, simple lookup tables. Many companies refer to it as their lookup tables. About every decade or so they have a project to round up all the enums and lookup tables in all their systems and put them in one place. Then they go back to sprawl for a while. I think its a real thing (ask Mike Atkin by the way)

The real problem is what is its scope. Most firms include country codes in their ref data, clearly we would not (or maybe we should, maybe we should have the ISO codes in ref data, but the real country in geospatial, not a bad idea, given the ambiguity around the land mass and the government., we could have both the Ukrainian land mass (constantly changing) and the Ukrainian government (hopefully prevailing) both point to UKR and .uk). ISO seems to favor land mass over government (Antartica and Western Sahara) but have Palestine, but in most uses seems not to distinguish. So maybe the way to maintain the ambiguity that most people seem to like is have the ambiguity in the ref data.

Most companies include currency codes in ref data, and we will if we have units of measure in our ref data. Some have units of measure, although its curious that not all do. Most of what is in most ref data is what we call categories. The classic ref data is gender. I think what most firms think of are things that they can put in a single large table where the columns are "group" (what we would cal the subcategory), "code", "short label", "description" and "definition". People often like to have codes in their ref data to provide multiple language labels.

Note: Master data is not ref data. Specifications (which is really product master data) is not ref data.

I suggested putting "language" in here and you guys talked me out of it, the the more I think about it I think language belongs here as well. The main used of language as reference data is the short codes that go in language coded strings, and references to which languages are spoken in which countries.

Most reference data has very simple (probably should be nearly identical) structure (like the table above), should change very slowly (most companies update their ref data annually, so almost immutable, and have what Katariina Kari calls high page rank items (lots of links in, few links out)

Tibco's definition is ok (except they want to make it part of the domain of master data, and then turn around and include transaction codes, which clearly aren't)
https://www.tibco.com/glossary/what-is-reference-data

wikipedia was pretty good, although I didn't follow their calendar example, (although as I think of it again financial services have a lot of codes about whether to use 360, 365 or 365.25 days in the calculation of interest and things like that all have codes. Again check with Mike Atkin, he's working on this stuff as we speak.
https://en.wikipedia.org/wiki/Reference_data

Collibra also pretty good, very similar, they lumped in product codes (and pricing!) which is I think is wrong
https://www.collibra.com/blog/what-is-reference-data

Atlan picked up one I'd missed: TimeZones
https://atlan.com/reference-data/#5-fundamental-examples-of-reference-data

Starburst make a few interesting distinctions. They correctly distinguish ref data from master, from transaction and from analytic data. They point out that changing master data typically doesn't affect work flow where changing ref data might. They distinguish external ref data (country and currency codes) from internal (business units and trans types) and point out the value in doing data virtualization
https://www.starburst.io/data-glossary/reference-data/

Even IBM likes reference data
https://www.ibm.com/docs/en/cloud-paks/cp-data/5.1.x?topic=artifacts-reference-data

But if you guys really want to ditch it and make the periodic table look like crap, ok.

@mkumba
Copy link
Contributor

mkumba commented Mar 15, 2025

Also, don't know if you caught it, but even TopQuadrant, in Steve Hedden's architecture had a prominent place for Reference Data

@uscholdm
Copy link
Contributor

@mkumba
In my understanding, there is broad consensus that

  • the concept of reference data is real, important and widely used in a typical enterprise.
  • any deployed knowledge graph at any client site will almost certainly include reference data.
  • reference data would ideally show up in the gist periodic table (GPT) if only as a covering concept/category like Aggregate and Temporal.
  • :ReferenceData it is not appropriate as a class
    • Mark: Periodic table shouldn't be so important that it makes us put in classes we don't need.

Let's imagine a class called data; reference data would be a subclass. Let's call an instance a datum.

Points against include:

  • Jamie: Reference data is a feature of the data and how it's used, not inherent to the data.
    • Some but not all categories will be considered to be reference data, so :Category cannot be a subclass of :ReferenceData
  • it is not clear what an individual datum would look like.
  • Noone could come up with a compelling use case for having such a class.

On the penultimate point, my first thought is that an individual reference datum is a piece of information, an assertion represented as a triple.

  • :_Gender_male a :Gender
  • ":_Country_USA :isIdentifiedBy :_CountryCode_USA

Or no, maybe its just an individual? We don't want to be reifying triples.

  • :_Gender_male
  • :_CountryCode_USA
  • :_Country_USA

@MichaelSullivanArchitect

Just to add to this: we describe any data that directly supports the ontology "curated" data. Do with that as you will.

@rjyounes
Copy link
Collaborator Author

rjyounes commented Mar 27, 2025

I suggested putting "language" in here and you guys talked me out of it, the the more I think about it I think language belongs here as well. The main used of language as reference data is the short codes that go in language coded strings, and references to which languages are spoken in which countries.

We can do the same thing here that you propose with countries and country codes: language codes are reference data, but the languages themselves are not. They have internal properties, relationships to people and countries who speak them, etc. (In fact, the Tibco and Atlan references list language codes, but not languages, as reference data.

Starburst includes customer segments and pricing as reference data. It's not clear that all uses would consider these unstructured categories.

Looking at the examples that recur in all the sources, we have things like postal codes, country codes, language codes, industry codes, etc. These can be modeled as identifiers. Not all categories that we define are reference data. So the concept still doesn't seem well-defined enough to constitute a class.

@rjyounes
Copy link
Collaborator Author

Reopening to discuss additional comments from Dave.

@rjyounes
Copy link
Collaborator Author

gist dev team meeting 4/24:

Rebecca: Distinguish codes from the things they identify - e.g., language codes vs languages, country codes vs countries, etc.

Pel: Data is reference data relative to how it's used, not in and of itself. Relativizing things seems like the role of a predicate rather than a class.

Dave: The idea that some people's reference data is someone else's data is interesting but not widely accepted. Master data is vendors, customers, products, etc. - different from reference data.

Rebecca: A client could define a class as a subclass of ReferenceData that another wouldn't.
Libraries have traditionally thought of data about works as metadata (e.g., author).

Jamie: Schema metadata argument not relevant to reference data, because it's not reference data.
Makes sense to think of ReferenceData as a class, but not as part of the hierarchy. You could double-class things so you know it's reference data.

Rebecca: Units of measure are definitely reference data.

Dan: Reference data feels like artifact - a class that we add as an organizing, covering class, and then decide that's not the way we want to organize things.

Michael: Would call Aspect SchemaMetada.

Straw poll: Define ReferenceData. (subclasses a separate issue)
For: 4
Against: 3
Neutral: 4
Abstain: 1
DECISION: Provide a definition, including subclasses, and review as a group. Assigned to Dave.

@rjyounes
Copy link
Collaborator Author

rjyounes commented Apr 25, 2025

Definition from Dave:

gist:ReferenceData
	a owl:Class ;
	skos:definition "Reference data is data used to classify or categorize other data (https://en.wikipedia.org/wiki/Reference_data). Typically, they are static or slowly changing over time.
"^^xsd:string ;
	skos:prefLabel "Reference Data"^^xsd:string ;
	skos:scopeNote "We agree with Wikipedia's definition generally; in particular, they note that it differs from master data; we would therefore not include corporate codes (as they had)."^^xsd:string ;
	.

gist:Aspect rdfs:subClassOf gist:ReferenceData .

gist:Category rdfs:subClassOf gist:ReferenceData .

gist:UnitOfMeasure rdfs:subClassOf gist:ReferenceData .

We may want to add something to the definition about the fact that reference data is frequently (though not always) a standard across an industry or domain and shared across organizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact: major Non-backward compatible (changes inferences; e.g., adding a restriction, domain, range)
Projects
None yet
Development

No branches or pull requests

6 participants