In Search of Metadata, an Open-Source Perspective

We live in a software defined world, there are many software applications that we use regularly, but silos exist. The effort for hooking up different systems in huge and painful, if not impossible. In a LF-AI initiative named Open BI/AI Connectivity (OBAIC), we are exploring ways of connecting BI and AI systems.

So, what is the scenario and the missing piece here? Suppose you are a business analyst working on a BI dashboard, and your dear fellow data scientist tells you that he has a predictive model that can be used to forecast your inventory pressure for the next three months. This sounds very appealing, but the predictive model was built and hosted on a separate AI platform in data science team, how would you manage to call that in order to get the prediction? Ideally BI system could talk to AI platform asking for the details of that model, then organize data per model’s requirement and invoke some sort of RESTful API provided by AI platform to make the prediction. From here we can see that there are two type of things the two systems need to communicate between each other: the model itself as well as the data feed into & yield by the model. This article document some of my research in search for the metadata.

Metadata for Model

Let’s look at model itself at first. We see that it is not necessary for data science team to share the whole model to BI system. The AI platform, or the data science team, still own the model, but it needs to let business user knows what kind of problem this AI model could solve, how good the model was and furthermore, how to use (input) and interpret (output) the model.

There are many ways of representing models. In the open-source world, if your background is in machine learning you might know very well about the Predictive Model Markup Language (PMML), or its successor Portable Format for Analytics (PFA) format. Or if you are working in the field of deep learning, the Open Neural Network Exchange (ONNX) shouldn’t sound unfamiliar to you. You might also be using some models created with open libraries in Python, R, or Julia, etc.

No matter what type of created, the thing to figure out here is that how an AI platform manage and publish that information. If you choose to not implement yours, there are some open-source projects for your reference. ACUMOS is a LF-AI project, it provides a model marketplace for interested parties. On the marketplace, you could find expected input of a model (such as default algorithm/strategy used during prediction) as well as the expected output, which might varies depend on different model settings. For instance, prediction, as well as associated confidence.

MLX is another LF-AI project that provides catalogs for user to define data and model pipelines. It allows the upload, registration, execution, and deployment of AI pipelines, pipeline components, models, datasets, and notebooks.

To recap, when we narrow down to the model, we purposefully separate information about the model into the following two categories:

  • Metadata for human reading, such as name, algorithm, tags, creator, description, performance, user ratings, etc.
  • Metadata for machine processing, such as ID, revision, format, dependency, input/output schema, parameters, etc.

Metadata for Data

Now let’s check the data part. BI and AI systems do not necessarily share the same storage. Consider the most basic, pass-by-value scenario, i.e., AI system told BI system what kind of data the model expects, and what type of output the model will product. BI system then sends a chunk of data to AI system, and check for the result in agreed format.

In order to send data back and forth, a schema is needed. A schema carries a set of fields (or columns) where metadata of each field include information for both human reading and machine processing like metadata of the model shown above. I.e.,

  • Metadata for human reading, such as name, taxonomy, example, description, etc.
  • Metadata for machine processing, such as type (operation type like categorical, ordinal, continuous, or data type like integer, float, double, boolean, date, time, datetime, string, etc.) and other properties like missing value, uniqueness, category, and statistics (minimum and maximum, for example)

What I would like to highlight is that the data part is a little bit tricky, at least not as straight forward as it looks. Consider the data passed between BI and AI system, even when we declare the type of a specific field to be INTEGER, the actual value is still dependent on the underlying programming language and operating system used. It is very likely that a number trivial to one system could easily overflow the other.

So, what is needed here is something like Data Mapping. Basically, the data has to be stored somewhere and being fetched upon analysis. Here I would like to borrow the two indices introduced in the New Theory of Disuse (Bjork & Bjork, 1992): the storage strength (SS) and the retrieval strength (RS), to evaluate the strategies of serving data in a platform. The key idea for present purposes is that conditions that most rapidly increase retrieval strength differ from the conditions that maximize the gain of storage strength.

When we define these mappings, there are a few things for both side to consider. Traditional data mapping is infused in data movement process and part of ETL capability, it requires strong type and is implementation dependent. New way of data mapping includes data-driven mapping and semantic mapping. It is part of Data Provision capability where data is created and shared by data owners, cataloged and governed by the platform to ensure security & privacy compliance, and made available to consumers via Data Integration / Data Movement. Previously when we are talking about data provision, we are talking about the capabilities of data collection, data extraction, data cleansing and data integration. With modern data warehouse / data lake architecture, majority of heavy lifting work in data collection, extraction, cleansing, validation and transformation have been built into data pipelines for integration. Thus, we could put more focus on how to better serve the data.

Back to above scenario, when we know how to map data from one system to another, we could then design the strategy to serialize the data along the wire. Simply put, no matter what types of data store one system uses, be it RDBMS, Key-value database, Wide Column Store or Graph database, a schema should be attached to the data such that the other side knows how to interpret.

For example, when we fetching data using Delta Sharing protocol, we will need to reference its schema object as specified here: https://github.com/delta-io/delta-sharing/blob/main/PROTOCOL.md#schema-object, which defines Field struct containing name, type, nullable and metadata information. When we read data in parquet format, we will read to the bottom of data file to get metadata and parse the file along with the thrift definition. Another example of data transport using columnar format is Apache Arrow, where we need to read Schema message before we could understand the structure of a RecordBatch. Adopting which protocol is the secret sauce of each platform, but as we can see, an explicit schema is required for us to understand the format.

Metastore

Now we know enough about the metadata required for calling an AI model, how we could manage them effectively? No matter how the mapping is defined between systems, practically there might needs to have a catalog for the metadata, i.e. a Metadata Registry (for providers to publish data and for consumers to use data).

A metastore is such a registry, it’s a place to hold metadata, which declares what the actual data looks like in the storage. And we can use that to keep metadata of both model and data schema for external systems to consume. Good news is that in LF-AI, there is such a project to assist. Marquez is an open-source metadata service for machine learning platforms. It helps to manage and utilize the collection, aggregation, and visualization of a data ecosystem’s metadata, thus, to address the requirement for metadata management capability like data lineage, data democratization, data discovery, data governance, and data quality. The metadata repository and metadata storage are organized around its data model. Again, this is another secret sauce for two systems to implement, but with all these projects, we have a very promising protocol on the way.

References

Bjork, R. A., & Bjork, E. L. (1992). A new theory of disuse and an old theory of stimulus fluctuation. In A. Healy, S. Kosslyn, & R. Shiffrin (Eds.), From learning processes to cognitive processes: Essays in honor of William K. Estes (Vol. 2, pp. 35–67). Hillsdale, NJ: Erlbaum.

https://wiki.lfaidata.foundation/display/DL/OBAIC

https://landscape.lfai.foundation/?selected=acumos

https://landscape.lfai.foundation/?selected=machine-learning-e-xchange

https://en.wikipedia.org/wiki/Data_mapping

https://databricks.com/blog/2021/05/26/introducing-delta-sharing-an-open-protocol-for-secure-data-sharing.html

https://github.com/apache/parquet-format

https://arrow.apache.org

https://marquezproject.github.io/marquez/quickstart.html

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store