Implement Conceptual Mapping

Introduction

The implement conceptual mapping step is where we finally convert your data from its original structure and ontology into LINCS RDF.

Resources Needed

For TEI data and natural language data, your team will do this step using LINCS tools. Our tools use templates for common relationships so you can get output in a few minutes. Though you may need to spend some time playing around with processing your source data to get the output you want.

For structured data and semi-structured data, we still have tools to help, but our approach is customized to each dataset so the process takes longer. An experienced user could convert a dataset in a few days but we find this step ends up taking a few weeks to a few months for the average project when you consider training, implementation, and troubleshooting. For these workflows, it will be a combined effort between the LINCS Conversion Team and your Research Team.

	Research Team	Ontology Team	Conversion Team	Storage Team
Set Up your Data	✓	✓	✓
Transform your Data	✓	✓	✓

Set Up your Data

To proceed with this step, you must have a conceptual mapping developed for your specific data. Ideally this mapping will be final so that you do not need to redo this implementation step later on. With that said, it is fine to have a mapping that only covers certain relationships of interest as a starting point and then to add to that mapping and to this implementation step in phases.

It is best if you have already cleaned your data before this step. However, if your implementation is going to use code or a tool that can be rerun easily then it is fine to start on this step before you have finished data cleaning. You can rerun the implementation step when the final cleaned data is ready.

Transform Your Data

Whenever possible, use tools or scripts that let you easily edit and rerun this step. That way if you find errors in your source data or if you have more data later on, you can rerun this step to quickly convert it.

Structured Data
Semi-Structured Data
TEI Data
Natural Language Data

Every dataset in this category comes with a unique starting structure and by this point should have its own conceptual mapping. To grab each piece of information from the source data and reconnect it together as CIDOC CRM triples, LINCS prefers to use the 3M mapping tool.

The 3M mapping tool takes XML documents as input and, through its graphical user interface, allows users to select data from their source files and map it into custom CIDOC CRM triples. We have found that this is the easiest method to get consistently converted data. LINCS has developed 3M documentation to guide you through creating your first mapping file, and the Ontology Team and Conversion Team can provide support as you get started.

You may choose to use 3M if:

Your data is already in XML or is in a format that can be easily converted to XML (e.g., a spreadsheet or JSON files)
You do not have a team member with programming experience and need a tool with a graphical user interface
Your data contains many relationships so the reliability of 3M output and its treatment of intermediate nodes will have a large benefit

Alternatively, you may choose to write custom scripts to convert your data instead of using 3M. You may choose to write custom scripts if:

You have a team member who understands the source data, understands the conceptual mapping, and has sufficient programming experience
Your data only covers a small number of relationships so learning 3M is not worth the time investment
Your data is in a highly normalized relational database and the code needed to transform the relational data into XML would be equivalent to code needed to output triples

3M requires your data to be input as XML. If your structured data is not in XML, you can convert it following our Preparing Data for 3M documentation. This documentation also gives suggestions for ways to edit your XML data to make working in 3M easier.

The Conversion Team and Ontology Team use 3M or custom scripts to write out the conceptual mapping and run the transformation on the data, resulting in LINCS RDF. To make sure that the output from 3M is correct, either the Research Team or the Conversion Team and Ontology Team transform a small sample of the data and vet the results using the built in 3M visualization tools and a manual comparison process. The full dataset is then converted.

note

The Natural Language Data workflow is still in progress. Check back over the next few months as we release the tools described here.

The task of extracting triples from natural language text in an automated way—without a human manually marking up a document—breaks down into the tasks of named entity recognition (NER) and relation extraction (RE), where we use a computer system to predict which words or phrases in the text represent named entities and what relationships the text expresses between those entities.

LINCS has developed APIs to make these automated tasks accessible. These APIs take plain text as input and output either:

Triples where the predicate—or relationship—must be from a list of allowable predicates
Triples where the predicate can be any word or phrase from the text

The first is the fastest and most reliable way to generate valid LOD, but limits the number of triples you will get because of the limits on possible predicates. The second method will give you many more triples to start with and can act as a productive first step in a more manual approach where a human goes through and cleans up the extracted triples. LINCS tools provide both of these options to you.

This level of automation is meant to be a faster, though less precise, conversion method than that of the structured conversion workflow or a manual treatment of natural language texts. If your Research Team has the time, then you can put more manual curation into it, using the tools as a starting point.

These extraction APIs will be part of LINCS-API and made accessible through programming notebooks and eventually through tools such as NERVE. In the meantime, NERVE is a great starting point for creating LOD from natural language texts even without the future relation extraction functionality. It allows you to tag entities in the text, reconcile them against external LOD sources, and then connect the mentions of those entities to the source text using the web annotations data model (WADM). LINCS can then help you transform NERVE’s output into CIDOC CRM triples ready for publication with LINCS.

One of the systems that we are using behind the scenes of these APIs is made possible through our collaboration with Diffbot. We have worked with them to tailor their Natural Language Processing API to handle the unique challenges of processing humanities texts.

info

The triples output from these automated systems may change slightly each time you run them if you edit the input text or if the system has updated since your last run. If you plan to run the tools on the same texts multiple times, consider how you will merge all your results after. For example, if you are manually reviewing to remove incorrect facts, keep track of those so that you can automatically remove them from future results.

After extracting triples with these tools, depending on how you choose to balance time versus data quality, you may want to inspect the results, removing inaccurate extractions and potentially adding missed triples. Finally, LINCS has an additional API to transform triples with predicates from our allowable list into CIDOC CRM triples ready for the Validate and Enhance step.

You should now have RDF data that follows LINCS’s ontology and vocabulary standards. Your data may not be quite ready for ingestion into the LINCS triplestore yet, but it will be after some final cleanup in the next step.

Introduction​

Resources Needed​

Set Up your Data​

Transform Your Data​

Introduction

Resources Needed

Set Up your Data

Transform Your Data