Structured Content in the Age of AI: The Ticket to the Promised Land?

July 17, 2024 Carrie Hane

The following is the text and slides from my April 2024 Information Architecture Conference talk.

Computers entered my life circa 1980, when I got to go to my school’s computer room to play Oregon Trail and Lemonade Stand on the teletype machine as a reward for something. Shortly thereafter, my elementary school got an Apple IIe. It was on a cart that went from classroom to classroom for us to play games on a screen!

40 years later, I carry more computing power in my pocket and rely on computers for just about everything: work, scheduling, communication, and staying in touch with my family and friends.

We are in AI’s infancy — similar to where we were when that Macintosh computer first arrived at my 5th grade classroom’s door. People ooh and aah about it but we aren’t quite sure how it will change our lives — including how we work.

But the hype is there in ways we humans in the early 1980s couldn’t even imagine.

When ChatGPT launched in November 2022, I mostly ignored it. I don’t work as a content creator, and I was skeptical about what it could do. I hoped it was a fad.

The hype died down by the summer of 2023, once people discovered that there were massive limitations and ethical concerns, and it was not a complete solution to anything.

And that’s when things started to change for me. I saw people ask, “What can we use GenAI for that’s useful?” I was relieved, when Sanity, the content management system company I work(ed) for, moved toward having it handle the chores of content creation rather than churn out low-quality content faster.

Turns out, we can expect it to do many things:

Summarizing documents
Synthesizing information – data, text, images, space
Editing written content
Translation
Coding
Categorization
Text to speech
Search
Medical research and diagnosis

Things that involve looking for patterns and connections.

This is what Generative AI is by definition:

artificial intelligence capable of generating text, images, videos, or other data using generative models in response to prompts. Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics. (Wikipedia, March 26, 2024)

It generates something from a pool of other things.

It does not create novel content.

It does not learn.

It is not innovative.

It predicts what is next based on patterns it finds in the things it is fed.

Underlying generative AI is a Large Language Model — or LLM. It is a machine-learning model trained on immense amounts of data that predicts and generates natural language responses to prompts or queries.

I’m sure you’ve used ChatGPT and have seen how this works:

Enter a prompt — a question or give it directions
It returns an answer in natural language
You refine your prompt and get a new output
And so on…

You have no way to verify whether the output is true or not. And it falls short of expectations most of the time.

It is prone to hallucinations — things that sound plausible but are not necessarily accurate. It makes stuff up. It’s like a know-it-all who has something to say about everything, whether or not it is true.

You don’t actually know when it will be accurate because it is inconsistent and unpredictable.

Therefore, it is untrustworthy.

When it comes to business analytics, It is inaccurate most of the time.

In a 2023 study conducted by Data.World, the best an LLM using data stored in a SQL database could do was be accurate a little over one-third of the time.

When it came to questions related to metrics, KPIs, and strategic planning, the LLM failed to return an accurate answer. It was never right.

At best, it produces decent first drafts of whatever you are producing. One friend told me that his consulting company bought a tool to generate proposals. The material it used to generate new proposals was a catalog of previous proposals humans had written. The good news is that the humans involved in writing proposals still have jobs. Because the proposals that were generated needed a lot of work to go out the door.

This is just one of many similar stories we all could tell.

This is because the content these LLMs rely on for their answers is not the best.

The source of the content is unstructured, inconsistent, and not always accurate.

In essence, garbage in, garbage out. Computers still do what they are told to do.

How do we move beyond decent first drafts or summarizations? How do we make it so that we can have the computers do the mundane work so we can have time for creativity and critical thinking?

Structure.

To answer the question asked in the title of this talk: Is structured content the ticket to the promised land?

YES. Well, sort of. Structured content is a ticket to the promised land, if not the ticket.

Here’s how I define structured content.

content that is broken into its smallest reasonable pieces, which are explicitly organized and classified to be understandable by computers and humans

It is:

Object-oriented
Componentized
Algorithmically predictable

It is a container that describes the intent of what the object — or entity — is, not what it looks like.

Structured content is semantic. It contains meaning.

There’s also structured data, which is organized and formatted in a specific way, allowing humans and computers to access it efficiently.

Neither are widespread. Most content and data stored today are unstructured.

Google told us what to do about this 12 years ago:

Turn strings into things.

Strings in this case means a sequence of characters.

In their release post, they used the example of [taj mahal].

A string search would look for that sequence of characters t-a-j-_-m-a-h-a-l.

But what is [taj mahal] as a thing — or an entity?

It could be the monument in India, or a musician, the casino in Atlantic City, or even your local Indian restaurant.

[taj mahal] as a string is ambiguous.

[taj mahal] as a thing is not. You can assign meaning to the thing.

This is what structured content does.

It turns a STRING into a THING that is neatly defined and organized.

Turns out the bots need what structured content provides:

Context
Explicitness
Relevance
Connections
Semantics

Without all of this, we’ll continue to get the same results we’re getting now.

Source: promptengineering.org/master-prompt-engineering-llm-embedding-and-fine-tuning/

[slide 16]
One might argue that we could fine-tune the bots to recognize all these things. But that is time-consuming, expensive, and involves a lot of human effort to review and revise the prompts to get the right output. And still, the accuracy will not be high enough to make it trustworthy.

Fine tuning does these things well:

Teaches new tasks or patterns
Originally created for image models, now applies to NLP tasks
Used for classification, sentiment analysis, and named entity recognition

But fine tuning is:

Not able to teach new information, only new tasks
Prone to confabulation and hallucination
Expensive, slow, and difficult to implement
Not scalable for large datasets

Fine-tuning was originally created for image models and has been applied to natural language processing tasks. As such it is less efficient. And it isn’t scalable.

Source: aws.amazon.com/what-is/data-labeling/

Likewise with training a bot.

Supervised learning requires a labeled set of data for the model to learn from. Initially, a human needs to manually label the data so that the model can learn.

From there, a machine-learning model can be trained on a subset of the human-labeled raw information. Maybe one day, we can remove the humans from the equation, but not yet.

Both fine-tuning and supervised learning rely on working with an entire corpus of content.

Source: aws.amazon.com/what-is/retrieval-augmented-generation/

The good news is that we have a framework for solving the problem: RAG — retrieval-augmented generation — offers a better solution and a better cost-benefit ratio.

RAG adds to — or augments — the LLM.

With RAG, GenAI-powered solutions can enhance their own knowledge and content generation by retrieving information from external sources, instead of relying on pre-programmed data sets.

In this framework, a retriever uses an index to locate the most relevant information rather than scanning the entire corpus.

Once it finds the relevant documents, the raw data gets turned into a coherent and contextually relevant query to the LLM, which then delivers a plain language response to the prompt.

Source: falkordb.com/knowledge-graph-vs-vector-database

Underpinning the RAG framework are 2 types of technologies that add context and organize data:

Vector databases
Knowledge graphs

Warning! I’m going to take a brief detour into math and computer science.

Don’t worry, it will be short!

Sources: pinecone.io/learn/vector-embeddings-for-developers, weaviate.io/blog/vector-embeddings-explained, https://www.osedea.com/en/blog/article-vector-databases

Vectors are objects that have both a magnitude and direction.

Embeddings are a type of vector data representation that carries the semantic information with it. They represent multiple dimensions of the data that help in understanding patterns, relationships, and underlying structures. They represent the meaning of a word, a phrase, or an entire document as a numerical vector. Embeddings are numbers.

A vector database computes an embedding for each data object. The embeddings are placed into an index for fast searching.

For each query, an embedding is computed and an algorithm finds the closest vectors to the vector of the query, which implies a similar meaning.

For example, a vector database can tell you that “Sacramento” and “California” are more related than “Sacramento” and “Washington,” based on their vector distances.

Source: wikipedia.org/wiki/Knowledge_graph

Graphs are representations of a network and describe the relationship between lines and points. Each object on a graph is called a node. Each relationship is an edge. They are multidimensional and allow us to connect many things in many ways.

Knowledge graphs represent data as a network of nodes and edges. They can handle complex, nuanced queries based on the types of connection and the nature of their nodes, structures, and properties. They can also capture rich semantic relationships that could get lost in a vectorized embedded space.

For example, a knowledge graph can tell you that “Sacramento” is the capital of “California,” based on their edge label.

Knowledge graphs map data to meaning, capturing both semantics and context. It is human readable.

End of detour. Back to AI.

So which type of augmentation do you use?

Of course, it depends.

It depends on what you are trying to achieve.

Vector databases are particularly good for high-dimensional data like images, audio, and video — things that are harder to capture meaning with words.

They are also good for similarity searches and recommendation systems because they use mathematical computations to find things that are most alike.

They are efficient because they use approximate nearest-neighbor algorithms, which are known to be fast queries.

They are good for:

Recommendation systems
Anomaly detection
Semantic searches
A wide array of queries in a closed system

For example, vector databases excel at helping customer service representatives, when an individual is looking for the best answer to a customer’s question using internal documentation.

Knowledge graphs can handle more complexity and nuance.

They add reasoning, understanding, contextual awareness, and precision to return more precise and traceable responses. They can also evolve as new information is added and new relationships are established.

They are good for:

Connecting data across multiple schemas
Situations where you need precise responses
When you need to generate insights, not just establish facts

An example of where a knowledge graph helps is in the insurance claims adjustment field, where adjusters have to consider policies, claims, customers, markets, and more.

“[Adding RAG] makes knowledge representation multi-dimensional and highly expressive beyond what is stored in data repositories.” —Nate Davis, Chief Information Architect, Methodbrain

Back to the Data.World study. They found that adding a knowledge graph improved LLM response accuracy by 3x across 43 business questions — even complex questions that had to look at multiple data tables to produce a response.

A lot better, but we still have to “trust but verify.”

“It's hard for traditional RAG to be right when the data looks like what it does at most companies: highly repetitive, some out of date, and all on roughly the same topic.” —May Habib, CEO, Writer.com

Technology applications help, but we still have this problem with the underlying content.

Back to things, not strings… or blobs vs chunks.

If your content is unstructured, it’s a series of blobs.

Vectors and graphs (embeddings and knowledge graphs) rely on entities, nodes, things — CHUNKS — for accuracy.

The vector or graph database will relate chunks. And those chunks might not get classified properly.

Structured content gives you control over the size and meaning of the chunks. Instead of having to determine the entities and relationships in the sentence, “The quick brown fox jumped over the lazy dog.”

We know that

Fox = animal

Dog = animal

Action = jumpedOver

(Yes, this looks a bit like sentence diagramming that you might recognize from the 20th century!)

Each animal has its own set of entities that can be used in different ways.

The structure you give to your entities gives the people creating content a guide for what needs to be captured and recorded for your situation. It frees up mental space to be creative within the constraints of the structure.

This is what we mean when we say structured content is good for humans and computers.

Content model for a live music venue.

Avalara is one of the companies leading the way in experimenting with Gen AI and knowledge graphs. They did a study of structured and unstructured content repositories and found that with more intelligence (i.e. knowledge graphs and structure), not only were there fewer hallucinations with the structured content, the AI hallucinated less than humans.

It’s proof that if you really want to take advantage of AI— and be among the early ones who do — you need to get your content structured now.

And it isn’t just for generative AI — it is useful for all the traditional places we need to publish content.

To scale, you need clean data and content.

With RAG + Structured Content you get a winning combination.

You can reduce the amount of training you need to do on your data or content.
The cost is lower.
Humans spend less time adjusting their prompts, verifying results, or cleaning up the source data.
Accuracy of the results is greatly improved.

I feel a little gratified that structured content is one of the answers to making AI better. Seven years ago, Mike Atherton and I wrote a book about structured content and how to model digital products. And before that, we had both been talking about structured content— object-oriented content— and modeling for over 5 years. And we didn’t make it up. We learned from others.

People working in technical communications have been talking about it for decades.

So you don’t need to make this up yourselves. There are plenty of examples and guides to follow. Here are just a few people who you can learn from.

Here’s more good news…

You don’t have to do everything all at once.

Just like any transformation, the best way to get started is to start small.

Experiment
Iterate
Adjust your structure
Cleanup your content and data

AI can make your life easier when you think about what computers do best: the repetitive things, the chores, the computations, the pattern matching. Things that take humans hours (or even days) that computers can predictably and accurately do in minutes.

It will take people with a “content science” background to make the content intelligent.

—Michael Iantosca, Senior Director of Content Platforms and Knowledge Engineering, Avalara.

It will take people who care about the “fundamental nature of information and its relationship to human cognition, experience, and society” (ChatGPT in response to “What do information architects care about?")

To get better, AI needs more…

Ontology
Taxonomy
Provenance acknowledgement
Metadata
Semantics
Behavior analysis
Explicit relationships

And we are the people that fit all these descriptions. We IAs are the people who care about…

Meaning
Understanding
Consistency
Sensemaking

Structured content is the ticket to the land where computers do what computers do best and humans do what humans do best. This is definitely what was promised to me as a child.

In 40 years, maybe one of our kids will be talking to their peers about how silly the early days of AI were.

I hope I am around to see it.