Knowledge Graph in Machine Learning: All You Need to Know

Danfeng Xu
|
CTO & Co-Founder
|
April 8, 2024
Knowledge Graph in Machine Learning: All You Need to Know

How is knowledge graph machine learning optimizing data analytics in AI? This article reveals the mechanics and advantages of integrating knowledge graphs and machine learning, from enhancing model precision to uncovering intricate data relationships. Delve into the essentials and witness how this synergy is reshaping our understanding of data and AI.

Key Takeaways

  • Knowledge graphs organize information in a structured graph format, detailing relationships between entities within a domain, and are essential for enhancing data analysis and retrieval, as well as aiding machine learning by providing context and facilitating data integration.
  • In machine learning applications, knowledge graphs improve model training, especially with limited data, and contribute to the explainability and accuracy of AI systems through systematic data structure and enhancing text data context comprehension.
  • While building a knowledge graph requires careful planning and expert input to align data and create a consistent ontology, it presents formidable benefits, including unifying diverse data sources, revealing hidden patterns, and enabling informed decision-making.

What is a Knowledge Graph?

A knowledge graph is akin to a treasure map of graph data. It captures vital information about key entities within a specific domain and the relationships between them, all organized in a structured graph format. This term was popularized by Google in 2012 when it enhanced its search functionality, shifting from mere keyword matching to understanding the deeper semantics and context of queries, paving the way for the semantic web.

A knowledge graph example using PuppyGraph

Within a knowledge graph, nodes are represented by entities like people, places, and events, while their interactions or relationships constitute the edges that link these nodes. This graph structure provides a contextualized view of data, enabling us to decipher complex relationships, like the intricate connections between various services or the web of customer interactions.

Originating from the field of knowledge representation in artificial intelligence, knowledge graphs aim to:

  • Merge data and business logic into a shared representation
  • Leverage ontologies to create a semantic layer that describes the types and relationships of entities
  • Serve as a map to search for data before delving into the data itself, facilitating efficient data retrieval.

In practical applications, knowledge graphs are behind many of the search engines, recommendation systems, and AI applications we use every day. They help in improving search results, enabling question answering systems, and supporting the development of intelligent applications that can understand and reason about the world in a more human-like manner.

Additionally, extensive knowledge graphs like DBpedia and Geonames consolidate all the data, laying the groundwork for other graphs, including graph databases. This enhances data analysis and interoperability, proving invaluable in today’s data-driven world. Therefore, be it developers, data scientists, or business analysts, each perceives a knowledge graph differently, appreciating its unique offerings in their respective roles.

What is a Knowledge Graph in Machine Learning?

Focusing on machine learning and neural networks, knowledge graphs are revealed as the indispensable tool that binds together disparate data sources, forging connections between entities such as people, places, or events. By adding context and depth to AI techniques, they simplify the process of feeding richer, more diverse data into algorithms, thereby enhancing the performance of machine learning models.

An essential aspect of knowledge graphs in machine learning is their capability to augment training data, particularly in situations where there is insufficient data for machine learning models. This greatly improves the learning capabilities of the models. Moreover, the decision-making process of machine learning systems can be summarized by mapping explanations to nodes in a knowledge graph, thereby enhancing the explainability and trustworthiness of results.

Knowledge graphs also enable the integration of heterogeneous data sources, providing a unified view of information. This capability is particularly valuable in scenarios where data comes from diverse domains or formats, allowing ML models to leverage a broader range of data for training and inference.

A deep dive into recent trends reveals the ascending importance of knowledge graphs that play a pivotal role in improving the accuracy of systems and extending the range of machine learning capabilities. They contribute to systematic improvement by providing clear and structured relationships between data points. Applications of knowledge graphs in machine learning are diverse, extending to question answering, recommendation systems, and supply chain management, among others.

How Do Knowledge Graphs Solve Machine Learning Problems?

Knowledge graphs play a crucial role in addressing several fundamental problems in machine learning enhancing the capabilities of machine learning models in various domains. By leveraging the structured, interconnected nature of knowledge graphs, machine learning systems can overcome challenges related to data sparsity, context understanding, and feature extraction, among others. Here are some of the examples.

Addressing Data Sparsity

Data sparsity occurs when there is insufficient data for machine learning models to learn effectively, leading to poor performance, especially in domains with complex or rare phenomena. Knowledge graphs mitigate this problem by enriching sparse data with additional context and connections derived from the graph.

Example: In recommendation systems, a knowledge graph can provide additional attributes and relationships between items and users, filling in the gaps where interaction data is sparse. For instance, if a user likes certain movies, the knowledge graph can suggest similar movies based on genres, directors, or actors connected in the graph, even if those movies have not been rated by many users.

Credit: Movie recommendation and search using an IMDB knowledge graph by AWS ML blog

Enhancing Context Understanding

Understanding the context in which data exists is crucial for machine learning models to make accurate predictions or generate relevant outputs. Knowledge graphs contribute rich, structured information that models can use to grasp the broader context of the data they are processing.

Example: In natural language processing (NLP), a knowledge graph can help a model understand that the term "Jaguar" refers to either an animal or a car brand, depending on the context of the conversation. This semantic understanding significantly improves the accuracy of tasks like entity recognition, disambiguation, and sentiment analysis.

Credit: A Knowledge Graph-Based Approach for Predicting Future Research Collaborations

Improving Feature Extraction

Feature extraction involves identifying the most relevant information from raw data that a machine learning model can use for learning. Knowledge graphs automatically provide a wealth of structured features that models can exploit, enhancing their learning efficiency and accuracy.

Example: In fraud detection, a knowledge graph that includes entities such as transactions, accounts, and users, along with their relationships (e.g., transactions between accounts, ownership of accounts by users), can help models identify patterns indicative of fraudulent activity more effectively than traditional data representations.

Credit: A knowledge graph showing BTC transaction chain by Coinbase

Enhancing Predictive Modeling Accuracy

The relational structure of knowledge graphs allows machine learning models to leverage the connections between entities to make more accurate predictions.

Example: In the healthcare domain, a knowledge graph encompassing diseases, symptoms, medications, and patient histories can improve the accuracy of diagnostic models. By understanding the relationships between symptoms and diseases, models can predict potential diagnoses based on symptom combinations, potentially uncovering less obvious conditions.

Credit: An Information Extraction and Knowledge Graph Platform for Accelerating Biochemical Discoveries

How to Build a Knowledge Graph

Building a knowledge graph is a meticulous process that requires careful planning and implementation. The initial step consists of pinpointing the use cases, spanning from product lifecycle management to artificial intelligence projects such as recommendation engines or chatbots. The next step requires identifying the necessary data by working with subject matter experts to define business questions and understand the field of knowledge relevant to your use case. It often requires collecting data from various sources relevant to the domain, such as databases, APIs, web scraping, and public datasets.

Once the initial data is ready, it is time to identify the key entities that will form the nodes of your graph, such as people, places, organizations, products, etc. It often follows as creating an ontology that defines the classes (types of entities), properties (attributes of entities), and relations (types of relationships) in your knowledge graph. This serves as a schema or blueprint for organizing the information. After that, choose the appropriate technologies and tools for building the knowledge graph with entities.

As you have created the first version of the knowledge graph, it is time for further iterations with two focuses:

  • Validate the structure and content of your knowledge graph to ensure it accurately represents the knowledge domain. Check for inconsistencies, missing data, and incorrect relationships.
  • Integrate additional data sources into your knowledge graph to enrich it. This can include importing data from external databases, linking to other knowledge graphs, or incorporating data from text using natural language processing techniques.

Building a knowledge graph is an iterative and ongoing process that requires careful planning, execution, and maintenance. By following these steps, you can create a robust knowledge graph that serves as a valuable resource for various applications, from enhancing search functionalities to powering complex machine learning algorithms.

Build a Knowledge Graph With PuppyGraph

PuppyGraph is the first and only graph query engine in the market that allows you query one or more of your existing SQL data stores as a unified graph. This means you can query the same copy of the tabular data as graphs (using Gremlin or Cypher) and in SQL at the same time - no ETL required. 

This section will look at how to build a knowledge graph based on a dataset of Twitch gamers using PuppyGraph with Postgres data. You can download the data from SNAP, and you can access the project here.

This dataset contains sampled twitch gamer accounts as well as the mutual follower relationships between them.

Step 1: Download the data from SNAP and unzip the data.

wget https://snap.stanford.edu/data/twitch_gamers.zip
unzip twitch_gamers.zip -d twitch_gamers

The folder should contain the following files.

$ ls twitch_gamers
README.txt  large_twitch_edges.csv  large_twitch_features.csv

Step 2: Start Postgres and PuppyGraph using docker through the following docker-compose.yaml File

Start Postgres and PuppyGraph using docker through the following docker-compose.yaml File.

version: "3"

services:
  puppygraph:
    image: puppygraph/puppygraph:stable
    pull_policy: always
    container_name: puppygraph
    environment:
      - PUPPYGRAPH_USERNAME=puppygraph
      - PUPPYGRAPH_PASSWORD=puppygraph123
    networks:
      postgres_net:
    ports:
      - "8081:8081"
      - "8182:8182"
      - "7687:7687"
  postgres:
    image: postgres:14.1-alpine
    container_name: postgres
    environment:
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres123
    networks:
      postgres_net:
    ports:
      - "5432:5432"
    volumes:
      - ./postgres-data:/var/lib/postgresql/data
networks:
  postgres_net:
    name: puppy-postgres

Once the file is created, run docker compose up -d to start postgres and puppygraph container.

Step 3: Load Twitch Data Into Postgres and Create Tables

We first load the data into postgres. Save the SQL below to a twitch.sql file, which we will use for loading the twitch data into the postgres.

CREATE SCHEMA IF NOT EXISTS twitch;

CREATE TABLE IF NOT EXISTS twitch.features
(
    views        BIGINT,
    mature       BOOLEAN,
    life_time    BIGINT,
    created_at   TIMESTAMP,
    updated_at   TIMESTAMP,
    numeric_id   BIGINT,
    dead_account BOOLEAN,
    language     VARCHAR(128),
    affiliate    BOOLEAN
);

CREATE TABLE IF NOT EXISTS twitch.edges
(
    id       BIGSERIAL PRIMARY KEY,
    follower BIGINT,
    followee BIGINT
);

Let's copy the twitch_gamers folder and the SQL file twitch.sql for creating tables to the Postgres container.

docker cp twitch_gamers postgres:/twitch_gamers/
docker cp twitch.sql postgres:/twitch_gamers/

Enter the Postgres container and start the console.

docker exec -it postgres psql -U postgres

Execute the following command to create tables and import data.



\i /twitch_gamers/twitch.sql

\copy twitch.features FROM /twitch_gamers/large_twitch_features.csv WITH CSV HEADER;

\copy twitch.edges(follower, followee) FROM /twitch_gamers/large_twitch_edges.csv WITH CSV HEADER;

Step 4: Query The Twitch Data As A Graph

Now the tables are created based on the twitch data. And we can query them

postgres=# \d twitch.features;
                           Table "twitch.features"
    Column    |            Type             | Collation | Nullable | Default 
--------------+-----------------------------+-----------+----------+---------
 views        | bigint                      |           |          | 
 mature       | boolean                     |           |          | 
 life_time    | bigint                      |           |          | 
 created_at   | timestamp without time zone |           |          | 
 updated_at   | timestamp without time zone |           |          | 
 numeric_id   | bigint                      |           |          | 
 dead_account | boolean                     |           |          | 
 language     | character varying(128)      |           |          | 
 affiliate    | boolean                     |           |          | 

postgres=# \d twitch.edges;
                                Table "twitch.edges"
  Column  |  Type  | Collation | Nullable |                 Default                  
----------+--------+-----------+----------+------------------------------------------
 id       | bigint |           | not null | nextval('twitch.edges_id_seq'::regclass)
 follower | bigint |           |          | 
 followee | bigint |           |          | 
Indexes:
    "edges_pkey" PRIMARY KEY, btree (id)

postgres=# select count(*) from twitch.features;
 count  
--------
 168114
(1 row)

postgres=# select count(*) from twitch.edges;
  count  
---------
 6797557
(1 row)

Now we can also use graph to query the same copy of the data:

PuppyGraph is running at port 8081. Access localhost:8081 in the browser to access it.

Input the username puppygraph and password puppygraph123 to login.

PuppyGraph Login Screen

After that, Upload the following file schema.json to build the graph on top of the Postgres instance.

{
    "catalogs": [
      {
          "name": "gamers",
          "type": "postgresql",
          "jdbc": {
              "username": "postgres",
              "password": "postgres123",
              "jdbcUri": "jdbc:postgresql://postgres:5432/postgres",
              "driverClass": "org.postgresql.Driver"
          }
      }
    ],
    "vertices": [
      {
        "label": "account",
        "mappedTableSource": {
          "catalog": "gamers",
          "schema": "twitch",
          "table": "features",
          "metaFields": {"id": "numeric_id"}
        },
        "attributes": [
          { "type": "Long"    , "name": "views"        },
          { "type": "Long"    , "name": "life_time"    },
          { "type": "DATETIME", "name": "created_at"   },
          { "type": "DATETIME", "name": "updated_at"   },
          { "type": "String"  , "name": "language"     },
          { "type": "Boolean" , "name": "mature"       },
          { "type": "Boolean" , "name": "dead_account" },
          { "type": "Boolean" , "name": "affiliate"    }
        ]
      }
    ],
    "edges": [
      {
        "label": "follows",
        "mappedTableSource": {
          "catalog": "gamers",
          "schema": "twitch",
          "table": "edges",
          "metaFields": {"id": "id", "from": "follower", "to": "followee"}
        },
        "from": "account",
        "to": "account",
        "attributes": []
      }
    ]
  }

Once the JSON file uploads, PuppyGraph will visualize the graph schema.

PuppyGraph Data Visualization Screen

PuppyGraph supports both Gremlin and openCypher. In this tutorial, we will use Gremlin.

PuppyGraph offers both Gremlin and Cypher support

Next, we will get the top-5 accounts whose last update time was the earliest. Below is the query we’ll run.

puppy-gremlin> g.V().order().by('updated_at').limit(5).elementMap()
Done! Elapsed time: 0.052s, rows: 5
==>map[affiliate:false created_at:2012-12-22 00:00:00 +0000 UTC dead_account:true id:account[7017] label:account language:OTHER life_time:266 mature:false updated_at:2013-09-14 00:00:00 +0000 UTC views:0]
==>map[affiliate:false created_at:2013-12-21 00:00:00 +0000 UTC dead_account:true id:account[140843] label:account language:OTHER life_time:52 mature:false updated_at:2014-02-11 00:00:00 +0000 UTC views:0]
==>map[affiliate:false created_at:2012-10-04 00:00:00 +0000 UTC dead_account:true id:account[32194] label:account language:OTHER life_time:506 mature:false updated_at:2014-02-22 00:00:00 +0000 UTC views:0]
==>map[affiliate:false created_at:2011-12-18 00:00:00 +0000 UTC dead_account:true id:account[111748] label:account language:OTHER life_time:811 mature:true updated_at:2014-03-08 00:00:00 +0000 UTC views:0]
==>map[affiliate:false created_at:2013-03-19 00:00:00 +0000 UTC dead_account:true id:account[104409] label:account language:OTHER life_time:414 mature:false updated_at:2014-05-07 00:00:00 +0000 UTC views:0]

Now we know that the account id account[7017] that was the one least recently updated. With graphs, we can easily get a given account’s followers on a graph structure. 

Next, we’ll query its top-5 viewed accounts among the 2-hop followers (followers of followers) of account [7017]

puppy-gremlin> g.V('account[7017]').both().both().order().by('views', desc).limit(5).elementMap()
Done! Elapsed time: 0.729s, rows: 5
==>map[affiliate:false created_at:2011-05-20 00:00:00 +0000 UTC dead_account:false id:account[32338] label:account language:EN life_time:2702 mature:false updated_at:2018-10-12 00:00:00 +0000 UTC views:202142952]
==>map[affiliate:false created_at:2007-06-28 00:00:00 +0000 UTC dead_account:false id:account[58773] label:account language:EN life_time:4124 mature:false updated_at:2018-10-12 00:00:00 +0000 UTC views:25063546]
==>map[affiliate:false created_at:2011-04-14 00:00:00 +0000 UTC dead_account:false id:account[56352] label:account language:EN life_time:2738 mature:false updated_at:2018-10-12 00:00:00 +0000 UTC views:21717613]
==>map[affiliate:false created_at:2015-01-18 00:00:00 +0000 UTC dead_account:false id:account[94108] label:account language:EN life_time:1363 mature:false updated_at:2018-10-12 00:00:00 +0000 UTC views:12124358]
==>map[affiliate:false created_at:2011-05-30 00:00:00 +0000 UTC dead_account:false id:account[131835] label:account language:EN life_time:2691 mature:true updated_at:2018-10-11 00:00:00 +0000 UTC views:4202097]

Here is a visualization of the Twitch dataset. It allows you to get a sense of the dataset as a graph and get an initial impression of your data. Click “Visualize” on the left panel to access it. You can even view the graph full-screen!

Full visualization of the Twitch dataset within PuppyGraph

You can also zoom-in on an individual node (which represents each account) to see the attributes associated with that account. 

Zoom in on a particular node within PuppyGraph

In PuppyGraph, you can also build a dashboard to provide a high-level overview of the dataset. For example, in this dashboard, we can quickly know there are 168,114 nodes and between them, there are 6,797,557 edges that connect them. You can also display a table where it displays the sample accounts and their info.

PuppyGraph Dashboard

Why PuppyGraph

PuppyGraph sets itself apart by decoupling storage from computation, capitalizing on the advantages of columnar data lakes to deliver significant scalability and performance gains. When conducting intricate graph queries like multi-hop neighbor searches, the need arises to join and manipulate numerous records. The columnar approach to data storage enhances read efficiency, allowing for the quick fetching of only the relevant columns needed for a query, thus avoiding the exhaustive scanning of entire rows.

PuppyGraph Architecture

With PuppyGraph, you can use the SQL data stores as you normally would, while reaping the benefits of graph-specific use cases such as complex pattern matching and efficient pathfinding. It avoids the additional complexity and resource consumption of maintaining a separate graph database and the associated ETL pipelines.

Advantages of Using Knowledge Graphs

Reducing hallucinations in generated content or predictions

Knowledge graphs can significantly reduce hallucinations in machine learning outputs, particularly in natural language processing (NLP) and generative models. Hallucinations in machine learning refer to instances where models generate incorrect, fabricated, or nonsensical information that is not supported by the input data or factual knowledge. This can be a significant issue in applications like chatbots, content generation, and automated decision-making systems, where accuracy and reliability are critical. 

Knowledge graphs offer a grounded context for machine learning models by linking data to real-world entities and their relationships. By structuring data in a way that reflects real-world knowledge and relationships, knowledge graphs ensure that models are trained on high-quality, consistent data. This reduces the risk of models learning from erroneous or conflicting data, which can lead to hallucinations in the generated outputs. In other words, knowledge graphs help ensure that the information generated by the model is based on factual and verifiable data. For example, in text generation tasks, a model trained with a knowledge graph can reference the graph to verify facts before including them in the generated text, reducing the likelihood of generating false or nonsensical information. 

Knowledge graphs can be used as a reference for fact-checking and validation during or after the generation process. By comparing generated content against the information in the knowledge graph, models can verify the accuracy of the information they produce. This capability is especially useful in automated content generation and question-answering systems, where factual accuracy is paramount.

Improving the interpretability of machine learning models

Knowledge graphs contribute to the interpretability of machine learning models by providing a clear and logical structure of the data and its underlying relationships. This clarity can help explain the rationale behind model predictions, making it easier to trust and validate the models' outputs.

In traditional machine learning models, features are often engineered from raw data through transformations that can obscure their original meaning, making it difficult to interpret what the model is basing its decisions on. Knowledge graphs, however, provide a direct mapping of features to real-world entities and their relationships, enhancing transparency. 

Knowledge graphs can be used to generate explanations for model predictions by highlighting the paths and entities within the graph that were most influential in reaching a decision. This capability is particularly valuable in fields requiring a high degree of accountability and transparency, such as healthcare diagnostics or financial risk assessment. By providing clear, traceable explanations for its decisions, a model becomes more trustworthy and easier to validate against domain knowledge.

Finally, the interpretability provided by knowledge graphs helps build trust among end users, who can better understand and validate the rationale behind the model's predictions. This trust is essential for the adoption of ML solutions in sensitive and impactful areas, where stakeholders must be confident in the model's accuracy and fairness.

Challenges of Building and Using Knowledge Graphs

Despite the undeniable benefits of knowledge graphs, their construction and usage present certain challenges. Aligning data with standards is a challenge when building knowledge graphs as it involves curating unstructured scientific text with ontologies to contextualize data for computer understanding. Data harmonization is another critical challenge due to the varied terminologies different authors use to describe the same concepts.

Extracting relations from data to identify specific associations between entities is challenging, particularly when the relationships are ambiguous and context-dependent. Generating the schema for knowledge graphs is complex as it involves creating a meta-graph that illustrates the relevant entities and their interrelationships. The necessity to integrate data from multiple sources can also be hindered by synonymy and terminological differences across datasets.

Discerning when a specific association exists between two entities, as opposed to mere co-mentioning in the same document, is a nuanced challenge necessitating meticulous relation extraction techniques. The use of machine learning models to help identify relationships in specific contexts is necessary but also adds complexity to the process of building knowledge graphs.

Furthermore, developing a high-level meta-graph through schema generation requires decisions about the representation and format, such as the Resource Description Framework, that will best suit the intended application of the knowledge graph.

Summary

Knowledge graphs and machine learning are key tools that drive advancements in AI and data science. Knowledge graphs provide a structured representation of knowledge, effectively encapsulating entities and their relationships, adding context to data, and enhancing its navigability and analysis. Machine learning leverages this structured data to build models capable of identifying patterns, making predictions, and generating insights.

Ready to add knowledge graph on your existing SQL data? Download the forever free PuppyGraph Developer Edition or begin your free 30-day trial of the Enterprise Edition today.

Danfeng Xu, CTO and Co-founder of PuppyGraph, is a passionate learner with extensive experience across online platforms, streaming services, big data, and developer productivity. He previously worked at LinkedIn, where he led a unified server platform strategy for thousands of microservices and modernized the engagement platform to deliver dynamic, personalized and engaging user experiences. He holds a Master's degree in Computer Science from UCLA.

Join our newsletter

See PuppyGraph
In Action

See PuppyGraph
In Action

Graph Your Data In 10 Minutes.

Get started with PuppyGraph!

PuppyGraph empowers you to seamlessly query one or multiple data stores as a unified graph model.

Dev Edition

Free Download

Enterprise Edition

Developer

$0
/month
  • Forever free
  • Single node
  • Designed for proving your ideas
  • Available via Docker install

Enterprise

$
Based on the Memory and CPU of the server that runs PuppyGraph.
  • 30 day free trial with full features
  • Everything in Developer + Enterprise features
  • Designed for production
  • Available via AWS AMI & Docker install
* No payment required

Developer Edition

  • Forever free
  • Single noded
  • Designed for proving your ideas
  • Available via Docker install

Enterprise Edition

  • 30-day free trial with full features
  • Everything in developer edition & enterprise features
  • Designed for production
  • Available via AWS AMI & Docker install
* No payment required