Episode 7: Knowledge Graph, storage and retrieval
📅 Published on September 6, 2024
The “divide and conquer” strategy is a powerful algorithmic method used to break down complex problems into smaller, manageable parts. In the context of knowledge graphs, this approach allows us to deal with entities like talents, projects, clients, products, and contracts by treating them as nodes in a graph database. These nodes are connected through edges, representing relationships that store the most relevant contextual information for easy retrieval.
I. Storage: Organizing data into a knowledge graph
The first phase of working with data in this context is storage. Most businesses already have existing data environments that need to be transformed into structured knowledge graphs. This involves organizing the data into nodes and establishing relationships between these nodes, effectively linking different entities. An essential aspect of this process is setting up data streaming to ensure continuous updates as new information becomes available.
To convert raw data into indexed and clustered knowledge, I recommend using a business fine-tuned large language model (LLM). A well-tuned LLM equipped with a tool like Neo4j can autonomously create nodes, assign properties and labels for classification, and establish relationships between them. This system helps to capture the necessary context while filtering out unnecessary information, ensuring that only the most relevant data is stored for future retrieval.
1. Ownership
Relation: Between stakeholders and projects or talents.
Explanation: Indicates that a stakeholder “owns” or is responsible for a certain project or talent.
2. Hierarchical
Relation: Between projects or technologies.
Explanation: Can represent a hierarchy between different projects, for example, a main project that oversees multiple sub-projects or technologies that depend on each other.
3. Causal
Relation: Between projects or between technologies and projects.
Explanation: Indicates that a project or technology causes or directly influences another project. For instance, the success of one project might lead to the start of a new project.
4. Sequential
Relation: Between projects.
Explanation: Represents a temporal order between projects, where one project starts after the completion of another or follows a certain development sequence.
5. Temporal
Relation: Between projects or between talents and projects.
Explanation: Marks the duration during which a project is active or the period during which a talent is engaged in a project.
6. Correlation
Relation: Between projects, technologies, or talents.
Explanation: Indicates that there is an association or similarity between different projects, technologies, or talents, but without implying a direct causal relationship. For example, two projects may be correlated because they use similar technologies.
The golden rule for storage
The key principle here is to define your storage and ontological strategies based on the specific purpose of your LLM. This involves making careful decisions about which nodes, properties, labels, and relationships should be created, ensuring that they align with the system’s objectives.
II. Retrieval: Accessing knowledge efficiently
Once data is stored in the knowledge graph, it transforms into knowledge that can be retrieved through effective querying. In systems like Neo4j, this retrieval is often done using Cypher queries, which are highly optimized for fast and accurate data access. In our Turing environment, the LLM is fine-tuned to generate precise Cypher queries based on user input, ensuring that the right information is retrieved quickly and efficiently.
However, the complexity of these queries increases with the number of relationships between nodes. As a reminder, relationships can include temporal, causal, correlation, ownership, and hierarchical connections. In more complex, multidimensional environments, traditional queries may become cumbersome, and this is where vector-based retrieval comes into play.
In such cases, similarity-based retrieval using vectors offers a more effective solution. This method excels at handling high-dimensional data, allowing for faster and more accurate searches when the context is expanded and complex. This is why vector databases, like ChromaDB, are becoming increasingly popular in AI and knowledge graph ecosystems. ChromaDB is particularly useful because it integrates seamlessly with Langchain, LlamaIndex, and other libraries we are using, and unlike Pinecone or Weaviate, it is not a cloud-based vector database.
Next Steps: Exploring RAG and vector databases
In the next phase, we will dive deeper into Retrieval-Augmented Generation (RAG) and explore how vector databases like ChromaDB enhance retrieval efficiency. This hybrid approach combines the strengths of traditional knowledge graphs and advanced vector databases, ensuring both structured data storage and rapid, high-dimensional data retrieval.
By combining storage and retrieval strategies with the right tools, businesses can leverage their data more effectively, making it a crucial part of decision-making processes and operational efficiency.