Generative language models like ChatGPT can answer almost any question instantly and are easy to use. However, upon closer inspection, some issues become apparent. One major issue is hallucinations: the models sometimes generate incorrect information. When a model lacks specific information, it fabricates details, presenting them convincingly enough to seem plausible. For example, an early version of Llama mistakenly identified Heise Verlag as the organizer of CeBIT due to its association with publications, IT, and Hannover, where the world’s largest computer fair was once held. Such well-crafted misinformation can easily be believed.
Training large language models is also extremely resource-intensive, often requiring thousands of GPU-years. As a result, models are rarely retrained, meaning they lack the latest information. Even relatively new models like Llama 3.1 have a knowledge cutoff from the previous year. Public language models struggle with internal information, such as company-specific content, since it isn’t included in their training data. While fine-tuning models is possible, it is labor-intensive and must be repeated for every new document added.
Combining large language models (LLMs) with retrieval-augmented generation (RAG) can help. Documents are indexed using embedding models, which are a type of large language model. Similarity metrics are used to find documents or passages that best answer a question. This “context” is then provided to a generative model, which summarizes the results and aligns them with the query.
RAG methods have become very popular, causing a small revolution in information retrieval last year by achieving significantly better results. Many frameworks now implement RAG. However, using RAG effectively is complex, as optimization can occur in different dimensions. Different embedding models, rerankers, and generative models can be used, requiring experience to select the right combination.
Currently, RAG alone cannot extract formalized knowledge from documents, which would improve model responses. This is where knowledge graphs come in. It would be beneficial to combine these two ideas.
Microsoft introduced GraphRAG, a hierarchical approach to RAG, as opposed to purely semantic text fragment searches. The process involves extracting a knowledge graph from raw text and building a community hierarchy with content summaries, which are then used for retrieval to formulate better answers. Unlike many Microsoft projects, GraphRAG’s implementation remains hidden, with existing Jupyter Notebooks relying heavily on Azure and OpenAI, transferring all information to the cloud. Much is hidden in classes, making it hard to understand the behind-the-scenes processes.
Fortunately, alternative sources and implementations exist. The introductory article by neuml is recommended, providing a clearer view of the process. An embedding model (intfloat/e5-base) enables similarity queries. Existing Wikipedia embeddings, available via Hugging Face, serve as the data basis. The implementation indexes a subset (the top 100,000 articles) and returns a graph for a query like “Machine Learning.”
The nodes in the graph’s upper part are more densely connected, indicating higher information density. Since the graph is modeled as a NetworkX graph in Python, methods can be called to determine paths between nodes or identify nodes with the highest centrality. The results reveal relevant Wikipedia pages, with scores from similarity analysis indicating a different order than graph analysis. Combining both methods identifies the most relevant documents for the query and context.
The topics appearing in the results are interesting, calculated during indexing, a resource-intensive process performed much faster on a GPU than a CPU. Powerful hardware is also needed for RAG’s final phase, text generation. Neuml uses a Mistral-7B-OpenOrca model, scaled down with Activation-aware Weight Quantization (AWQ) for less powerful GPUs. This allows quick generation and fact extraction from the graph. The “Machine Learning” query is answered using Wikipedia:
Machine learning is a field in artificial intelligence focused on developing statistical algorithms that learn from and generalize to unseen data. Generative artificial neural networks have surpassed many previous approaches. Machine learning algorithms can inherit and amplify biases in training data, leading to skewed representations or unfair demographic treatment. Supervised learning uses input objects and desired outputs to train models, enabling accurate predictions for unseen data. An autoencoder is a neural network used for unsupervised learning, efficient coding of unlabeled data, and dimensionality reduction.
The response focuses on generative AI, reflecting the information available on Wikipedia, which apparently has more on generative models than other machine learning topics. Surprisingly, the model explains supervised learning and autoencoders but not unsupervised learning.
Neuml provides a Streamlit application on GitHub for experimentation, allowing users to index their documents.