Github: urban-air-quality-kg Documentation

Project Overview

GITHUB: https://github.com/XiangX91/urban-air-quality-kg

Funded by:

Based in:

The Urban Air Quality KG Github project, as the outcome of the 1851 funded Built Environment fellowship, aims to enhance understanding of urban air pollution by building a semantically enriched knowledge graph (KG) of air quality data. The KG represents key concepts like pollutants, pollution sources, meteorological factors, and mitigation measures in a structured graph database (Neo4j). This enables users to explore urban air quality phenomena (e.g. causes and effects of pollution) and associated mitigation strategies through natural language queries and graph-based reasoning.

Key Components: The system integrates several advanced tools: a Neo4j graph database for storing and querying the knowledge graph, SentenceTransformers (a type of language model for embeddings) to add semantic vector representations to graph nodes, and a Retrieval-Augmented Generation (RAG) pipeline that uses a local large language model to answer questions by retrieving relevant knowledge from the graph. In summary, raw textual information about urban air quality is converted into a structured KG, enriched with semantic embeddings, and made queryable via an LLM-based Q&A interface.

Project Goals: By combining knowledge graphs with LLMs, this project allows users to query and interact with air quality data in natural language, obtaining insightful answers that are grounded in the structured knowledge. The goal is to support researchers and policymakers in exploring how different pollutants, environmental conditions, and interventions relate to each other in urban contexts.

Please click: https://xiangx91.github.io/urban-air-quality-kg/visualisation

An interactive visualization of the Urban Air Quality Knowledge Graph. Nodes are colored by category (e.g. pollutants, sources, environmental factors, mitigations) and edges illustrate relationships between them. Such a graph structure enables complex queries and multi-hop reasoning about air quality in cities.

System Architecture

Architecture Overview: The Urban Air Quality KG system can be seen as a pipeline with three main stages: Knowledge Graph construction, Semantic Embedding & Retrieval, and LLM-driven Q&A. Each stage corresponds to specific components in the repository:

Knowledge Graph Construction: Unstructured text about air quality (e.g. research reports, policy documents) is processed to extract structured knowledge. This uses Large Language Models (LLMs) guided by a predefined ontology to identify entities (pollutants, sources, etc.) and relationships, outputting JSON data. The JSON is then imported into Neo4j, creating nodes and relationships according to the project’s ontology/schema. Neo4j serves as the KG storage, managing data on pollutants, emission sources, environmental conditions, and mitigation measures. The ontology (defined in a YAML file) ensures that extracted knowledge fits a consistent schema.
Semantic Embedding & Similarity Retrieval: Once the KG is in Neo4j, the project employs SentenceTransformers to generate vector embeddings for each node (capturing semantic meaning). These embeddings are stored as properties in Neo4j for use in similarity search. When a user poses a query, the system can embed the query in the same vector space and compare it to node embeddings to find relevant nodes (for example, finding which entities are semantically related to the query). The script neo4j_similarity_search.py performs this semantic similarity search in the KG, retrieving closely related entities based on the query vector. This allows the system to identify not just direct matches, but conceptually similar information (e.g. a query about “smog” might retrieve nodes related to particulate matter and traffic emissions even if the word “smog” isn’t explicitly a node).
Retrieval-Augmented Generation (RAG) Q&A: The final layer is a local Q&A system that uses a Retrieval-Augmented Generation approach. Here, a local LLM (such as a Mistral 7B model, provided in the models/ directory) is run via Llama.cpp to generate answers. The pipeline (neo4j_local_rag.py) takes a natural language question from the user, embeds it, retrieves relevant knowledge graph nodes (and their associated info) via the similarity search, and then feeds that contextual information along with the question into the LLM. The LLM, armed with factual data from the KG, composes a natural language answer for the user. This ensures the answers are grounded in the curated knowledge graph content (improving accuracy and relevance). The RAG approach effectively combines the KG’s structured data with the generative abilities of the LLM.

In summary, the architecture marries a Neo4j knowledge graph (for structured, queryable data) with embedding-based retrieval and a local AI model for answering questions. This design enables interactive exploration: users ask questions in plain English, the system finds pertinent facts in the graph, and the LLM formulates a detailed answer based on those facts.

Repository Structure and Key Files

The repository is organized to reflect the stages above, with directories for data, ontology, source code, models, etc. Below is a breakdown of the repository structure (with key files):

urban-air-quality-kg/
├── data/ 
│   ├── example_txt/          # Example text files for knowledge extraction input
│   ├── baseline_KG/          # Baseline knowledge graph data (structured JSON files)
│   ├── RAG/                  # Neo4j database dump of the baseline KG (for quick start with RAG)
│   └── output/               # Outputs of knowledge extraction/merging (JSON files)
├── images/                   # Documentation visuals (graphs, diagrams)
├── models/                   # Local LLM models for RAG (e.g. Mistral 7B in GGUF format)
├── notebooks/                # Jupyter notebooks demonstrating usage of various components
│   ├── Knowledge_extraction.ipynb           # How to extract knowledge from text
│   ├── Knowledge_enrich_and_validation.ipynb # Merging new knowledge and validating JSON
│   ├── Embedding_and_similarity_search.ipynb # Generating embeddings & finding similar nodes
│   └── Explicit_local_RAG_QA.ipynb           # Running the local RAG Q&A pipeline
├── ontology/
│   └── urban_air_quality.yaml # Ontology definitions (schema of entities/relations)
├── src/                      # Python scripts for KG construction and query pipeline
│   ├── extraction.py             # Text-to-JSON knowledge extraction using LLMs
│   ├── jsonvalidator.py          # Validate JSON data against the ontology schema
│   ├── merge_knowledge.py        # Merge multiple JSON knowledge files, resolve duplicates
│   ├── neo4j_local_import.py     # Import JSON data into Neo4j (creating nodes/relationships)
│   ├── neo4j_embedding_pipeline.py # Generate and store node embeddings (SentenceTransformers)
│   ├── neo4j_similarity_search.py  # Perform semantic similarity search in the KG
│   └── neo4j_local_rag.py          # Run the local RAG Q&A (retrieval + LLM answer) 
├── visualisation/            # Tools or scripts for visualizing the KG (if provided)
├── requirements.txt          # Python dependencies for the project
└── README.md                 # Project documentation and instructions

Each of the key Python scripts in src/ corresponds to a specific function in the pipeline (as noted by in-line comments above). For example, extraction.py handles extracting structured facts from raw text using an LLM (guided by the ontology), outputting a JSON file with entities and relationships. The neo4j_local_import.py script then reads such JSON and imports it into the Neo4j graph, mapping the data into the pre-defined graph schema (creating nodes labeled as Pollutant, Source, etc., and relationships among them). The neo4j_embedding_pipeline.py computes embeddings for each node via SentenceTransformers and stores these vectors in Neo4j, while neo4j_similarity_search.py can query those embeddings to find related nodes. Finally, neo4j_local_rag.py ties it all together to enable question-answering over the graph with a local LLM.

The Jupyter notebooks in notebooks/ serve as tutorials/demos for each stage of usage. For instance, Knowledge_extraction.ipynb demonstrates using the extraction script on the files in data/example_txt/ to produce knowledge JSON, and Explicit_local_RAG_QA.ipynb shows how to ask questions to the system and get answers via the RAG pipeline. These notebooks are a great starting point to interact with the system step-by-step in an interactive environment.

The ontology YAML (urban_air_quality.yaml) defines the schema of the knowledge graph – i.e., what entity types exist (pollutants, environmental factors, etc.), what relationships link them, and any attributes. This ontology is used by the extraction and validation steps to ensure consistency (e.g. only valid entity types are created, required fields are present, etc.).

Installation and Setup

To set up the project locally, follow these steps (the instructions assume a Unix-like environment; Windows users can adjust the commands accordingly):

1. Clone the Repository: Start by downloading the code from GitHub. In a terminal, run:

git clone https://github.com/XiangX91/urban-air-quality-kg.git 
cd urban-air-quality-kg

2. Create a Virtual Environment: It’s recommended to use a Python virtual environment for the project. You can create one using venv or Conda. For example, with Python’s built-in venv:

python3 -m venv venv
source venv/bin/activate   # (On Windows: venv\Scripts\activate)

This will activate a virtual environment named “venv” for the project.

3. Install Python Dependencies: Once the virtual env is active, install all required Python libraries by running:

pip install -r requirements.txt

This will download and install all packages listed in requirements.txt. These likely include libraries such as Neo4j Python driver (or Py2Neo) for graph database access, SentenceTransformers for embedding generation, pandas for data handling, and possibly llama-cpp-python or similar to interface with the local LLM model. (Ensure you have an appropriate compiler setup if llama-cpp-python is used, as it may need to compile the LLM backend.)

4. Install Neo4j: Neo4j is the graph database used to store the knowledge graph. You need to have Neo4j (Community or Enterprise Edition) installed on your system. You can download it from the [official Neo4j site] and follow their installation instructions. After installation, start the Neo4j server and set a password for the neo4j user (the default username is neo4j; you will be prompted to set an admin password on first launch or you can use the default and change later). Make sure Neo4j is running locally (by default on bolt://localhost:7687) so that the scripts can connect to it.

5. Configure APOC (if not already enabled): APOC is a library of procedures for Neo4j that this project uses (likely for importing data from JSON or performing graph algorithms). In your Neo4j installation, locate the apoc.conf or neo4j.conf file (in the conf/ directory of Neo4j). Open this config file and ensure the following line is present (add it if not):

apoc.export.file.enabled=true

This setting allows APOC to export to/import from files, which might be needed for the KG import or export functionality. After adding the line, restart the Neo4j server to apply the changes. (Note: Some Neo4j versions bundle APOC by default but disable file access for security; this line explicitly enables it. In some cases, you might also need dbms.security.procedures.unrestricted=apoc.* in the config to allow all APOC procedures.)

6. (Optional) Set up the Local LLM Model: The project includes a directory models/ which in the repository listing shows a file named mistral-7b-instruct-v0.2.Q4_K_M.gguf. This appears to be a quantized GGUF format model for the [Mistral 7B instruct LLM]. If this file is not present (it may be large and possibly handled via Git LFS or a separate download), you will need to obtain it. Make sure the model file is placed in models/ and that the path or name is correctly referenced in the code. This model will be used by the Llama.cpp backend to generate answers for the RAG pipeline. If you prefer, you could substitute your own GGUF model, but ensure it’s an instruct-tuned model (so it responds well to questions) and update the code accordingly if needed. No additional installation is required for the model aside from having the file; the llama-cpp library will load it at runtime.

After completing the above steps, you should have all components in place: Python environment ready, dependencies installed, Neo4j running with the required configuration, and the local LLM model available. You’re now ready to build and explore the urban air quality knowledge graph!

Running the Project Locally (Step-by-Step Usage)

Once setup is complete, you can proceed to construct the knowledge graph and run queries. Below is a typical workflow with the corresponding scripts/notebooks:

1. Knowledge Extraction from Text: Start by extracting knowledge from unstructured text files. The repository provides example text files under data/example_txt/ that describe various air quality facts and measures. You can run the extraction in two ways:

Via script: Execute the extraction.py script on a folder of text files. By default, it may be configured to read from data/example_txt/ and output JSON to data/output/. For example, run: bashCopyEditpython src/extraction.py (Check the script for any arguments or configuration; it might output a combined JSON or one per text file.)
Via notebook: Open and run the steps in Knowledge_extraction.ipynb which will load an LLM (possibly using OpenAI or a local model) to process each text and extract structured data according to the ontology. This extraction uses predefined prompts to identify entities like pollutants and their relationships in the text. After running this, you should obtain one or more JSON files in data/output/ containing the extracted knowledge. Each JSON entry will likely have fields identifying an entity (with a type and name) and how it connects to others (for example, a JSON might state a pollutant “NO₂” is emitted by source “Vehicles” under certain conditions).

2. Validate and Merge Knowledge (optional): If multiple JSON files are produced from different sources or texts, you might want to merge them into a single knowledge base and eliminate duplicates. The script merge_knowledge.py can merge new knowledge into an existing JSON dataset, using fuzzy matching to avoid duplicating the same entity. Likewise, jsonvalidator.py can be used to ensure the JSON conforms to the ontology schema (e.g., all required fields are present, entity types are valid). For example:

python src/jsonvalidator.py data/output/extracted_knowledge.json

This would print any validation errors against the ontology (defined in urban_air_quality.yaml). It’s good practice to run this before import. If you have multiple JSON fragments (e.g., one per input document), you can merge them:

python src/merge_knowledge.py data/output/knowledge1.json data/output/knowledge2.json -o data/output/combined.json

(Assuming the script takes input files and an output path via arguments – refer to its help or the notebook for exact usage.)

3. Import into Neo4j: With a consolidated and validated JSON of knowledge, the next step is to load it into the Neo4j graph database. Use the neo4j_local_import.py script for this. You may need to configure connection details (host, user, password) either inside this script or via environment variables. By default, it might assume bolt://localhost:7687 and user neo4j. Run the import:

python src/neo4j_local_import.py data/output/combined.json

This script will connect to Neo4j and create nodes and relationships as defined in the JSON. For example, a pollutant entity in JSON will become a node labeled “Pollutant” (or similar) in Neo4j, a mitigation measure becomes a node of type “Mitigation”, etc., and relations like “affects” or “emitted_by” will become relationships between nodes. After running this, you will have a populated knowledge graph in Neo4j. You can verify by opening Neo4j Browser (at http://localhost:7474) and running a simple Cypher query like:

MATCH (n) RETURN labels(n), count(*);

to see the count of nodes per label, or

MATCH p=(n)-[r]->(m) RETURN p LIMIT 20;

to see a sample of the graph relationships.

4. Generate Semantic Embeddings: Once the graph is populated, generate embeddings for the nodes to enable semantic similarity queries. Run neo4j_embedding_pipeline.py:

python src/neo4j_embedding_pipeline.py

This will likely connect to Neo4j, retrieve all nodes (or all node IDs and their text representation), compute an embedding for each using a SentenceTransformers model, and then store the embedding vector back onto each node in Neo4j (possibly as a property, e.g., embedding). The SentenceTransformers model could be a pre-trained model (e.g. all-MiniLM-L6-v2 or another suitable model for short phrases). The notebook Embedding_and_similarity_search.ipynb demonstrates this process as well. After running this, each node in the KG has a numerical vector representation capturing its meaning.

5. Perform Query (Semantic Similarity Search): You can now query the knowledge graph for relevant information. One way is to use the semantic similarity search script directly. For example:

python src/neo4j_similarity_search.py "What are sources of PM2.5?"

The script will take the query string, embed it in the same vector space, and find the nearest neighbor nodes in the graph by cosine similarity. It might then print out the top matching nodes and maybe their relationships. In this example, a query about “sources of PM2.5” might retrieve nodes like “Vehicle emissions (PM2.5 particles)” or “Construction dust (PM2.5)” if those are in the graph. Essentially, this gives you a way to discover which parts of the KG are most related to your question. You can also do this interactively in the provided notebook, which will show how to formulate queries and interpret the results.

6. Ask Questions via RAG (Natural Language Q&A): The highlight of the project is the ability to ask complex questions in natural language and get answers backed by the KG. Use the neo4j_local_rag.py script to do this. For instance:

python src/neo4j_local_rag.py

(It might drop you into an interactive prompt or you might modify it to answer a single question.) You can then ask something like: “How can we reduce NO₂ levels in urban areas?”. Behind the scenes, the script will embed your question, find relevant nodes (e.g. the NO₂ pollutant node, and mitigation nodes like “Low Emission Zone”, “Electric vehicle adoption”, etc.), retrieve facts or descriptions associated with those nodes, and feed that into the local LLM (the Mistral model) to generate a coherent answer. For example, the system might respond with an answer along the lines of:

“To reduce NO₂ levels in cities, common strategies include promoting public transit and electric vehicles (to cut down vehicle exhaust emissions, a major source of NO₂), establishing Low Emission Zones or Clean Air Zones to restrict high-NO₂ emitters, and improving traffic flow to minimize congestion. Additionally, encouraging urban green spaces can help as vegetation can absorb some pollutants.”

The answer is formulated by the LLM but grounded in the knowledge graph content (e.g. it knows vehicles cause NO₂, and that clean air zones and EVs are mitigation measures, etc., because those relationships exist in the KG). The notebook Explicit_local_RAG_QA.ipynb demonstrates such Q&A usage with example questions. It provides a step-by-step of how the question is processed and how the answer is generated, which can help in understanding and debugging the pipeline.

By following the above steps, you can reproduce the pipeline: ingest new knowledge, build the graph, and query it. The provided examples and notebooks are a good guide – you can start by running them with the included data, and then extend with your own air quality documents to grow the knowledge graph.

Notable Dependencies and Configuration Tips

When working with the Urban Air Quality KG, keep in mind a few important dependencies and configurations:

Neo4j Database: As mentioned, Neo4j is central to this project. You should have Neo4j 4.x or 5.x (preferably the latest stable version) installed. The project uses Neo4j’s Bolt protocol to connect, so ensure the Neo4j Python driver (neo4j package or py2neo) is installed (it should be via requirements). After installation, the database should be running and you should know the Bolt URI, username, and password. If the code doesn’t prompt for it, you might have to open the scripts and set the credentials (for simplicity, you could use the default neo4j/neo4j and update the password in the Neo4j browser, then use that in the scripts). The APOC configuration (apoc.export.file.enabled=true) is needed to allow import/export procedures – without it, the import script might fail if it relies on APOC to load JSON or CSV data.
SentenceTransformers Model: The embedding generation will likely download a pre-trained model the first time it runs (if not provided in the repo). Ensure you have internet access when running neo4j_embedding_pipeline.py for the first time, so it can fetch the model (unless the model is a local file and the code is pointed to it). Common models for sentence embeddings (like sentence-transformers/all-MiniLM-L6-v2) are a few hundred MB downloads. After download, they cache in ~/.cache/torch/sentence_transformers/. If running in an offline environment, you might need to manually provide the model. Check the script to see which model name is used and adjust if necessary (you can change to any SentenceTransformer model that suits short text).
Local LLM and Llama.cpp: The RAG QA uses a local large language model. The project provides a quantized Mistral 7B Instruct model (GGUF format) which is run with Llama.cpp. Make sure you have a compatible setup for running this:
- The Python package llama-cpp-python (if used) will need to be installed (it likely is in requirements). On first use, it may compile the C++ backend. You might need to install a C compiler (like gcc or cl.exe on Windows, or clang on Mac) beforehand. If that’s troublesome, another approach could be using the command-line llama.cpp separately, but the provided integration is probably easier.
- The model file (GGUF) should be present in the models/ directory. If it’s not included due to size, get the Mistral 7B v0.2 instruct model in 4-bit format (the filename suggests Q4_K_M quantization) and place it there. This file can be several GB (quantized maybe ~4GB). Ensure you have disk space and use Git LFS if provided.
- If you prefer or have a better model, you can use that – just update neo4j_local_rag.py to point to your model’s path. Keep in mind model size vs your RAM; 7B 4-bit should run on CPU with <8GB RAM. Larger models (13B, 70B) may not fit or will be very slow on CPU.

Conclusion

With the Urban Air Quality KG project set up, you have a powerful tool at your disposal: a combination of a knowledge graph and AI that can answer complex questions about air pollution. The repository provides everything from data ingestion to query interfaces. By following this tutorial, you should be able to recreate the knowledge graph on your machine, understand its structure (Neo4j, nodes/relationships), and interact with it either through direct queries or via the intelligent RAG pipeline for natural language Q&A. This not only serves as a practical guide to this specific project but also illustrates how to integrate knowledge graphs with modern AI models – an approach that can be extended to many other domains beyond air quality.

Happy exploring, and may your insights help in devising cleaner air solutions!

Acknowledgement

This work was supported by the Built Environment Fellowship awarded to Dr Xiang Xie from the 1851 Royal Commission.

Contributing

Contributions are welcome! Get in touch if you have any queries or would like to collabrate. Email xiang.xie@ncl.ac.uk