{"id":105,"date":"2025-05-10T16:17:42","date_gmt":"2025-05-10T16:17:42","guid":{"rendered":"https:\/\/www.staff.ncl.ac.uk\/xiangxie\/?page_id=105"},"modified":"2025-05-12T15:13:57","modified_gmt":"2025-05-12T15:13:57","slug":"github-urban-air-quality-kg-documentation","status":"publish","type":"page","link":"https:\/\/www.staff.ncl.ac.uk\/xiangxie\/github-urban-air-quality-kg-documentation\/","title":{"rendered":"Github: urban-air-quality-kg Documentation"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Project Overview<\/h2>\n\n\n\n<div class=\"wp-block-group is-content-justification-space-between is-nowrap is-layout-flex wp-container-core-group-is-layout-1 wp-block-group-is-layout-flex\">\n<p><strong>GITHUB:<\/strong> <a href=\"https:\/\/github.com\/XiangX91\/urban-air-quality-kg\">https:\/\/github.com\/XiangX91\/urban-air-quality-kg<\/a><\/p>\n\n\n\n<p><strong>Funded by:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"663\" height=\"564\" src=\"https:\/\/www.staff.ncl.ac.uk\/xiangxie\/files\/2025\/05\/logo_1851-3.png\" alt=\"\" class=\"wp-image-125\" style=\"width:auto;height:60px\" \/><\/figure>\n\n\n\n<p><strong>Based in<\/strong>:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"780\" height=\"582\" src=\"https:\/\/www.staff.ncl.ac.uk\/xiangxie\/files\/2025\/05\/newcastle-logo.png\" alt=\"\" class=\"wp-image-126\" style=\"width:auto;height:60px\" \/><\/figure>\n<\/div>\n\n\n\n<p>The <strong>Urban Air Quality KG<\/strong> Github project, as the outcome of the 1851 funded <a href=\"https:\/\/www.staff.ncl.ac.uk\/xiangxie\/selfprogramming-agi\/\" data-type=\"link\" data-id=\"https:\/\/www.staff.ncl.ac.uk\/xiangxie\/selfprogramming-agi\/\">Built Environment fellowship<\/a>, aims to enhance understanding of urban air pollution by building a semantically enriched <strong>knowledge graph<\/strong> (KG) of air quality data. The KG represents key concepts like pollutants, pollution sources, meteorological factors, and mitigation measures in a structured graph database (Neo4j). This enables users to explore <strong>urban air quality phenomena<\/strong> (e.g. causes and effects of pollution) and associated mitigation strategies through natural language queries and graph-based reasoning.<\/p>\n\n\n\n<p><strong>Key Components:<\/strong> The system integrates several advanced tools: a Neo4j graph database for storing and querying the knowledge graph, <strong>SentenceTransformers<\/strong> (a type of language model for embeddings) to add semantic vector representations to graph nodes, and a <strong>Retrieval-Augmented Generation (RAG)<\/strong> pipeline that uses a local large language model to answer questions by retrieving relevant knowledge from the graph. In summary, raw textual information about urban air quality is converted into a structured KG, enriched with semantic embeddings, and made queryable via an LLM-based Q&amp;A interface.<\/p>\n\n\n\n<p><strong>Project Goals:<\/strong> By combining knowledge graphs with LLMs, this project allows users to query and interact with air quality data in natural language, obtaining insightful answers that are grounded in the structured knowledge. The goal is to support researchers and policymakers in exploring how different pollutants, environmental conditions, and interventions relate to each other in urban contexts.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1723\" height=\"918\" src=\"https:\/\/www.staff.ncl.ac.uk\/xiangxie\/files\/2025\/05\/visualisation-preview.png\" alt=\"\" class=\"wp-image-106\" \/><\/figure>\n\n\n\n<p class=\"has-medium-font-size\"><strong><em>Please click:<\/em><\/strong> <a href=\"https:\/\/xiangx91.github.io\/urban-air-quality-kg\/visualisation\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/xiangx91.github.io\/urban-air-quality-kg\/visualisation<\/a><\/p>\n\n\n\n<p><em>An interactive visualization of the Urban Air Quality Knowledge Graph. Nodes are colored by category (e.g. pollutants, sources, environmental factors, mitigations) and edges illustrate relationships between them. Such a graph structure enables complex queries and multi-hop reasoning about air quality in cities.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">System Architecture<\/h2>\n\n\n\n<p><strong>Architecture Overview:<\/strong> The Urban Air Quality KG system can be seen as a pipeline with three main stages: <strong>Knowledge Graph construction<\/strong>, <strong>Semantic Embedding &amp; Retrieval<\/strong>, and <strong>LLM-driven Q&amp;A<\/strong>. Each stage corresponds to specific components in the repository:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Knowledge Graph Construction:<\/strong> Unstructured text about air quality (e.g. research reports, policy documents) is processed to extract structured knowledge. This uses Large Language Models (LLMs) guided by a predefined ontology to identify entities (pollutants, sources, etc.) and relationships, outputting JSON data. The JSON is then imported into Neo4j, creating nodes and relationships according to the project\u2019s ontology\/schema. Neo4j serves as the KG storage, managing data on pollutants, emission sources, environmental conditions, and mitigation measures. The ontology (defined in a YAML file) ensures that extracted knowledge fits a consistent schema.<\/li>\n\n\n\n<li><strong>Semantic Embedding &amp; Similarity Retrieval:<\/strong> Once the KG is in Neo4j, the project employs <strong>SentenceTransformers<\/strong> to generate vector embeddings for each node (capturing semantic meaning). These embeddings are stored as properties in Neo4j for use in similarity search. When a user poses a query, the system can embed the query in the same vector space and compare it to node embeddings to find relevant nodes (for example, finding which entities are semantically related to the query). The script <code>neo4j_similarity_search.py<\/code> performs this semantic similarity search in the KG, retrieving closely related entities based on the query vector. This allows the system to identify not just direct matches, but conceptually similar information (e.g. a query about \u201csmog\u201d might retrieve nodes related to particulate matter and traffic emissions even if the word \u201csmog\u201d isn\u2019t explicitly a node).<\/li>\n\n\n\n<li><strong>Retrieval-Augmented Generation (RAG) Q&amp;A:<\/strong> The final layer is a local Q&amp;A system that uses a <strong>Retrieval-Augmented Generation<\/strong> approach. Here, a <strong>local LLM<\/strong> (such as a Mistral 7B model, provided in the <code>models\/<\/code> directory) is run via Llama.cpp to generate answers. The pipeline (<code>neo4j_local_rag.py<\/code>) takes a natural language question from the user, embeds it, retrieves relevant knowledge graph nodes (and their associated info) via the similarity search, and then feeds that contextual information along with the question into the LLM. The LLM, armed with factual data from the KG, composes a natural language answer for the user. This ensures the answers are grounded in the curated knowledge graph content (improving accuracy and relevance). The RAG approach effectively combines the KG\u2019s structured data with the generative abilities of the LLM.<\/li>\n<\/ul>\n\n\n\n<p>In summary, the architecture marries a <strong>Neo4j knowledge graph<\/strong> (for structured, queryable data) with <strong>embedding-based retrieval<\/strong> and a <strong>local AI model<\/strong> for answering questions. This design enables interactive exploration: users ask questions in plain English, the system finds pertinent facts in the graph, and the LLM formulates a detailed answer based on those facts.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Repository Structure and Key Files<\/h2>\n\n\n\n<p>The repository is organized to reflect the stages above, with directories for data, ontology, source code, models, etc. Below is a breakdown of the repository structure (with key files):<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>urban-air-quality-kg\/<br>\u251c\u2500\u2500 data\/ <br>\u2502   \u251c\u2500\u2500 example_txt\/          # Example text files for knowledge extraction input<br>\u2502   \u251c\u2500\u2500 baseline_KG\/          # Baseline knowledge graph data (structured JSON files)<br>\u2502   \u251c\u2500\u2500 RAG\/                  # Neo4j database dump of the baseline KG (for quick start with RAG)<br>\u2502   \u2514\u2500\u2500 output\/               # Outputs of knowledge extraction\/merging (JSON files)<br>\u251c\u2500\u2500 images\/                   # Documentation visuals (graphs, diagrams)<br>\u251c\u2500\u2500 models\/                   # Local LLM models for RAG (e.g. Mistral 7B in GGUF format)<br>\u251c\u2500\u2500 notebooks\/                # Jupyter notebooks demonstrating usage of various components<br>\u2502   \u251c\u2500\u2500 Knowledge_extraction.ipynb           # How to extract knowledge from text<br>\u2502   \u251c\u2500\u2500 Knowledge_enrich_and_validation.ipynb # Merging new knowledge and validating JSON<br>\u2502   \u251c\u2500\u2500 Embedding_and_similarity_search.ipynb # Generating embeddings &amp; finding similar nodes<br>\u2502   \u2514\u2500\u2500 Explicit_local_RAG_QA.ipynb           # Running the local RAG Q&amp;A pipeline<br>\u251c\u2500\u2500 ontology\/<br>\u2502   \u2514\u2500\u2500 urban_air_quality.yaml # Ontology definitions (schema of entities\/relations)<br>\u251c\u2500\u2500 src\/                      # Python scripts for KG construction and query pipeline<br>\u2502   \u251c\u2500\u2500 extraction.py             # Text-to-JSON knowledge extraction using LLMs<br>\u2502   \u251c\u2500\u2500 jsonvalidator.py          # Validate JSON data against the ontology schema<br>\u2502   \u251c\u2500\u2500 merge_knowledge.py        # Merge multiple JSON knowledge files, resolve duplicates<br>\u2502   \u251c\u2500\u2500 neo4j_local_import.py     # Import JSON data into Neo4j (creating nodes\/relationships)<br>\u2502   \u251c\u2500\u2500 neo4j_embedding_pipeline.py # Generate and store node embeddings (SentenceTransformers)<br>\u2502   \u251c\u2500\u2500 neo4j_similarity_search.py  # Perform semantic similarity search in the KG<br>\u2502   \u2514\u2500\u2500 neo4j_local_rag.py          # Run the local RAG Q&amp;A (retrieval + LLM answer) <br>\u251c\u2500\u2500 visualisation\/            # Tools or scripts for visualizing the KG (if provided)<br>\u251c\u2500\u2500 requirements.txt          # Python dependencies for the project<br>\u2514\u2500\u2500 README.md                 # Project documentation and instructions<br><\/code><\/pre>\n\n\n\n<p>Each of the key Python scripts in <strong><code>src\/<\/code><\/strong> corresponds to a specific function in the pipeline (as noted by in-line comments above). For example, <code>extraction.py<\/code> handles extracting structured facts from raw text using an LLM (guided by the ontology), outputting a JSON file with entities and relationships. The <code>neo4j_local_import.py<\/code> script then reads such JSON and imports it into the Neo4j graph, mapping the data into the pre-defined graph schema (creating nodes labeled as Pollutant, Source, etc., and relationships among them). The <code>neo4j_embedding_pipeline.py<\/code> computes embeddings for each node via SentenceTransformers and stores these vectors in Neo4j, while <code>neo4j_similarity_search.py<\/code> can query those embeddings to find related nodes. Finally, <code>neo4j_local_rag.py<\/code> ties it all together to enable question-answering over the graph with a local LLM.<\/p>\n\n\n\n<p>The <strong>Jupyter notebooks<\/strong> in <code>notebooks\/<\/code> serve as tutorials\/demos for each stage of usage. For instance, <code>Knowledge_extraction.ipynb<\/code> demonstrates using the extraction script on the files in <code>data\/example_txt\/<\/code> to produce knowledge JSON, and <code>Explicit_local_RAG_QA.ipynb<\/code> shows how to ask questions to the system and get answers via the RAG pipeline. These notebooks are a great starting point to interact with the system step-by-step in an interactive environment.<\/p>\n\n\n\n<p>The <strong>ontology YAML<\/strong> (<code>urban_air_quality.yaml<\/code>) defines the schema of the knowledge graph \u2013 i.e., what entity types exist (pollutants, environmental factors, etc.), what relationships link them, and any attributes. This ontology is used by the extraction and validation steps to ensure consistency (e.g. only valid entity types are created, required fields are present, etc.).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Installation and Setup<\/h2>\n\n\n\n<p>To set up the project locally, follow these steps (the instructions assume a Unix-like environment; Windows users can adjust the commands accordingly):<\/p>\n\n\n\n<p><strong>1. Clone the Repository:<\/strong> Start by downloading the code from GitHub. In a terminal, run:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>git clone https:\/\/github.com\/XiangX91\/urban-air-quality-kg.git \ncd urban-air-quality-kg<\/code><\/pre>\n\n\n\n<p><strong>2. Create a Virtual Environment:<\/strong> It\u2019s recommended to use a Python virtual environment for the project. You can create one using venv or Conda. For example, with Python\u2019s built-in venv:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>python3 -m venv venv<br>source venv\/bin\/activate   # (On Windows: venv\\Scripts\\activate)<\/code><\/pre>\n\n\n\n<p>This will activate a virtual environment named \u201cvenv\u201d for the project.<\/p>\n\n\n\n<p><strong>3. Install Python Dependencies:<\/strong> Once the virtual env is active, install all required Python libraries by running:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>pip install -r requirements.txt<\/code><\/pre>\n\n\n\n<p>This will download and install all packages listed in <code>requirements.txt<\/code>. These likely include libraries such as <strong>Neo4j Python driver<\/strong> (or Py2Neo) for graph database access, <strong>SentenceTransformers<\/strong> for embedding generation, <strong>pandas<\/strong> for data handling, and possibly <strong>llama-cpp-python<\/strong> or similar to interface with the local LLM model. (Ensure you have an appropriate compiler setup if llama-cpp-python is used, as it may need to compile the LLM backend.)<\/p>\n\n\n\n<p><strong>4. Install Neo4j:<\/strong> Neo4j is the graph database used to store the knowledge graph. You need to have Neo4j (Community or Enterprise Edition) installed on your system. You can download it from the [official Neo4j site] and follow their installation instructions. After installation, <strong>start the Neo4j server<\/strong> and set a password for the <code>neo4j<\/code> user (the default username is <code>neo4j<\/code>; you will be prompted to set an admin password on first launch or you can use the default and change later). Make sure Neo4j is running locally (by default on bolt:\/\/localhost:7687) so that the scripts can connect to it.<\/p>\n\n\n\n<p><strong>5. Configure APOC (if not already enabled):<\/strong> APOC is a library of procedures for Neo4j that this project uses (likely for importing data from JSON or performing graph algorithms). In your Neo4j installation, locate the <code>apoc.conf<\/code> or <code>neo4j.conf<\/code> file (in the <code>conf\/<\/code> directory of Neo4j). Open this config file and ensure the following line is present (add it if not):<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>apoc.export.file.enabled=true<\/code><\/pre>\n\n\n\n<p>This setting allows APOC to export to\/import from files, which might be needed for the KG import or export functionality. After adding the line, restart the Neo4j server to apply the changes. (Note: Some Neo4j versions bundle APOC by default but disable file access for security; this line explicitly enables it. In some cases, you might also need <code>dbms.security.procedures.unrestricted=apoc.*<\/code> in the config to allow all APOC procedures.)<\/p>\n\n\n\n<p><strong>6. (Optional) Set up the Local LLM Model:<\/strong> The project includes a directory <code>models\/<\/code> which in the repository listing shows a file named <code>mistral-7b-instruct-v0.2.Q4_K_M.gguf<\/code>. This appears to be a quantized GGUF format model for the [Mistral 7B instruct LLM]. If this file is not present (it may be large and possibly handled via Git LFS or a separate download), you will need to obtain it. Make sure the model file is placed in <code>models\/<\/code> and that the path or name is correctly referenced in the code. This model will be used by the Llama.cpp backend to generate answers for the RAG pipeline. If you prefer, you could substitute your own GGUF model, but ensure it\u2019s an instruct-tuned model (so it responds well to questions) and update the code accordingly if needed. No additional installation is required for the model aside from having the file; the <code>llama-cpp<\/code> library will load it at runtime.<\/p>\n\n\n\n<p>After completing the above steps, you should have all components in place: Python environment ready, dependencies installed, Neo4j running with the required configuration, and the local LLM model available. You\u2019re now ready to build and explore the urban air quality knowledge graph!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Running the Project Locally (Step-by-Step Usage)<\/h2>\n\n\n\n<p>Once setup is complete, you can proceed to construct the knowledge graph and run queries. Below is a typical workflow with the corresponding scripts\/notebooks:<\/p>\n\n\n\n<p><strong>1. Knowledge Extraction from Text:<\/strong> Start by extracting knowledge from unstructured text files. The repository provides example text files under <code>data\/example_txt\/<\/code> that describe various air quality facts and measures. You can run the extraction in two ways:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Via script:<\/strong> Execute the <code>extraction.py<\/code> script on a folder of text files. By default, it may be configured to read from <code>data\/example_txt\/<\/code> and output JSON to <code>data\/output\/<\/code>. For example, run: bashCopyEdit<code>python src\/extraction.py<\/code> (Check the script for any arguments or configuration; it might output a combined JSON or one per text file.)<\/li>\n\n\n\n<li><strong>Via notebook:<\/strong> Open and run the steps in <strong><code>Knowledge_extraction.ipynb<\/code><\/strong> which will load an LLM (possibly using OpenAI or a local model) to process each text and extract structured data according to the ontology. This extraction uses predefined prompts to identify entities like pollutants and their relationships in the text. After running this, you should obtain one or more JSON files in <code>data\/output\/<\/code> containing the extracted knowledge. Each JSON entry will likely have fields identifying an entity (with a type and name) and how it connects to others (for example, a JSON might state a pollutant \u201cNO\u2082\u201d is emitted by source \u201cVehicles\u201d under certain conditions).<\/li>\n<\/ul>\n\n\n\n<p><strong>2. Validate and Merge Knowledge (optional):<\/strong> If multiple JSON files are produced from different sources or texts, you might want to merge them into a single knowledge base and eliminate duplicates. The script <code>merge_knowledge.py<\/code> can merge new knowledge into an existing JSON dataset, using fuzzy matching to avoid duplicating the same entity. Likewise, <code>jsonvalidator.py<\/code> can be used to ensure the JSON conforms to the ontology schema (e.g., all required fields are present, entity types are valid). For example:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>python src\/jsonvalidator.py data\/output\/extracted_knowledge.json<\/code><\/pre>\n\n\n\n<p>This would print any validation errors against the ontology (defined in <code>urban_air_quality.yaml<\/code>). It\u2019s good practice to run this before import. If you have multiple JSON fragments (e.g., one per input document), you can merge them:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>python src\/merge_knowledge.py data\/output\/knowledge1.json data\/output\/knowledge2.json -o data\/output\/combined.json<\/code><\/pre>\n\n\n\n<p>(Assuming the script takes input files and an output path via arguments \u2013 refer to its help or the notebook for exact usage.)<\/p>\n\n\n\n<p><strong>3. Import into Neo4j:<\/strong> With a consolidated and validated JSON of knowledge, the next step is to load it into the Neo4j graph database. Use the <code>neo4j_local_import.py<\/code> script for this. You may need to configure connection details (host, user, password) either inside this script or via environment variables. By default, it might assume bolt:\/\/localhost:7687 and user <code>neo4j<\/code>. Run the import:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>python src\/neo4j_local_import.py data\/output\/combined.json<\/code><\/pre>\n\n\n\n<p>This script will connect to Neo4j and create nodes and relationships as defined in the JSON. For example, a pollutant entity in JSON will become a node labeled &#8220;Pollutant&#8221; (or similar) in Neo4j, a mitigation measure becomes a node of type &#8220;Mitigation&#8221;, etc., and relations like &#8220;affects&#8221; or &#8220;emitted_by&#8221; will become relationships between nodes. After running this, you will have a <strong>populated knowledge graph<\/strong> in Neo4j. You can verify by opening Neo4j Browser (at http:\/\/localhost:7474) and running a simple Cypher query like:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>MATCH (n) RETURN labels(n), count(*);<\/code><\/pre>\n\n\n\n<p>to see the count of nodes per label, or<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>MATCH p=(n)-[r]-&gt;(m) RETURN p LIMIT 20;<\/code><\/pre>\n\n\n\n<p>to see a sample of the graph relationships.<\/p>\n\n\n\n<p><strong>4. Generate Semantic Embeddings:<\/strong> Once the graph is populated, generate embeddings for the nodes to enable semantic similarity queries. Run <code>neo4j_embedding_pipeline.py<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>python src\/neo4j_embedding_pipeline.py<\/code><\/pre>\n\n\n\n<p>This will likely connect to Neo4j, retrieve all nodes (or all node IDs and their text representation), compute an embedding for each using a SentenceTransformers model, and then store the embedding vector back onto each node in Neo4j (possibly as a property, e.g., <code>embedding<\/code>). The SentenceTransformers model could be a pre-trained model (e.g. <code>all-MiniLM-L6-v2<\/code> or another suitable model for short phrases). The notebook <code>Embedding_and_similarity_search.ipynb<\/code> demonstrates this process as well. After running this, each node in the KG has a numerical vector representation capturing its meaning.<\/p>\n\n\n\n<p><strong>5. Perform Query (Semantic Similarity Search):<\/strong> You can now query the knowledge graph for relevant information. One way is to use the semantic similarity search script directly. For example:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>python src\/neo4j_similarity_search.py \"What are sources of PM2.5?\"<\/code><\/pre>\n\n\n\n<p>The script will take the query string, embed it in the same vector space, and find the nearest neighbor nodes in the graph by cosine similarity. It might then print out the top matching nodes and maybe their relationships. In this example, a query about \u201csources of PM2.5\u201d might retrieve nodes like <strong>\u201cVehicle emissions (PM2.5 particles)\u201d<\/strong> or <strong>\u201cConstruction dust (PM2.5)\u201d<\/strong> if those are in the graph. Essentially, this gives you a way to discover which parts of the KG are most related to your question. You can also do this interactively in the provided notebook, which will show how to formulate queries and interpret the results.<\/p>\n\n\n\n<p><strong>6. Ask Questions via RAG (Natural Language Q&amp;A):<\/strong> The highlight of the project is the ability to ask complex questions in natural language and get answers backed by the KG. Use the <code>neo4j_local_rag.py<\/code> script to do this. For instance:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>python src\/neo4j_local_rag.py<\/code><\/pre>\n\n\n\n<p>(It might drop you into an interactive prompt or you might modify it to answer a single question.) You can then ask something like: <strong>\u201cHow can we reduce NO\u2082 levels in urban areas?\u201d<\/strong>. Behind the scenes, the script will embed your question, find relevant nodes (e.g. the NO\u2082 pollutant node, and mitigation nodes like \u201cLow Emission Zone\u201d, \u201cElectric vehicle adoption\u201d, etc.), retrieve facts or descriptions associated with those nodes, and feed that into the local LLM (the Mistral model) to generate a coherent answer. For example, the system might respond with an answer along the lines of:<\/p>\n\n\n\n<p><em>\u201cTo reduce NO\u2082 levels in cities, common strategies include promoting public transit and electric vehicles (to cut down vehicle exhaust emissions, a major source of NO\u2082), establishing Low Emission Zones or Clean Air Zones to restrict high-NO\u2082 emitters, and improving traffic flow to minimize congestion. Additionally, encouraging urban green spaces can help as vegetation can absorb some pollutants.\u201d<\/em><\/p>\n\n\n\n<p>The answer is formulated by the LLM but grounded in the knowledge graph content (e.g. it knows vehicles cause NO\u2082, and that clean air zones and EVs are mitigation measures, etc., because those relationships exist in the KG). The notebook <code>Explicit_local_RAG_QA.ipynb<\/code> demonstrates such Q&amp;A usage with example questions. It provides a step-by-step of how the question is processed and how the answer is generated, which can help in understanding and debugging the pipeline.<\/p>\n\n\n\n<p>By following the above steps, you can reproduce the pipeline: ingest new knowledge, build the graph, and query it. The provided examples and notebooks are a good guide \u2013 you can start by running them with the included data, and then extend with your own air quality documents to grow the knowledge graph.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Notable Dependencies and Configuration Tips<\/h2>\n\n\n\n<p>When working with the Urban Air Quality KG, keep in mind a few important dependencies and configurations:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Neo4j Database:<\/strong> As mentioned, Neo4j is central to this project. You should have Neo4j 4.x or 5.x (preferably the latest stable version) installed. The project uses Neo4j\u2019s Bolt protocol to connect, so ensure the Neo4j Python driver (<code>neo4j<\/code> package or py2neo) is installed (it should be via requirements). After installation, the database should be running and you should know the <strong>Bolt URI, username, and password<\/strong>. If the code doesn\u2019t prompt for it, you might have to open the scripts and set the credentials (for simplicity, you could use the default neo4j\/neo4j and update the password in the Neo4j browser, then use that in the scripts). The APOC configuration (<code>apoc.export.file.enabled=true<\/code>) is needed to allow import\/export procedures \u2013 without it, the import script might fail if it relies on APOC to load JSON or CSV data.<\/li>\n\n\n\n<li><strong>SentenceTransformers Model:<\/strong> The embedding generation will likely download a pre-trained model the first time it runs (if not provided in the repo). Ensure you have internet access when running <code>neo4j_embedding_pipeline.py<\/code> for the first time, so it can fetch the model (unless the model is a local file and the code is pointed to it). Common models for sentence embeddings (like <code>sentence-transformers\/all-MiniLM-L6-v2<\/code>) are a few hundred MB downloads. After download, they cache in <code>~\/.cache\/torch\/sentence_transformers\/<\/code>. If running in an offline environment, you might need to manually provide the model. Check the script to see which model name is used and adjust if necessary (you can change to any SentenceTransformer model that suits short text).<\/li>\n\n\n\n<li><strong>Local LLM and Llama.cpp:<\/strong> The RAG QA uses a local large language model. The project provides a quantized <strong>Mistral 7B Instruct<\/strong> model (GGUF format) which is run with Llama.cpp. Make sure you have a compatible setup for running this:\n<ul class=\"wp-block-list\">\n<li>The Python package <code>llama-cpp-python<\/code> (if used) will need to be installed (it likely is in requirements). On first use, it may compile the C++ backend. You might need to install a C compiler (like <code>gcc<\/code> or <code>cl.exe<\/code> on Windows, or <code>clang<\/code> on Mac) beforehand. If that\u2019s troublesome, another approach could be using the command-line llama.cpp separately, but the provided integration is probably easier.<\/li>\n\n\n\n<li>The model file (GGUF) should be present in the <code>models\/<\/code> directory. If it\u2019s not included due to size, get the <strong>Mistral 7B v0.2 instruct<\/strong> model in 4-bit format (the filename suggests Q4_K_M quantization) and place it there. This file can be several GB (quantized maybe ~4GB). Ensure you have disk space and use Git LFS if provided.<\/li>\n\n\n\n<li>If you prefer or have a better model, you can use that \u2013 just update <code>neo4j_local_rag.py<\/code> to point to your model\u2019s path. Keep in mind model size vs your RAM; 7B 4-bit should run on CPU with &lt;8GB RAM. Larger models (13B, 70B) may not fit or will be very slow on CPU.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>With the Urban Air Quality KG project set up, you have a powerful tool at your disposal: a combination of a knowledge graph and AI that can answer complex questions about air pollution. The repository provides everything from data ingestion to query interfaces. By following this tutorial, you should be able to recreate the knowledge graph on your machine, understand its structure (Neo4j, nodes\/relationships), and interact with it either through direct queries or via the intelligent RAG pipeline for natural language Q&amp;A. This not only serves as a practical guide to this specific project but also illustrates how to integrate knowledge graphs with modern AI models \u2013 an approach that can be extended to many other domains beyond air quality.<\/p>\n\n\n\n<p>Happy exploring, and may your insights help in devising cleaner air solutions!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Acknowledgement<\/h2>\n\n\n\n<p>This work was supported by the Built Environment Fellowship awarded to Dr Xiang Xie from the 1851 Royal Commission.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Contributing<\/h2>\n\n\n\n<p class=\"has-small-font-size\"><mark class=\"has-inline-color has-blue-color\"><strong>Contributions are welcome! <em>Get in touch if you have any queries or would like to collabrate. Email xiang.xie@ncl.ac.uk<\/em><\/strong><\/mark><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Project Overview GITHUB: https:\/\/github.com\/XiangX91\/urban-air-quality-kg Funded by: Based in: The Urban Air Quality KG Github project, as the outcome of the 1851 funded Built Environment fellowship, aims to enhance understanding of urban air pollution by building a semantically enriched knowledge graph (KG) of air quality data. The KG represents key concepts like pollutants, pollution sources, meteorological [&hellip;]<\/p>\n","protected":false},"author":11184,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-105","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/www.staff.ncl.ac.uk\/xiangxie\/wp-json\/wp\/v2\/pages\/105","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.staff.ncl.ac.uk\/xiangxie\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.staff.ncl.ac.uk\/xiangxie\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.staff.ncl.ac.uk\/xiangxie\/wp-json\/wp\/v2\/users\/11184"}],"replies":[{"embeddable":true,"href":"https:\/\/www.staff.ncl.ac.uk\/xiangxie\/wp-json\/wp\/v2\/comments?post=105"}],"version-history":[{"count":16,"href":"https:\/\/www.staff.ncl.ac.uk\/xiangxie\/wp-json\/wp\/v2\/pages\/105\/revisions"}],"predecessor-version":[{"id":142,"href":"https:\/\/www.staff.ncl.ac.uk\/xiangxie\/wp-json\/wp\/v2\/pages\/105\/revisions\/142"}],"wp:attachment":[{"href":"https:\/\/www.staff.ncl.ac.uk\/xiangxie\/wp-json\/wp\/v2\/media?parent=105"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}