Model Architecture

In the first part of this series, I introduced the idea of federated language models, where we take advantage of a capable cloud-based large language model (LLM) and a small language model (SLM) running at the edge.

To recap, an agent sends the user query (1) along with the available tools (2) to a cloud-based LLM to map the prompt into a set of functions and arguments (3). It then executes the functions to generate appropriate context from a database (4a). If there are no tools involved, it leverages the simple RAG mechanism to perform a semantic search in a vector database (4b). The context gathered from either of the sources is then sent (5) to an edge-based SLM to generate a factually correct response. The response (6) generated by the SLM is sent as the final answer (7) to the user query.

This tutorial focuses on the practical implementation of a federated LM architecture based on the below components:

OpenAI GPT-4 Omni as an LLM.
Microsoft Phi-3 as a SLM.
Ollama as the inference engine for Phi-3.
Nvidia Jetson AGX Orin as the edge device to run Ollama.
MySQL database and a Flask API server running locally.
Chroma as the local vector store for semantic search.

Refer to the tutorial on setting up Ollama on Jetson Orin and implementing the RAG agent for additional context and details.

Start by cloning the Git repository https://github.com/janakiramm/federated-llm.git, which has the scripts, data, and Jupyter Notebooks. This tutorial assumes that you have access to OpenAI and an Nvidia Jetson Orin device. You can also run Ollama on your local workstation and change the IP address in the code.

Step 1: Run Ollama on Jetson Orin

SSH into Jetson Orin and run the commands mentioned in the file, setup_ollama.sh.

Verify that you are able to connect to and access the model by running the below command on your workstation, where you run Jupyter Notebook.

123456789101112131415

curl http://localhost:11434/v1/chat/completions \ -H “Content-Type: application/json” \ -d ‘{ “model”: “phi3:mini”, “messages”: [ { “role”: “system”, “content”: “You are a helpful assistant.” }, { “role”: “user”, “content”: “What is the capital of France?” } ] }’

Replace localhost with the IP address of your Jetson Orin device. If everything goes well, you should be able to get a response from the model.

Congratulations, your edge inference server is now ready!

Step 2: Run MySQL DB and Flask API Server

The next step is to run the API server, which exposes a set of functions that will be mapped to the prompt. To make this simple, I built a Docker Compose file that exposes the REST API endpoint for the MySQL database backend.

Switch to the api folder and run the below command:

1	start_api_server.sh

Check if two containers are running on your workstation with the docker ps command.

If you run the command curl "http://localhost:5000/api/sales/trends?start_date=2023-05-01&end_date=2023-05-30", you should see the response from the API.

Step 3: Index the PDF and Ingest the Embeddings in Chroma DB

With the API in place, it’s time to generate the embeddings for the datasheet PDF and store them in the vector database.

For this, run the Index_Datasheet.ipynb Jupyter Notebook, which is available in the notebooks folder.

A simple search retrieves the phrases that semantically match the query.

Step 4: Run the Federated LM Agent

The Jupyter Notebook, Federated-LM.ipynb, has the complete code to implement the logic. Let’s understand the key sections of the code.

We will import the API client that exposes the tools to the LLM.

We will import the API client that exposes the tools to the LLM.

First, we initialize two LLMs: GPT-4o (Cloud) and Phi3:mini (Edge)

After creating a Python list with the signatures of the tools, we will let GPT-4o map the prompt to appropriate functions and their arguments.

For example, passing the prompt What was the top selling product in Q2 based on revenue? to GPT-4o results in the model responding with the function get_top_selling_products and the corresponding arguments. Notice that a capable model is able to translate Q2 into date range, starting from April 1st to June 30th. This is exactly the power we want to exploit from the cloud-based LLM.

Once we enumerate the tool(s) suggested by GPT-4o, we execute, collect, and aggregate the output to form the context.

If the prompt doesn’t translate to tools, we attempt to use the retriever based on the semantic search from the vector database.

To avoid sending sensitive context to the cloud-based LLM, we leverage the model (edge_llm) at the edge for generation.

Finally, we implement the agent that orchestrates the calls between the cloud-based LLM and the edge-based LLM. It checks if the tools list is empty and then moves to the retriever to generate the context. If both are empty, the agent responds with the phrase “I don’t know.”

Below is the response from the agent based on tools, retriever, and unknown context.

To summarize, we implemented a federated LLM approach where an agent sends the user query along with available tools to a cloud-based LLM, which maps the prompt into functions and arguments. The agent executes these functions to generate context from a database. If no tools are involved, a simple RAG mechanism is used for semantic search in a vector database. The context is then sent to an edge-based SLM to generate a factually correct response, which is provided as the final answer to the user query.

Tag: Model Architecture

How To Run an Agent on Federated Language Model Architecture

Step 1: Run Ollama on Jetson Orin

Step 2: Run MySQL DB and Flask API Server

Step 3: Index the PDF and Ingest the Embeddings in Chroma DB

Step 4: Run the Federated LM Agent