After chatting locally with a LLM via Ollama, is it possible to make the local LLM remember what has been talked about?
Yes, it is possible to make a locally running LLM (via Ollama) remember what has been talked about during a conversation. This concept is often referred to as conversation history, context retention, or memory management. By maintaining a record of previous interactions, the model can generate responses that are contextually aware and consistent with earlier parts of the conversation.
Here’s how you can implement this functionality:
1. Understanding Context Retention in LLMs
LLMs like those run via Ollama don’t inherently "remember" conversations across sessions because they process each input independently. However, you can simulate memory by:
- Storing Conversation History: Save the conversation history (user inputs and model responses) in a structured format.
- Appending History to Prompts: When sending a new prompt to the model, include the conversation history as part of the input. This allows the model to consider past interactions when generating responses.
2. Implementing Memory in Ollama
Step 1: Store Conversation History
You can store the conversation history in a simple data structure like a list or dictionary. For example:
- Use a Python list to store alternating user inputs and model responses.
- Alternatively, use a database or file system for persistent storage if you want the conversation to persist across sessions.
Example: Storing History in Python
conversation_history = []
Step 2: Append History to Prompts
When interacting with the model, append the conversation history to the current prompt before sending it to the model.
Example: Sending History with Each Prompt
import requests
# Initialize an empty conversation history
conversation_history = []
def chat_with_model(prompt):
global conversation_history
# Append the user's input to the history
conversation_history.append(f"User: {prompt}")
# Combine the history into a single string
full_prompt = "\n".join(conversation_history) + f"\nModel:"
# Send the full prompt to the model
response = requests.post('http://localhost:11434/api/generate', json={
"model": "llama2",
"prompt": full_prompt
})
# Extract the model's response
model_response = response.json()["response"]
# Append the model's response to the history
conversation_history.append(f"Model: {model_response}")
return model_response
# Example usage
while True:
user_input = input("You: ")
if user_input.lower() in ["exit", "quit"]:
break
response = chat_with_model(user_input)
print(f"Model: {response}")
In this example:
- The
conversation_history
list stores all user inputs and model responses. - Each time the user sends a new prompt, the entire conversation history is appended to the prompt before sending it to the model.
- The model generates a response based on the full context of the conversation.
3. Managing Context Length
LLMs have a maximum context length (e.g., 2048 tokens for some models). If the conversation history becomes too long, it may exceed this limit, causing errors or truncated responses. To handle this, you can:
- Truncate Older Messages: Remove older messages from the conversation history to keep it within the token limit.
- Summarize History: Periodically summarize the conversation history and replace it with a concise summary.
Example: Truncating History
MAX_HISTORY_LENGTH = 10 # Maximum number of messages to keep
def chat_with_model(prompt):
global conversation_history
# Append the user's input to the history
conversation_history.append(f"User: {prompt}")
# Truncate the history if it exceeds the maximum length
if len(conversation_history) > MAX_HISTORY_LENGTH:
conversation_history = conversation_history[-MAX_HISTORY_LENGTH:]
# Combine the history into a single string
full_prompt = "\n".join(conversation_history) + f"\nModel:"
# Send the full prompt to the model
response = requests.post('http://localhost:11434/api/generate', json={
"model": "llama2",
"prompt": full_prompt
})
# Extract the model's response
model_response = response.json()["response"]
# Append the model's response to the history
conversation_history.append(f"Model: {model_response}")
return model_response
4. Persistent Storage for Long-Term Memory
If you want the model to remember conversations across sessions (e.g., after restarting your computer), you can store the conversation history in a file or database.
Example: Saving History to a File
import json
HISTORY_FILE = "conversation_history.json"
# Load conversation history from file
def load_history():
try:
with open(HISTORY_FILE, "r") as f:
return json.load(f)
except FileNotFoundError:
return []
# Save conversation history to file
def save_history(history):
with open(HISTORY_FILE, "w") as f:
json.dump(history, f)
# Load history at the start
conversation_history = load_history()
def chat_with_model(prompt):
global conversation_history
# Append the user's input to the history
conversation_history.append(f"User: {prompt}")
# Combine the history into a single string
full_prompt = "\n".join(conversation_history) + f"\nModel:"
# Send the full prompt to the model
response = requests.post('http://localhost:11434/api/generate', json={
"model": "llama2",
"prompt": full_prompt
})
# Extract the model's response
model_response = response.json()["response"]
# Append the model's response to the history
conversation_history.append(f"Model: {model_response}")
# Save the updated history to file
save_history(conversation_history)
return model_response
In this example:
- The conversation history is saved to a JSON file (
conversation_history.json
) after each interaction. - When the script starts, it loads the history from the file, allowing the model to "remember" past conversations.
5. Advanced Memory Techniques
For more advanced use cases, you can implement techniques like:
- Vector Databases: Store conversation history in a vector database (e.g., Pinecone, Weaviate) and retrieve relevant parts of the conversation using semantic search.
- External Memory Systems: Use tools like LangChain or AutoGPT to manage memory and context more effectively.
6. Limitations
- Token Limits: Most LLMs have a maximum token limit for input. If the conversation history becomes too long, it will need to be truncated or summarized.
- Privacy: If you’re storing conversation history locally or in a database, ensure that sensitive information is handled securely.
- Performance: Including long conversation histories in each prompt can increase processing time and resource usage.
7. Conclusion
By maintaining and appending conversation history to prompts, you can make a locally running LLM (via Ollama) "remember" past interactions. This can be implemented using simple data structures like lists or more advanced systems like databases or vector stores. Additionally, you can persist the conversation history across sessions by saving it to files or databases.
This approach allows you to create more interactive and context-aware applications, such as chatbots, virtual assistants, or conversational agents, while leveraging the power of local LLMs.