Summarisation via LangChain

Diese Seite ist noch nicht in deiner Sprache verfügbar. Englische Seite aufrufen

This tutorial shows how to use large language models (LLMs) with the LangChain framework. This chapter focuses on summarisation—a task well suited to LLMs—while highlighting potential pitfalls. For an introduction to the framework, see LangChain expression language (basic usage).

In this tutorial, we summarise a large fictional text and a plausible email conversation from a software development team.

Configure access to STACKIT LLM instances

After you create a STACKIT AI Model Serving auth token (see Manage auth tokens), provide it as model_serving_auth_token. From Available shared models, choose a model and provide the model name and the base URL.

import os

from dotenv import load_dotenv

load_dotenv("../.env")

model = os.environ["STACKIT_MODEL_SERVING_MODEL"]                           # Select a chat model from https://support.docs.stackit.cloud/stackit/en/models-licenses-319914532.html
base_url = os.environ["STACKIT_MODEL_SERVING_BASE_URL"]                     # For example: "https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1"
model_serving_auth_token = os.environ["STACKIT_MODEL_SERVING_AUTH_TOKEN"]   # For example: "ey..."

Utilities

We use helper functions to ingest data, including code that reads files and returns their payload in a formatted manner.

import json
from pathlib import Path

def read_gutenberg(filename: Path | str) -> str:
    """Consume the specified Project Gutenberg ebook, extract the payload, and return it as a string."""
    payload_boundary = "***"
    fp = Path(filename)
    _verify_file_path(path=fp)

    with fp.open(mode="r", encoding="utf-8") as f:
        text = f.read()

    fragments = text.split(payload_boundary)
    if len(fragments) != 5 or ("PROJECT GUTENBERG" not in fragments[1]):
        raise ValueError("Specified file is not a Project Gutenberg ebook.")

    return fragments[2]

def read_mail_conversation(filename: Path | str) -> str:
    """Load messages from a JSON file and format them as a single string."""
    message_delimiter = "\n\n" + "#"*32 + "\n\n"
    fp = Path(filename)
    _verify_file_path(path=fp)

    with fp.open(mode="r", encoding="utf-8") as f:
        payload = json.load(f)
    return message_delimiter.join("\n".join(f"{k.strip()}:\t{v.strip()}" for k, v in msg.items()) for msg in payload)

def _verify_file_path(path: Path) -> None:
    """Raise FileNotFoundError if the specified path does not point to a file."""
    if not path.exists():
        raise FileNotFoundError(f"Could not find file: {path}")
    if not path.is_file():
        raise FileNotFoundError(f"Expected {path} to be a file, but found a directory.")

Plain summary

We define a LangChain runnable to obtain a summary of a fictional text. As noted in LangChain expression language (basic usage), the prompt specifies the LLM’s task.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers.string import StrOutputParser
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model=model,
    base_url=base_url,
    api_key=model_serving_auth_token,
    temperature=.01
)

summary_prompt = ChatPromptTemplate([
    ("system", "You are a helpful AI bot."),
    ("human", "Summarise the following text:\n\n{text_to_summarize}"),
])
summary_chain = summary_prompt | llm | StrOutputParser()

First, look at the text to summarise: “Das Urteil” is one of the most well‑known stories by Franz Kafka. Written in 1912 and published one year later.

text_to_summarize = read_gutenberg(Path.cwd().parent / "data" / "das_urteil.txt")
summary = summary_chain.invoke(text_to_summarize)

print(f"Amount of words original: ca. {len(text_to_summarize.split(' '))}")
print(f"Amount of words summary: ca. {len(summary.split(' '))}\n")

print(summary)


# output
#> Amount of words original: ca. 3953
#> Amount of words summary: ca. 258
#>
#>
#> Der Text ist die Erzählung “Das Urteil” von Franz Kafka. Die Geschichte handelt von Georg Bendemann, einem jungen Kaufmann, der in einem Brief an seinen Freund in Petersburg seine Verlobung mit Fräulein Frieda Brandenfeld ankündigt. Georgs Vater ist jedoch nicht erfreut über die Nachricht und beschuldigt Georg, ihn und seine Mutter verraten zu haben.
#> Der Vater offenbart, dass er Georgs Briefe an den Freund in Petersburg gelesen hat und dass er selbst Briefe an den Freund geschrieben hat, in denen er Georgs Verhalten kritisiert. Der Vater wirft Georg vor, dass er ihn und seine Mutter verraten hat, indem er sich mit Fräulein Brandenfeld verlobt hat.
#> Georg ist schockiert und verletzt von den Vorwürfen seines Vaters und fühlt sich aus dem Zimmer gejagt. Er läuft aus dem Haus und springt von der Brücke in den Fluss, wo er sich das Leben nimmt.
#> Die Geschichte ist ein typisches Beispiel für Kafkas Stil, der von existenzieller Angst, Isolation und dem Konflikt zwischen Vater und Sohn geprägt ist. Der Text ist auch ein Kommentar zur Gesellschaft und Kultur der Zeit, in der er geschrieben wurde.
#> Einige wichtige Themen in der Geschichte sind:
#>  - Der Konflikt zwischen Vater und Sohn: Der Vater repräsentiert die Autorität und Tradition, während der Sohn die Moderne und die Freiheit verkörpert.
#>  - Die Isolation und Einsamkeit: Georg ist ein einsamer Mensch, der sich von seiner Familie und seiner Gesellschaft abgeschnitten fühlt.
#>  - Die existenzielle Angst: Georgs Tod ist ein Symbol für die Angst vor dem Unbekannten und die Unsicherheit des Lebens.
#>
#> Insgesamt ist “Das Urteil” eine komplexe und vielschichtige Geschichte, die viele Interpretationen und Analysen ermöglicht.

This task appears satisfactory. However, we identified issues to address:

The language switched.
The summary length was not constrained.
The model’s world knowledge influenced the output.

Be specific in prompts. Refine the summary prompt to address these issues.

refined_summary_prompt = ChatPromptTemplate([
    ("system", "You are a helpful AI bot."),
    ("human", "Complete the following task:\n- Summarise the text provided at the end of this instruction.\n- Use the same language as the original text.\n- Do not exceed 100 words.\n- Do not add external information; stick to the text.\n\nHere is the text:\n\n{text_to_summarize}")
])
refined_summary_chain = refined_summary_prompt | llm | StrOutputParser()

refined_summary = refined_summary_chain.invoke(text_to_summarize)

print(f"Amount of words original: ca. {len(text_to_summarize.split(' '))}")
print(f"Amount of words summary: ca. {len(refined_summary.split(' '))}\n")

print(refined_summary)

The summary stays within the given boundary and it also does not contain any comments on the overall work of the well-known author.

Summarise Emails

Keeping in mind the issues from the previous section, we are going to switch topics to an area where the model’s world knowledge is not useful at all. We have some emails from a software team that develops an app for tutoring music instruments. There is a product owner, a Scrum master, and a few developers having a conversation consisting of about a dozen messages. The use case is someone wants to catch up on the latest activities. A special demand in this situation is the need to distinguish several topics in the course of the conversation, which needs to be addressed in the prompt. As we learned our lesson from the summary attempt above, we start with a very precise prompt right away.

First, we may have a look at the data. While this conversation is fairly easy to grasp, it could change significantly when we want to apply this to larger teams and periods of time.

The example is based on a file called music_app_converstaion.json:

[
  {
    "sender": "alex.johnson@musicapp.com",
    "receiver": "emma.thompson@musicapp.com",
    "date": "2025-01-18T09:00:00Z",
    "cc": "team@musicapp.com",
    "bcc": "",
    "subject": "Upcoming Sprint Goals",
    "message": "Hi Team,\n\nFor the next sprint, I want us to focus on the following items:\n1. Implementing a new feature for piano chord recognition.\n2. Addressing the bug in the progress tracking dashboard reported by several users.\n3. Refactoring the onboarding module for better maintainability.\n\nLet me know if you have any questions or concerns.\n\nBest regards,\nAlex"
  },
  {
    "sender": "emma.thompson@musicapp.com",
    "receiver": "alex.johnson@musicapp.com",
    "date": "2025-01-18T09:15:00Z",
    "cc": "team@musicapp.com",
    "bcc": "",
    "subject": "Re: Upcoming Sprint Goals",
    "message": "Hi Alex,\n\nThank you for outlining the goals. I’ll create the sprint backlog and ensure the team understands the priorities. I’ll follow up after the planning meeting tomorrow.\n\nBest,\nEmma"
  },
  {
    "sender": "michael.brown@musicapp.com",
    "receiver": "emma.thompson@musicapp.com",
    "date": "2025-01-18T10:30:00Z",
    "cc": "team@musicapp.com",
    "bcc": "",
    "subject": "Question About Piano Chord Recognition Feature",
    "message": "Hi Emma,\n\nI have a quick question about the piano chord recognition feature. Are we integrating it into the current practice mode or developing it as a standalone feature?\n\nThanks,\nMichael"
  },
  {
    "sender": "emma.thompson@musicapp.com",
    "receiver": "michael.brown@musicapp.com",
    "date": "2025-01-18T11:00:00Z",
    "cc": "team@musicapp.com",
    "bcc": "",
    "subject": "Re: Question About Piano Chord Recognition Feature",
    "message": "Hi Michael,\n\nIt should be integrated into the current practice mode. Let’s discuss the technical details in our next stand-up.\n\nBest,\nEmma"
  },
  {
    "sender": "sophia.miller@musicapp.com",
    "receiver": "team@musicapp.com",
    "date": "2025-01-18T12:00:00Z",
    "cc": "emma.thompson@musicapp.com",
    "bcc": "",
    "subject": "Bug Report: Progress Tracking Dashboard",
    "message": "Hi Team,\n\nI looked into the bug reported in the progress tracking dashboard. It seems to be an issue with the API returning incorrect data. I’ll investigate further, but let me know if anyone has insights or has worked on this part of the app before.\n\nThanks,\nSophia"
  },
  {
    "sender": "daniel.wilson@musicapp.com",
    "receiver": "sophia.miller@musicapp.com",
    "date": "2025-01-18T12:30:00Z",
    "cc": "team@musicapp.com",
    "bcc": "",
    "subject": "Re: Bug Report: Progress Tracking Dashboard",
    "message": "Hi Sophia,\n\nI worked on the progress tracking feature last quarter. The issue might be related to a recent database schema update. Let’s pair up to debug this.\n\nBest,\nDaniel"
  },
  {
    "sender": "jessica.taylor@musicapp.com",
    "receiver": "emma.thompson@musicapp.com",
    "date": "2025-01-18T13:00:00Z",
    "cc": "team@musicapp.com",
    "bcc": "",
    "subject": "Confusion About Onboarding Refactoring Task",
    "message": "Hi Emma,\n\nI’m unclear about the refactoring task for the onboarding module. Are we expected to rewrite the entire module or just optimize specific functions?\n\nThanks,\nJessica"
  },
  {
    "sender": "emma.thompson@musicapp.com",
    "receiver": "jessica.taylor@musicapp.com",
    "date": "2025-01-18T13:30:00Z",
    "cc": "team@musicapp.com",
    "bcc": "",
    "subject": "Re: Confusion About Onboarding Refactoring Task",
    "message": "Hi Jessica,\n\nThe focus is on optimizing specific functions, especially those impacting performance. Let’s talk more about this in the next sprint planning session.\n\nBest,\nEmma"
  },
  {
    "sender": "user_feedback@musicapp.com",
    "receiver": "alex.johnson@musicapp.com",
    "date": "2025-01-18T14:00:00Z",
    "cc": "",
    "bcc": "",
    "subject": "Feedback: Missing Clarinet Tutorials",
    "message": "Hi,\n\nI’ve been using your app to learn piano, and it’s fantastic! However, I noticed there are no tutorials for the clarinet. Are there any plans to add this instrument in the future?\n\nThanks,\nA User"
  },
  {
    "sender": "alex.johnson@musicapp.com",
    "receiver": "emma.thompson@musicapp.com",
    "date": "2025-01-18T14:30:00Z",
    "cc": "team@musicapp.com",
    "bcc": "",
    "subject": "Re: Feedback: Missing Clarinet Tutorials",
    "message": "Hi Team,\n\nWe’ve received feedback requesting clarinet tutorials. Let’s consider adding this to the roadmap and discuss during the next planning session.\n\nBest,\nAlex"
  },
  {
    "sender": "jessica.taylor@musicapp.com",
    "receiver": "team@musicapp.com",
    "date": "2025-01-18T15:00:00Z",
    "cc": "emma.thompson@musicapp.com",
    "bcc": "",
    "subject": "Issue With Piano Chord Recognition Implementation",
    "message": "Hi Team,\n\nI’ve started implementing the piano chord recognition feature, but I realized I misunderstood the integration point. I thought this was for a new module, not the current practice mode. Should I restart, or can I modify my approach?\n\nThanks,\nJessica"
  },
  {
    "sender": "emma.thompson@musicapp.com",
    "receiver": "jessica.taylor@musicapp.com",
    "date": "2025-01-18T15:30:00Z",
    "cc": "team@musicapp.com",
    "bcc": "",
    "subject": "Re: Issue With Piano Chord Recognition Implementation",
    "message": "Hi Jessica,\n\nThanks for flagging this. It’s essential to align with the requirements. Let’s discuss during the stand-up tomorrow to find the best way forward without losing too much time.\n\nBest,\nEmma"
  },
  {
    "sender": "sophia.miller@musicapp.com",
    "receiver": "team@musicapp.com",
    "date": "2025-01-18T16:00:00Z",
    "cc": "emma.thompson@musicapp.com",
    "bcc": "",
    "subject": "Progress on Bug Fix for Dashboard",
    "message": "Hi Team,\n\nDaniel and I identified the root cause of the progress tracking bug. It was indeed related to the database schema update. We’ve implemented the fix and it has passed initial tests. I’ll deploy the update this afternoon.\n\nBest,\nSophia"
  },
  {
    "sender": "daniel.wilson@musicapp.com",
    "receiver": "team@musicapp.com",
    "date": "2025-01-18T17:00:00Z",
    "cc": "emma.thompson@musicapp.com",
    "bcc": "",
    "subject": "Clarification Needed: Legacy Refactoring Task",
    "message": "Hi Team,\n\nFor the onboarding module refactoring, do we have specific performance metrics to target, or should we focus on general code improvements?\n\nBest,\nDaniel"
  },
  {
    "sender": "emma.thompson@musicapp.com",
    "receiver": "daniel.wilson@musicapp.com",
    "date": "2025-01-18T17:30:00Z",
    "cc": "team@musicapp.com",
    "bcc": "",
    "subject": "Re: Clarification Needed: Legacy Refactoring Task",
    "message": "Hi Daniel,\n\nThe priority is on improving load times and reducing memory usage during the onboarding process. Please document your findings as you progress.\n\nBest,\nEmma"
  }
]

emails_to_summarize = read_mail_conversation(Path.cwd().parent / "data" / "music_app_conversation.json")

llm = ChatOpenAI(
    model=model,
    base_url=base_url,
    api_key=model_serving_auth_token
)

mail_summary_prompt = ChatPromptTemplate([
    ("system", "You are a helpful AI bot."),
    ("human", "Summarise the emails provided at the end of this instruction. The conversation concerns software development. Distinguish: accomplished tasks, new issues, ongoing tasks, reported bugs, and user feedback. Keep the summary brief and concise.\n\nHere is the conversation:\n\n{text_to_summarize}"),
])
mail_summary_chain = mail_summary_prompt | llm | StrOutputParser()
mail_summary = mail_summary_chain.invoke(emails_to_summarize)

print(mail_summary)


# Output
#> Here’s a concise summary of the emails:
#>
#> Accomplished Tasks:
#>  - Sophia and Daniel identified and fixed the bug in the progress tracking dashboard.
#>
#> New Issues:
#>  - Jessica realized she misunderstood the integration point for the piano chord recognition feature.
#>  - Daniel requested clarification on performance metrics for the onboarding module refactoring.
#> Ongoing Tasks:
#>  - Implementing the piano chord recognition feature.
#>  - Refactoring the onboarding module.
#> Reported Bugs:
#>  - Progress tracking dashboard bug (resolved).
#>  - No new bugs reported.
#> User Feedback:
#>  - A user requested clarinet tutorials, and the team will consider adding this to the roadmap.

This concludes the summarisation tutorial. Key takeaways:

Precise prompts yield precise results, even for seemingly simple tasks.
Input formatting may need iteration.

Two final notes:

The data sources here were simple (text and JSON files). Real projects often require substantial data preparation.
Prompt compliance is not verified in this tutorial. You can add verification steps, for example by checking whether the summary matches the source and regenerating if necessary. These topics are beyond this