Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Information extraction with LLMs using Amazon SageMaker JumpStart

    Information extraction with LLMs using Amazon SageMaker JumpStart

    May 7, 2024

    Large language models (LLMs) have unlocked new possibilities for extracting information from unstructured text data. Although much of the current excitement is around LLMs for generative AI tasks, many of the key use cases that you might want to solve have not fundamentally changed. Tasks such as routing support tickets, recognizing customers intents from a chatbot conversation session, extracting key entities from contracts, invoices, and other type of documents, as well as analyzing customer feedback are examples of long-standing needs.

    What makes LLMs so transformative, however, is their ability to achieve state-of-the-art results on these common tasks with minimal data and simple prompting, and their ability to multitask. Rather than requiring extensive feature engineering and dataset labeling, LLMs can be fine-tuned on small amounts of domain-specific data to quickly adapt to new use cases. By handling most of the heavy lifting, services like Amazon SageMaker JumpStart remove the complexity of fine-tuning and deploying these models.

    SageMaker JumpStart is a machine learning (ML) hub with foundation models (FMs), built-in algorithms, and prebuilt ML solutions that you can deploy with just a few clicks. With SageMaker JumpStart, you can evaluate, compare, and select FMs quickly based on predefined quality and responsibility metrics to perform tasks like article summarization and image generation.

    This post walks through examples of building information extraction use cases by combining LLMs with prompt engineering and frameworks such as LangChain. We also examine the uplift from fine-tuning an LLM for a specific extractive task. Whether you’re looking to classify documents, extract keywords, detect and redact personally identifiable information (PIIs), or parse semantic relationships, you can start ideating your use case and use LLMs for your natural language processing (NLP).

    Prompt engineering

    Prompt engineering enables you to instruct LLMs to generate suggestions, explanations, or completions of text in an interactive way. Prompt engineering relies on large pretrained language models that have been trained on massive amounts of text data. At first glance, there might not be one best way to design a prompt, and different LLMs might work better or worse with different prompts. Therefore, prompts are often iteratively refined through trial and error to produce better results. As a starting point, you can refer to the model documentation which typically includes recommendations and best practices for prompting the model, and examples provided in SageMaker JumpStart.

    In the following sections, we focus on the prompt engineering techniques required for extractive use cases. They help unlock the power of LLMs by providing helpful constraints and guide the model toward its intended behavior. We discuss the following use cases:

    Sensitive information detection and redaction
    Entity extraction; generic and specific entities with structured formats
    Classification, using prompt engineering and fine-tuning

    Before we explore these use cases, we need to set up our development environment.

    Prerequisites

    The source code accompanying this example is available in this GitHub repo. It consists of several Jupyter notebooks and a utils.py module. The utils.py module houses the shared code that is used throughout the notebooks.

    The simplest way to run this example is by using Amazon SageMaker Studio with the Data Science 3.0 kernel or an Amazon SageMaker notebook instance with the conda_python3 kernel. For the instance type, you can choose the default settings.

    In this example, we use ml.g5.2xlarge and ml.g5.48xlarge instances for endpoint usage, and ml.g5.24xlarge for training job usage. Use the Service Quotas console to make sure you have sufficient quotas for these instances in the Region where you’re running this example.

    We use Jupyter notebooks throughout this post. Before we explore the examples, it’s crucial to confirm that you have the latest version of the SageMaker Python SDK. This SDK offers a user-friendly interface for training and deploying models on SageMaker. To install or upgrade to the latest version, run the following command in the first cell of your Jupyter notebook:

    %pip install –quiet –upgrade sagemaker

    Deploy Llama-2-70b-chat using SageMaker JumpStart

    There are many LLMs available in SageMaker JumpStart to choose from. In this example, we use Llama-2-70b-chat, but you might use a different model depending on your use case. To explore the list of SageMaker JumpStart models, see JumpStart Available Model Table.

    To deploy a model from SageMaker JumpStart, you can use either APIs, as demonstrated in this post, or use the SageMaker Studio UI. After the model is deployed, you can test it by asking a question from the model:

    from sagemaker.jumpstart.model import JumpStartModel

    model_id, model_version = “meta-textgeneration-llama-2-70b-f”, “2.*”
    endpoint_name = model_id
    instance_type = “ml.g5.48xlarge”

    model = JumpStartModel(
    model_id=model_id, model_version=model_version, role=role_arn
    )
    predictor = model.deploy(
    endpoint_name=endpoint_name, instance_type=instance_type
    )

    If no instance_type is provided, the SageMaker JumpStart SDK will select the default type. In this example, you explicitly set the instance type to ml.g5.48xlarge.

    Sensitive data extraction and redaction

    LLMs show promise for extracting sensitive information for redaction. This includes techniques such as prompt engineering, which includes priming the model to understand the redaction task, and by providing examples that can improve the performance. For example, priming the model by stating “redact sensitive information” and demonstrating a few examples of redacting names, dates, and locations can help the LLM infer the rules of the task.

    More in-depth forms of priming the model include providing positive and negative examples, demonstrations of common errors, and in-context learning to teach the nuances of proper redaction. With careful prompt design, LLMs can learn to redact information while maintaining readability and utility of the document. In real-life applications, however, additional evaluation is often necessary to improve the reliability and safety of LLMs for handling confidential data. This is often achieved through the inclusion of human review, because no automated approach is entirely foolproof.

    The following are a few examples of using prompt engineering for the extraction and redaction of PII. The prompt consists of multiple parts: the report_sample, which includes the text that you want to identify and mask the PII data within, and instructions (or guidance) passed on to the model as the system message.

    report_sample = “””
    This month at AnyCompany, we have seen a significant surge in orders from a diverse clientele. On November 5th, 2023, customer Alice from US placed an order with total of $2190. Following her, on Nov 7th, Bob from UK ordered a bulk set of twenty-five ergonomic keyboards for his office setup with total of $1000. The trend continued with Jane from Australia, who on Nov 12th requested a shipment of ten high-definition monitors with total of $9000, emphasizing the need for environmentally friendly packaging. On the last day of that month, customer John, located in Singapore, finalized an order for fifteen USB-C docking stations, aiming to equip his design studio with the latest technology for total of $3600.
    “””

    system = “””
    Your task is to precisely identify Personally Identifiable Information (PII) and identifiable details, including name, address, and the person’s country, in the provided text. Replace these details with exactly four asterisks (****) as the masking characters. Use ‘****’ for masking text of any length. Only write the masked text in the response.
    “””

    In the following example, you define the llama2_chat function that encapsulates sending the prompt to the Llama-2 model. You reuse this function throughout the examples.

    def llama2_chat(
    predictor,
    user,
    temperature=0.1,
    max_tokens=512,
    top_p=0.9,
    system=None,
    ):
    “””Constructs the payload for the llama2 model, sends it to the endpoint,
    and returns the response.”””

    inputs = []
    if system:
    inputs.append({“role”: “system”, “content”: system})
    if user:
    inputs.append({“role”: “user”, “content”: user})

    payload = {
    “inputs”: [inputs],
    “parameters”: {
    “max_new_tokens”: max_tokens,
    “top_p”: top_p,
    “temperature”: temperature,
    },
    }
    response = predictor.predict(payload, custom_attributes=”accept_eula=true”)
    return response

    Use the following code to call the function, passing your parameters:

    response = utils.llama2_chat(
    predictor,
    system=system,
    user=report_sample,
    )
    print(utils.llama2_parse_output(response))

    You get the following output:

    This month at AnyCompany, we have seen a significant surge in orders from a diverse clientele. On November 5th, 2023, customer ***** from ***** placed an order with total of $2190. Following her, on Nov 7th, ***** from ***** ordered a bulk set of twenty-five ergonomic keyboards for his office setup with total of $1000. The trend continued with ***** from *****, who on Nov 12th requested a shipment of ten high-definition monitors with total of $9000, emphasizing the need for environmentally friendly packaging. On the last day of that month, customer *****, located in *****, finalized an order for fifteen USB-C docking stations, aiming to equip his design studio with the latest technology for total of $3600.

    Entity extraction

    Entity extraction is the process of identifying and extracting key information entities from unstructured text. This technique helps create structured data from unstructured text and provides useful contextual information for many downstream NLP tasks. Common applications for entity extraction include building a knowledge base, extracting metadata to use for personalization or search, and improving user inputs and conversation understanding within chatbots.

    You can effectively use LLMs for entity extraction tasks through careful prompt engineering. With a few examples of extracting entities from text, explanatory prompts, and the desired output format, the model can learn to identify and extract entities such as people, organizations, and locations from new input texts. In the following examples, we demonstrate a few different entity extraction tasks ranging from simpler to more complex using prompt engineering with the Llama-2-70b-chat model you deployed earlier.

    Extract generic entities

    Use the following code to extract specific entities:

    email_sample = “Hello, My name is John. Your AnyCompany Financial Services, LLC credit card account 1111-0000-1111-0008 has a minimum payment of $24.53 that is due by July 31st. Based on your autopay settings, we will withdraw your payment on the due date from your bank account number XXXXXX1111 with the routing number XXXXX0000. Customer feedback for Sunshine Spa, 123 Main St, Anywhere. Send comments to Alice at alice_aa@anycompany.com and Bob at bob_bb@anycompany.com. I enjoyed visiting the spa. It was very comfortable but it was also very expensive. The amenities were ok but the service made the spa a great experience.”

    system = “””
    Your task is to precisely identify any email addresses from the given text and then write them, one per line. Remember to ONLY write an email address if it’s precisely spelled out in the input text. If there are no email addresses in the text, write “N/A”. DO NOT write anything else.
    “””

    result = utils.llama2_chat(predictor, system=system, user=email_sample)
    print(utils.llama2_parse_output(result))

    You get the following output:

    alice_aa@anycompany.com
    bob_bb@anycompany.com

    Extract specific entities in a structured format

    Using the previous sample report, you can extract more complex information in a structured manner. This time, you provide a JSON template for the model to use and return the output in JSON format.

    With LLMs generating JSON documents as output, you can effortlessly parse them into a range of other data structures. This enables simple conversions to dictionaries, YAML, or even Pydantic models using third-party libraries, such as LangChain’s PydanticOutputParser. You can see the implementation in the GitHub repo.

    import json

    system = “””
    Your task is to precisely extract information from the text provided, and format it according to the given JSON schema delimited with triple backticks. Only include the JSON output in your response. If a specific field has no available data, indicate this by writing `null` as the value for that field in the output JSON. In cases where there is no data available at all, return an empty JSON object. Avoid including any other statements in the response.

    “`
    {json_schema}
    “`
    “””

    json_schema = “””
    {
    “orders”:
    [
    {
    “name”: “<customer_name>”,
    “location”: “<customer_location>”,
    “order_date”: “<order_date in format YYYY-MM-DD>”,
    “order_total”: “<order_total>”,
    “order_items”: [
    {
    “item_name”: “<item_name>”,
    “item_quantity”: “<item_quantity>”
    }
    ]
    }
    ]
    }
    “””

    response = utils.llama2_chat(
    predictor,
    system=system.format(json_schema=json_schema),
    user=report_sample,
    )
    json_str = utils.llama2_parse_output(response)
    print(json_str)

    You get the following output:

    {
    “orders”: [
    {
    “name”: “Alice”,
    “location”: “US”,
    “order_date”: “2023-11-05”,
    “order_total”: 2190,
    “order_items”: [
    {
    “item_name”: null,
    “item_quantity”: null
    }
    ]
    },
    {
    “name”: “Bob”,
    “location”: “UK”,
    “order_date”: “2023-11-07”,
    “order_total”: 1000,
    “order_items”: [
    {
    “item_name”: “ergonomic keyboards”,
    “item_quantity”: 25
    }
    ]
    },
    {
    “name”: “Jane”,
    “location”: “Australia”,
    “order_date”: “2023-11-12”,
    “order_total”: 9000,
    “order_items”: [
    {
    “item_name”: “high-definition monitors”,
    “item_quantity”: 10
    }
    ]
    },
    {
    “name”: “John”,
    “location”: “Singapore”,
    “order_date”: “2023-11-30”,
    “order_total”: 3600,
    “order_items”: [
    {
    “item_name”: “USB-C docking stations”,
    “item_quantity”: 15
    }
    ]
    }
    ]
    }

    Classification using prompt engineering

    LLMS can be a useful tool for information extraction tasks such as text classification. Common applications include classifying the intents of user interactions via channels such as email, chatbots, voice, and others, or categorizing documents to route their requests to downstream systems. The initial step involves identifying the intent or class of the user’s request or the document. These intents or classes could take many forms—from short single words to thousands of hierarchical classes and sub-classes.

    In the following examples, we demonstrate prompt engineering on synthetic conversation data to extract intents. Additionally, we show how pre-trained models can be assessed to determine if fine-tuning is needed.

    Let’s start with the following example. You have a list of customer interactions with an imaginary health and life insurance company. To start, use the Llama-2-70b-chat model you deployed in the previous section:

    inference_instance_type = “ml.g5.48xlarge”

    # Llama-2-70b chat
    model_id, model_version = “meta-textgeneration-llama-2-70b-f”, “2.*”
    endpoint_name = model_id

    predictor = utils.get_predictor(
    endpoint_name=endpoint_name,
    model_id=model_id,
    model_version=model_version,
    inference_instance_type=inference_instance_type,
    )

    The get_predictor function is a helper function that creates a predictor object from a model ID and version. If the specified endpoint doesn’t exist, it creates a new endpoint and deploy the model. If the endpoint already exists, it uses the existing endpoint.

    customer_interactions = [
    “””Hello, I’ve recently moved to a new state and I need to update my address for my health insurance policy.
    Can you assist me with that?
    “””,
    “””Good afternoon! I’m interested in adding dental coverage to my existing health plan.
    Could you provide me the options and prices?
    “””,
    “””I had a disappointing experience with the customer service yesterday regarding my claim.
    I want to file a formal complaint and speak with a supervisor.
    “””,
    ]

    system = “””
    Your task is to identify the customer intent from their interactions with support bot in the provided text. The intent output must not more than 4 words. If the intent is not clear, please provide a fallback intent of “unknown”.
    “””

    def get_intent(system, customer_interactions):
    for customer_interaction in customer_interactions:
    response = utils.llama2_chat(
    predictor,
    system=system,
    user=customer_interaction,
    )
    content = utils.llama2_parse_output(response)
    print(content)
    get_intent(system, customer_interactions)

    You get the following output:

    Update Address
    Intent: Informational
    Intent: Escalate issue

    Looking at the output, these seem reasonable as the intents. However, the format and style of the intents can vary depending on the language model. Another limitation of this approach is that intents are not confined to a predefined list, which means the language model might generate and word the intents differently each time you run it.

    To address this, you can use the in-context learning technique in prompt engineering to steer the model towards selecting from a predefined set of intents, or class labels, that you provide. In the following example, alongside the customer conversation, you include a list of potential intents and ask the model to choose from this list:

    system = “””
    Your task is to identify the intent from the customer interaction with the support bot. Select from the intents provided in the following list delimited with ####. If the intent is not clear, please provide a fallback intent of “unknown”. ONLY write the intent.

    ####
    – information change
    – add coverage
    – complaint
    – portal navigation
    – free product upgrade
    ####
    “””

    get_intent(system, customer_interactions)

    You get the following output:

    information change
    add coverage
    complaint

    Reviewing the results, it’s evident that the language model performs well in selecting the appropriate intent in the desired format.

    Sub-intents and intent trees

    If you make the preceding scenario more complex, as in many real-life use cases, intents can be designed in a large number of categories and also in a hierarchical fashion, which will make the classification tasks more challenging for the model. Therefore, you can further improve and modify your prompt to provide an example to the model, also known as n-shot learning, k-shot learning, or few-shot learning.

    The following is the intent tree to use in this example. You can find its source code in the utils.py file in the code repository.

    INTENTS = [
    {
    “main_intent”: “profile_update”,
    “sub_intents”: [
    “contact_info”,
    “payment_info”,
    “members”,
    ],
    },
    {
    “main_intent”: “health_cover”,
    “sub_intents”: [
    “add_extras”,
    “add_hospital”,
    “remove_extras”,
    “remove_hospital”,
    “new_policy”,
    “cancel_policy”,
    ],
    },
    {
    “main_intent”: “life_cover”,
    “sub_intents”: [
    “new_policy”,
    “cancel_policy”,
    “beneficiary_info”,
    ],
    },
    {
    “main_intent”: “customer_retention”,
    “sub_intents”: [
    “complaint”,
    “escalation”,
    “free_product_upgrade”,
    ],
    },
    {
    “main_intent”: “technical_support”,
    “sub_intents”: [
    “portal_navigation”,
    “login_issues”,
    ],
    },
    ]

    Using the following prompt (which includes the intents), you can ask the model to pick from the provided list of intents:

    system = “””
    Your task is to identify the intent from the customer interaction with the support bot. Identify the intent of the provided text using the list of provided intent tree delimited with ####. The intents are defined in classes and sub-classes. Write the intention with this format: <main-intent>:<sub-intent>. ONLY write the intent.

    OUTPUT EXAMPLE:
    profile_update:contact_info

    OUTPUT EXAMPLE:
    customer_retention:complaint

    ####
    {intents}
    ####
    “””

    intents_json = json.dumps(utils.INTENTS, indent=4)
    system = system.format(intents=intents_json)
    get_intent(system, customer_interactions)

    You get the following output:

    profile_update:contact_info
    health_cover:add_extras
    customer_retention:complaint

    Although LLMs can often correctly identify intent from a list of possible intents, they may sometimes produce additional outputs or fail to adhere to the exact intent structure and output schema. There are also scenarios where intents are not as straightforward as they initially seem or are highly specific to a business domain context that the model doesn’t fully comprehend.

    As an example, in the following sample interaction, the customer ultimately wants to change their coverage, but their immediate question and interaction intent is to get help with portal navigation. Similarly, in the second interaction, the more appropriate intent is “free product upgrade” which the customer is requesting. However, the model is unable to detect these nuanced intents as accurately as desired.

    customer_interactions = [
    “I want to change my coverage plan. But I’m not seeing where to do this on the online website. Could you please point me to it?”,
    “I’m unhappy with the current benefits of my plan and I’m considering canceling unless there are better alternatives. What can you offer?”,
    ]

    get_intent(system, customer_interactions)

    You get the following output:

    profile_update:contact_info
    customer_retention:complaint

    Prompt engineering can often successfully extract specific intents from text. However, for some use cases, relying solely on prompt engineering has limitations. Scenarios where additional techniques beyond prompt engineering may be needed include:

    Conversations with a large number of intent classes or long contexts that exceed the language model’s context window size, or making queries more computationally expensive
    Desired outputs in specific formats that the model struggles to adopt
    Enhancing model understanding of the domain or task to boost performance

    In the following section, we demonstrate how fine-tuning can boost the accuracy of the LLM for the intent classification task attempted earlier.

    Fine-tuning an LLM for classification

    The following sections detail the fine-tuning process of the FlanT5-XL and Mistral 7B model using SageMaker JumpStart. We use the FlanT5-XL and Mistral 7B models to compare their accuracy. Both models are significantly smaller compared to the Llama-2-70b-Chat. The goal is to determine whether smaller models can achieve state-of-the-art performance on specific tasks after they’re fine-tuned.

    We have fine-tuned both Mitral 7B and FlanT5-XL models. You can see the details of the Mistral 7b fine-tuning in the code repository. In the following, we outline the steps for fine-tuning and evaluating of FlanT5-XL.

    Initially, you deploy (or reuse) the FlanT5 endpoint as the base_predictor, which represents the base model prior to any fine-tuning. Subsequently, you assess the performance of the models by comparing them after the fine-tuning process.

    inference_instance_type = “ml.g5.2xlarge”

    model_id , model_version= “huggingface-text2text-flan-t5-xl”, “2.0.0”
    base_endpoint_name = model_id

    base_predictor = utils.get_predictor(
    endpoint_name=base_endpoint_name,
    model_id=model_id,
    model_version=model_version,
    inference_instance_type=inference_instance_type,
    )

    Prepare training data for fine-tuning

    Preparing for fine-tuning requires organizing several files, including the dataset and template files. The dataset is structured to align with the required input format for fine-tuning. For example, each record in our training dataset adheres to the following structure:

    {“query”: “customer query”, “response”: “main-intent:sub-intent”}

    In this example, you use a synthesized dataset comprising customer interactions with a fictional insurance company. To learn more about the data and gain access to it, refer to the source code.

    intent_dataset_file = “data/intent_dataset.jsonl”
    intent_dataset_train_file = “data/intent_dataset_train.jsonl”
    intent_dataset_test_file = “data/intent_dataset_test.jsonl”
    ft_template_file = “data/template.json”

    The following is the prompt for fine-tuning. The prompt has the query parameter, which is set during the fine-tuning using the SageMaker JumpStart SDK.

    FT_PROMPT = “””Identify the intent classes from the given user query, delimited with ####. Intents are categorized into two levels: main intent and sub intent. In your response, provide only ONE set of main and sub intents that is most relevant to the query. Write your response ONLY in this format <main-intent>:<sub-intent>. ONLY Write the intention.

    OUTPUT EXAMPLE:
    profile_update:contact_info

    OUTPUT EXAMPLE:
    technical_support:portal_navigation

    #### QUERY:
    {query}
    ####
    “””

    The following creates a template file that will be used by the SageMaker JumpStart framework to fine-tune the model. The template has two fields, prompt and completion. These fields are used to pass labeled data to the model for the fine-tuning process.

    template = {
    “prompt”: utils.FT_PROMPT,
    “completion”: “{response}”,
    }

    with open(ft_template_file, “w”) as f:
    json.dump(template, f)

    The training data is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, setting the stage for the actual fine-tuning process.

    train_data_location = utils.upload_train_and_template_to_s3(
    bucket_prefix=”intent_dataset_flant5″,
    train_path=intent_dataset_train_file,
    template_path=ft_template_file,
    )

    Fine-tune the model

    Configure the JumpStartEstimator, specifying your chosen model and other parameters like instance type and hyperparameters (in this example, you use five epochs for the training). This estimator drives the fine-tuning process.

    from sagemaker.jumpstart.estimator import JumpStartEstimator

    estimator = JumpStartEstimator(
    model_id=model_id,
    disable_output_compression=True,
    instance_type=”ml.g5.24xlarge”,
    role=utils.get_role_arn(),
    )

    estimator.set_hyperparameters(
    instruction_tuned=”True”, epochs=”5″, max_input_length=”1024″
    )

    estimator.fit({“training”: train_data_location})

    Deploy the fine-tuned model

    After fine-tuning, deploy the fine-tuned model:

    finetuned_endpoint_name = “flan-t5-xl-ft-infoext”
    finetuned_model_name = finetuned_endpoint_name
    # Deploying the finetuned model to an endpoint
    finetuned_predictor = estimator.deploy(
    endpoint_name=finetuned_endpoint_name,
    model_name=finetuned_model_name,
    )

    Use the following code to test the fine-tuned model against its base model with ambiguous queries, which you saw in the previous section:

    ambiguous_queries = [
    {
    “query”: “I want to change my coverage plan. But I’m not seeing where to do this on the online site. Could you please show me how?”,
    “main_intent”: “techincal_support”,
    “sub_intent”: “portal_navigation”,
    },
    {
    “query”: “I’m unhappy with the current benefits of my plan and I’m considering canceling unless there are better alternatives. What can you offer?”,
    “main_intent”: “customer_retention”,
    “sub_intent”: “free_product_upgrade”,
    },
    ]
    for query in ambiguous_queries:
    question = query[“query”]
    print(“query:”, question, “n”)
    print(
    “expected intent: “, f”{query[‘main_intent’]}:{query[‘sub_intent’]}”
    )

    prompt = utils.FT_PROMPT.format(query=question)
    response = utils.flant5(base_predictor, user=prompt, max_tokens=13)
    print(“base model: “, utils.parse_output(response))

    response = utils.flant5(finetuned_predictor, user=prompt, max_tokens=13)
    print(“finetuned model: “, utils.parse_output(response))
    print(“-” * 80)

    You get the following output:

    query: I want to change my coverage plan. But I’m not seeing where to do this on the online site. Could you please show me how?
    expected intent: techincal_support:portal_navigation
    base model: main_intent>:sub_intent> change
    finetuned model: technical_support:portal_navigation
    ——————————————————————————–
    query: I’m unhappy with the current benefits of my plan and I’m considering canceling unless there are better alternatives. What can you offer?

    expected intent: customer_retention:free_product_upgrade
    base model: main_intent>:sub_intent> cancel
    finetuned model: customer_retention:free_product_upgrade
    ——————————————————————————–

    As shown in this example, the fine-tuned model is able to classify the ambiguous queries correctly.

    In evaluations, fine-tuned models performed better in identifying the correct class for both clear and ambiguous intents. The following section details the benchmark’s performance overall, and against each intent.

    Performance comparisons and considerations

    In this section, we have gathered the evaluation results and performance benchmarks for each model, before and after fine-tuning, as well as a comparison between the prompt engineering and fine-tuning the LLM. The dataset consists of 7,824 examples, with a 90% split for training (including validation) and 10% for testing.

    Model
    Overall Accuracy
    Fine-tuning Duration (minutes)
    Notes

    Mistral-7b (fine-tuned five epochs, without classes in the prompt)
    98.97%
    720

    Given Mistral-7b’s nature as a text generation model, parsing its output to extract intent can be challenging due to tendencies for character repetition and generation of additional characters.

    Improved performance with more epochs: 98% accuracy for five epochs compared to 92% for one epoch.

    Flan-T5-XL (fine-tuned one epochs, without classes in the prompt)
    98.46%
    150
    Marginal improvement in accuracy with increased epochs: from 97.5% (one epoch) to 98.46% (five epochs).

    Llama-2-70b-chat (With classes in the prompt)
    78.42%
    N/A
    Low accuracy in ambiguous scenarios.

    Llama-2-70b-chat (Without classes in the prompt)
    10.85%
    N/A
    .

    Flan-T5-XL (base model, without classes in the prompt)
    0.0%
    N/A
    Unable to identify any of the intent classes with the expected format.

    Mistral-7b (base model, without classes in the prompt)
    0.0%
    N/A
    Unable to identify any of the intent classes with the expected format.

    The following table contains a breakdown of models’ accuracy for each intent class.

    Main Intent
    Sub-intent
    Example Count
    Llama2-70b (without classes in prompt)
    Llama2-70b (with classes in prompt)
    Flant5-XL
    Fine-tuned

    Mistral-7b Fine-tuned

    Customer Retention
    Complaint
    63
    7.94%
    44.44%
    98.41%
    98.41%

    Customer Retention
    Escalation
    49
    91.84%
    100%
    100%
    100%

    Customer Retention
    Free Product Upgrade
    50
    0.00%
    64.00%
    100%
    100%

    Health Cover
    Add Extras
    38
    0.00%
    100%
    97.37%
    100%

    Health Cover
    Add Hospital
    44
    0.00%
    81.82%
    100%
    97.73%

    Health Cover
    Cancel Policy
    43
    0.00%
    100%
    100%
    97.67%

    Health Cover
    New Policy
    41
    0.00%
    82.93%
    100%
    100%

    Health Cover
    Remove Extras
    47
    0.00%
    85.11%
    100%
    100%

    Health Cover
    Remove Hospital
    53
    0.00%
    84.90%
    100%
    100%

    Life Cover
    Beneficiary Info
    45
    0.00%
    100%
    97.78%
    97.78%

    Life Cover
    Cancel Policy
    47
    0.00%
    55.32%
    100%
    100%

    Life Cover
    New Policy
    40
    0.00%
    90.00%
    92.50%
    100%

    Profile Update
    Contact Info
    45
    35.56%
    95.56%
    95.56%
    95.56%

    Profile Update
    Members
    52
    0.00%
    36.54%
    98.08%
    98.08%

    Profile Update
    Payment Info
    47
    40.43%
    97.87%
    100%
    100%

    Technical Support
    Login Issues
    39
    0.00%
    92.31%
    97.44%
    100%

    Technical Support
    Portal Navigation
    40
    0.00%
    45.00%
    95.00%
    97.50%

    This comparative analysis illustrates the trade-offs between fine-tuning time and model accuracy. It highlights the ability of models like Mistral-7b and FlanT5-XL to achieve higher classification accuracy through fine-tuning. Additionally, it shows how smaller models can match or surpass the performance of larger models on specific tasks when fine-tuned, contrasted with using prompt engineering alone on the larger models.

    Clean up

    Complete the following steps to clean up your resources:

    Delete the SageMaker endpoints, configuration, and models.
    Delete the S3 bucket created for this example.
    Delete the SageMaker notebook instance (if you used one to run this example).

    Summary

    Large language models have revolutionized information extraction from unstructured text data. These models excel in tasks such as classifying information and extracting key entities from various documents, achieving state-of-the-art results with minimal data.

    This post demonstrated the use of large language models for information extraction through prompt engineering and fine-tuning. While effective, relying solely on prompt engineering can have limitations for complex tasks that require rigid output formats or a large number of classes. In these scenarios, fine-tuning even smaller models on domain-specific data can significantly improve performance beyond what prompt engineering alone can achieve.

    The post included practical examples highlighting how fine-tuned smaller models can surpass prompt engineering with larger models for such complex use cases. Although prompt engineering is a good starting point for simpler use cases, fine-tuning offers a more robust solution for complex information extraction tasks, ensuring higher accuracy and adaptability to specific use cases. SageMaker JumpStart tools and services facilitate this process, making it accessible for individuals and teams across all levels of ML expertise.

    Additional reading

    You can read more on using SageMaker JumpStart for intelligent document processing, fine-tuning, and evaluation of LLMs in the following resources:

    Enhancing AWS intelligent document processing with generative AI
    Fine-tune and Deploy Mistral 7B with Amazon SageMaker JumpStart
    Domain-adaptation Fine-tuning of Foundation Models in Amazon SageMaker JumpStart on Financial data
    Instruction fine-tuning for FLAN T5 XL with Amazon SageMaker Jumpstart
    Evaluate large language models for quality and responsibility

    About the Authors

    Pooya Vahidi  is a Senior Solutions Architect at AWS, passionate about computer science, artificial intelligence, and cloud computing. As an AI professional, he is an active member of the AWS AI/ML Area-of-Depth team. With a background spanning over two decades of expertise in leading the architecture and engineering of large-scale solutions, he helps customers on their transformative journeys through cloud and AI/ML technologies.

    Dr. Romina Sharifpour is a Senior Machine Learning and Artificial Intelligence Solutions Architect at Amazon Web Services (AWS). She has spent over 10 years leading the design and implementation of innovative end-to-end solutions enabled by advancements in ML and AI. Romina’s areas of interest are natural language processing, large language models, and MLOps.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMeet ZleepAnlystNet: A Novel Deep Learning Model for Automatic Sleep Stage Scoring based on Single-Channel Raw EEG Data Using Separating Training
    Next Article E2B Introduces Code Interpreter SDK: Enabling Code Interpreting Capabilities to AI Apps

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

    May 17, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    CVE-2025-28099 – Opencms Arbitrary File Read Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    NeuMeta (Neural Metamorphosis): A Paradigm for Self-Morphable Neural Networks via Continuous Weight Manifolds

    Development

    CVE-2025-4510 – Changjietong UFIDA CRM SQL Injection

    Common Vulnerabilities and Exposures (CVEs)

    Generative Engine Optimization (GEO): The Future of SEO in the Age of AI-Powered Search

    Web Development

    Highlights

    Development

    SingCERT Warns Critical Vulnerabilities Found in Multiple WordPress Plugins

    May 27, 2024

    The Cyber Security Agency of Singapore has issued a critical alert concerning vulnerabilities in several…

    DeaDBeeF 1.10 Release Brings New Features

    April 2, 2025

    CTEM in the Spotlight: How Gartner’s New Categories Help to Manage Exposures

    August 29, 2024

    From innovation to impact: How AWS and NVIDIA enable real-world generative AI success

    March 19, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.