24  Using LLMs in Python for Text Generation

24.1 Intro

In this tutorial, we’ll explore how to leverage Large Language Models (LLMs) to generate text using OpenAI’s API. We’ll use the gpt-4o-mini model to generate responses to fixed and variable prompts, optimize our code with helper functions and vectorization, and handle data using pandas DataFrames.

24.2 Learning Objectives

  • Set up the OpenAI client
  • Define and use simple functions to generate text
  • Use vectorization to apply functions to DataFrames

24.3 Setting Up the OpenAI Client

First, we need to set up the OpenAI client using your API key. Here, we store the key in a file called local_settings.py, then import it into our script.

from openai import OpenAI
import pandas as pd
import numpy as np
from local_settings import OPENAI_KEY

# Set up the OpenAI API key
# Initialize the OpenAI client with your API key
client = OpenAI(api_key=OPENAI_KEY)

Alternatively, you can pass your API key directly when setting the api_key, but be cautious not to expose it in your code, especially if you plan to share or publish it.

24.4 Making an API Call

Let’s make an API call to the gpt-4o-mini model to generate a response to a prompt.

response = client.chat.completions.create(
    model="gpt-4o-mini", messages=[{"role": "user", "content": "What is the most tourist-friendly city in France?"}]
)
print(response.choices[0].message.content)
Paris is widely regarded as the most tourist-friendly city in France. The city's rich history, iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral, as well as its vibrant culture and cuisine, attract millions of visitors each year. 

Paris also offers a well-developed public transportation system, including the Metro and buses, making it easy for tourists to navigate the city. Additionally, there are numerous resources available for visitors, such as tourist information centers, guided tours, and multilingual services, enhancing the overall travel experience. 

While other cities like Nice, Lyon, and Marseille also have their own appeal, Paris remains the quintessential destination for those looking to experience the charm and beauty of France.

24.5 Defining Helper Functions

To simplify our code and avoid repetition, we’ll define a helper function for making API calls. API calls contain a lot of boilerplate code, so encapsulating this logic in a function makes our code cleaner and more maintainable.

If you ever forget how to structure the API calls, refer to the OpenAI API documentation or search for “OpenAI Python API example” online.

Here’s how we can define the llm_chat function:

def llm_chat(message):
    response = client.chat.completions.create(
        model="gpt-4o-mini", messages=[{"role": "user", "content": message}]
    )
    return response.choices[0].message.content

This function takes a message as input, sends it to the LLM, and returns the generated response. The model parameter specifies which model to use—in this case, gpt-4o-mini. We use this model for its balance of quality, speed, and cost. If you want a more performant model, you can use gpt-4o but be careful not to exceed your API quota.

24.6 Fixed Questions

Let’s start by sending a fixed question to the gpt-4o-mini model and retrieving a response.

# Example usage
response = llm_chat("What is the most tourist-friendly city in France?")
print(response)
Paris is often considered the most tourist-friendly city in France. As the capital and a major cultural hub, it offers a wide range of attractions, such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city is well-equipped for tourists, featuring extensive public transportation, multilingual signage, and a plethora of dining and accommodation options catering to various budgets. Additionally, Paris hosts numerous events and activities throughout the year, making it accessible and engaging for visitors from around the world. Other cities like Nice, Lyon, and Marseille also offer tourist-friendly amenities, but Paris remains the most popular and recognized destination.
Practice

24.7 Practice Q: Get tourist-friendly city in Brazil

Use the llm_chat function to ask the model for the most tourist-friendly city in Brazil. Store the response in a variable called rec_brazil. Print the response.

# Your code here

24.8 Variables as Prompt Inputs

Often, you’ll want to generate responses based on varying inputs. Let’s create a function that takes a country as input and asks the model for the most tourist-friendly city in that country.

def city_rec(country):
    prompt = f"What is the most tourist-friendly city in {country}?"
    return llm_chat(prompt)

Now, you can get recommendations for different countries by calling city_rec("Country Name"):

city_rec("Nigeria")
'Lagos is often considered the most tourist-friendly city in Nigeria. As the largest city and a major financial hub, it offers a vibrant atmosphere with a rich cultural heritage, diverse nightlife, beautiful beaches, and various attractions. Visitors can explore places like Lekki Conservation Centre, Nike Art Gallery, and the National Museum, as well as enjoy local cuisine at numerous restaurants and food markets. Additionally, Lagos hosts various events and festivals that showcase Nigerian culture, making it an appealing destination for tourists. Other cities like Abuja, Port Harcourt, and Calabar also offer unique experiences, but Lagos tends to be the most accessible and engaging for tourists.'

However, if we try to use this function on a list of countries or a DataFrame column directly, it won’t process each country individually. Instead, it will attempt to concatenate the list into a single string, which isn’t the desired behavior.

# Incorrect usage
country_df = pd.DataFrame({"country": ["Nigeria", "Chile", "France", "Canada"]})

response = city_rec(country_df["country"])

print(response)
Determining the "most tourist-friendly" city can be subjective and depend on various factors such as infrastructure, safety, attractions, hospitality, and overall visitor experience. However, I can provide some insights into some of the major cities in each of the listed countries that are often regarded as tourist-friendly:

1. **Nigeria**: Lagos is often considered the most tourist-friendly city in Nigeria, known for its vibrant culture, music, and entertainment scene. However, security concerns can influence tourists' experiences.

2. **Chile**: Santiago, the capital, is a popular destination for tourists. It offers a mix of urban and natural attractions, with beautiful parks, museums, and proximity to the Andes mountains.

3. **France**: Paris is renowned as one of the most tourist-friendly cities in the world, with iconic landmarks, rich culture, and excellent public transportation.

4. **Canada**: Toronto and Vancouver are both highly regarded for their tourist-friendly nature. Vancouver is praised for its natural beauty and outdoor activities, while Toronto offers a diverse cultural scene.

Overall, Paris, France, is often considered the most tourist-friendly city among the options provided due to its well-developed tourist infrastructure and abundant attractions.

To process each country individually, we can use NumPy’s vectorize function. This function transforms city_rec so that it can accept arrays (like lists or NumPy arrays) and apply the function element-wise.

# Vectorize the function
city_rec_vec = np.vectorize(city_rec)

# Apply the function to each country
country_df["city_rec"] = city_rec_vec(country_df["country"])
country_df
country city_rec
0 Nigeria Lagos is often considered the most tourist-fri...
1 Chile The most tourist-friendly city in Chile is oft...
2 France Paris is often considered the most tourist-fri...
3 Canada While "most tourist-friendly" can be subjectiv...

This code will output a DataFrame with a new column city_rec containing city recommendations corresponding to each country.

Practice

24.9 Practice Q: Get local dishes

Create a function called get_local_dishes that takes a country name as input and returns some of the most famous local dishes from that country. Then, vectorize this function and apply it to the country_df DataFrame to add a column with local dish recommendations for each country.

# Your code here

24.10 Automated Summary: Movies Dataset

In this example, we’ll use the movies dataset from vega_datasets to generate automated summaries for each movie. We’ll convert each movie’s data into a dictionary and use it as input for the LLM to generate a one-paragraph performance summary.

First, let’s load the movies dataset and preview the first few rows:

import pandas as pd
import vega_datasets as vd

# Load the movies dataset
movies = vd.data.movies().head()  # Using only the first 5 rows to conserve API credits
movies
Title US_Gross Worldwide_Gross US_DVD_Sales Production_Budget Release_Date MPAA_Rating Running_Time_min Distributor Source Major_Genre Creative_Type Director Rotten_Tomatoes_Rating IMDB_Rating IMDB_Votes
0 The Land Girls 146083.0 146083.0 NaN 8000000.0 Jun 12 1998 R NaN Gramercy None None None None NaN 6.1 1071.0
1 First Love, Last Rites 10876.0 10876.0 NaN 300000.0 Aug 07 1998 R NaN Strand None Drama None None NaN 6.9 207.0
2 I Married a Strange Person 203134.0 203134.0 NaN 250000.0 Aug 28 1998 None NaN Lionsgate None Comedy None None NaN 6.8 865.0
3 Let's Talk About Sex 373615.0 373615.0 NaN 300000.0 Sep 11 1998 None NaN Fine Line None Comedy None None 13.0 NaN NaN
4 Slam 1009819.0 1087521.0 NaN 1000000.0 Oct 09 1998 R NaN Trimark Original Screenplay Drama Contemporary Fiction None 62.0 3.4 165.0

Next, we’ll convert each row of the DataFrame into a dictionary. This will be useful for passing the data to the LLM.

# Convert each movie's data into a dictionary
movies.to_dict(orient="records")
[{'Title': 'The Land Girls',
  'US_Gross': 146083.0,
  'Worldwide_Gross': 146083.0,
  'US_DVD_Sales': nan,
  'Production_Budget': 8000000.0,
  'Release_Date': 'Jun 12 1998',
  'MPAA_Rating': 'R',
  'Running_Time_min': nan,
  'Distributor': 'Gramercy',
  'Source': None,
  'Major_Genre': None,
  'Creative_Type': None,
  'Director': None,
  'Rotten_Tomatoes_Rating': nan,
  'IMDB_Rating': 6.1,
  'IMDB_Votes': 1071.0},
 {'Title': 'First Love, Last Rites',
  'US_Gross': 10876.0,
  'Worldwide_Gross': 10876.0,
  'US_DVD_Sales': nan,
  'Production_Budget': 300000.0,
  'Release_Date': 'Aug 07 1998',
  'MPAA_Rating': 'R',
  'Running_Time_min': nan,
  'Distributor': 'Strand',
  'Source': None,
  'Major_Genre': 'Drama',
  'Creative_Type': None,
  'Director': None,
  'Rotten_Tomatoes_Rating': nan,
  'IMDB_Rating': 6.9,
  'IMDB_Votes': 207.0},
 {'Title': 'I Married a Strange Person',
  'US_Gross': 203134.0,
  'Worldwide_Gross': 203134.0,
  'US_DVD_Sales': nan,
  'Production_Budget': 250000.0,
  'Release_Date': 'Aug 28 1998',
  'MPAA_Rating': None,
  'Running_Time_min': nan,
  'Distributor': 'Lionsgate',
  'Source': None,
  'Major_Genre': 'Comedy',
  'Creative_Type': None,
  'Director': None,
  'Rotten_Tomatoes_Rating': nan,
  'IMDB_Rating': 6.8,
  'IMDB_Votes': 865.0},
 {'Title': "Let's Talk About Sex",
  'US_Gross': 373615.0,
  'Worldwide_Gross': 373615.0,
  'US_DVD_Sales': nan,
  'Production_Budget': 300000.0,
  'Release_Date': 'Sep 11 1998',
  'MPAA_Rating': None,
  'Running_Time_min': nan,
  'Distributor': 'Fine Line',
  'Source': None,
  'Major_Genre': 'Comedy',
  'Creative_Type': None,
  'Director': None,
  'Rotten_Tomatoes_Rating': 13.0,
  'IMDB_Rating': nan,
  'IMDB_Votes': nan},
 {'Title': 'Slam',
  'US_Gross': 1009819.0,
  'Worldwide_Gross': 1087521.0,
  'US_DVD_Sales': nan,
  'Production_Budget': 1000000.0,
  'Release_Date': 'Oct 09 1998',
  'MPAA_Rating': 'R',
  'Running_Time_min': nan,
  'Distributor': 'Trimark',
  'Source': 'Original Screenplay',
  'Major_Genre': 'Drama',
  'Creative_Type': 'Contemporary Fiction',
  'Director': None,
  'Rotten_Tomatoes_Rating': 62.0,
  'IMDB_Rating': 3.4,
  'IMDB_Votes': 165.0}]

Let’s store this new column in the DataFrame:

movies["full_dict"] = movies.to_dict(orient="records")
movies
Title US_Gross Worldwide_Gross US_DVD_Sales Production_Budget Release_Date MPAA_Rating Running_Time_min Distributor Source Major_Genre Creative_Type Director Rotten_Tomatoes_Rating IMDB_Rating IMDB_Votes full_dict
0 The Land Girls 146083.0 146083.0 NaN 8000000.0 Jun 12 1998 R NaN Gramercy None None None None NaN 6.1 1071.0 {'Title': 'The Land Girls', 'US_Gross': 146083...
1 First Love, Last Rites 10876.0 10876.0 NaN 300000.0 Aug 07 1998 R NaN Strand None Drama None None NaN 6.9 207.0 {'Title': 'First Love, Last Rites', 'US_Gross'...
2 I Married a Strange Person 203134.0 203134.0 NaN 250000.0 Aug 28 1998 None NaN Lionsgate None Comedy None None NaN 6.8 865.0 {'Title': 'I Married a Strange Person', 'US_Gr...
3 Let's Talk About Sex 373615.0 373615.0 NaN 300000.0 Sep 11 1998 None NaN Fine Line None Comedy None None 13.0 NaN NaN {'Title': 'Let's Talk About Sex', 'US_Gross': ...
4 Slam 1009819.0 1087521.0 NaN 1000000.0 Oct 09 1998 R NaN Trimark Original Screenplay Drama Contemporary Fiction None 62.0 3.4 165.0 {'Title': 'Slam', 'US_Gross': 1009819.0, 'Worl...

Now, let’s define a function movie_performance that takes a movie’s data dictionary, constructs a prompt, and calls the llm_chat function to get a summary:

def movie_performance(movie_data):
    prompt = f"Considering the following data on this movie {movie_data}, provide a one-paragraph summary of its performance for my report."
    return llm_chat(prompt)

We’ll vectorize this function so we can apply it to the entire full_dict column:

import numpy as np

# Vectorize the function to apply it to the DataFrame
movie_performance_vec = np.vectorize(movie_performance)

Let’s test our function with an example:

# Example usage
movie_performance("Name: Kene's Movie, Sales: 100,000 USD")
"Kene's Movie has demonstrated impressive financial performance, achieving sales of $100,000 USD. This figure indicates a strong reception in the market, reflecting positive audience engagement and effective marketing strategies. The impressive sales suggest that the film resonated well with its target demographic, contributing to its overall success. This robust sales figure provides a solid foundation for potential future projects or sequels, highlighting Kene's Movie as a noteworthy entry in its genre."

Finally, we’ll apply the vectorized function to generate summaries for each movie:

# Generate summaries for each movie
movies["llm_summary"] = movie_performance_vec(movies["full_dict"])

You can now save the DataFrame with the generated summaries to a CSV file:

# Save the results to a CSV file
movies.to_csv("movies_output.csv", index=False)

This approach allows you to generate detailed summaries for each movie based on its full set of data, which can be incredibly useful for automated reporting and data analysis.

Practice

24.11 Practice Q: Weather Summary

Using the first 5 rows of the seattle_weather dataset from vega_datasets, create a function that takes all weather columns for a particular day and generates a summary of the weather conditions for that day. The function should use the LLM to generate a one-paragraph summary for a report, considering the data provided. Store the function in a column called weather_summary.

weather = vd.data.seattle_weather().head()
weather
date precipitation temp_max temp_min wind weather
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle
1 2012-01-02 10.9 10.6 2.8 4.5 rain
2 2012-01-03 0.8 11.7 7.2 2.3 rain
3 2012-01-04 20.3 12.2 5.6 4.7 rain
4 2012-01-05 1.3 8.9 2.8 6.1 rain
# Your code here

24.12 Wrap-up

In this tutorial, we learned the basics of using OpenAI’s LLMs in Python for text generation, created helper functions, and applied these functions to datasets using vectorization.

In the next lesson, we’ll look at structured outputs that allow us to specify the format of the response we want from the LLM. We’ll use this to extract structured data from unstructured text, a common task in data analysis.