from openai import OpenAI
import pandas as pd
import numpy as np
from local_settings import OPENAI_KEY
# Set up the OpenAI API key
# Initialize the OpenAI client with your API key
= OpenAI(api_key=OPENAI_KEY) client
24 Using LLMs in Python for Text Generation
24.1 Intro
In this tutorial, we’ll explore how to leverage Large Language Models (LLMs) to generate text using OpenAI’s API. We’ll use the gpt-4o-mini
model to generate responses to fixed and variable prompts, optimize our code with helper functions and vectorization, and handle data using pandas DataFrames.
24.2 Learning Objectives
- Set up the OpenAI client
- Define and use simple functions to generate text
- Use vectorization to apply functions to DataFrames
24.3 Setting Up the OpenAI Client
First, we need to set up the OpenAI client using your API key. Here, we store the key in a file called local_settings.py
, then import it into our script.
Alternatively, you can pass your API key directly when setting the api_key
, but be cautious not to expose it in your code, especially if you plan to share or publish it.
24.4 Making an API Call
Let’s make an API call to the gpt-4o-mini
model to generate a response to a prompt.
= client.chat.completions.create(
response ="gpt-4o-mini", messages=[{"role": "user", "content": "What is the most tourist-friendly city in France?"}]
model
)print(response.choices[0].message.content)
While many cities in France are known for their tourist appeal, Paris is often considered the most tourist-friendly city. It offers a wealth of attractions, including iconic landmarks like the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris also has an extensive public transportation system that makes it easy for visitors to navigate the city.
In addition to Paris, other cities such as Nice, Lyon, and Bordeaux are also popular among tourists and offer their own unique charm, cultural experiences, and amenities tailored to visitors. Ultimately, the "most tourist-friendly" city may vary depending on individual preferences and interests.
24.5 Defining Helper Functions
To simplify our code and avoid repetition, we’ll define a helper function for making API calls. API calls contain a lot of boilerplate code, so encapsulating this logic in a function makes our code cleaner and more maintainable.
If you ever forget how to structure the API calls, refer to the OpenAI API documentation or search for “OpenAI Python API example” online.
Here’s how we can define the llm_chat
function:
def llm_chat(message):
= client.chat.completions.create(
response ="gpt-4o-mini", messages=[{"role": "user", "content": message}]
model
)return response.choices[0].message.content
This function takes a message
as input, sends it to the LLM, and returns the generated response. The model
parameter specifies which model to use—in this case, gpt-4o-mini
. We use this model for its balance of quality, speed, and cost. If you want a more performant model, you can use gpt-4o
but be careful not to exceed your API quota.
24.6 Fixed Questions
Let’s start by sending a fixed question to the gpt-4o-mini
model and retrieving a response.
# Example usage
= llm_chat("What is the most tourist-friendly city in France?")
response print(response)
Paris is widely considered the most tourist-friendly city in France. As the capital and one of the most iconic cities in the world, it offers a wealth of attractions, including the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and charming neighborhoods like Montmartre and Le Marais. Paris is well-equipped for tourists with an extensive public transportation system, numerous hotels, restaurants, and services catering to visitors. Additionally, English is commonly spoken in tourist areas, making it easier for international travelers to navigate the city.
24.7 Practice Q: Get tourist-friendly city in Brazil
Use the llm_chat
function to ask the model for the most tourist-friendly city in Brazil. Store the response in a variable called rec_brazil
. Print the response.
# Your code here
= llm_chat("What is the most tourist-friendly city in Brazil?")
response print(response)
One of the most tourist-friendly cities in Brazil is Rio de Janeiro. Known for its breathtaking landscapes, iconic beaches like Copacabana and Ipanema, and famous landmarks such as Christ the Redeemer and Sugarloaf Mountain, Rio offers a vibrant culture, rich history, and a range of activities for visitors.
Other cities that are also considered tourist-friendly include São Paulo, with its diverse gastronomy and cultural attractions, and Salvador, known for its Afro-Brazilian culture and colonial architecture. Each city has its unique charm and offers different experiences for travelers. Ultimately, the best choice depends on your interests and what you're looking to explore in Brazil.
24.8 Variables as Prompt Inputs
Often, you’ll want to generate responses based on varying inputs. Let’s create a function that takes a country as input and asks the model for the most tourist-friendly city in that country.
def city_rec(country):
= f"What is the most tourist-friendly city in {country}?"
prompt return llm_chat(prompt)
Now, you can get recommendations for different countries by calling city_rec("Country Name")
:
"Nigeria") city_rec(
'Lagos is often considered the most tourist-friendly city in Nigeria. As the largest city in the country, it offers a vibrant mix of culture, entertainment, and attractions. Visitors can enjoy beautiful beaches, a lively nightlife, art galleries, museums, and markets showcasing local crafts and cuisine.\n\nAdditionally, Lagos hosts various festivals and events throughout the year, highlighting its rich cultural heritage. Other notable cities for tourism in Nigeria include Abuja, the capital city, known for its modern architecture and green spaces, and Calabar, famous for its Carnival and rich history. However, Lagos remains the most prominent destination for tourists seeking a diverse and lively experience.'
However, if we try to use this function on a list of countries or a DataFrame column directly, it won’t process each country individually. Instead, it will attempt to concatenate the list into a single string, which isn’t the desired behavior.
# Incorrect usage
= pd.DataFrame({"country": ["Nigeria", "Chile", "France", "Canada"]})
country_df
= city_rec(country_df["country"])
response
print(response)
The most tourist-friendly city among the options provided (Nigeria, Chile, France, Canada) would likely be a city in France. Paris, for example, is widely regarded as a major tourist destination known for its cultural attractions, hospitality, and infrastructure catering to tourists. Additionally, cities like Vancouver in Canada and Santiago in Chile are also very friendly to tourists, but France, particularly Paris, often stands out in terms of global recognition and tourist services.
If you are looking for a specific city from each country:
- **Nigeria**: Lagos could be considered, but it's often less tourist-friendly due to safety concerns.
- **Chile**: Santiago is a good choice for its modern amenities and attractions.
- **France**: Paris is the most tourist-friendly city with extensive tourist infrastructure.
- **Canada**: Cities like Vancouver or Toronto are quite welcoming to tourists.
So, based on these considerations, Paris in France would be the most tourist-friendly city.
To process each country individually, we can use NumPy’s vectorize
function. This function transforms city_rec
so that it can accept arrays (like lists or NumPy arrays) and apply the function element-wise.
# Vectorize the function
= np.vectorize(city_rec)
city_rec_vec
# Apply the function to each country
"city_rec"] = city_rec_vec(country_df["country"])
country_df[ country_df
country | city_rec | |
---|---|---|
0 | Nigeria | Lagos is often considered the most tourist-fri... |
1 | Chile | Santiago, the capital of Chile, is often consi... |
2 | France | Paris is often considered the most tourist-fri... |
3 | Canada | One of the most tourist-friendly cities in Can... |
This code will output a DataFrame with a new column city_rec
containing city recommendations corresponding to each country.
24.9 Practice Q: Get local dishes
Create a function called get_local_dishes
that takes a country name as input and returns some of the most famous local dishes from that country. Then, vectorize this function and apply it to the country_df
DataFrame to add a column with local dish recommendations for each country.
# Your code here
def get_local_dishes(country):
= f"What are some of the most famous local dishes from {country}?"
prompt return llm_chat(prompt)
# Vectorize the function
= np.vectorize(get_local_dishes)
get_local_dishes_vec
# Apply to the DataFrame
'local_dishes'] = get_local_dishes_vec(country_df['country'])
country_df[ country_df
country | city_rec | local_dishes | |
---|---|---|---|
0 | Nigeria | Lagos is often considered the most tourist-fri... | Nigeria is known for its rich and diverse culi... |
1 | Chile | Santiago, the capital of Chile, is often consi... | Chile boasts a rich culinary tradition influen... |
2 | France | Paris is often considered the most tourist-fri... | France is renowned for its diverse and rich cu... |
3 | Canada | One of the most tourist-friendly cities in Can... | Canada has a rich culinary heritage influenced... |
24.10 Automated Summary: Movies Dataset
In this example, we’ll use the movies dataset from vega_datasets
to generate automated summaries for each movie. We’ll convert each movie’s data into a dictionary and use it as input for the LLM to generate a one-paragraph performance summary.
First, let’s load the movies dataset and preview the first few rows:
import pandas as pd
import vega_datasets as vd
# Load the movies dataset
= vd.data.movies().head() # Using only the first 5 rows to conserve API credits
movies movies
Title | US_Gross | Worldwide_Gross | US_DVD_Sales | Production_Budget | Release_Date | MPAA_Rating | Running_Time_min | Distributor | Source | Major_Genre | Creative_Type | Director | Rotten_Tomatoes_Rating | IMDB_Rating | IMDB_Votes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | The Land Girls | 146083.0 | 146083.0 | NaN | 8000000.0 | Jun 12 1998 | R | NaN | Gramercy | None | None | None | None | NaN | 6.1 | 1071.0 |
1 | First Love, Last Rites | 10876.0 | 10876.0 | NaN | 300000.0 | Aug 07 1998 | R | NaN | Strand | None | Drama | None | None | NaN | 6.9 | 207.0 |
2 | I Married a Strange Person | 203134.0 | 203134.0 | NaN | 250000.0 | Aug 28 1998 | None | NaN | Lionsgate | None | Comedy | None | None | NaN | 6.8 | 865.0 |
3 | Let's Talk About Sex | 373615.0 | 373615.0 | NaN | 300000.0 | Sep 11 1998 | None | NaN | Fine Line | None | Comedy | None | None | 13.0 | NaN | NaN |
4 | Slam | 1009819.0 | 1087521.0 | NaN | 1000000.0 | Oct 09 1998 | R | NaN | Trimark | Original Screenplay | Drama | Contemporary Fiction | None | 62.0 | 3.4 | 165.0 |
Next, we’ll convert each row of the DataFrame into a dictionary. This will be useful for passing the data to the LLM.
# Convert each movie's data into a dictionary
="records") movies.to_dict(orient
[{'Title': 'The Land Girls',
'US_Gross': 146083.0,
'Worldwide_Gross': 146083.0,
'US_DVD_Sales': nan,
'Production_Budget': 8000000.0,
'Release_Date': 'Jun 12 1998',
'MPAA_Rating': 'R',
'Running_Time_min': nan,
'Distributor': 'Gramercy',
'Source': None,
'Major_Genre': None,
'Creative_Type': None,
'Director': None,
'Rotten_Tomatoes_Rating': nan,
'IMDB_Rating': 6.1,
'IMDB_Votes': 1071.0},
{'Title': 'First Love, Last Rites',
'US_Gross': 10876.0,
'Worldwide_Gross': 10876.0,
'US_DVD_Sales': nan,
'Production_Budget': 300000.0,
'Release_Date': 'Aug 07 1998',
'MPAA_Rating': 'R',
'Running_Time_min': nan,
'Distributor': 'Strand',
'Source': None,
'Major_Genre': 'Drama',
'Creative_Type': None,
'Director': None,
'Rotten_Tomatoes_Rating': nan,
'IMDB_Rating': 6.9,
'IMDB_Votes': 207.0},
{'Title': 'I Married a Strange Person',
'US_Gross': 203134.0,
'Worldwide_Gross': 203134.0,
'US_DVD_Sales': nan,
'Production_Budget': 250000.0,
'Release_Date': 'Aug 28 1998',
'MPAA_Rating': None,
'Running_Time_min': nan,
'Distributor': 'Lionsgate',
'Source': None,
'Major_Genre': 'Comedy',
'Creative_Type': None,
'Director': None,
'Rotten_Tomatoes_Rating': nan,
'IMDB_Rating': 6.8,
'IMDB_Votes': 865.0},
{'Title': "Let's Talk About Sex",
'US_Gross': 373615.0,
'Worldwide_Gross': 373615.0,
'US_DVD_Sales': nan,
'Production_Budget': 300000.0,
'Release_Date': 'Sep 11 1998',
'MPAA_Rating': None,
'Running_Time_min': nan,
'Distributor': 'Fine Line',
'Source': None,
'Major_Genre': 'Comedy',
'Creative_Type': None,
'Director': None,
'Rotten_Tomatoes_Rating': 13.0,
'IMDB_Rating': nan,
'IMDB_Votes': nan},
{'Title': 'Slam',
'US_Gross': 1009819.0,
'Worldwide_Gross': 1087521.0,
'US_DVD_Sales': nan,
'Production_Budget': 1000000.0,
'Release_Date': 'Oct 09 1998',
'MPAA_Rating': 'R',
'Running_Time_min': nan,
'Distributor': 'Trimark',
'Source': 'Original Screenplay',
'Major_Genre': 'Drama',
'Creative_Type': 'Contemporary Fiction',
'Director': None,
'Rotten_Tomatoes_Rating': 62.0,
'IMDB_Rating': 3.4,
'IMDB_Votes': 165.0}]
Let’s store this new column in the DataFrame:
"full_dict"] = movies.to_dict(orient="records")
movies[ movies
Title | US_Gross | Worldwide_Gross | US_DVD_Sales | Production_Budget | Release_Date | MPAA_Rating | Running_Time_min | Distributor | Source | Major_Genre | Creative_Type | Director | Rotten_Tomatoes_Rating | IMDB_Rating | IMDB_Votes | full_dict | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | The Land Girls | 146083.0 | 146083.0 | NaN | 8000000.0 | Jun 12 1998 | R | NaN | Gramercy | None | None | None | None | NaN | 6.1 | 1071.0 | {'Title': 'The Land Girls', 'US_Gross': 146083... |
1 | First Love, Last Rites | 10876.0 | 10876.0 | NaN | 300000.0 | Aug 07 1998 | R | NaN | Strand | None | Drama | None | None | NaN | 6.9 | 207.0 | {'Title': 'First Love, Last Rites', 'US_Gross'... |
2 | I Married a Strange Person | 203134.0 | 203134.0 | NaN | 250000.0 | Aug 28 1998 | None | NaN | Lionsgate | None | Comedy | None | None | NaN | 6.8 | 865.0 | {'Title': 'I Married a Strange Person', 'US_Gr... |
3 | Let's Talk About Sex | 373615.0 | 373615.0 | NaN | 300000.0 | Sep 11 1998 | None | NaN | Fine Line | None | Comedy | None | None | 13.0 | NaN | NaN | {'Title': 'Let's Talk About Sex', 'US_Gross': ... |
4 | Slam | 1009819.0 | 1087521.0 | NaN | 1000000.0 | Oct 09 1998 | R | NaN | Trimark | Original Screenplay | Drama | Contemporary Fiction | None | 62.0 | 3.4 | 165.0 | {'Title': 'Slam', 'US_Gross': 1009819.0, 'Worl... |
Now, let’s define a function movie_performance
that takes a movie’s data dictionary, constructs a prompt, and calls the llm_chat
function to get a summary:
def movie_performance(movie_data):
= f"Considering the following data on this movie {movie_data}, provide a one-paragraph summary of its performance for my report."
prompt return llm_chat(prompt)
We’ll vectorize this function so we can apply it to the entire full_dict
column:
import numpy as np
# Vectorize the function to apply it to the DataFrame
= np.vectorize(movie_performance) movie_performance_vec
Let’s test our function with an example:
# Example usage
"Name: Kene's Movie, Sales: 100,000 USD") movie_performance(
'"Kene\'s Movie" has achieved notable commercial success, generating sales of $100,000. This impressive revenue indicates a strong reception among audiences and suggests effective marketing and distribution strategies. The financial performance of the film not only highlights its popularity but also reflects potential for continued interest, possibly leading to further opportunities in ancillary markets. Overall, Kene\'s Movie\'s sales performance positions it as a noteworthy contender within its genre, warranting further analysis for future projects.'
Finally, we’ll apply the vectorized function to generate summaries for each movie:
# Generate summaries for each movie
"llm_summary"] = movie_performance_vec(movies["full_dict"]) movies[
You can now save the DataFrame with the generated summaries to a CSV file:
# Save the results to a CSV file
"movies_output.csv", index=False) movies.to_csv(
This approach allows you to generate detailed summaries for each movie based on its full set of data, which can be incredibly useful for automated reporting and data analysis.
24.11 Practice Q: Weather Summary
Using the first 5 rows of the seattle_weather
dataset from vega_datasets
, create a function that takes all weather columns for a particular day and generates a summary of the weather conditions for that day. The function should use the LLM to generate a one-paragraph summary for a report, considering the data provided. Store the function in a column called weather_summary
.
= vd.data.seattle_weather().head()
weather weather
date | precipitation | temp_max | temp_min | wind | weather | |
---|---|---|---|---|---|---|
0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle |
1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain |
2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain |
3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain |
4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain |
# Your code here
# Step 1: Load the dataset
= vd.data.seattle_weather().head()
weather
# Step 2: Convert each row into a dictionary and add it to the DataFrame
= weather.to_dict(orient="records")
weather_dicts "full_dict"] = weather_dicts
weather[
# Step 3: Define the function to generate summaries
def weather_summary(weather_data):
= (
prompt f"Considering the following weather data: {weather_data}, "
"provide a one-paragraph summary of the weather conditions for my report."
)return llm_chat(prompt)
# Step 4: Vectorize the function
= np.vectorize(weather_summary)
weather_summary_vec
# Step 5: Apply the function to generate summaries
"summary"] = weather_summary_vec(weather["full_dict"]) weather[
24.12 Wrap-up
In this tutorial, we learned the basics of using OpenAI’s LLMs in Python for text generation, created helper functions, and applied these functions to datasets using vectorization.
In the next lesson, we’ll look at structured outputs that allow us to specify the format of the response we want from the LLM. We’ll use this to extract structured data from unstructured text, a common task in data analysis.