Skip to content

Retrieving the Meeting Data

Posted on:November 4, 2023 at 04:58 PM

I’m starting with the basic fetching and summarization module that will pull minutes from the Cincinnati City Council meeting instances on Archive.org. I’ll be using Python to create the scripts to fetch and summarize the minutes and using ChatGPT and Phind as pair programmers. I don’t have a ton of Python experience, so I’ll be relying on a lot of the output from these LLMs to assist me in creating the necessary code.

Let’s get started.

Jeremy Howard from FastAI put together a nice set of custom instructions for ChatGPT that I’ve found helpful to reduce noise and improve answers, specifically around coding projects:

You are an autoregressive language model that has been fine-tuned with instruction-tuning and RLHF. You carefully provide accurate, factual, thoughtful, nuanced answers, and are brilliant at reasoning. If you think there might not be a correct answer, you say so.
Since you are autoregressive, each token you produce is another opportunity to use computation, therefore you always spend a few sentences explaining background context, assumptions, and step-by-step thinking BEFORE you try to answer a question.
Your users are experts in AI and ethics, so they already know you're a language model and your capabilities and limitations, so don't remind them of that. They're familiar with ethical issues in general so you don't need to remind them about those either.
Don't be verbose in your answers, but do provide details and examples where it might help the explanation. When showing code, minimize vertical space.

These reduce a lot of the boilerplate that comes with the vanilla ChatGPT and also works to reduce white space in code examples.

I plan to start this in a Jupyter notebook hosted on Github’s codespaces to prototype how we will fetch the minutes. I’ll plan on asking ChatGPT to use Archive.org’s Python package to search and fetch the relevant files with the following prompts. I’ll first describe the overall scope of this module and then provide more clarifying details.

I'm working on a project where I need to fetch Cincinnati City Council meeting minutes, details, and metadata from Archive.org. I need these meeting minutes saved in a folder alongside the other metadata. I will eventually place all of this information in a SQL database so please save the output in an accessible format to easy enable this. We will begin by grabbing all Cincinnati City Council meetings for 2023 until November 1, 2023.

Upon retrieval, each meeting should be placed in a new, unique folder with the complete minutes and metadata. 

We will be using Archive.org Python package to fetch the minutes. The minutes are saved as a closed captioned txt file. We will need to use the package's search functionality to search and find the correct account and then filter out any meetings that do not have the associated closed captioning.

Any example URL of a meeting is: https://archive.org/details/10230322-coun The slug of URL represents the date. For this particular meeting the date is March 22, 2023. Another example is https://archive.org/details/10230802-coun and the date for this is August 2, 2023. I'm not sure what the `10` represents in the beginning of the slug, but the remainder is `10YYMMDD-coun` 

Each meeting that has a corresponding closed caption file has it stored in this URL structure: https://ia802704.us.archive.org/19/items/10230802-coun/10230802Coun.disc.txt

The following URL correctly filters the City Council meetings from 2023. Reference this in our API request using the Archive.org Python package: https://archive.org/search?query=subject%3A%22Cincinnati+City+Council%22&and%5B%5D=year%3A%222023%22

Generate the necessary code to:
* Pull all Cincinnati City Council meeting minutes using the captions to a unique local directory
* Add relevant and important metadata, such as: URL, time, date, anything else that might be useful

ChatGPT returned the following code, but it’s not exactly runnable just yet:

import os
import json
from internetarchive import search_items, get_item

# Function to download a single meeting's minutes and metadata
def download_meeting(slug, directory):
    item = get_item(slug)
    files = item.files

    # Create a directory for the meeting
    os.makedirs(directory, exist_ok=True)

    # Find the closed captioned txt file
    caption_file = next((f for f in files if f['name'].endswith('.disc.txt')), None)
    if caption_file:
        item.download(files=[caption_file['name']], destdir=directory, no_directory=True)

        # Save the metadata
        metadata = {
            'url': f'https://archive.org/details/{slug}',
            'time': item.metadata['publicdate'],
            'date': slug[2:10], # Assuming the '10' is a constant prefix and can be skipped
            # Add any other relevant metadata here
        }
        with open(os.path.join(directory, 'metadata.json'), 'w') as meta_file:
            json.dump(metadata, meta_file)

# Function to search and download all meetings
def download_all_meetings(base_url, query, end_date):
    # Perform the search
    search_results = search_items(query)
    for result in search_results:
        # Filter based on the date
        date_slug = result['identifier'][2:10]
        if date_slug <= end_date:
            directory_name = f"Council_Meetings_{date_slug}"
            download_meeting(result['identifier'], directory_name)

# Your specific query parameters
base_url = 'https://archive.org/details/'
query = 'subject:"Cincinnati City Council" AND year:"2023"'
end_date = '20231101' # Format YYYYMMDD

download_all_meetings(base_url, query, end_date)

There’s a few errors here. The first is the mismatch in the end_date variable that’s being used to filter out any years before 2023. But since our query contains 2023, we can remove this conditional and change the code to:

# Function to search and download all meetings
def download_all_meetings(base_url, query, end_date):
    # Perform the search
    search_results = search_items(query)
    for result in search_results:
        # Filter based on the date
        date_slug = result['identifier'][2:10]
        directory_name = f"Council_Meetings_{date_slug}"
        download_meeting(result['identifier'], directory_name)

This now pulls down all of 2023’s meetings with the following metadata:

{"url": "https://archive.org/details/10230223-coun", "time": "2023-02-23 21:00:39", "date": "230223-c"}

There’s a few things I don’t like here. The date isn’t stored in a proper format and I bet there’s more metadata other than the url, time, and date that could be helpful. Let’s fix these things. I’m going back to ChatGPT to help me address these.

I’ll first tell ChatGPT that I removed the conditional and then ask it to print out all of the possible metadata we could use for each meeting so I can decide what to keep in the generated metadata.json file.

I removed the conditional in download_all_meetings because our query specifies 2023. We won't need to double check this. I also would like to examine all of the metadata in each meeting so I can decide what to keep. Please write a function that will output all metadata for each meeting.

After a few quick iterations, ChatGPT and I came up with the following Python code to download all meeting minutes from 2023 and pair their metadata in a metadata.json file.

import os
import json
from datetime import datetime
from internetarchive import get_item, download

# Converts dates from 'YYYY-MM-DD HH:MM:SS' to ISO 8601 'YYYY-MM-DDTHH:MM:SSZ'
def convert_date_to_iso(date_str):
    dt = datetime.strptime(date_str, '%Y-%m-%d %H:%M:%S')
    return dt.isoformat() + 'Z'

def download_meeting(slug, output_directory):
    metadata = get_item(slug).metadata
    
    # Convert dates to ISO 8601 format
    if 'publicdate' in metadata:
        metadata['publicdate'] = convert_date_to_iso(metadata['publicdate'])
    if 'addeddate' in metadata:
        metadata['addeddate'] = convert_date_to_iso(metadata['addeddate'])
    
    # The directory for this meeting
    meeting_dir = os.path.join(output_directory, slug)
    os.makedirs(meeting_dir, exist_ok=True)
    
    # Define the caption file name
    cc_file_name = f"{slug}.disc.txt"
    
    # Check if closed captioning is available
    if metadata.get('closed_captioning') == 'yes':
        # Download closed caption file
        files = download(slug, destdir=meeting_dir, no_directory=True, glob_pattern="*.txt")
        
        # Rename the closed caption file if necessary
        downloaded_files = os.listdir(meeting_dir)
        for file in downloaded_files:
            if file.endswith(".txt") and not file.endswith(cc_file_name):
                os.rename(os.path.join(meeting_dir, file), os.path.join(meeting_dir, cc_file_name))
                break

    # Save metadata to a JSON file
    with open(os.path.join(meeting_dir, 'metadata.json'), 'w', encoding='utf-8') as f:
        json.dump(metadata, f, ensure_ascii=False, indent=4)

# Function to search and download all meetings
def download_all_meetings(base_url, query, end_date):
    # Perform the search
    search_results = search_items(query)
    for result in search_results:
        # Filter based on the date
        date_slug = result['identifier'][2:10]
        if date_slug <= end_date:
            directory_name = f"Council_Meetings_{date_slug}"
            download_meeting(result['identifier'], directory_name)

# Your specific query parameters
base_url = 'https://archive.org/details/'
query = 'subject:"Cincinnati City Council" AND year:"2023"'

download_all_meetings(base_url, query, end_date)

This works great, but it is unnecessary nesting the files, e.g. /Council_Meetings_230125-c/10230125-coun I want each files placed in its correct directory Council_Meetings_230125-c there’s no need for the 10230125-coun directory.

I pasted the above verbatim into ChatGPT and had it fix the download_meeting method:

def download_meeting(slug, output_directory):
    metadata = get_item(slug).metadata
    
    # Convert dates to ISO 8601 format
    if 'publicdate' in metadata:
        metadata['publicdate'] = convert_date_to_iso(metadata['publicdate'])
    if 'addeddate' in metadata:
        metadata['addeddate'] = convert_date_to_iso(metadata['addeddate'])
    
    # Define the directory for this meeting based on the slug
    meeting_dir = os.path.join(output_directory, f'Council_Meetings_{slug[2:8]}-c')
    os.makedirs(meeting_dir, exist_ok=True)
    
    # Check if closed captioning is available
    if metadata.get('closed_captioning') == 'yes':
        # Download closed caption file directly into the meeting directory
        download(slug, destdir=meeting_dir, no_directory=True, glob_pattern="*.txt")

    # Save metadata to a JSON file in the meeting directory
    with open(os.path.join(meeting_dir, 'metadata.json'), 'w', encoding='utf-8') as f:
        json.dump(metadata, f, ensure_ascii=False, indent=4)

I performed a few more back and forth iterations and came up with the following to pull all meetings with captions and their relevant metadata:

from internetarchive import search_items, get_item, download
import os
import json
from datetime import datetime

def fetch_council_meetings(year):
    search_query = f'subject:"Cincinnati City Council" AND year:"{year}"'
    for result in search_items(search_query):
        item_id = result['identifier']
        download_and_save_meeting(item_id, year)

def download_and_save_meeting(item_id, year):
    item = get_item(item_id)
    metadata = item.metadata

    # Extract the date from the identifier and create the directory
    date_part = item_id.split('-coun')[0][-6:]
    dir_name = f'Council_Meetings_{date_part}-c'
    nested_dir_path = os.path.join(dir_name, dir_name)

    # Check if the nested directory exists, if not create it
    if not os.path.exists(nested_dir_path):
        os.makedirs(nested_dir_path)

    # Download the closed caption file into the nested directory
    cc_glob_pattern = f'*{date_part}Coun.disc.txt'
    download(item_id, destdir=nested_dir_path, no_directory=True, glob_pattern=cc_glob_pattern)

    # Refine metadata and convert date and time to machine-readable format
    refined_metadata = {
        'identifier': metadata['identifier'],
        'url': f'https://archive.org/details/{item_id}',
        'date': metadata['date'],
        'time': datetime.strptime(metadata['publicdate'], '%Y-%m-%d %H:%M:%S').isoformat(),
        'title': metadata.get('title', ''),
        'creator': metadata.get('creator', ''),
        'subjects': metadata.get('subject', []),
        'description': metadata.get('description', ''),
        'language': metadata.get('language', 'eng'),
        'license_url': metadata.get('licenseurl', ''),
        'closed_captioning': metadata.get('closed_captioning', 'no')
    }

    # Save refined metadata into the nested directory
    metadata_path = os.path.join(nested_dir_path, 'metadata.json')
    with open(metadata_path, 'w', encoding='utf-8') as f:
        json.dump(refined_metadata, f, ensure_ascii=False, indent=4)

fetch_council_meetings('2023')

This now will pull all meetings from 2023 and save them into specific directories that I can use to process later.

This is a great start! From here we’ll want to move this out of the Jupyter notebook and into a script that can be ran with a cron job to periodically check when new meetings are added. I’ll use ChatGPT to help me further refine this as my next step.