Exercises

Note

Optional: Use Google Colab to evaluate Python, R, and/or Julia code generated by the LLM:

  1. Select your programming language at Runtime > Change runtime type > [Python3 | R | Julia]

  2. Paste the code in a cell

  3. Click the Run Button.

    (NB: A Google account is required for Google Colab access.)

Hands-On Exercise: Screening Patient Data for Clinical Trial Eligibility

  1. Search for a kidney-related active trial on clinicaltrials.gov
  2. Copy the Participation Criteria
  3. Prompt LLM to write SQL (or alternatively Python, R, Julia, SPSS, Stata, or language of your choice) to identify eligible patients using relevant columns in EPIC’s Clarity Database Schema

Example Actively Recruiting Trial Participation Criteria

Optional Extensions

Prompt LLM to create:

  • Patient-Screener Web App

  • CONSORT Diagram of Patient Eligibility Criteria

  • Synthetic Dataset to Test Code


Additional Hands-On Exercises

Exercise 1: Code Explanation

Prompt LLM for a detailed explanation of code, including libraries/packages used, required data types and functional outputs.

Options:

  1. Use your own code.

  2. Search for and copy code from GitHub.

GitHub Search Documentation

Example Searches:

  • language:Python clinical trial

  • language:R renal

  • language:SAS clinical

  • language:Python nephrology

Extension: Prompt an LLM for ways to improve the code. For example, to reduce run/compilation time or improve readability.

Exercise 2: Data Reporting, Visualization, and Predictive Modeling

Using the following description of the clinical trial, copy and paste the variables and data descriptions into an LLM and prompt it to:

  1. Create a JAMA style Table 1 in Markdown.

  2. Explore and visualize an important relationship(s) between the predictors and remission.

  3. Provide and explain code for building multiple predictive models for remission (remission = 1).

  4. Optional: Download the dataset and evaluate the code in Google Colab.

Variables
days_of_life - age in days. Numeric. Range: 1207-32356. 1 missing value.
plt - Platelet Count. Numeric. Range: 11-1114. 4 missing values.
mpv - Mean Platelet Volume. Numeric. Range: 5.3-13.5. 21 missing values.
un - Blood Urea Nitrogen. Numeric. Range: 2-118. 53 missing values.
wbc - White Blood Cell Count. Numeric. Range: 0.7-33.5. No missing values.
hgb - Hemoglobin. Numeric. Range: 4.5-18.6. 4 missing values.
hct Hematocrit. Numeric. Range: 13.7-55.2. 3 missing values.
rbc - Red Blood Cell Count. Numeric. Range: 1.57-7.04. 3 missing values.
mcv - Mean Corpuscular (RBC) Volume. Numeric. Range: 56.5-124. 3 missing values.
mch - Mean Corpuscular (RBC) Hemoglobin. Numeric. Range: 16.7-42.3. 7 missing values.
mchc - Mean Corpuscular (RBC) Hemoglobin per Cell. Numeric. Range: 28.2-38.0. 7 missing values.
rdw - Red cell Distribution Width. Numeric. Range: 11.3-39.7. 3 missing values.
neut_percent - Percent of Neutrophils in WBC count. Numeric. Range: 17-98.1. No missing values.
lymph_percent - Percent of Lymphocytes in WBC count. Numeric. Range: 1-67.9. No missing values.
mono_percent - Percent of Monocytes in WBC count. Numeric. Range: 0-30.3. No missing values.
eos_percent - Percent of Eosinophils in WBC count. Numeric. Range: 0.5-29.3. 6 missing values.
baso_percent - Percent of Basoophils in WBC count. Numeric. Range: 0.2-5.3. 6 missing values.
sod - Sodium. Numeric. Range: 116-151. No missing values.
pot - Potassium. Numeric. Range: 2.6-10.1. 1 missing value.
chlor - Chloride. Numeric. Range: 83-126. No missing values.
co2 - Bicarbonate (CO2). Numeric. Range: 12-40. 5 missing values.
creat - Creatinine. Numeric. Range: 0.2-8.4. No missing values.
gluc - Glucose. Numeric. Range: 41-486. No missing values.
cal - Calcium. Numeric. Range: 6.5-11.8. 1 missing value.
prot - Protein. Numeric, range 2.9-10, 0 missing values
alb - Albumin. Numeric, range 1.2-5.5, 0 missing values
ast - Aspartate Transaminase. Numeric, range 5-7765, 0 missing values
alt - Alanine Transaminase. Numeric, range 1-10666, 18 missing values
alk - Alkaline phosphatase. Numeric, range 13-1938, 0 missing values
tbil - Total Bilirubin. Numeric, range 0.09-27, 0 missing values
active - Active Inflammation despite Thiopurines for > 12 weeks. Numeric, range 0-1, 0 missing values
remission - Remission of Inflammation after Thiopurines for > 12 weeks. Numeric, range 0-1, 0 missing values

Download CSV

Description

Data Codebook

Paper

Citation: Higgins P (2023). medicaldata: Data Package for Medical Datasets. https://higgi13425.github.io/medicaldata/, https://github.com/higgi13425/medicaldata/.

Extension: Try using Paper Banana to create a visualization or CONSORT diagram.

Exercise 3: Munging Messy Data

Prompt an LLM to clean and validate the messy_aki dataset in preparation for analysis and modeling.

  1. Download the messy_aki dataset from Dr. Peter Higgins’ {medicaldata} R package.
  2. Prompt an LLM to write code to clean the data.
  3. Evaluate the results and adjust your prompt to address any missed issues.

Data: messy_aki

Optional Extension: Use the LLM to visualize eGFR trends for each patient over time.