Section 1: Data Science Basics for Pharmacists
A foundational introduction to the core concepts of data science and machine learning, translated into a clinical context. We’ll demystify terms like regression, classification, and clustering with practical, pharmacy-centric examples.
Data Science Basics for Pharmacists
Translating Your Clinical Intuition into Scalable, Data-Driven Algorithms.
24.1.1 The “Why”: From Evidence-Based Practice to Data-Driven Precision
As a pharmacist, your entire career has been built upon a foundation of evidence-based medicine. You are trained to critically evaluate clinical trials, interpret statistical significance, and apply population-level data to the care of an individual patient. You take abstract concepts like “number needed to treat” and use them to make concrete recommendations. This rigorous, scientific mindset is the bedrock of your profession. Data science is not a replacement for this foundation; it is the next logical, powerful evolution of it.
Evidence-based medicine, for all its power, has historically relied on looking in the rearview mirror. It analyzes what happened to large groups of patients in the past under controlled conditions. Data science gives us the tools to look forward—to make predictions about the specific, individual patient in front of us right now, using the messy, real-world data generated within our own health system. It allows us to move from population-level probability to personalized prediction.
Think about your “clinical intuition.” It’s that gut feeling you get about a patient. The feeling that tells you a specific patient is at high risk for non-adherence, that a particular prescription seems “off,” or that a patient’s combination of medications is a ticking time bomb for an adverse event. This intuition isn’t magic. It’s a sophisticated pattern-recognition engine you’ve built in your brain over thousands of hours of practice. You are subconsciously processing hundreds of data points—the patient’s age, their lab values, their refill history, the prescriber’s specialty, the jittery way the patient is speaking—and synthesizing them into a prediction. Data science is the discipline of teaching a computer to do the exact same thing, but with millions of data points, at the scale of your entire patient population, 24 hours a day.
This module will demystify this world. We will not be writing complex code. Instead, we will focus on translating the language of data science into the language of pharmacy. You will learn the core concepts so you can become an intelligent partner in analytics projects. You’ll learn to identify opportunities within the pharmacy department where these tools can be applied, how to ask the right questions of the data, and how to critically interpret the results of a predictive model. The goal is to empower you to be the essential clinical bridge between the data scientists and the frontline clinicians, ensuring that these powerful technologies are used safely, ethically, and effectively to improve patient care.
Pharmacist Analogy: The Clinical Intuition Algorithm
Imagine you’re reviewing the profile of a 72-year-old patient being discharged on a new prescription for apixaban after a pulmonary embolism. Your internal “risk algorithm” immediately starts running. You’re not just looking at the one prescription; you’re scanning the entire picture.
- Data Point 1 (Feature): The patient is also on high-dose ibuprofen for arthritis. Your internal rule fires: High bleeding risk.
- Data Point 2 (Feature): You check their latest labs and see a creatinine clearance of 28 mL/min. Your internal rule fires: Standard dose is inappropriate; dose reduction required.
- Data Point 3 (Feature): You look at their home medication list and see they were previously on warfarin, managed by a chaotic anticoagulation clinic with poor INR control. Your internal rule fires: History of adherence challenges.
- Data Point 4 (Feature): The discharge summary notes they live alone. Your internal rule fires: Social determinant of health; potential for missed doses or confusion.
Based on these four “features,” your brain makes a prediction (a classification): “This patient is at HIGH RISK for a 30-day readmission due to a bleeding event or subtherapeutic anticoagulation.” You don’t just dispense the prescription. You intervene. You call the doctor to discuss the ibuprofen and the renal dosing. You counsel the patient extensively, perhaps using a pillbox. You may even flag them for a follow-up call from a transitions-of-care pharmacist.
This is data science in its purest form. You took multiple inputs (data), processed them through a set of rules you’ve learned (the model), and generated a predictive output (the risk classification) that changed your actions. A machine learning model does the exact same thing, just more formally. It would learn from the data of 10,000 other patients discharged on apixaban what the most important risk factors (features) are for readmission (the label), and it would build a mathematical model to score every new patient, just like you did. Your clinical expertise is the human-powered prototype for the algorithms we will explore.
24.1.2 Demystifying the Terminology: A Pharmacist’s Glossary
Before we dive into the methods, we need to establish a common language. The world of data science is filled with jargon that can be intimidating. Let’s translate the most important terms into concepts you already understand.
| Data Science Term | Plain English Definition | Direct Pharmacy Analogy |
|---|---|---|
| Algorithm | A set of step-by-step rules or instructions a computer follows to solve a problem. | A dosing nomogram (like the vancomycin or heparin nomograms) is a perfect example of a clinical algorithm. It’s a set of “if-then” rules to get to a specific output. |
| Model (or Machine Learning Model) | The output of an algorithm after it has been “trained” on data. It’s a mathematical representation of the patterns found in that data. | Your clinical experience is your personal “model.” After seeing hundreds of diabetic patients, your brain has learned the patterns that connect diet, medication, and A1c levels. The model in your head can predict how a dose change might affect a patient. |
| Feature | An input variable used by the model to make a prediction. It’s a piece of data, a characteristic of the thing you’re analyzing. | When you assess a patient, every piece of information is a feature: age, weight, serum creatinine, number of active medications, allergy to penicillin, zip code. |
| Label (or Target Variable) | The output variable; the thing you are trying to predict. | The clinical outcome you’re interested in is the label: 30-day readmission (Yes/No), next A1c value, length of stay (in days), risk of opioid overdose (High/Medium/Low). |
| Training Data | The historical dataset used to teach the model. The model looks at the features in this data and the corresponding labels to learn the patterns. | Your years of pharmacy practice and residency were your training data. You learned by observing thousands of “features” and seeing the resulting “labels” (patient outcomes). |
| Test Data | A separate dataset, which the model has never seen before, used to evaluate the model’s performance and see how well it generalizes to new situations. | The NAPLEX and your board exams were your test data. They evaluated how well your internal “model” could apply its learned patterns to new, unseen patient cases. |
24.1.3 The Three Core Questions: The Main Types of Machine Learning
At its core, supervised machine learning (the most common type used in healthcare) is about using data to answer three fundamental types of questions. Your role as a pharmacy informatics analyst is to learn how to frame your clinical and operational problems as one of these three questions. If you can do that, you can work with a data scientist to solve it.
1. Regression
“How much?” or “How many?”
This type of modeling predicts a continuous numerical value. You’re trying to land on a specific point on a number line.
- How many vials of IVIG will we need next month?
- How much will this patient’s A1c drop with this new medication?
- What will be the expected cost of this patient’s therapy over the next year?
- What will this patient’s vancomycin trough level be on this dose?
2. Classification
“Which category?” or “Is this A or B?”
This type of modeling predicts a discrete category or class. You’re trying to put something into a specific bucket.
- Will this patient be readmitted within 30 days (Yes/No)?
- Is this patient at high, medium, or low risk for an adverse drug event?
- Is this prescription likely fraudulent (Yes/No)?
- Will this patient be adherent to their medication (Adherent/Non-Adherent)?
3. Clustering
“What are the natural groups?”
This is an “unsupervised” method. You don’t have a specific outcome to predict. Instead, you ask the machine to find hidden structures or groupings within the data itself.
- Can we identify distinct types of polypharmacy patients based on their medication patterns?
- Are there groups of prescribers with unusual antibiotic ordering habits?
- Can we segment our non-adherent patient population into different personas for targeted interventions?
Masterclass Deep Dive #1: Regression
The Core Task: Predicting a number. Of all the data science concepts, regression is the one you are already doing every single day, even if you don’t use the term. Every time you perform a pharmacokinetic calculation, you are performing a regression. You take input features (patient’s weight, age, kidney function) and use a mathematical formula (the model) to predict a numerical label (the dose, the resulting drug level). Machine learning simply allows us to do this for more complex problems where the formula isn’t already known.
The Foundation: Linear Regression
The simplest and most intuitive form of regression is linear regression. It tries to find the best-fitting straight line that describes the relationship between an input feature and an output label. You all remember the equation for a line from algebra: $$ y = mx + b $$
In data science, we just use slightly different notation, but the concept is identical: $$ \text{Predicted Label} = (\text{Coefficient}_1 \times \text{Feature}_1) + \text{Intercept} $$
- The Predicted Label (y): This is the number we are trying to predict. Example: The patient’s next Hemoglobin A1c.
- The Feature (x): The input variable we think influences the label. Example: The daily dose of metformin.
- The Coefficient (m): This is the most important part. It’s the slope of the line. It tells us, for every one-unit increase in our feature (e.g., for every extra 500mg of metformin), how much we expect our label to change (e.g., how much the A1c will decrease). The model “learns” this value from the training data.
- The Intercept (b): This is where the line crosses the y-axis. It’s the predicted value of our label if the feature were zero.
Critical Concept: Correlation is NOT Causation
A regression model can tell you that two variables are strongly correlated (e.g., patients on higher doses of statins tend to have lower LDL cholesterol). This means they move together. It cannot, by itself, prove that one causes the other. While in this case the causal link is well-established by clinical trials, a model might also find that ice cream sales are highly correlated with shark attacks. This doesn’t mean ice cream causes shark attacks; it means a hidden variable (summer heat) is causing both. As a clinician, your role is to apply your domain expertise to determine if a correlation found by a model is clinically plausible and potentially causal, or just a statistical coincidence.
Deep Dive Use Case: Forecasting Drug Expenditures
The Clinical Problem: The Director of Pharmacy needs to set the budget for next fiscal year. One of the most expensive and volatile line items is the budget for high-cost biologics used in inflammatory bowel disease (e.g., infliximab, adalimumab). Over-budgeting ties up capital unnecessarily; under-budgeting could lead to a crisis mid-year. The director asks you, the informatics analyst, to create a more accurate forecast.
Your Task (Framed as a Regression Problem): Can you predict the total expenditure (in dollars) for our IBD biologics for the next quarter?
1. Feature Engineering (Gathering Your Data): Your clinical brain tells you what drives this cost. You would work to extract the following features from your EHR and purchasing systems for the last 3 years:
- Historical Data (Time Series):
- `total_expenditure_previous_quarter`
- `total_expenditure_same_quarter_last_year` (to account for seasonality)
- Volume & Patient Metrics:
- `number_of_active_patients_on_biologics`
- `number_of_new_starts_previous_quarter`
- External Factors:
- `wholesale_acquisition_cost` (WAC) for each drug (did the price change?)
- `new_biosimilar_launched` (a binary 1 or 0 feature)
2. Model Training: A data scientist would take this historical data and train a regression model (likely a more complex one than simple linear regression, like a time-series model, but the principle is the same). The model’s job is to learn the mathematical relationship between all those input features and the label you’re trying to predict: `expenditure_next_quarter`.
3. Interpreting the Output: The model might produce an equation that looks something like this (in concept):
`Predicted_Expenditure` = ($5,000 * `num_active_patients`) + ($10,000 * `num_new_starts`) + (0.8 * `expenditure_previous_quarter`) – ($50,000 * `new_biosimilar_launched`) + …
As a pharmacist, you can now interpret this! The model has learned that each active patient adds about $5,000 to the quarterly cost, each new start adds $10,000, and that the launch of a biosimilar saved about $50,000. This is no longer a black box; it’s a financial model grounded in your operational reality.
4. Actionable Insights: You can now use this model to run scenarios. “What if we get 10 more new starts than expected next quarter?” You can plug that number into the model and get a precise budget impact estimate. This transforms you from a simple record-keeper into a strategic financial forecaster for the department.
Masterclass Deep Dive #2: Classification
The Core Task: Predicting a category. This is the workhorse of clinical predictive modeling. So many critical questions in a hospital come down to a categorical choice: Is the patient going to get better or worse? Do we intervene or not? Is this safe or unsafe? Classification models are designed to automate this process of sorting and flagging, allowing you to focus your clinical attention on the highest-risk patients.
An Intuitive Model: The Decision Tree
While models like logistic regression are powerful, the most intuitive classification model for a clinician is a decision tree. It works exactly like a clinical workflow or diagnostic algorithm, by asking a series of “yes/no” questions to arrive at a final classification. The algorithm learns from the data which questions to ask, in what order, and what the best cut-off points are for each question to best separate the categories.
Imagine a model designed to predict if a patient is at high risk for an opioid-induced respiratory depression event. The trained decision tree might look like this:
Is Total Daily MME > 90?
CLASSIFICATION: LOW RISK
Is Patient on a Benzodiazepine?
CLASSIFICATION: MEDIUM RISK
CLASSIFICATION: HIGH RISK
Evaluating a Classification Model: The Confusion Matrix
For a regression model, we can see how “wrong” it is by measuring the distance between the predicted number and the actual number. But for classification, it’s either right or wrong. The tool we use to understand the performance of a classification model is the confusion matrix. It’s a simple table that shows us the four possible outcomes of a prediction.
Let’s use a critical clinical example: a model that predicts which ICU patients will develop sepsis within the next 12 hours.
Confusion Matrix: Sepsis Prediction Model
| Actual Condition | ||
|---|---|---|
| Sepsis | No Sepsis | |
Predicted |
True Positive (TP) 90 (Model correctly predicted sepsis) |
False Positive (FP) 100 (Model predicted sepsis, but patient was fine – Alert Fatigue) |
|
False Negative (FN) 10 (Model missed the sepsis case – Catastrophic Failure) |
True Negative (TN) 9,800 (Model correctly predicted no sepsis) |
|
Performance Metrics: It’s a Balancing Act
From the confusion matrix, data scientists calculate several key metrics. As a clinician, you need to understand what they mean, because there is always a trade-off.
- Accuracy: (TP + TN) / Total. “Overall, how often was the model right?” In our example, it’s (90 + 9800) / 10000 = 98.9%. This looks amazing, but it’s misleading because the vast majority of patients don’t have sepsis. Accuracy is a poor metric for rare events.
- Precision (Positive Predictive Value): TP / (TP + FP). “Of all the patients the model flagged, how many actually had sepsis?” In our example, 90 / (90 + 100) = 47.4%. This tells you about alert fatigue. Less than half of the alerts from this model would be real.
- Recall (Sensitivity): TP / (TP + FN). “Of all the patients who truly had sepsis, how many did the model catch?” In our example, 90 / (90 + 10) = 90%. This is your safety metric. The model caught 90% of the true cases.
The Clinical Trade-Off: You can tune a model to have higher recall (catch more cases), but it will almost always result in lower precision (more false alarms). Conversely, you can make it more precise (fewer false alarms), but you risk missing more true cases (lower recall). Your job as the clinical expert is to help the team decide on the acceptable balance. For a deadly condition like sepsis, you would always prioritize high recall, even if it means dealing with more false positives. For a less critical prediction, you might prioritize precision to avoid alert fatigue.
Deep Dive Use Case: Predicting 30-Day Hospital Readmissions
The Clinical Problem: Hospital readmissions are costly, disruptive for patients, and a major quality metric that incurs financial penalties from payers. The transitions-of-care (TOC) pharmacy team is small and cannot possibly provide intensive follow-up for every discharged patient. They need to focus their efforts on the patients most likely to come back.
Your Task (Framed as a Classification Problem): Can we predict which patients, at the time of discharge, have a high probability of being readmitted within 30 days (Yes/No)?
1. Feature Engineering (Gathering Your Data): This is a classic problem with well-established risk factors. You would mine the EHR for features like:
- Demographics: `age`, `gender`, `insurance_type`
- Clinical History: `LACE_score` (Length of stay, Acuity of admission, Comorbidities, Emergency department visits), specific comorbidities like `has_CHF`, `has_COPD`, `has_diabetes`.
- Current Admission Data: `admitted_from_SNF`, `discharge_disposition` (home, home health, SNF), `number_of_procedures`.
- Medication-Related Features: `number_of_discharge_meds`, `is_on_anticoagulant`, `is_on_insulin`, `had_med_reconciliation_by_pharmacy`.
- Social Determinants: `zip_code` (as a proxy for socioeconomic status), `has_documented_caregiver`.
2. Model Training & Output: A classification model (like logistic regression or a more advanced model like a gradient-boosted tree) would be trained on data from tens of thousands of past discharges. For each new patient being discharged, the model wouldn’t just give a “Yes” or “No” answer. It would output a probability score, a number between 0 and 1 (or 0% and 100%). For example, Patient A might get a score of 0.08 (8% chance of readmission), while Patient B gets a score of 0.65 (65% chance of readmission).
3. Actionable Insights & Workflow Integration: This probability score is incredibly powerful. Now you can build an automated workflow:
- The model runs automatically on every patient scheduled for discharge in the next 24 hours.
- A report or dashboard is generated, ranking all discharging patients by their readmission risk score.
- The TOC pharmacy team gets this report every morning. They can now ignore the hundreds of patients with <10% risk and focus their entire day on the 20 patients with >50% risk.
- For these high-risk patients, they can perform intensive interventions: a bedside medication delivery, a scheduled 3-day post-discharge follow-up call, and coordination with the outpatient pharmacy.
This is how predictive analytics transforms a pharmacy service from a reactive, scattershot approach to a proactive, data-driven, and highly efficient operation. You are using data to apply your most valuable resource—your clinical expertise—to the patients who need it most.
Masterclass Deep Dive #3: Clustering
The Core Task: Finding hidden groups. Clustering is different from regression and classification. With those, you had a specific target you were trying to predict (the “label”). This is called supervised learning. Clustering is a form of unsupervised learning. You don’t have a label. You just have a dataset full of features, and you say to the computer, “I don’t know what the patterns are in here. You tell me. Find the natural, meaningful groups in this data.”
An Intuitive Model: K-Means Clustering
The most common clustering algorithm is K-Means. The “K” just stands for the number of clusters you tell the algorithm to find. The process is surprisingly simple and intuitive:
- Step 1 (Choose K): You decide how many clusters you want to find. Let’s say you want to find 3 types of prescribers, so K=3.
- Step 2 (Random Start): The algorithm randomly places 3 points (called “centroids”) into your dataset.
- Step 3 (Assign): It assigns every single data point (each prescriber) to the nearest centroid. This creates 3 initial, rough groups.
- Step 4 (Update): It calculates the new center of each of the 3 groups it just created, and moves the centroid to that new center.
- Step 5 (Repeat): It repeats steps 3 and 4 over and over. Each time, the centroids get closer and closer to the true center of the natural clusters in the data, until they stop moving. The final groups of data points surrounding each centroid are your clusters.
Deep Dive Use Case: Patient Segmentation for MTM Services
The Clinical Problem: Your health system wants to improve medication adherence for its diabetic population. A “one-size-fits-all” approach (like a generic refill reminder text) isn’t very effective. You believe there are different “types” of non-adherent patients who would respond to different kinds of interventions. But you don’t know what those types are.
Your Task (Framed as a Clustering Problem): Can we analyze our diabetic patient population and identify distinct, meaningful segments (clusters) based on their demographic and medication-taking behaviors?
1. Feature Engineering (Gathering Your Data): You don’t have a label for “patient type,” so you just gather all the relevant features you can find for your population of 20,000 diabetic patients:
- Demographics: `age`, `zip_code`
- Clinical Data: `hemoglobin_a1c`, `number_of_comorbidities`, `years_since_diagnosis`
- Medication Data: `proportion_of_days_covered` (PDC), `number_of_diabetes_meds`, `is_on_insulin`
- Engagement Data: `number_of_primary_care_visits_last_year`, `uses_patient_portal` (Yes/No)
Critical Concept: Feature Scaling
Clustering works by measuring “distance” between data points. But you can’t compare `age` (which ranges from maybe 20-90) to `hemoglobin_a1c` (which ranges from 5-14). The age variable would completely dominate the distance calculation. Before clustering, data must be scaled or normalized, usually by converting each feature to a common scale (like -1 to 1, or a “Z-score”). This is a critical pre-processing step that ensures all features contribute fairly to the clustering algorithm.
2. Model Training & Interpreting the Clusters: You tell the K-Means algorithm to find, let’s say, K=4 clusters. The algorithm runs and assigns every one of the 20,000 patients to one of four groups. Now comes the most important part, which requires your clinical expertise: profiling the clusters. You would look at the average values of all the features for the patients in each cluster to understand what makes them unique. You might find:
| Cluster Profile | Key Characteristics (Averages) | Your Clinical Interpretation & Persona | Targeted Intervention |
|---|---|---|---|
| Cluster 1 (n=5,000) | Age: 75, Comorbidities: 8, PDC: 95%, Portal Use: No | “The Adherent but Complex Elderly” These patients are taking their meds but are at high risk for polypharmacy issues and drug interactions. |
Comprehensive Medication Review (CMR) focused on de-prescribing and simplification. |
| Cluster 2 (n=3,000) | Age: 35, Comorbidities: 1, PDC: 50%, A1c: 10.5% | “The Young and Disengaged” These patients are relatively healthy but struggling with adherence, likely due to behavioral factors or disease understanding. |
Motivational interviewing, connecting them with a diabetes educator, and leveraging mobile app-based reminders. |
| Cluster 3 (n=8,000) | Age: 55, Comorbidities: 3, PDC: 90%, High-deductible plan | “The Stable but Cost-Conscious” These patients are trying to be adherent but are likely facing financial barriers to care and medication access. |
Proactive outreach from a pharmacy tech specializing in patient assistance programs and identifying lower-cost therapeutic alternatives. |
| Cluster 4 (n=4,000) | Age: 60, Comorbidities: 5, PDC: 65%, On Insulin, High Portal Use | “The Struggling and Tech-Savvy” These patients have complex regimens and are trying to manage their disease but are failing. They are comfortable with technology. |
Enrollment in a remote monitoring program using continuous glucose monitors (CGMs) with pharmacist-led telehealth check-ins. |
3. Actionable Insights: By using clustering, you have transformed a generic problem (“improve adherence”) into four specific, targeted problems with clear, actionable solutions. You can now design four separate MTM campaigns, each tailored to the unique needs of the patient persona you discovered. This is a level of strategic, personalized care that is impossible to achieve without using data science to uncover the hidden structures within your patient population.