In needs assessments, respondents are frequently asked to rank unmet needs, survival challenges or sectoral priorities. The so-called Plackett-Luce model estimates from the ranking data their relative strength for the sample or population. The strength is measured at the ratio level; i.e., ratios between items are legitimate. In addition, Plackett-Luce trees identify and visualize statistically distinct need / priority etc. profiles by social group in response to vulnerability attributes measured at the unit level (e.g., households) (these attributes are the “covariates” in the model). This is an example of a simple tree:
Our demonstration dataset is from a Multi-Sectoral Needs Assessment (MSNA) in the Rohingya refugee camps in Bangladesh in 2019. We provide R code for both the unconditional strength estimates as well as for group-sensitive trees. Plackett-Luce models may be particularly relevant in assessments with limited access (Covid-19!) and short questionnaires that still have room for questions about first and second (and ideally also third) priorities. We also solve an apparent shortcoming in the original R package (PlackettLuce
) options for the tree legend, making for attractive tree plots in presentations and reports. We show how to filter observations such that the trees can be calculated using sampling weights, a key requirement for survey data analysis.
We present the code in chapters and chapter segments, with numerous annotations for the technically interested reader. We point out several potential issues and illustrate them in the appendix:
Some households will state only one priority. Plackett-Luce excludes these observations. We demonstrate for our dataset that the missing priorities are not random. If these observations are numerous, they should be investigated for differences in first priorities and in co-variates vs. the included, to assess the potential bias.
Analysts may want to select from the set of ranked items, in order to focus on the highest priorities. The more items are excluded, the smaller the proportion of observations with at least two remaining priorities. We demonstrate that with more exclusions the effective sample size diminishes acceleratingly.
We investigate the impact of assessment design on priority profiles. Qualitative studies in the Rohingya camps (Acaps et.al. 2019) compellingly demonstrated the particular vulnerability of households headed by unmarried women. The quantitatively oriented MSNA paired enumerators and respondents by gender. The tied enumerator/respondent gender is liable to act as a confounder, given the correlation between the genders of respondent and household head. We find that the design variable overwhelms the substantive covariates. However, the bias can be circumvented to a degree and even affords new insight. Analysts should compare Plackett-Luce trees between versions that include suspected confounders in the covariates vs. those that don’t.
Like other statistical models, notably regression and cluster analysis, Plackett-Luce trees are sensitive to covariate transformations, particularly categorizations of numeric variables. We show the effect of different transformations, some neutral, others leading to vastly different trees.
As a novice R user, I would not have been able to write this code without the extensive help of my son Attilio Benini. The two of us gradually developed this project over several months together. In the text, therefore, I use first-plural pronouns.
I thank Sudeep Shrestha, Acaps, for supplying me with a clean and documented dataset as the basis of our demonstration data.
I am grateful also to Edouard Legoupil, UNHCR, who in December 2019 added a comparison between the Borda Count, another measure of sectoral priorities, and the Plackett-Luce model, providing R code for both (he modestly attributed this contribution to “Benini & Sons”, but the novel elements are all his) (Legoupil 2019).
I thank Prof. Heather Turner, co-creator of the PlackettLuce
package, for advice and code examples (personal email).
Neither Attilio nor myself have a financial interest in this project. Readers should be free to use and modify the code as they see fit, with a reference to our work and to the authors of the Plackett-Luce package (see references), where appropriate.
Aldo Benini
A downloadable zip-file containing
PLtrees_msna.Rmd
generating this .html,PLtrees_msna.Rproj
,msna2019.RData
, holding the demonstration data msna2019
and its metadata msna2019_meta_data
is available from the same web page as this .html. Make sure to extract the zipped files to the same working directory.
The .html does not show the entire code of the .Rmd. The complete plain code can be extracted using the code below within the same environment as the .Rmd (copy to a script or use the console).
"PLtrees_msna.Rmd" %in% list.files(getwd())
# check whether the .Rmd to extract is within the current environment
knitr::knit(
input = "PLtrees_msna.Rmd",
tangle = TRUE,
quiet = TRUE,
output = "PLtrees_msna_code_extracted.R"
)
The code was written using R version 4.0.2. and RStudio version 1.3.959. The dependencies of the used packages require R version >= 3.2.3. The used Pandoc version was 2.9.2, which came with RStudio.
The Plackett-Luce model was presented to the humanitarian data analyst community in Benini, A. (2019) “Priorities and preferences in humanitarian needs assessments - Their measurement by rating and ranking methods”. However, at the time we did not understand the R code for Plackett-Luce trees, an analytical as well as graphical technique to identify significantly different preference profiles, measured by ranked items, in response to covariates of the units (e.g., households) expressing the preferences. This facility is part of the R package PlackettLuce
. It was developed by, and described in, Turner, van Etten, Firth and Kosmidis (2018b). Some of the concepts, in particular ranking networks, are succinctly explained in Turner et al. (2018a).
As the authors explain, the algorithm producing the Plackett-Luce trees conducts an iterative process:
Fit a Plackett-Luce model to the full data.
Assess the stability of the worth parameters with respect to each available covariate.
If there is signiﬁcant instability, split the full data by the covariate with the strongest instability and use the cut-point with the highest improvement in model ﬁt.
Repeat steps 1-3 until there are no more signiﬁcant instabilities, or a split produces a sub-group below a given size threshold (Turner et al., 2018b, op.cit).
Point no. 3 is key. The association of priorities with other variables - notably the measured attributes of vulnerable groups - can, of course, be explored with the means of descriptive statistics, and group profiles with statistics of priorities and attributes of interest can be contrasted. Plackett-Luce adds value in two respects:
First, it expresses the sample- or population-wide preference for each ranked item as a value on a latent variable (called “worth”), calculated from the stated item ranks.
Second, its tree algorithm finds the most strongly distinct group profiles on the basis of the user-defined set of attributes, retaining those attributes which simultaneously cause the strongest distinctions. Not only does it find the subset; it also determines the sequence among the retained attributes that divide the sample into successively refined groups as well as the attribute values at which these splits are made.
The result is a set of “worth” estimates for each item in each group so defined. The “worth” measure is ratio-level, so that items can be compared as ratios (i.e., not only as differences without natural zero point). Similarly, for a given item, groups can be compared by ratios between the worth that each of them assigns to it (e.g., the relative priority that men and woman give to food aid).
Finally, the algorithm can be used to visualize retained attributes, splitting values and group-specific preference profiles in tree plots that resemble graphic decision trees.
In a minor change of terminology, we replace in what follows “worth” with “strength”, as in the strength with which participants in humanitarian assessments hold their beliefs, preferences and priorities. We believe that this wording is more appropriate to humanitarian assessments.
This document demonstrates code for estimating the unconditional (no group differences) Plackett-Luce model (PlackettLuce
) and for Plackett-Luce trees (pltree
), which identify groups with distinct profiles. Our dataset is a segment from the 2019 Rohingya MSNA (ISCG et al. 2019b).
One of us extensively worked with those data using Stata 15. We mention this because Stata estimates some version of Plackett-Luce models with the command “rologit — Rank-ordered logistic regression”; in the statistical community, this model is also known as the “exploded logit model” and as the “choice-based method of conjoint analysis”. However, both the data organization - rologit demands long-shape data - and the model differ. rologit requires that the independent variables vary between the items of a given respondent, at least for some of the respondents. For instance, a respondent’s gender cannot directly affect his or her rankings; it could affect the rankings only via an interaction with a variable that does vary over items. The R version, however, does not require that. Moreover, it offers extensive data preparation tools.
In a recent paper, the authors of PlackettLuce
(Turner et.al., 2020) compare its features to those of other ranking-data software. In its appendix, interested R users will find code examples for several alternative procedures. We do not discuss these here.
These packages are required for the subsequent code to run properly:
packages_required <- c(
"tidyverse", # Packages for data manipulation and plotting
"ggrepel", # Automatically Position Non-Overlapping Text Labels with 'ggplot2'
"rmarkdown", # Convert R Markdown documents into HTML, MS Word, PDF and Beamer.
"knitr", # General-Purpose Package for Dynamic Report Generation in R
"bookdown", # Authoring Books and Technical Documents with R Markdown
"kableExtra", # For constructing complex tables
"summarytools", # For writing some types of tables
"PlackettLuce", # Evaluates strength (worth) of priorities and computes Plackett-Luce trees
"png", # Read and write PNG images
"ridittools" # To construct the riddit-transformed education proxy
)
# Determine new packages to install:
new_packages <- packages_required[!packages_required %in% installed.packages()]
# If any new ones are indeed required, install them:
if(length(new_packages) > 0){install.packages(new_packages)}
# Load / activate the required packages:
lapply(X = packages_required,
function(X)
suppressPackageStartupMessages(
library(package = X, character.only = T)
)
)
For brevity, in the above code chunk the output including conflict warnings is suppressed. To see them, run this code line in the console:
lapply(X = packages_required, function(X) library(package = X, character.only = T))
The dataset used here is a segment of the shared December 2019 version of the dataset from the “Joint Multi-Sector Needs Assessment (J-MSNA) - Rohingya Refugees - September 2019.” Cox’s Bazar, Bangladesh (ISCG and et.al. 2019b). A recently (March 2020) updated version of the dataset (in .xlsx format) can be downloaded here.
The original dataset is large, with 536 variables at the household level. We removed all variables but the 20 used here. We replaced the original record ID with a sequential numeric one and removed the camp markers. No records were entirely deleted; thus n = 3418.
Several more calculated variables are added during execution.
load("msna2019.RData")
# Load data plus meta-data from a file stored in the same directory as this .Rmd
Although this ID is unique by design, in other circumstances it will be reassuring to do a check, using a line of code like this:
n_distinct(msna2019$id, na.rm = TRUE) == nrow(msna2019)
## [1] TRUE
The assessment enumerators requested the respondents to name their first three priorities from a list of 12 needs areas. These twelve are, in the logic of Plackett-Luce, the items. A minimum of two ranked items is needed for an observation to be in the effective estimation sample. For the way the priorities were coded, see further below.
We include up to five covariates in the Plackett-Luce tree models, as well as a sixth (a design variable) in some, in order to make a methodological point. Descriptive statistics of the covariates follow after this table; for the items we give them in histograms further below.
variable | description |
---|---|
id | ID (record identifier) |
sampling_weight | Sampling weight |
resp_gender | Gender of the respondent (design variable). 2 categories: ‘Male’, ‘Female’. |
hh_gender_marital | Household head gender and marital status (key vulnerability indicator). 4 categories: ‘Fem-Single’, ‘Fem-Married’, ‘Male-Single’, ‘Male-Married’. |
males_18_59 | Males 18-59 years in household (household capacity indicator). 2 categories: ‘At least one’, ‘None’. |
educ_max_years | Max. years education all household members (excl. madrassa) (household capacity indicator). Range years (0-12) of schooling. |
working | Some household members working (household capacity indicator). 2 categories: ‘At least one’, ‘None’. |
fcs | Food Consumption Score (FCS) (household poverty indicator) |
items | |
food | Food |
shelter | Shelter |
water | Water |
latrines | Latrines |
electricity | Electricity |
cooking | Cooking utensils |
clothing | Clothing |
health_care | Health care |
education | Education |
safety | Safety |
fuel | Cooking fuel |
income | Income |
“A total of 3418 households, composed of 17162 individuals, were surveyed across 34 camps. The households were identified through simple random sampling of Open Street Map (OSM) shelter footprints. Each survey was conducted with an adult household representative responding on behalf of the household and its members.” (Excel data file, sheet “Read Me”).
as well as of the sampling weights and of the design variable resp_gender
## Data Frame Summary
## msna2019
## Dimensions: 3418 x 7
## Duplicates: 62
##
## -------------------------------------------------------------------------------
## Variable Stats / Values Freqs (% of Valid) Missing
## -------------------- -------------------------- --------------------- ---------
## fcs Mean (sd) : 44.5 (11.1) 149 distinct values 0
## [numeric] min < med < max: (0%)
## 13 < 43.5 < 103
## IQR (CV) : 13 (0.3)
##
## working 1. At least one 1827 (53.4%) 0
## [factor] 2. None 1591 (46.6%) (0%)
##
## educ_max_years Mean (sd) : 3.2 (3.1) 13 distinct values 1
## [integer] min < med < max: (0.03%)
## 0 < 3 < 12
## IQR (CV) : 5 (1)
##
## males_18_59 1. At least one 2937 (85.9%) 0
## [factor] 2. None 481 (14.1%) (0%)
##
## hh_gender_marital 1. Fem-Single 235 ( 6.9%) 0
## [factor] 2. Fem-Married 331 ( 9.7%) (0%)
## 3. Male-Single 77 ( 2.2%)
## 4. Male-Married 2775 (81.2%)
##
## resp_gender 1. Male 1669 (48.8%) 0
## [factor] 2. Female 1749 (51.2%) (0%)
##
## sampling_weight Mean (sd) : 1 (0.4) 34 distinct values 0
## [numeric] min < med < max: (0%)
## 0.2 < 1 < 1.9
## IQR (CV) : 0.7 (0.4)
## -------------------------------------------------------------------------------
Item statistics follow below in graphic form.
Plackett-Luce trees are sensitive to transformations of covariates (the independent variables of the model). Different trees may result from the original numeric vs. from derived categorical versions of the same covariate. This is best discussed by way of example. For this we use the education of household members, a household capacity indicator in our dataset.
Education is likely to influence the way the respondent reasons about priorities, both cognitively (how he/she understands questions and frames the response) as well as through the effect that education has on the actual household situation, particularly in employment and dealings with outside agencies.
However, members’ levels have been observed incompletely. The MSNA was limited to a proxy measure, with one recorded observation per household. This is the maximum number of years of schooling completed among the members (educ_max_years
).
educ_max_years
is a count variable with multiple peaks. Already in the original dataset, a dichotomous indicator educ_min_6_years
was calculated, whether the maximum number of years of schooling was higher than Class 5, or not.
Neither the raw count nor the chosen break point for the indicator are necessarily optimal for finding distinct priority patterns along the education dimension. Therefore, we construct two more derivatives from educ_max_years
, one continuous, the other dichotomous.
The continuous version, named educ_ridit
, proxies for the education level of the household. Formally, the educ_ridit
is defined as the riddit of educ_max_years
. In different formulation, it is is the probability that the household’s educ_max_years
is higher than that of a randomly chosen household plus half the probability that it is the same.
The intuition is that this measure is likely substantially correlated with the unknown average education level of the adult members and, if so, with the ability to interact usefully with outside agencies and potential employers and clients.
educ_max_years
and educ_ridit
map strictly monotonously to each other. The advantage of educ_ridit
is that the distances between adjacent values are symptomatic of the effort to get from one level to the next in the education system, such as from never-school to first grade completed vs. first to second grade, etc.
As such, educ_ridit
may be better suited to reveal education-related effects on priority profiles. It is a continuous variables within the interval (0, 1) whereas educ_max_years is a count variable with min = 0. But will this produce a different tree from that obtained using the original count variable?
Further, we dichotomize educ_max_years
in any_school_ever
, which takes the value “No one” if no one in the household has ever completed Class 1, and “At least one” otherwise. The intuition is that total non-exposure to the formal education system indexes a wider lack of social capital, carried over from Myanmar and unremedied in camp life. It thus expresses more than the education status.
This code creates a comparison table for the four measures and merges the two new ones back to the dataframe msna2019
.
# generate factor educ_min_6_years
msna2019 %>%
mutate(
educ_min_6_years = ifelse(educ_max_years < 6, "0-5 years", "min. 6 years"),
educ_min_6_years = factor(educ_min_6_years, levels = c("0-5 years", "min. 6 years"))
) -> msna2019
(msna2019 %>%
filter(!is.na(educ_max_years)) %>%
group_by(educ_max_years, educ_min_6_years) %>%
count() %>%
ungroup() %>%
mutate(
proportion = n/sum(n),
educ_ridit = ridittools::toridit(proportion),
any_school_ever = ifelse(educ_max_years == 0, "No one", "At least one"),
any_school_ever = factor(x = any_school_ever, levels = c("No one", "At least one"))
) %>%
select(
educ_max_years,
n:educ_ridit,
educ_min_6_years,
any_school_ever
) -> tab_educ_ridit)
## # A tibble: 13 x 6
## educ_max_years n proportion educ_ridit educ_min_6_years any_school_ever
## <int> <int> <dbl> <dbl> <fct> <fct>
## 1 0 1042 0.305 0.152 0-5 years No one
## 2 1 226 0.0661 0.338 0-5 years At least one
## 3 2 359 0.105 0.424 0-5 years At least one
## 4 3 373 0.109 0.531 0-5 years At least one
## 5 4 393 0.115 0.643 0-5 years At least one
## 6 5 273 0.0799 0.740 0-5 years At least one
## 7 6 155 0.0454 0.803 min. 6 years At least one
## 8 7 165 0.0483 0.850 min. 6 years At least one
## 9 8 152 0.0445 0.896 min. 6 years At least one
## 10 9 77 0.0225 0.930 min. 6 years At least one
## 11 10 162 0.0474 0.965 min. 6 years At least one
## 12 11 13 0.00380 0.990 min. 6 years At least one
## 13 12 27 0.00790 0.996 min. 6 years At least one
tab_educ_ridit %>%
select(educ_max_years, educ_ridit, any_school_ever) %>%
left_join(x = msna2019, y = ., by = "educ_max_years") -> msna2019
For space reasons, we run models each using one of the four versions of the education proxy in this appendix. We find that Plackett-Luce trees indeed are not indifferent to all covariate transformations:
For bijective numeric-to-numeric transformations, the trees are the same (except for the re-scaled split values). In our example, the trees with educ_max_years
and educ_ridit
both come with the exactly same four terminal nodes. Note that this robustness is not trivial; it does not hold for other types of models, e.g. regression on, or cluster analysis with, transformed covariates.
Categorization of originally numeric covariates, however, is consequential. The specification with educ_min_6_years
returns a tree with two terminal nodes only; with any_school_ever
Plackett-Luce splits the sample into three groups.
The point to retain is that we ought to be wary of reducing numeric covariates to categorical ones in the preparation stage. First we may want to try out a Plackett-Luce tree using the numeric versions. If these trees have no sensible interpretation, only then should we work with the categorical transforms. The split values in the numeric ones may even suggest meaningful categories.
For several reasons, we need an overview of the observations that have missing values. Usually one would create a table with a row for each combination of missing values and a column for their frequency, then consider further analysis strategies. Our demonstration msna2019
dataset has only one such observation; so we simply print it out in its entirety.
msna2019_incomplete <- msna2019 %>% filter(!complete.cases(.))
nrow(msna2019_incomplete)
## [1] 1
msna2019_incomplete
## id sampling_weight resp_gender hh_gender_marital males_18_59
## 1 1569 1.149119 Female Fem-Married At least one
## educ_max_years working fcs food shelter water latrines electricity cooking
## 1 NA None 41 3 1 0 0 2 0
## clothing health_care education safety fuel income educ_min_6_years educ_ridit
## 1 0 0 0 0 0 0 <NA> NA
## any_school_ever
## 1 <NA>
And only one observation has missing values, in the maximum number of years of schooling that any of the household members attained and in the three transformations of this variable. The items have no missing values because PlackettLuce
requires all unranked items in an observation to be coded 0.
That coding convention must not be confused with the problem that for a number of sample households only one priority was recorded. Plackett-Luce drops them from the estimation sample. Yet, these observations have value. If we do not analyze separately, we may not know whether the results are significantly biased. This may happen when missing priorities are not random. We will discuss this in the next section.
In the original dataset, the priorities are organized as item rankings held in a separate variable for each item. They are coded naturally, i.e., first rank = 1, second = 2, etc., plus: unranked = 0. For the PL-tree model, it is convenient (although not absolutely necessary at this point) to reorganize the priorities into j = 1, 2,.., K polytomous variables, each j-th one holding all the observed items of j-th priority, K being the the lowest rank given in the entire item set. In our case, with three priorities elicited, this boils down to the calculation of priority_1
, priority_2
, and priority_3
(A reminder for readers familiar with the Borda count: the coding of the ranked items is reversed; for unranked ones, it is 0, too.) . This arrangement is also helpful for other preparation goals: ordered frequency plots, and a common factor coding across the three priorities.
With this code, we obtain the three priority variables such that they have …
msna2019 %>%
select(id, food:income) %>%
pivot_longer(-id, names_to = "item", values_to = "priority") %>%
filter(priority != 0) %>%
mutate(priority = paste0("priority_", priority)) %>%
pivot_wider(names_from = priority, values_from = item, names_sort = TRUE) -> priorities
… the same set of factor levels. The levels are sorted by frequency of items in the first priority; subsequently, items that never were the first priority are sorted by their frequencies as second priority, etc.
priorities %>%
select(starts_with("priority")) %>%
mutate_all(fct_infreq) %>%
# factorize each priority ordered by number of observations with its levels
fct_unify() %>%
# unify the levels of the different priorities
data.frame(id = priorities$id, .) %>%
left_join(x = msna2019, y = ., by = "id") -> msna2019
We joined these derivatives by the id
to their origin (for households that gave fewer than three priorities (0 - 2), NA-values are filled in as needed). And now we check if these priorities really all have the same factor levels:
msna2019 %>%
select(starts_with("priority")) %>%
lapply(levels) %>%
unique()
## [[1]]
## [1] "food" "shelter" "electricity" "water" "income"
## [6] "latrines" "health_care" "cooking" "fuel" "education"
## [11] "safety" "clothing"
The combined histogram renders the distribution of priorities over the items:
msna2019 %>%
select(starts_with("priority")) %>%
pivot_longer(cols = names(.), names_to = "priority", values_to = "item") %>%
mutate(item = factor(item, levels = levels(msna2019$priority_1))) %>%
ggplot(aes(item, fill = item)) +
geom_histogram(stat = "count") +
facet_wrap(~priority, ncol = 1) +
theme_bw() +
# ggtitle(paste0("items sorted by priority factor levels (n = ",
# nrow(msna2019), ")")) +
labs(y = "counts") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "none")