Summary

In needs assessments, respondents are frequently asked to rank unmet needs, survival challenges or sectoral priorities. The so-called Plackett-Luce model estimates from the ranking data their relative strength for the sample or population. The strength is measured at the ratio level; i.e., ratios between items are legitimate. In addition, Plackett-Luce trees identify and visualize statistically distinct need / priority etc. profiles by social group in response to vulnerability attributes measured at the unit level (e.g., households) (these attributes are the “covariates” in the model). This is an example of a simple tree:

Example of a Plackett-Luce tree. Data from a needs assessment in Rohingya refugee camps in 2019 (see main part). y-axis values express the strengths of priorities, with mean = 1/12 (horizontal line). Households headed by unmarried women place higher urgency on shelter improvements (item no.2) than other household types.

Figure 1: Example of a Plackett-Luce tree. Data from a needs assessment in Rohingya refugee camps in 2019 (see main part). y-axis values express the strengths of priorities, with mean = 1/12 (horizontal line). Households headed by unmarried women place higher urgency on shelter improvements (item no.2) than other household types.

Our demonstration dataset is from a Multi-Sectoral Needs Assessment (MSNA) in the Rohingya refugee camps in Bangladesh in 2019. We provide R code for both the unconditional strength estimates as well as for group-sensitive trees. Plackett-Luce models may be particularly relevant in assessments with limited access (Covid-19!) and short questionnaires that still have room for questions about first and second (and ideally also third) priorities. We also solve an apparent shortcoming in the original R package (PlackettLuce) options for the tree legend, making for attractive tree plots in presentations and reports. We show how to filter observations such that the trees can be calculated using sampling weights, a key requirement for survey data analysis.

We present the code in chapters and chapter segments, with numerous annotations for the technically interested reader. We point out several potential issues and illustrate them in the appendix:

  1. Some households will state only one priority. Plackett-Luce excludes these observations. We demonstrate for our dataset that the missing priorities are not random. If these observations are numerous, they should be investigated for differences in first priorities and in co-variates vs. the included, to assess the potential bias.

  2. Analysts may want to select from the set of ranked items, in order to focus on the highest priorities. The more items are excluded, the smaller the proportion of observations with at least two remaining priorities. We demonstrate that with more exclusions the effective sample size diminishes acceleratingly.

  3. We investigate the impact of assessment design on priority profiles. Qualitative studies in the Rohingya camps (Acaps et.al. 2019) compellingly demonstrated the particular vulnerability of households headed by unmarried women. The quantitatively oriented MSNA paired enumerators and respondents by gender. The tied enumerator/respondent gender is liable to act as a confounder, given the correlation between the genders of respondent and household head. We find that the design variable overwhelms the substantive covariates. However, the bias can be circumvented to a degree and even affords new insight. Analysts should compare Plackett-Luce trees between versions that include suspected confounders in the covariates vs. those that don’t.

  4. Like other statistical models, notably regression and cluster analysis, Plackett-Luce trees are sensitive to covariate transformations, particularly categorizations of numeric variables. We show the effect of different transformations, some neutral, others leading to vastly different trees.

Acknowledgment

As a novice R user, I would not have been able to write this code without the extensive help of my son Attilio Benini. The two of us gradually developed this project over several months together. In the text, therefore, I use first-plural pronouns.

I thank Sudeep Shrestha, Acaps, for supplying me with a clean and documented dataset as the basis of our demonstration data.

I am grateful also to Edouard Legoupil, UNHCR, who in December 2019 added a comparison between the Borda Count, another measure of sectoral priorities, and the Plackett-Luce model, providing R code for both (he modestly attributed this contribution to “Benini & Sons”, but the novel elements are all his) (Legoupil 2019).

I thank Prof. Heather Turner, co-creator of the PlackettLuce package, for advice and code examples (personal email).

Neither Attilio nor myself have a financial interest in this project. Readers should be free to use and modify the code as they see fit, with a reference to our work and to the authors of the Plackett-Luce package (see references), where appropriate.

Aldo Benini

How to obtain data and plain code

The .Rmd generating this .html document as well as the relevant .RData file (which contains the demonstration data and its metadata) are available as a zip-file from the same Web page as the .html. Make sure to extract them to the same working directory. To extract the plain code from this .Rmd, use the code below within the same environment as the .Rmd (copy to a script or use the console).

"PLtrees_msna.Rmd" %in% list.files(getwd())
# check whether the .Rmd to extract is within the current environment

knitr::knit(
  input  = "PLtrees_msna.Rmd",
  tangle = TRUE,
  quiet  = TRUE,
  output = "PLtrees_msna_code_extracted.R"
  )

The code was written using R version 4.0.1. and RStudio version 1.3.959. The dependencies of the used packages require R version >= 3.2.3. The used Pandoc version was 2.9.2, which came with RStudio.

Introduction

Background

The Plackett-Luce model was presented to the humanitarian data analyst community in Benini, A. (2019) “Priorities and preferences in humanitarian needs assessments - Their measurement by rating and ranking methods”. However, at the time we did not understand the R code for Plackett-Luce trees, an analytical as well as graphical technique to identify significantly different preference profiles, measured by ranked items, in response to covariates of the units (e.g., households) expressing the preferences. This facility is part of the R package PlackettLuce, which calls extensively on the packages psychotree, partykit and qvcalc. It was developed by, and described in, Turner, van Etten, Firth and Kosmidis (2018b). Some of the concepts, in particular ranking networks, are succinctly explained in Turner et al. (2018a).

As the authors explain, the algorithm producing the Plackett-Luce trees conducts an iterative process:

  1. Fit a Plackett-Luce model to the full data.

  2. Assess the stability of the worth parameters with respect to each available covariate.

  3. If there is significant instability, split the full data by the covariate with the strongest instability and use the cut-point with the highest improvement in model fit.

  4. Repeat steps 1-3 until there are no more significant instabilities, or a split produces a sub-group below a given size threshold (Turner et al., 2018b, op.cit).

Point no. 3 is key. The association of priorities with other variables - notably the measured attributes of vulnerable groups - can, of course, be explored with the means of descriptive statistics, and group profiles with statistics of priorities and attributes of interest can be contrasted. Plackett-Luce adds value in two respects:

First, it expresses the sample- or population-wide preference for each ranked item as a value on a latent variable (called “worth”), calculated from the stated item ranks.

Second, its tree algorithm finds the most strongly distinct group profiles on the basis of the user-defined set of attributes, retaining those attributes which simultaneously cause the strongest distinctions. Not only does it find the subset; it also determines the sequence among the retained attributes that divide the sample into successively refined groups as well as the attribute values at which these splits are made.

The result is a set of “worth” estimates for each item in each group so defined. The “worth” measure is ratio-level, so that items can be compared as ratios (i.e., not only as differences without natural zero point). Similarly, for a given item, groups can be compared by ratios between the worth that each of them assigns to it (e.g., the relative priority that men and woman give to food aid).

Finally, the algorithm can be used to visualize retained attributes, splitting values and group-specific preference profiles in tree plots that resemble graphic decision trees.

In a minor change of terminology, we replace in what follows “worth” with “strength”, as in the strength with which participants in humanitarian assessments hold their beliefs, preferences and priorities. We believe that this wording is more appropriate to humanitarian assessments.

This document demonstrates code for estimating the unconditional (no group differences) Plackett-Luce model (PlackettLuce) and for Plackett-Luce trees (pltree), which identify groups with distinct profiles. Our dataset is a segment from the 2019 Rohingya MSNA (ISCG et al. 2019b).

One of us extensively worked with those data using Stata 15. We mention this because Stata estimates some version of Plackett-Luce models with the command “rologit — Rank-ordered logistic regression”; in the statistical community, this model is also known as the “exploded logit model” and as the “choice-based method of conjoint analysis”. However, both the data organization - rologit demands long-shape data - and the model differ. rologit requires that the independent variables vary between the items of a given respondent, at least for some of the respondents. For instance, a respondent’s gender cannot directly affect his or her rankings; it could affect the rankings only via an interaction with a variable that does vary over items. The R version, however, does not require that. Moreover, it offers extensive data preparation tools.

In a recent paper, the authors of PlackettLuce (Turner et.al., 2020) compare its features to those of other ranking-data software. In its appendix, interested R users will find code examples for several alternative procedures. We do not discuss these here.

Packages

These packages are required for the subsequent code to run properly. For brevity, the output including conflict warnings is suppressed. If users apprehend conflicts between functions of these packages, they may want to run the below code without suppressPackageStartupMessages(), but with library(package = X, character.only = T).

packages_required <- c(
  "tidyverse",      # Packages for data manipulation and plotting
  "rmarkdown",      # Convert R Markdown documents into HTML, MS Word, PDF and Beamer.
  "knitr",          # General-Purpose Package for Dynamic Report Generation in R
  "bookdown",       # Authoring Books and Technical Documents with R Markdown
  "kableExtra",     # For constructing complex tables
  "summarytools",   # For writing some types of tables
  "PlackettLuce",   # Evaluates strength (worth) of priorities and computes Plackett-Luce trees
  "png",            # Read and write PNG images 
  "ridittools"      # To construct the riddit-transformed education proxy
  )  
# Determine new packages to install:
new_packages <- packages_required[!packages_required %in% installed.packages()]
# If any new ones are indeed required, install them:
if(length(new_packages) > 0){install.packages(new_packages)}
# Load / activate the required packages:
lapply(X = packages_required,
       function(X)
         suppressPackageStartupMessages(
           library(package = X, character.only = T)
           )
       )

Data

Source

The dataset used here is a segment of the shared December 2019 version of the dataset from the “Joint Multi-Sector Needs Assessment (J-MSNA) - Rohingya Refugees - September 2019.” Cox’s Bazar, Bangladesh (ISCG and et.al. 2019b). A recently (March 2020) updated version of the dataset (in .xlsx format) can be downloaded here.

Original and imported variables

The original dataset is large, with 536 variables at the household level. We explored a small selection in the statistical application Stata v.15.1. The dataset, initially with 46 variables kept and all 3,418 observations, was imported to R using the function readstata13::read.dta13. We removed all variables but the few needed here. For greater anonymity, we replaced the original record ID with a sequential numeric one and removed the camp markers. No records were entirely deleted; thus n = 3418.

This .Rmd adds several more calculated variables during execution.

load("msna2019.RData")
# Load data plus meta-data from a file stored in the same directory as this .Rmd

Although this ID is unique by design, in other circumstances it will be reassuring to do a check, using a line of code like this:

n_distinct(msna2019$id, na.rm = TRUE) == nrow(msna2019)
## [1] TRUE

Variables of interest

The assessment enumerators requested the respondents to name their first three priorities from a list of 12 needs areas. These twelve are, in the logic of Plackett-Luce, the items. A minimum of two ranked items is needed for an observation to be in the effective estimation sample. For the way the priorities were coded, see further below.

We include five covariates in most Plackett-Luce tree models, as well as a sixth (a design variable) in some, in order to make a methodological point. Descriptive statistics of the covariates follow after this table; for the items we give produce them in histograms further below.

Table 1: Variables of dataset msna2019 as documented in the dataset msna2019_meta_data. Both datasets are included in the file msna2019.RData, which is downloadable with …. (s. above. ….)
variable description
id ID (record identifier)
sampling_weight Sampling weight
resp_gender Gender of the respondent (design variable). 2 categories: ‘Male’, ‘Female’.
hh_gender_marital Household head gender and marital status (key vulnerability indicator). 4 categories: ‘Fem-Single’, ‘Fem-Married’, ‘Male-Single’, ‘Male-Married’.
males_18_59 Males 18-59 years in household (household capacity indicator). 2 categories: ‘At least one’, ‘None’.
educ_max_years Max. years education all household members (excl. madrassa) (household capacity indicator). Range years (0-12) of schooling.
working Some household members working (household capacity indicator). 2 categories: ‘At least one’, ‘None’.
fcs Food Consumption Score (FCS) (household poverty indicator)
items
food Food
shelter Shelter
water Water
latrines Latrines
electricity Electricity
cooking Cooking utensils
clothing Clothing
health_care Health care
education Education
safety Safety
fuel Cooking fuel
income Income

Sample

“A total of 3418 households, composed of 17162 individuals, were surveyed across 34 camps. The households were identified through simple random sampling of Open Street Map (OSM) shelter footprints. Each survey was conducted with an adult household representative responding on behalf of the household and its members.” (Excel data file, sheet “Read Me”).

Descriptive statistics of the covariates

as well as of the sampling weights and of the design variable resp_gender

## Data Frame Summary  
## msna2019  
## Dimensions: 3418 x 7  
## Duplicates: 62  
## 
## -------------------------------------------------------------------------------
## Variable             Stats / Values             Freqs (% of Valid)    Missing  
## -------------------- -------------------------- --------------------- ---------
## fcs                  Mean (sd) : 44.5 (11.1)    149 distinct values   0        
## [numeric]            min < med < max:                                 (0%)     
##                      13 < 43.5 < 103                                           
##                      IQR (CV) : 13 (0.3)                                       
## 
## working              1. At least one            1827 (53.4%)          0        
## [factor]             2. None                    1591 (46.6%)          (0%)     
## 
## educ_max_years       Mean (sd) : 3.2 (3.1)      13 distinct values    1        
## [integer]            min < med < max:                                 (0.03%)  
##                      0 < 3 < 12                                                
##                      IQR (CV) : 5 (1)                                          
## 
## males_18_59          1. At least one            2937 (85.9%)          0        
## [factor]             2. None                     481 (14.1%)          (0%)     
## 
## hh_gender_marital    1. Fem-Single               235 ( 6.9%)          0        
## [factor]             2. Fem-Married              331 ( 9.7%)          (0%)     
##                      3. Male-Single               77 ( 2.2%)                   
##                      4. Male-Married            2775 (81.2%)                   
## 
## resp_gender          1. Male                    1669 (48.8%)          0        
## [factor]             2. Female                  1749 (51.2%)          (0%)     
## 
## sampling_weight      Mean (sd) : 1 (0.4)        34 distinct values    0        
## [numeric]            min < med < max:                                 (0%)     
##                      0.2 < 1 < 1.9                                             
##                      IQR (CV) : 0.7 (0.4)                                      
## -------------------------------------------------------------------------------

Item statistics follow below in graphic form.

Transformed covariates: Education proxy

Plackett-Luce trees are sensitive to transformations of covariates (the independent variables of the model). Different trees may result from the original numeric vs. from derived categorical versions of the same covariate. This is best discussed by way of example. For this we use the education of household members, a household capacity indicator in our dataset.

Education is likely to influence the way the respondent reasons about priorities, both cognitively (how he/she understands questions and frames the response) as well as through the effect that education has on the actual household situation, particularly in employment and dealings with outside agencies.

However, members’ levels have been observed incompletely. The MSNA was limited to a proxy measure, with one recorded observation per household. This is the maximum number of years of schooling completed among the members (“educ_max_years”).

Max. of years of formal education among the members of the household (n = 3418).

Figure 2: Max. of years of formal education among the members of the household (n = 3418).

“educ_max_years” is a count variable with multiple peaks. Already in the orginal dataset, a dichotomous indicator “educ_min_6_years” was calculated, whether the maximum number of years of schooling was higher than Class 5, or not.

Neither the raw count nor the chosen break point for the indicator are necessarily optimal for finding distinct priority patterns along the education dimension. Therefore, we construct two more derivatives from “educ_max_years”, one continuous, the other dichotomous.

The continuous version, named “educ_ridit”, proxies for the education level of the household. Formally, the “educ_ridit” is defined as the riddit of “educ_max_years”. In different formulation, it is is the probability that the household’s “educ_max_years” is higher than that of a randomly chosen household plus half the probability that it is the same.

The intuition is that this measure is likely substantially correlated with the unknown average education level of the adult members and, if so, with the ability to interact usefully with outside agencies and potential employers and clients.

“educ_max_years” and “educ_ridit” map strictly monotonously to each other. The advantage of “educ_ridit” is that the distances between adjacent values are symptomatic of the effort to get from one level to the next in the education system, such as from never-school to first grade completed vs. first to second grade, etc.

As such, “educ_ridit” may be better suited to reveal education-related effects on priority profiles. It is a continuous variables within the interval (0, 1) whereas educ_max_years is a count variable with min = 0. But will this produce a different tree from that obtained using the original count variable?

Further, we dichotomize “educ_max_years” in “any_school_ever”, which takes the value “No one” if no one in the household has ever completed Class 1, and “At least one” otherwise. The intuition is that total non-exposure to the formal education system indexes a wider lack of social capital, carried over from Myanmar and unremedied in camp life. It thus expresses more than the education status.

This code creates a comparison table for the four measures and merges the two new ones back to the dataframe “msna2019”.

# generate factor educ_min_6_years
msna2019 %>%
  mutate(
    educ_min_6_years = ifelse(educ_max_years < 6, "0-5 years", "min. 6 years")  %>%
                       factor(levels = c("0-5 years", "min. 6 years"))
  ) -> msna2019

(msna2019 %>% 
  filter(!is.na(educ_max_years)) %>%
  group_by(educ_max_years, educ_min_6_years) %>% 
  count() %>% 
  ungroup() %>%
  mutate(
    proportion      = n/sum(n),
    educ_ridit      = ridittools::toridit(proportion),
    any_school_ever = ifelse(educ_max_years == 0, "No one", "At least one"),
    any_school_ever = factor(x = any_school_ever, levels = c("No one", "At least one"))
    ) %>% 
  select(
    educ_max_years,
    n:educ_ridit,
    educ_min_6_years,
    any_school_ever
    ) -> tab_educ_ridit)
## # A tibble: 13 x 6
##    educ_max_years     n proportion educ_ridit educ_min_6_years any_school_ever
##             <int> <int>      <dbl>      <dbl> <fct>            <fct>          
##  1              0  1042    0.305        0.152 0-5 years        No one         
##  2              1   226    0.0661       0.338 0-5 years        At least one   
##  3              2   359    0.105        0.424 0-5 years        At least one   
##  4              3   373    0.109        0.531 0-5 years        At least one   
##  5              4   393    0.115        0.643 0-5 years        At least one   
##  6              5   273    0.0799       0.740 0-5 years        At least one   
##  7              6   155    0.0454       0.803 min. 6 years     At least one   
##  8              7   165    0.0483       0.850 min. 6 years     At least one   
##  9              8   152    0.0445       0.896 min. 6 years     At least one   
## 10              9    77    0.0225       0.930 min. 6 years     At least one   
## 11             10   162    0.0474       0.965 min. 6 years     At least one   
## 12             11    13    0.00380      0.990 min. 6 years     At least one   
## 13             12    27    0.00790      0.996 min. 6 years     At least one
tab_educ_ridit %>% 
  select(educ_max_years, educ_ridit, any_school_ever) %>% 
  left_join(x = msna2019, y = ., by = "educ_max_years") -> msna2019

For space reasons, we run models each using one of the four versions of the education proxy in this appendix. We find that Plackett-Luce trees indeed are not indifferent to all covariate transformations:

For bijective numeric-to-numeric transformations, the trees are the same (except for the re-scaled split values). In our example, the trees with “educ_max_years” and “educ_ridit” both come with the exactly same four terminal nodes. Note that this robustness is not trivial; it does not hold for other types of models, e.g. regression on, or cluster analysis with, transformed covariates.

Categorization of originally numeric covariates, however, is consequential. The specification with “educ_min_6_years” returns a tree with two terminal nodes; with “any_school_ever” Plackett-Luce splits the sample into two groups only.

The point to retain is that we ought to be wary of reducing numeric covariates to categorical ones in the preparation stage. First we may want to try out a Plackett-Luce tree using the nu