Summary

In needs assessments, respondents are frequently asked to rank unmet needs, survival challenges or sectoral priorities. The so-called Plackett-Luce model estimates from the ranking data their relative strength for the sample or population. The strength is measured at the ratio level; i.e., ratios between items are legitimate. In addition, Plackett-Luce trees identify and visualize statistically distinct need / priority etc. profiles by social group in response to vulnerability attributes measured at the unit level (e.g., households) (these attributes are the “covariates” in the model). This is an example of a simple tree:

Example of a Plackett-Luce tree. Data from a needs assessment in Rohingya refugee camps in 2019 (see main part). y-axis values express the strengths of priorities, with mean = 1/12 (horizontal line). Households headed by unmarried women place higher urgency on shelter improvements (item no.2) than other household types.

Figure 1: Example of a Plackett-Luce tree. Data from a needs assessment in Rohingya refugee camps in 2019 (see main part). y-axis values express the strengths of priorities, with mean = 1/12 (horizontal line). Households headed by unmarried women place higher urgency on shelter improvements (item no.2) than other household types.

Our demonstration dataset is from a Multi-Sectoral Needs Assessment (MSNA) in the Rohingya refugee camps in Bangladesh in 2019. We provide R code for both the unconditional strength estimates as well as for group-sensitive trees. Plackett-Luce models may be particularly relevant in assessments with limited access (Covid-19!) and short questionnaires that still have room for questions about first and second (and ideally also third) priorities. We also solve an apparent shortcoming in the original R package (PlackettLuce) options for the tree legend, making for attractive tree plots in presentations and reports. We show how to filter observations such that the trees can be calculated using sampling weights, a key requirement for survey data analysis.

We present the code in chapters and chapter segments, with numerous annotations for the technically interested reader. We point out several potential issues and illustrate them in the appendix:

  1. Some households will state only one priority. Plackett-Luce excludes these observations. We demonstrate for our dataset that the missing priorities are not random. If these observations are numerous, they should be investigated for differences in first priorities and in co-variates vs. the included, to assess the potential bias.

  2. Analysts may want to select from the set of ranked items, in order to focus on the highest priorities. The more items are excluded, the smaller the proportion of observations with at least two remaining priorities. We demonstrate that with more exclusions the effective sample size diminishes acceleratingly.

  3. We investigate the impact of assessment design on priority profiles. Qualitative studies in the Rohingya camps (Acaps et.al. 2019) compellingly demonstrated the particular vulnerability of households headed by unmarried women. The quantitatively oriented MSNA paired enumerators and respondents by gender. The tied enumerator/respondent gender is liable to act as a confounder, given the correlation between the genders of respondent and household head. We find that the design variable overwhelms the substantive covariates. However, the bias can be circumvented to a degree and even affords new insight. Analysts should compare Plackett-Luce trees between versions that include suspected confounders in the covariates vs. those that don’t.

  4. Like other statistical models, notably regression and cluster analysis, Plackett-Luce trees are sensitive to covariate transformations, particularly categorizations of numeric variables. We show the effect of different transformations, some neutral, others leading to vastly different trees.

Acknowledgment

As a novice R user, I would not have been able to write this code without the extensive help of my son Attilio Benini. The two of us gradually developed this project over several months together. In the text, therefore, I use first-plural pronouns.

I thank Sudeep Shrestha, Acaps, for supplying me with a clean and documented dataset as the basis of our demonstration data.

I am grateful also to Edouard Legoupil, UNHCR, who in December 2019 added a comparison between the Borda Count, another measure of sectoral priorities, and the Plackett-Luce model, providing R code for both (he modestly attributed this contribution to “Benini & Sons”, but the novel elements are all his) (Legoupil 2019).

I thank Prof. Heather Turner, co-creator of the PlackettLuce package, for advice and code examples (personal email).

Neither Attilio nor myself have a financial interest in this project. Readers should be free to use and modify the code as they see fit, with a reference to our work and to the authors of the Plackett-Luce package (see references), where appropriate.

Aldo Benini

How to obtain data and plain code

A downloadable zip-file containing

  • PLtrees_msna.Rmd generating this .html,
  • PLtrees_msna.Rproj,
  • and the file msna2019.RData, holding the demonstration data msna2019 and its metadata msna2019_meta_data

is available from the same web page as this .html. Make sure to extract the zipped files to the same working directory.

The .html does not show the entire code of the .Rmd. The complete plain code can be extracted using the code below within the same environment as the .Rmd (copy to a script or use the console).

"PLtrees_msna.Rmd" %in% list.files(getwd())
# check whether the .Rmd to extract is within the current environment

knitr::knit(
  input  = "PLtrees_msna.Rmd",
  tangle = TRUE,
  quiet  = TRUE,
  output = "PLtrees_msna_code_extracted.R"
  )

The code was written using R version 4.0.2. and RStudio version 1.3.959. The dependencies of the used packages require R version >= 3.2.3. The used Pandoc version was 2.9.2, which came with RStudio.

Introduction

Background

The Plackett-Luce model was presented to the humanitarian data analyst community in Benini, A. (2019) “Priorities and preferences in humanitarian needs assessments - Their measurement by rating and ranking methods”. However, at the time we did not understand the R code for Plackett-Luce trees, an analytical as well as graphical technique to identify significantly different preference profiles, measured by ranked items, in response to covariates of the units (e.g., households) expressing the preferences. This facility is part of the R package PlackettLuce. It was developed by, and described in, Turner, van Etten, Firth and Kosmidis (2018b). Some of the concepts, in particular ranking networks, are succinctly explained in Turner et al. (2018a).

As the authors explain, the algorithm producing the Plackett-Luce trees conducts an iterative process:

  1. Fit a Plackett-Luce model to the full data.

  2. Assess the stability of the worth parameters with respect to each available covariate.

  3. If there is significant instability, split the full data by the covariate with the strongest instability and use the cut-point with the highest improvement in model fit.

  4. Repeat steps 1-3 until there are no more significant instabilities, or a split produces a sub-group below a given size threshold (Turner et al., 2018b, op.cit).

Point no. 3 is key. The association of priorities with other variables - notably the measured attributes of vulnerable groups - can, of course, be explored with the means of descriptive statistics, and group profiles with statistics of priorities and attributes of interest can be contrasted. Plackett-Luce adds value in two respects:

First, it expresses the sample- or population-wide preference for each ranked item as a value on a latent variable (called “worth”), calculated from the stated item ranks.

Second, its tree algorithm finds the most strongly distinct group profiles on the basis of the user-defined set of attributes, retaining those attributes which simultaneously cause the strongest distinctions. Not only does it find the subset; it also determines the sequence among the retained attributes that divide the sample into successively refined groups as well as the attribute values at which these splits are made.

The result is a set of “worth” estimates for each item in each group so defined. The “worth” measure is ratio-level, so that items can be compared as ratios (i.e., not only as differences without natural zero point). Similarly, for a given item, groups can be compared by ratios between the worth that each of them assigns to it (e.g., the relative priority that men and woman give to food aid).

Finally, the algorithm can be used to visualize retained attributes, splitting values and group-specific preference profiles in tree plots that resemble graphic decision trees.

In a minor change of terminology, we replace in what follows “worth” with “strength”, as in the strength with which participants in humanitarian assessments hold their beliefs, preferences and priorities. We believe that this wording is more appropriate to humanitarian assessments.

This document demonstrates code for estimating the unconditional (no group differences) Plackett-Luce model (PlackettLuce) and for Plackett-Luce trees (pltree), which identify groups with distinct profiles. Our dataset is a segment from the 2019 Rohingya MSNA (ISCG et al. 2019b).

One of us extensively worked with those data using Stata 15. We mention this because Stata estimates some version of Plackett-Luce models with the command “rologit — Rank-ordered logistic regression”; in the statistical community, this model is also known as the “exploded logit model” and as the “choice-based method of conjoint analysis”. However, both the data organization - rologit demands long-shape data - and the model differ. rologit requires that the independent variables vary between the items of a given respondent, at least for some of the respondents. For instance, a respondent’s gender cannot directly affect his or her rankings; it could affect the rankings only via an interaction with a variable that does vary over items. The R version, however, does not require that. Moreover, it offers extensive data preparation tools.

In a recent paper, the authors of PlackettLuce (Turner et.al., 2020) compare its features to those of other ranking-data software. In its appendix, interested R users will find code examples for several alternative procedures. We do not discuss these here.

Packages

These packages are required for the subsequent code to run properly:

packages_required <- c(
  "tidyverse",      # Packages for data manipulation and plotting
  "ggrepel",        # Automatically Position Non-Overlapping Text Labels with 'ggplot2'
  "rmarkdown",      # Convert R Markdown documents into HTML, MS Word, PDF and Beamer.
  "knitr",          # General-Purpose Package for Dynamic Report Generation in R
  "bookdown",       # Authoring Books and Technical Documents with R Markdown
  "kableExtra",     # For constructing complex tables
  "summarytools",   # For writing some types of tables
  "PlackettLuce",   # Evaluates strength (worth) of priorities and computes Plackett-Luce trees
  "png",            # Read and write PNG images 
  "ridittools"      # To construct the riddit-transformed education proxy
  )  
# Determine new packages to install:
new_packages <- packages_required[!packages_required %in% installed.packages()]
# If any new ones are indeed required, install them:
if(length(new_packages) > 0){install.packages(new_packages)}
# Load / activate the required packages:
lapply(X = packages_required,
       function(X)
         suppressPackageStartupMessages(
           library(package = X, character.only = T)
           )
       )

For brevity, in the above code chunk the output including conflict warnings is suppressed. To see them, run this code line in the console:

lapply(X = packages_required, function(X) library(package = X, character.only = T))

Data

Source

The dataset used here is a segment of the shared December 2019 version of the dataset from the “Joint Multi-Sector Needs Assessment (J-MSNA) - Rohingya Refugees - September 2019.” Cox’s Bazar, Bangladesh (ISCG and et.al. 2019b). A recently (March 2020) updated version of the dataset (in .xlsx format) can be downloaded here.

Original and imported variables

The original dataset is large, with 536 variables at the household level. We removed all variables but the 20 used here. We replaced the original record ID with a sequential numeric one and removed the camp markers. No records were entirely deleted; thus n = 3418.

Several more calculated variables are added during execution.

load("msna2019.RData")
# Load data plus meta-data from a file stored in the same directory as this .Rmd

Although this ID is unique by design, in other circumstances it will be reassuring to do a check, using a line of code like this:

n_distinct(msna2019$id, na.rm = TRUE) == nrow(msna2019)
## [1] TRUE

Variables of interest

The assessment enumerators requested the respondents to name their first three priorities from a list of 12 needs areas. These twelve are, in the logic of Plackett-Luce, the items. A minimum of two ranked items is needed for an observation to be in the effective estimation sample. For the way the priorities were coded, see further below.

We include up to five covariates in the Plackett-Luce tree models, as well as a sixth (a design variable) in some, in order to make a methodological point. Descriptive statistics of the covariates follow after this table; for the items we give them in histograms further below.

Table 1: Variables of dataset msna2019 as documented in msna2019_meta_data. Both datasets are included in the file msna2019.RData, which is downloadable as part of this zip-file.
variable description
id ID (record identifier)
sampling_weight Sampling weight
resp_gender Gender of the respondent (design variable). 2 categories: ‘Male’, ‘Female’.
hh_gender_marital Household head gender and marital status (key vulnerability indicator). 4 categories: ‘Fem-Single’, ‘Fem-Married’, ‘Male-Single’, ‘Male-Married’.
males_18_59 Males 18-59 years in household (household capacity indicator). 2 categories: ‘At least one’, ‘None’.
educ_max_years Max. years education all household members (excl. madrassa) (household capacity indicator). Range years (0-12) of schooling.
working Some household members working (household capacity indicator). 2 categories: ‘At least one’, ‘None’.
fcs Food Consumption Score (FCS) (household poverty indicator)
items
food Food
shelter Shelter
water Water
latrines Latrines
electricity Electricity
cooking Cooking utensils
clothing Clothing
health_care Health care
education Education
safety Safety
fuel Cooking fuel
income Income

Sample

“A total of 3418 households, composed of 17162 individuals, were surveyed across 34 camps. The households were identified through simple random sampling of Open Street Map (OSM) shelter footprints. Each survey was conducted with an adult household representative responding on behalf of the household and its members.” (Excel data file, sheet “Read Me”).

Descriptive statistics of the covariates

as well as of the sampling weights and of the design variable resp_gender

## Data Frame Summary  
## msna2019  
## Dimensions: 3418 x 7  
## Duplicates: 62  
## 
## -------------------------------------------------------------------------------
## Variable             Stats / Values             Freqs (% of Valid)    Missing  
## -------------------- -------------------------- --------------------- ---------
## fcs                  Mean (sd) : 44.5 (11.1)    149 distinct values   0        
## [numeric]            min < med < max:                                 (0%)     
##                      13 < 43.5 < 103                                           
##                      IQR (CV) : 13 (0.3)                                       
## 
## working              1. At least one            1827 (53.4%)          0        
## [factor]             2. None                    1591 (46.6%)          (0%)     
## 
## educ_max_years       Mean (sd) : 3.2 (3.1)      13 distinct values    1        
## [integer]            min < med < max:                                 (0.03%)  
##                      0 < 3 < 12                                                
##                      IQR (CV) : 5 (1)                                          
## 
## males_18_59          1. At least one            2937 (85.9%)          0        
## [factor]             2. None                     481 (14.1%)          (0%)     
## 
## hh_gender_marital    1. Fem-Single               235 ( 6.9%)          0        
## [factor]             2. Fem-Married              331 ( 9.7%)          (0%)     
##                      3. Male-Single               77 ( 2.2%)                   
##                      4. Male-Married            2775 (81.2%)                   
## 
## resp_gender          1. Male                    1669 (48.8%)          0        
## [factor]             2. Female                  1749 (51.2%)          (0%)     
## 
## sampling_weight      Mean (sd) : 1 (0.4)        34 distinct values    0        
## [numeric]            min < med < max:                                 (0%)     
##                      0.2 < 1 < 1.9                                             
##                      IQR (CV) : 0.7 (0.4)                                      
## -------------------------------------------------------------------------------

Item statistics follow below in graphic form.

Transformed covariates: Education proxy

Plackett-Luce trees are sensitive to transformations of covariates (the independent variables of the model). Different trees may result from the original numeric vs. from derived categorical versions of the same covariate. This is best discussed by way of example. For this we use the education of household members, a household capacity indicator in our dataset.

Education is likely to influence the way the respondent reasons about priorities, both cognitively (how he/she understands questions and frames the response) as well as through the effect that education has on the actual household situation, particularly in employment and dealings with outside agencies.

However, members’ levels have been observed incompletely. The MSNA was limited to a proxy measure, with one recorded observation per household. This is the maximum number of years of schooling completed among the members (educ_max_years).

Max. of years of formal education among the members of the household (n = 3418).

Figure 2: Max. of years of formal education among the members of the household (n = 3418).

educ_max_years is a count variable with multiple peaks. Already in the original dataset, a dichotomous indicator educ_min_6_years was calculated, whether the maximum number of years of schooling was higher than Class 5, or not.

Neither the raw count nor the chosen break point for the indicator are necessarily optimal for finding distinct priority patterns along the education dimension. Therefore, we construct two more derivatives from educ_max_years, one continuous, the other dichotomous.

The continuous version, named educ_ridit, proxies for the education level of the household. Formally, the educ_ridit is defined as the riddit of educ_max_years. In different formulation, it is is the probability that the household’s educ_max_years is higher than that of a randomly chosen household plus half the probability that it is the same.

The intuition is that this measure is likely substantially correlated with the unknown average education level of the adult members and, if so, with the ability to interact usefully with outside agencies and potential employers and clients.

educ_max_years and educ_ridit map strictly monotonously to each other. The advantage of educ_ridit is that the distances between adjacent values are symptomatic of the effort to get from one level to the next in the education system, such as from never-school to first grade completed vs. first to second grade, etc.

As such, educ_ridit may be better suited to reveal education-related effects on priority profiles. It is a continuous variables within the interval (0, 1) whereas educ_max_years is a count variable with min = 0. But will this produce a different tree from that obtained using the original count variable?

Further, we dichotomize educ_max_years in any_school_ever, which takes the value “No one” if no one in the household has ever completed Class 1, and “At least one” otherwise. The intuition is that total non-exposure to the formal education system indexes a wider lack of social capital, carried over from Myanmar and unremedied in camp life. It thus expresses more than the education status.

This code creates a comparison table for the four measures and merges the two new ones back to the dataframe msna2019.

# generate factor educ_min_6_years
msna2019 %>%
  mutate(
    educ_min_6_years = ifelse(educ_max_years < 6, "0-5 years", "min. 6 years"),
    educ_min_6_years = factor(educ_min_6_years, levels = c("0-5 years", "min. 6 years"))
  ) -> msna2019

(msna2019 %>% 
  filter(!is.na(educ_max_years)) %>%
  group_by(educ_max_years, educ_min_6_years) %>% 
  count() %>% 
  ungroup() %>%
  mutate(
    proportion      = n/sum(n),
    educ_ridit      = ridittools::toridit(proportion),
    any_school_ever = ifelse(educ_max_years == 0, "No one", "At least one"),
    any_school_ever = factor(x = any_school_ever, levels = c("No one", "At least one"))
    ) %>% 
  select(
    educ_max_years,
    n:educ_ridit,
    educ_min_6_years,
    any_school_ever
    ) -> tab_educ_ridit)
## # A tibble: 13 x 6
##    educ_max_years     n proportion educ_ridit educ_min_6_years any_school_ever
##             <int> <int>      <dbl>      <dbl> <fct>            <fct>          
##  1              0  1042    0.305        0.152 0-5 years        No one         
##  2              1   226    0.0661       0.338 0-5 years        At least one   
##  3              2   359    0.105        0.424 0-5 years        At least one   
##  4              3   373    0.109        0.531 0-5 years        At least one   
##  5              4   393    0.115        0.643 0-5 years        At least one   
##  6              5   273    0.0799       0.740 0-5 years        At least one   
##  7              6   155    0.0454       0.803 min. 6 years     At least one   
##  8              7   165    0.0483       0.850 min. 6 years     At least one   
##  9              8   152    0.0445       0.896 min. 6 years     At least one   
## 10              9    77    0.0225       0.930 min. 6 years     At least one   
## 11             10   162    0.0474       0.965 min. 6 years     At least one   
## 12             11    13    0.00380      0.990 min. 6 years     At least one   
## 13             12    27    0.00790      0.996 min. 6 years     At least one
tab_educ_ridit %>% 
  select(educ_max_years, educ_ridit, any_school_ever) %>% 
  left_join(x = msna2019, y = ., by = "educ_max_years") -> msna2019

For space reasons, we run models each using one of the four versions of the education proxy in this appendix. We find that Plackett-Luce trees indeed are not indifferent to all covariate transformations:

For bijective numeric-to-numeric transformations, the trees are the same (except for the re-scaled split values). In our example, the trees with educ_max_years and educ_ridit both come with the exactly same four terminal nodes. Note that this robustness is not trivial; it does not hold for other types of models, e.g. regression on, or cluster analysis with, transformed covariates.

Categorization of originally numeric covariates, however, is consequential. The specification with educ_min_6_years returns a tree with two terminal nodes only; with any_school_ever Plackett-Luce splits the sample into three groups.

The point to retain is that we ought to be wary of reducing numeric covariates to categorical ones in the preparation stage. First we may want to try out a Plackett-Luce tree using the numeric versions. If these trees have no sensible interpretation, only then should we work with the categorical transforms. The split values in the numeric ones may even suggest meaningful categories.

Observations with missing values

For several reasons, we need an overview of the observations that have missing values. Usually one would create a table with a row for each combination of missing values and a column for their frequency, then consider further analysis strategies. Our demonstration msna2019 dataset has only one such observation; so we simply print it out in its entirety.

msna2019_incomplete <- msna2019 %>% filter(!complete.cases(.))
nrow(msna2019_incomplete)
## [1] 1
msna2019_incomplete
##     id sampling_weight resp_gender hh_gender_marital  males_18_59
## 1 1569        1.149119      Female       Fem-Married At least one
##   educ_max_years working fcs food shelter water latrines electricity cooking
## 1             NA    None  41    3       1     0        0           2       0
##   clothing health_care education safety fuel income educ_min_6_years educ_ridit
## 1        0           0         0      0    0      0             <NA>         NA
##   any_school_ever
## 1            <NA>

And only one observation has missing values, in the maximum number of years of schooling that any of the household members attained and in the three transformations of this variable. The items have no missing values because PlackettLuce requires all unranked items in an observation to be coded 0.

That coding convention must not be confused with the problem that for a number of sample households only one priority was recorded. Plackett-Luce drops them from the estimation sample. Yet, these observations have value. If we do not analyze separately, we may not know whether the results are significantly biased. This may happen when missing priorities are not random. We will discuss this in the next section.

Preparations

Reconstructing priorities

In the original dataset, the priorities are organized as item rankings held in a separate variable for each item. They are coded naturally, i.e., first rank = 1, second = 2, etc., plus: unranked = 0. For the PL-tree model, it is convenient (although not absolutely necessary at this point) to reorganize the priorities into j = 1, 2,.., K polytomous variables, each j-th one holding all the observed items of j-th priority, K being the the lowest rank given in the entire item set. In our case, with three priorities elicited, this boils down to the calculation of priority_1, priority_2, and priority_3 (A reminder for readers familiar with the Borda count: the coding of the ranked items is reversed; for unranked ones, it is 0, too.) . This arrangement is also helpful for other preparation goals: ordered frequency plots, and a common factor coding across the three priorities.

With this code, we obtain the three priority variables such that they have …

msna2019 %>%
  select(id, food:income) %>%
  pivot_longer(-id, names_to = "item", values_to = "priority") %>%
  filter(priority != 0) %>%
  mutate(priority = paste0("priority_", priority)) %>%
  pivot_wider(names_from = priority, values_from = item, names_sort = TRUE) -> priorities

… the same set of factor levels. The levels are sorted by frequency of items in the first priority; subsequently, items that never were the first priority are sorted by their frequencies as second priority, etc.

priorities %>% 
  select(starts_with("priority")) %>% 
  mutate_all(fct_infreq) %>%
  # factorize each priority ordered by number of observations with its levels 
  fct_unify() %>%
  # unify the levels of the different priorities 
  data.frame(id = priorities$id, .) %>% 
  left_join(x = msna2019, y = ., by = "id") -> msna2019

We joined these derivatives by the id to their origin (for households that gave fewer than three priorities (0 - 2), NA-values are filled in as needed). And now we check if these priorities really all have the same factor levels:

msna2019 %>%
  select(starts_with("priority")) %>% 
  lapply(levels) %>% 
  unique()
## [[1]]
##  [1] "food"        "shelter"     "electricity" "water"       "income"     
##  [6] "latrines"    "health_care" "cooking"     "fuel"        "education"  
## [11] "safety"      "clothing"

The combined histogram renders the distribution of priorities over the items:

msna2019 %>%
  select(starts_with("priority")) %>%
  pivot_longer(cols = names(.), names_to = "priority", values_to = "item") %>% 
  mutate(item = factor(item, levels = levels(msna2019$priority_1))) %>% 
  ggplot(aes(item, fill = item)) +
  geom_histogram(stat = "count") +
  facet_wrap(~priority, ncol = 1) +
  theme_bw() +
  # ggtitle(paste0("items sorted by priority factor levels (n = ",
  #               nrow(msna2019), ")")) +
  labs(y = "counts") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position = "none")
Items sorted by priority factor levels (n = 3418).

Figure 3: Items sorted by priority factor levels (n = 3418).

ggsave(filename = "figure_priorities.png", 
       width = 6, height = 5)

Non-randomly missing priorities

As a potential problem

In needs assessments, a number of respondents may state one priority only (or, due to an error, only one is recorded). This group is omitted from Plackett-Luce models because the model works with comparisons (rankings) of more than one item within each observation.

If this group is of considerable size - how large is “considerable” is impossible to say as a general rule -, their omission can distort the results of estimated Plackett-Luce models. This happens when the the group (e.g., of households) with only one stated priority significantly differs from those with two or more with respect to

  • the priorities themselves and/or
  • the covariates of the PL-trees.

In our data, about seven percent of the respondents gave only one priority. We can demonstrate that this group is indeed different in those two regards. For space reasons we provide the statistics in this appendix.

msna2019 %>%
  select(food:income) %>%
  # select all 12 items
  apply(1, function(x){sum(x > 0)}) -> msna2019$n_priorities
  # add variable with number of recorded priorities per observation
Respondents by number of priorities recorded.

Figure 4: Respondents by number of priorities recorded.

Dropping observations with missing values

In the data passed to PlackettLuce and pltree, we drop observations with missing values in ID, items, potential Pl-tree covariates as well as the variable n_priorities. We call this new dataframe msna2019_filtered. In a second step we drop observations with fewer than two recorded priorities (n_priorities).

nrow(msna2019)
## [1] 3418
vars_but_priorities <- msna2019 %>% select(!starts_with("priority")) %>% names()
# includes id, items, potential covariates and n_priorities
msna2019_filtered   <- msna2019 %>% filter(complete.cases(.[ ,vars_but_priorities]))
nrow(msna2019_filtered)
## [1] 3417
msna2019_filtered <- msna2019_filtered %>% filter(n_priorities >= 2)
nrow(msna2019_filtered)
## [1] 3188

This is the set of complete observations to estimate the models with the stated priorities and with the current set of covariates.

Plackett-Luce model

Selection of relevant items

For our demonstration of a PL-tree we work with all items from the dataset msna2019_filtered:

(items_sel <- levels(msna2019_filtered$priority_1))
##  [1] "food"        "shelter"     "electricity" "water"       "income"     
##  [6] "latrines"    "health_care" "cooking"     "fuel"        "education"  
## [11] "safety"      "clothing"

The chosen items are arranged in the same order as the factor levels of the reconstructed priorities, viz. by their first-priority frequencies, or if some items never occur as first priorities, by their second-priority frequencies, and analogously for only-third-priority items.

In general, one ought to be wary of excluding items from the original set on which the rankings were elicited. With every additional item excluded from the Plackett-Luce model, the number of viable observations is liable to decrease, acceleratingly. This issue is illustrated in this appendix.

Strength

In standard Plackett-Luce terminology, the quantity of greatest interest returned, the measure of preference strength, is called “worth” - the worth of a ranked item compared to that of other items. In the humanitarian context of assessing unmet needs, however, it seems more appropriate to think of the strength with which respondents hold their priorities. We, therefore, use the term “strength” instead of “worth”.

# Code for unweighted observations:
# strength <- PlackettLuce(msna2019_filtered[ ,items_sel])

# Code with sampling weights:
strength <- PlackettLuce(
              msna2019_filtered[ ,items_sel],
              weights = msna2019_filtered$sampling_weight
              )
## Recoded rankings that are not in dense form

This warning is inconsequential. It appears because the respondents ranked at most three out of the selected items. “Dense rankings” would imply that all 12 items were ranked, in every observation. With a maximum of three priorities, we deal with “sparse rankings”.

Inference

The standard null hypothesis would be that the item of interest is a priority of zero strength; the test would establish the probability that the data are compatible with it. However, as the authors of the package (Turner et al., 2018b, op. cit., 7) note, “this hypothesis is generally not of interest for the worth parameters in a Plackett-Luce model: we expect most items to have some worth, the question is whether the items differ in their worth”.

To obtain the statistical significance of differences between item strength, one of the items has to be designated as the reference category. The comparison is between this item and all the others.

In this example, we are interested in the substantive question whether finding more earning opportunities (income) (mostly for adult workers) commands higher priority than education (of children and adolescents). We make income the reference category.

PL_est <- summary(strength, ref = "income")
PL_est
## Call: PlackettLuce(rankings = msna2019_filtered[, items_sel], 
##     weights = msna2019_filtered$sampling_weight)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## food         2.59911    0.08973  28.967  < 2e-16 ***
## shelter      1.59650    0.08505  18.771  < 2e-16 ***
## electricity  0.78216    0.08413   9.297  < 2e-16 ***
## water        1.34034    0.10485  12.783  < 2e-16 ***
## income       0.00000         NA      NA       NA    
## latrines     1.04800    0.10249  10.225  < 2e-16 ***
## health_care  0.30187    0.09440   3.198 0.001385 ** 
## cooking      0.09714    0.09833   0.988 0.323167    
## fuel         0.53533    0.14107   3.795 0.000148 ***
## education    0.01927    0.14110   0.137 0.891381    
## safety      -0.47076    0.18002  -2.615 0.008922 ** 
## clothing    -0.81147    0.11498  -7.057  1.7e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual deviance:  8002.7 on 8702.8 degrees of freedom
## AIC:  8024.7 
## Number of iterations: 20

We find that the strength estimates between income and education are not significantly different.

Strength on a natural scale

Out of technical necessity, the test returns the estimates on the logarithmic scale. To obtain them on a natural scale, we have to take a detour using quasi-variances and omitting the reference category (On quasi-variance, see the Wikipedia article and/or the help file in the already installed package qvcalc). This approach provides quasi-standard errors for all items. By exponentiating the estimates and their confidence interval bounds, we obtain the relevant quantities on a ratio-scale. expLCI is the lower bound of the 95-percent confidence interval, expUCI is the upper bound. expEst holds the estimated strength, the key result.

The scale is valid up to a normalizing factor. The strength values become intuitive when they are re-scaled to sum to 1. This happens in the following table, with column names that directly refer to strength. The items are sorted by descending strength.

On this scale, the strength of an item is the probability that a randomly selected respondent will choose it as his/her first priority.

qv <- qvcalc(strength, ref = NULL) 
# summary(qv) # Not really needed for our purpose.

qv$qvframe %>% # qv table 
  data.frame(item = row.names(.), ., row.names = NULL) %>%
  # Computes additional columns needed for 95%-CIs,
  # Exponentiates coefficent and CI bound estimates, in accordance with PL-model.
  mutate(
    quasiSD  = sqrt(quasiVar),
    quasiLCI = estimate - quasiSD * qnorm(0.975, 0, 1),
    quasiUCI = estimate + quasiSD * qnorm(0.975, 0, 1),
    expLCI   = exp(quasiLCI),
    expEst   = exp(estimate),
    expUCI   = exp(quasiUCI)
    ) -> qv_estim_CI
# normed strength-of-priorities estimates
qv_estim_CI %>% 
  select(item, '95%-LB' = expLCI, strength = expEst, '95%-UB' = expUCI) %>% 
  mutate_if(is.numeric, function(x){x/sum(.$strength)}) %>% 
  mutate(item = factor(item, rev(levels(msna2019$priority_1)))) -> strength_normed

mutate_if(strength_normed, is.numeric, function(x){round(x, 3)})
##           item 95%-LB strength 95%-UB
## 1         food  0.354    0.390  0.429
## 2      shelter  0.132    0.143  0.155
## 3  electricity  0.058    0.063  0.069
## 4        water  0.096    0.111  0.128
## 5       income  0.025    0.029  0.033
## 6     latrines  0.072    0.083  0.095
## 7  health_care  0.034    0.039  0.045
## 8      cooking  0.028    0.032  0.037
## 9         fuel  0.039    0.050  0.063
## 10   education  0.023    0.030  0.038
## 11      safety  0.013    0.018  0.025
## 12    clothing  0.011    0.013  0.015

We see in this coefficient plot that food is by far the highest priority, followed by shelter and water. Their confidence intervals do not overlap, i.e. statistically their strengths are significantly different from each other. The code for this visualization was adapted from Legoupil 2019, op.cit.

strength_normed %>%
  mutate(item = fct_reorder(item, strength)) %>%
  ggplot(aes(x = item, y = strength, ymin = `95%-LB`, ymax = `95%-UB`)) +
  geom_point(color = "purple", shape = 18, size = 3) +
  geom_errorbar(
    aes(xmax = `95%-UB`, xmin = `95%-LB`), 
    color = "black",
    width = 0.3,
    size  = 0.35
    ) +
  coord_flip() +  # flip coordinates (puts labels on y axis)
  xlab("Items") +
  ylab("Strength") +
  theme(
    axis.text          = element_text(size = 11),
    axis.title.x       = element_text(size = 14),   
    axis.title.y       = element_text(size = 14), 
    panel.grid.major.x = element_line(color = "#cbcbcb"),
    panel.grid.major.y = element_blank(),
    strip.text.x       = element_text(size = 11)
    ) +
  scale_y_continuous(limits = c(0, ceiling(max(strength_normed$`95%-UB`)*10)/10))
Strength of sectoral priorities. 95%-confidence intervals. Strength coefficients normed to sum to 1. Effective sample size (households with > 1 ranked items) = 3,188.

Figure 5: Strength of sectoral priorities. 95%-confidence intervals. Strength coefficients normed to sum to 1. Effective sample size (households with > 1 ranked items) = 3,188.

ggsave(filename = "figure_coef_plot.png", 
       width = 6, height = 4)

The interpretation of the normalized strength of an item as the probability that a randomly selected additional respondent in the population will choose it as his-her first priority makes it easier to communicate the meaning of the estimates to non-statistical publics.

Plackett-Luce trees

Key points about trees

With the tree facility, we compute and visualize statistically distinct worth profiles - in our parlance: strength profiles - for items, separated by groups of observations. The groups are defined by values of categorical, resp. value ranges of continuous variables. These are called covariates. Our model covariates have already been described further above.

The key advantage of the pltree function is that its algorithm detects among the included covariates those associated with distinct profiles. The influential covariates are shown as nodes in a hierarchical tree, with the outgoing arrows from each node indicating which values / value ranges differentiate between strength profiles. The terminal nodes appear as line diagrams showing the strengths of the selected items in the particular profile.

Conversely, covariates that are not associated with distinct profiles do not appear in the tree, nor in the attendant coefficient tables.

The term “covariate” is purely formal. Substantively, humanitarian data analysts will find covariates of interest chiefly among attributes of vulnerability that differentiate beween affected survey units. Hence the title, “plot of ranked needs, by vulnerable groups”. Of course, variables on other topics, such as on previous relief distributions, too, might become covariates.

Back to the formal, the importance of two optional arguments in pltree must be underlined (they are defined in partykit::mob_control, on which pltree makes internal calls):

  • minsize = [integer > 0] specifies the minimum number of observations in a node. The model does not split the sample at the next internal node if the split would produce a node with membership smaller than minsize.

  • maxdepth = [integer >= 2]: The maximum depth of the tree. Counting starts from the top node and includes the internal nodes and the terminal node on one of the longest paths. Generally, trees of depth 4 or higher are too unwieldy for a legible plot. In many (most?) cases the maximum depth is not reached because significant splits are found to a smaller depth only.

The editorial options in pltree are limited. In particular, there is no way to place item names on the x-axis vertically, nor an immediate legend option. Our approach, therefore, is to overlay the tree onto a blank canvas into which the separately constructed legend is inscribed. We create a function pl_tree_plot that allows for the specification of plot parameters (see further below).

Ranking and grouping objects

In order to generate the tree, two intermediate objects are needed. From the priorities recorded in the selected items (item_sel) we first create an object of class "rankings". From this we create an object of class "grouped_rankings".

In the following we call these objects R, respectively G. In this case, the difference is trivial (R and G are identical, except for the way they express missing values). In other situations - see section “4.1 Beans example” in Turner et al., op.cit. - separately obtained rankings for the same observation have to be collated, with the result that the objects of the classes "rankings" and "grouped_rankings" become genuinely different. The reader need not worry about that here.

Rankings

Create a “rankings” object/variable R from selected items:

msna2019_filtered$R <- msna2019_filtered[ ,items_sel] %>% as.matrix() %>% as.rankings()
## Recoded rankings that are not in dense form
head(msna2019_filtered$R, 2)
## [1] "health_care > electricity > food" "food > electricity"

As already noted further above, this warning is inconsequential. It appears because the respondents ranked at most three out of the selected items. “Dense rankings” would imply that all 12 items were ranked, in every observation. With a maximum of three priorities, we deal with “sparse rankings”.

Grouped rankings

Use the PlackettLuce::group function to generate the object G:

msna2019_filtered$G <- group(x = msna2019_filtered$R, index = 1:nrow(msna2019_filtered))

head(msna2019_filtered$G, 2)
##                      1                      2 
## "health_care > el ..."   "food > electricity"

group takes two arguments. The object R supplies the first. index, the second, directs group to extract for each observation the appropriate element from R. The expression 1:nrow(msna2019_filtered) ensures that each element of the R vector is used exactly once. In our situation that is trivial (every respondent had exactly one opportunity to rank items), but since index has no default, it has to be made explicit.

The tree

Estimation

The model formula in pltree calls for the grouped rankings object as well as the selected covariates. The other four arguments are understood from the context and/or have been discussed above.

pl_tree <- pltree(G ~ fcs + educ_min_6_years + working + males_18_59 + hh_gender_marital, 
                  data     = msna2019_filtered, 
                  minsize  = 150,
                  maxdepth = 4 , 
                  weights  = msna2019_filtered$sampling_weight
                  )

The tree in simple text format

The pl_tree object returns, in text format, the branch-and-node structure of the tree, the calculated values or value ranges for each branch, and the membership size (weighted number of observations) and the item strengths in each terminal node. The strengths are rendered on the logarithmic scale, with the first item as a reference point (its log-strength constrained to 0). Below the tree, pl_tree adds a statistical summary.

pl_tree
## Plackett-Luce tree
## 
## Model formula:
## G ~ fcs + educ_min_6_years + working + males_18_59 + hh_gender_marital
## 
## Fitted party:
## [1] root
## |   [2] hh_gender_marital in Fem-Single: n = 214.706025764346
## |              food     shelter electricity       water      income    latrines 
## |         0.0000000  -0.5279754  -1.6180376  -1.2097145  -3.5313738  -1.8428231 
## |       health_care     cooking        fuel   education      safety    clothing 
## |        -3.4227368  -2.7991705  -2.5009129  -3.5387742  -3.5902143  -3.7039538 
## |   [3] hh_gender_marital in Fem-Married, Male-Single, Male-Married
## |   |   [4] hh_gender_marital in Fem-Married, Male-Single: n = 379.019926950336
## |   |              food     shelter electricity       water      income    latrines 
## |   |          0.000000   -1.138336   -1.911570   -1.065725   -2.332029   -1.292029 
## |   |       health_care     cooking        fuel   education      safety    clothing 
## |   |         -2.245821   -2.200293   -2.326054   -2.010709   -2.090687   -3.668319 
## |   |   [5] hh_gender_marital in Male-Married: n = 2577.62592074275
## |   |              food     shelter electricity       water      income    latrines 
## |   |          0.000000   -1.014020   -1.830034   -1.301218   -2.597797   -1.582128 
## |   |       health_care     cooking        fuel   education      safety    clothing 
## |   |         -2.254997   -2.535779   -1.971760   -2.645893   -3.264664   -3.368565 
## 
## Number of inner nodes:    2
## Number of terminal nodes: 3
## Number of parameters per node: 12
## Objective function (negative log-likelihood): 3977.032

The log-scaled strengths can be returned as a dataframe using the subsequent function coef. This format is inconvenient for many analytic purposes. A second function, itempar, returns a table with strengths that sum to 1 over the items in each node. The selected options below make for a natural scale and no reference category. These values are the ones on the y-axis in the line plots of the terminal nodes in the graphic representation further below.

# We show "coef(pl_tree)" in this example. Later, we will show only "itempar(pl_tree)", which is sufficient for most needs. Both arrive as matrices; for easier reading we transpose them and for possible further investigation we convert them to dataframes.

## 1. "coef(pl_tree)" returns strengths on  the log scale. The first item serves as a reference.
mat <- coef(pl_tree) # 
# save matrix as an object, then add a prefix to row names of matrix:
dimnames(mat)[[1]] <- paste0("Node ", dimnames(mat)[[1]])
dimnames(mat)[[1]]
## [1] "Node 2" "Node 4" "Node 5"
# transpose matrix, turn into a df, add variable name "item" to df:
plt_coef_df <- t(mat) %>% as.data.frame() %>% data.frame(item = row.names(.), ., row.names = NULL)
# round the numeric variables to three fractional digits:
mutate_if(plt_coef_df, is.numeric, function(x){round(x, 3)})
##           item Node.2 Node.4 Node.5
## 1         food  0.000  0.000  0.000
## 2      shelter -0.528 -1.138 -1.014
## 3  electricity -1.618 -1.912 -1.830
## 4        water -1.210 -1.066 -1.301
## 5       income -3.531 -2.332 -2.598
## 6     latrines -1.843 -1.292 -1.582
## 7  health_care -3.423 -2.246 -2.255
## 8      cooking -2.799 -2.200 -2.536
## 9         fuel -2.501 -2.326 -1.972
## 10   education -3.539 -2.011 -2.646
## 11      safety -3.590 -2.091 -3.265
## 12    clothing -3.704 -3.668 -3.369
## 2. "itempar(pl_tree, ..)" with arguments "ref = NULL, log = FALSE", returns strength on the natural scale; no reference item.

mat <- itempar(pl_tree, ref = NULL, log = FALSE)
# save matrix as an object and add a prefix to row names of matrix:
dimnames(mat)[[1]] <- paste0("Node ", dimnames(mat)[[1]])
dimnames(mat)[[1]]
## [1] "Node 2" "Node 4" "Node 5"
# transpose matrix, turn into a df, add variable name "item" to df:
plt_itempar_df <- t(mat) %>% as.data.frame() %>% data.frame(item = row.names(.), ., row.names = NULL)
# round the numeric variables to three fractional digits:
mutate_if(plt_itempar_df, is.numeric, function(x){round(x, 3)})
##           item Node.2 Node.4 Node.5
## 1         food  0.395  0.359  0.393
## 2      shelter  0.233  0.115  0.143
## 3  electricity  0.078  0.053  0.063
## 4        water  0.118  0.124  0.107
## 5       income  0.012  0.035  0.029
## 6     latrines  0.063  0.099  0.081
## 7  health_care  0.013  0.038  0.041
## 8      cooking  0.024  0.040  0.031
## 9         fuel  0.032  0.035  0.055
## 10   education  0.011  0.048  0.028
## 11      safety  0.011  0.044  0.015
## 12    clothing  0.010  0.009  0.014

The tree as plot

The output of pltree() can simply be plotted with the generic plot-function, which under the hood calls the method plot.pltree, which in turn calls the method plot.bttree. But due to limited graphical options, the result does not always satisfy. The method plot.pltree, respectively plot.bttree is immune to graphical parameters offered by par from the package graphics as well as to additional annotations made with text(), mtext() or legend().

The below function pl_tree_plot() is a workaround. Its arguments width (default = 1000), height (default = 0.7*width) and pointsize (default = 18) permit varying the size of the plot.pltree relative to its text elements; the user may need to experiment with different values in order to obtain a satisfying appearance. The argument yscale defines the range of the y-axis values in the terminal node subplots; its default range [0, 0.5] needs to be extended if the maximum item strength exceeds 0.5, or shrunk if the maximum falls way below. Moreover, pl_tree_plot() adds a legend with item codes and names. The number of columns can be set in the argument ncol_item_coding (default = 2). The legend position can be set by position_item_coding (default = "topleft"). With cex_item_coding (default = 0.7) the font size of item coding legend can be adjusted.

pl_tree_plot <- function(
  pltree,
  width     = 1000,
  height    = 0.7*width,
  pointsize = 18,
  yscale    = c(0, 0.5),
  position_item_coding = "topleft",
  ncol_item_coding = 2,
  cex_item_coding  = 0.7
  ){
  # write pl-tree plot as a .png to temp. directory
  png(
    filename = file.path(tempdir(), "pl_tree_plot.png"),
    width = width, height = height, units = "px", pointsize = pointsize
    )
  # plot pl-tree
  plot(
    pltree,            # input data for the pl-tree plot
    names  = FALSE,    # replaces default abbreviated item names in x-axes with natural numbering
    yscale = yscale,   # shown y-range of plots
    cex    = 0.8,      # size of dots in plots
    refcol = "gray40"  # color of the mean strength line
    )     
  dev.off() # end of .png writing
  
  # save current plot margin setting
  mar_set <- par("mar")
  
  # read the .png from the temp. directory into a raster array
  img <- png::readPNG(file.path(tempdir(), "pl_tree_plot.png"))
  # set all 4 plot margins to 0
  par(mar = rep(0,4)) 
  # fake plot showing nothing, but needed as canvas ...
  plot(0:1, 0:1, ann = FALSE, axes = FALSE, type = "n")
  # ... for the already read-in plot/.png ...
  rasterImage(img, xleft = 0, ybottom = 0, xright = 1, ytop = 1)
  # ..., on top of which in turn the text explaining the item coding ...
  
  items_chr <- attr(attr(pltree$data[ ,1], "rankings"), "dimnames")[[2]]
  # extract items from the argument pltree (which is of the class pltree) as character vector
  
  item_coding <- paste0(
    ifelse(seq_along(items_chr) < 10, "  ", ""), # gap filler to align code
    seq_along(items_chr), # item numbering / coding
    ": ",                 # separation between code and item
    items_chr             # items in plain text
    )
  # ... is set into a table implemented as the legend
  legend(position_item_coding, legend = item_coding,
         title = "item coding", title.adj = 0, x.intersp = -1,
         ncol = ncol_item_coding,
         bty = "n", cex = cex_item_coding, inset = 0.04)
  
  par(mar = mar_set) # reset margins
  }

Then we call the function to render the plot, using the already calculated pl_tree.

png(filename = "figure_PL_treeplot.png", 
    width = 2200, height = 1500, pointsize = 40)
pl_tree_plot(pltree = pl_tree, width = 2200, pointsize = 30, position_item_coding = "topright")
dev.off() # Saving has to be done in R, not in knitting
# `ggsave` does not work with objects of the class `pltree`.

Note the horizontal greyed line at y = 1 / Number of items. This is the mean strength; the line makes it easier to spot items with above average strength in each terminal node.

Adding node membership as a new variable

The below helper function extracts from an object of the class pltree the fitted terminal node membership and returns it as a factor with level ordered according to the numbering of the terminal nodes.

get_pl_tree_nodes <- function(pltree){
  predict(object = pltree, type = "node") %>%
    # create node membership variable (integer as character)
    as.integer() %>%
    # convert to integer 
    (function(x){
      chr    <- stringr::str_c("Node ", x)
      levels <- stringr::str_c("Node ", sort(unique(x)))
      factor(chr, levels = levels)
      }
     )
  # create factor variables with levels sorted by integer value + prefix "Node ".
  }

We add the node membership extracted from the object pl_tree to the data.frame msna2019_filtered as the new variable node. Recall that pl_tree originated from msna2019_filtered.

 # Creates the new variable "node":
msna2019_filtered$node <- get_pl_tree_nodes(pltree = pl_tree)
table(msna2019_filtered$node, useNA = "always") %>% addmargins()
## 
## Node 2 Node 4 Node 5   <NA>    Sum 
##    224    375   2589      0   3188

node can be joined, using id, to the unfiltered original dataset, where it may be further analyzed jointly with other variables of interest:

msna2019 <- left_join(msna2019, msna2019_filtered[ ,c("id", "node")], by = "id")
table(msna2019$node, useNA = "always") %>% addmargins()
## 
## Node 2 Node 4 Node 5   <NA>    Sum 
##    224    375   2589    230   3418

Finally, we verify that the households excluded from the PL-model indeed differ from the included with regards to their food consumption scores.

msna2019 %>%
  group_by(node, resp_gender) %>%
  summarise(
    n            = n(),
    "mean fcs"   = mean(fcs),
    "median fcs" = median(fcs),
    .groups      = "drop"
    )
## # A tibble: 8 x 5
##   node   resp_gender     n `mean fcs` `median fcs`
##   <fct>  <fct>       <int>      <dbl>        <dbl>
## 1 Node 2 Male           31       40.8         40.5
## 2 Node 2 Female        193       41.2         40  
## 3 Node 4 Male          150       47.5         45  
## 4 Node 4 Female        225       45.1         43  
## 5 Node 5 Male         1420       44.8         43.5
## 6 Node 5 Female       1169       44.9         43.5
## 7 <NA>   Male           68       52.2         52.5
## 8 <NA>   Female        162       37.0         35.5

The excluded households with female respondents score lower on food consumption than the rest of the sample; their male counterparts score higher. The pattern holds for both mean and median fcs. This boxplot visualizes the dispersion around the group medians.

counts <- msna2019 %>% group_by(node, resp_gender) %>% tally

msna2019 %>% 
  ggplot(aes(x = node, y = fcs, fill = resp_gender))  +
  stat_boxplot(geom ='errorbar', width = 0.5, position=position_dodge(0.7)) +
  geom_boxplot(width = 0.6,
               position=position_dodge(0.7),
               outlier.shape = 1,
               outlier.size  = 2,
               outlier.alpha = 0.5) +
  theme_bw() +
  scale_y_continuous(breaks = seq(10,110,20), limits = c(2.5,110)) +
  # ggtitle(paste0("...\nThere are (n = ", nrow(msna2019), ") observations."))  +
  labs(fill = "Gender of\nthe respondent", 
       x = "Terminal node",
       y = "Food Consumption Score") +
  # theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_text(
    data      = counts,
    aes(label = paste0("n = ", n), y = 10),
    position  = position_dodge(0.7),
    angle     = -30,
    hjust     = 0,
    size      = 2.5
    )
Food Consumption Scores, by terminal nodes, resp. excluded observations (NA). There are  n = 3188 observations.

Figure 6: Food Consumption Scores, by terminal nodes, resp. excluded observations (NA). There are n = 3188 observations.

ggsave(filename = "figure_boxplot1.png", 
       width = 6, height = 4)

If the user wants to see the detailed frequencies by

  • number of recorded priorities,
  • item that is the first priority ,
  • respondent gender and
  • inclusion/exclusion from the model,

this code produces a long table (not shown here) helpful to spot unexpectedly large groups, particularly among the excluded:

msna2019 %>% 
  group_by(
    excluded_from_pl_tree_model = is.na(node),
    resp_gender,
    priority_1,
    n_priorities
    ) %>% 
  count() %>% 
  arrange(n_priorities, resp_gender, excluded_from_pl_tree_model, -n) %>% 
  data.frame() -> tab_insight_missing_rankings

More compelling is the visual approach, as in this multiple barplot. With a glance at the distribution of priorities in the panel “.. priorities given = 1”, one spots the importance of shelter for female respondents in this group:

tab_insight_missing_rankings %>% 
  mutate(n_priorities = paste0("number of priorities recorded = ", n_priorities)) %>% 
  ggplot(aes(priority_1, n, fill = resp_gender)) +
  geom_col(position = position_dodge()) +
  facet_wrap(~n_priorities, ncol = 1, scales = "free_y") +
  theme_bw() +
  labs(x = "items of priority 1", y = "counts", fill = "gender of\nrespondent") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
First priority counts, by item, number of priorities recorded and respondent gender (n = 3418).

Figure 7: First priority counts, by item, number of priorities recorded and respondent gender (n = 3418).

ggsave(filename = "figure_priorities_recorded.png", 
       width = 9, height = 6)

Discussion

Why Plackett-Luce?

The Plackett-Luce model is attractive both from the theoretical and from the applied-research (i.e., needs assessment) perspectives. In the former, a strong point is that the model flows from a well-defined data-generating process, which in turn is based on psychological decision theory (“Luce’s axiom of choice”, Luce, R.D. 1977). It returns, as we have noted, a measure at the ratio level. Inference on the difference between the strengths of preference for two or more items in the population is unproblematic. Preference heterogeneity across groups characterized by observed attributes is revealed and visualized in Plackett-Luce trees.

For analysts of humanitarian research, and specifically of needs assessments,

  • fractional sampling weights,
  • comparisons of priorities via strength ratios, and
  • detection of the vulnerability attributes that differentiate priority profiles

most strongly stand out as key advantages. The PlackettLuce R package allows us to conveniently exploit them.

Plackett-Luce and the Borda count

Those features elevate the Plackett-Luce above the better known Borda count (Benini 2019, op.cit., Legoupil 2019, op.cit.). The Borda count makes demanding assumptions on psychological mechanisms. In particular, it assumes that in the mind of the respondent the difference in the strengths between first and second strongest preferences is equal to that between second and third (and third and fourth, etc., if more than three items can be ranked). The Borda count, therefore, permits, at most, item comparisons at interval level. While descriptive breakdowns by pre-defined vulnerability groups are a natural way to look for different priority profiles, we do not know of a procedure to simultaneously detect the best differentiating attributes.

However, this will not ring the death bell for the Borda count in humanitarian analysis. For one thing, this measure is easy to calculate in spreadsheets. As such, it stays at the fingertips of most junior analysts who have heard about it, including those who have only very basic notions of R or none.

Moreover, when identifying, from partial ranking data, the most pressing needs in the affected population, Plackett-Luce and Borda will often (though not always) find the same leaders. In our example, both measures single out food and shelter as first and second priorities. In Plackett-Luce food drastically outdistances all other items. Shelter, water and latrines, while showing significantly different strengths from each other, form a second tier. In the Borda count, the triad of food, shelter and electricity is clearly offset from the rest.

Borda count vs. Plackett-Luce.

Figure 8: Borda count vs. Plackett-Luce.

As a plus for the Borda count, it does include observations with only one ranked item, contrary to Plackett-Luce. Also, as an essentially voting method, the Borda counts of closely substantively connected items can be pooled. Think of, e.g., safety and electricity, the latter being understood as a key component for reducing nighttime crime and accidents in the camps. With rankings as its basis, PlackettLuce, in its current form, does not allow that.

However, the Plackett-Luce model is preferable when we believe that distinct priority profiles between vulnerable groups are the rule rather than the exception, and that tree models will reveal (or confirm qualitative knowledge of) their defining attributes.

Code for Borda count statistics (item-wise counts, means and normed means) as well as for the above diagram can be found in this appendix.

Hopes and wishes

We hope that our code is correct and detailed enough as well as lucidly explained in order to give the interested analyst learner a headstart. We also give several caveats, which we exemplify in the appendix. We admit that our own understanding of the R package PlackettLuce is incomplete. Notably, for us the workings of the function pltree and of the object that it creates are not fully transparent. For one thing, using sampling weights requires tedious filtering. Moreover, adjusting the text size of the plotted tree and annotating it with further text elements are not straightforward. The standard formatting options seem very limited. We have added code for nicer item legends. When sampling weights are used, we have not found a way to round the terminal node frequencies (the “n = ..” on top of the plot rectangles) to legible numbers. Perhaps some of the packages that PlackettLuce depends on offer more options. We will be glad to be told about them or else look forward to future versions to resolve these shortcomings.

In the meantime we look forward to receiving from you, our readers, suggestions for improvements in any aspects that this note has touched.

References

Acaps, NPM Analysis Hub and REACH Initiative. (2019). “Vulnerabilities in the Rohingya refugee camps [December 2019]”, Cox’s Bazar, Bangladesh. Retrieved 5 June 2020, from https://www.acaps.org/sites/acaps/files/products/files/20191220_acaps_analysis_hub_in_coxs_vulnerabilities_in_the_rohingya_refugee_camps_0.pdf.

Benini, A. (2019) “Priorities and preferences in humanitarian needs assessments - Their measurement by rating and ranking methods.” from http://aldo-benini.org/Level2/HumanitData/OkularAnalytics_Priorities_and_Preferences_v2019_01_23.pdf.

ISCG and Et.al. (2019a). “Bangladesh - 2019 Multi Sectoral Needs Assessment # 2 (‘In-depth’ Assessment for 2020 JRP) [Dataset, updated March 23, 2020].” Cox’s Bazar, UN Inter Sector Coordination Group (ISCG), UNHCR, IOM Needs and Population Monitoring (NPM), ACAPS, WFP VAM, Translators without Borders, and REACH. from https://data.humdata.org/dataset/3b28e3a6-16fc-4cab-81be-2fa086c0352f/resource/9e6ff150-a95e-48dd-9809-d2a11794acf5/download/bgd_dataset_joint-msna_refugee_september-2019.xlsx.

ISCG and Et.al. (2019b, 4 May 2020). “Joint Multi-Sector Needs Assessment (J-MSNA) - Rohingya Refugees - September 2019.” Cox’s Bazar, Bangladesh, UN Inter Sector Coordination Group (ISCG), European Union, UNHCR, IOM Needs and Population Monitoring (NPM), ACAPS, WFP VAM, Translators without Borders, and REACH. from https://reliefweb.int/sites/reliefweb.int/files/resources/bgd_report_2019_jmsna_refugee_community_december_2019_to_share_v3.pdf.

Legoupil, E. (2019). Estimation of Sectoral Priorities: Borda Count and Plackett-Luce Model (December 3, 2019) [incorrectly attributed to Benini & Sons], Humanitarian R User Group. https://humanitarian-user-group.github.io/post/plackett-luce/.

Luce, R. D. (1977) “The choice axiom after twenty years.” Journal of Mathematical Psychology 15, 215-233. Retrieved June 10, 2020, from https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.336.1366&rep=rep1&type=pdf.

Turner, H. L. (2020). Plackett-Luce questions. Email msg. to Aldo Benini, 20 May 2020.

Turner, H., I. Kosmidis, D. Firth and J. van Etten. (2018). PlackettLuce: Plackett-Luce Models for Rankings. URL https://CRAN.R-project.org/package=PlackettLuce, R package version 0.2-3.

Turner, H. L., J. van Etten, D. Firth and I. Kosmidis. (2018a). “Modelling item worth based on rankings.” Retrieved 28 April 2020, from https://www.heatherturner.net/modelling_item_worth.html#/.

Turner, H. L., J. van Etten, D. Firth and I. Kosmidis. (2018b). “Modelling rankings in R: the PlackettLuce package.” Retrieved 10 December 2018, from https://arxiv.org/pdf/1810.12068.

Turner, H. L., J. van Etten, D. Firth and I. Kosmidis (2020) “Modelling rankings in R: the PlackettLuce package.” Computational Statistics DOI: 10.1007/s00180-020-00959-3. from https://link.springer.com/content/pdf/10.1007/s00180-020-00959-3.pdf.

Wikipedia. (2019). “Quasi-variance.” Retrieved 4 May 2020, from https://en.wikipedia.org/wiki/Quasi-variance.

Appendices

Selection of relevant items and side effects on number priorities

Back to the previous point in section Selection of relevant items.

We demonstrate several methods to make selections of items from the original full set:

Method 1: Hand-picked items

items_sel_1 <- c("food", "shelter", "electricity", "water", "income", "latrines")

Method 2: Form a predefined selection / vector of character strings as shown above with all items from the dataset msna2019:

(items_sel_2 <- levels(msna2019$priority_1))
##  [1] "food"        "shelter"     "electricity" "water"       "income"     
##  [6] "latrines"    "health_care" "cooking"     "fuel"        "education"  
## [11] "safety"      "clothing"

Method 3: Statistical criterion, such as the minimum count an item appeared as any priority
(Overall count = occurrences of the item as priority 1, 2 or 3):

(msna2019 %>%
    select(food:income) %>% 
    apply(2, function(x){sum(x > 0)}) %>% 
    # count comlumwise all records > 0 (0 means: "not recorded as a priority")
    tibble(item = names(.), 'overall count' = .) %>% 
    arrange(-`overall count`) -> item_overall_count)
## # A tibble: 12 x 2
##    item        `overall count`
##    <chr>                 <int>
##  1 food                   1796
##  2 electricity            1639
##  3 shelter                1569
##  4 income                  724
##  5 health_care             709
##  6 cooking                 693
##  7 clothing                662
##  8 water                   533
##  9 latrines                506
## 10 education               216
## 11 fuel                    212
## 12 safety                  137
min_count <- 250

(item_overall_count %>% 
  filter(`overall count` >=  min_count) %>%
  .$item -> items_sel_3)
## [1] "food"        "electricity" "shelter"     "income"      "health_care"
## [6] "cooking"     "clothing"    "water"       "latrines"

In general, one ought to be wary of excluding items from the original set on which the rankings were elicited. With every additional item excluded from the Plackett-Luce model, the number of viable observations is liable to decrease, acceleratingly. This multi-panel histogram details the loss for this dataset, in response to higher and higher item frequency thresholds:

The downside of selecting items.

Figure 9: The downside of selecting items.

Back to the previous point in section Selection of relevant items.

Non-randomness of missing priorities

Back to the relevant main text.

We demonstrate the non-random missingness in two regards: 1. The association with poverty; 2. differences in priorities.

Association with poverty

For female respondents only one priority was recorded twice as often as for their male counterparts, as this table shows:

## 
## ------------- -------------- ---------- ------------ ------------- -------------- ---------------
##                 n_priorities          0            1             2              3           Total
##   resp_gender                                                                                    
##          Male                  2 (0.1%)    66 (4.0%)   220 (13.2%)   1381 (82.7%)   1669 (100.0%)
##        Female                  0 (0.0%)   161 (9.2%)   178 (10.2%)   1410 (80.6%)   1749 (100.0%)
##         Total                  2 (0.1%)   227 (6.6%)   398 (11.6%)   2791 (81.7%)   3418 (100.0%)
## ------------- -------------- ---------- ------------ ------------- -------------- ---------------

Taking the Food Consumption Score (fcs) as a measure of poverty, we see that this group appears to be poorer than the rest of the groups shown in the boxplot:

Households with female respondents with only one recorded priority tend to be poorer

Figure 10: Households with female respondents with only one recorded priority tend to be poorer

Different priorities

To compare those households that stated only one priority to those with two or three, we look at the distribution of first priorities over the items, by number of priorities stated:

## 
## ------------- -------------- -------------- -------------- --------------- ---------------
##                 n_priorities              1              2               3           Total
##    priority_1                                                                             
##          food                   40 ( 17.6%)   128 ( 32.2%)   1265 ( 45.3%)   1433 ( 41.9%)
##       shelter                   79 ( 34.8%)    83 ( 20.9%)    535 ( 19.2%)    697 ( 20.4%)
##   electricity                   37 ( 16.3%)    82 ( 20.6%)    318 ( 11.4%)    437 ( 12.8%)
##         water                    3 (  1.3%)    17 (  4.3%)    160 (  5.7%)    180 (  5.3%)
##        income                   11 (  4.8%)    13 (  3.3%)    146 (  5.2%)    170 (  5.0%)
##      latrines                    6 (  2.6%)    20 (  5.0%)    109 (  3.9%)    135 (  4.0%)
##   health_care                   19 (  8.4%)     8 (  2.0%)     85 (  3.0%)    112 (  3.3%)
##       cooking                   13 (  5.7%)    19 (  4.8%)     54 (  1.9%)     86 (  2.5%)
##          fuel                    5 (  2.2%)     0 (  0.0%)     49 (  1.8%)     54 (  1.6%)
##     education                    4 (  1.8%)     9 (  2.3%)     33 (  1.2%)     46 (  1.3%)
##        safety                    5 (  2.2%)     2 (  0.5%)     22 (  0.8%)     29 (  0.8%)
##      clothing                    4 (  1.8%)     5 (  1.3%)     15 (  0.5%)     24 (  0.7%)
##          <NA>                    1 (  0.4%)    12 (  3.0%)      0 (  0.0%)     13 (  0.4%)
##         Total                  227 (100.0%)   398 (100.0%)   2791 (100.0%)   3416 (100.0%)
## ------------- -------------- -------------- -------------- --------------- ---------------

The 13 observations with NA as first priority, yet with one or two recorded priorities are probably the result of processing errors. These priorrities were recorded as second and/or third ones.

Going through the column-wise proportions, one notices that among the 227 respondents who stated only one priority, shelter and health care are more frequent whereas the frequency of food as first priority is much lower.

Thus, in both regards, missing priorities are not randomly distributed. Data analysts should take care to produce some such cross-tabulations and, if the number of exclusions from the Plackett-Luce model is significant, report noticeable deviations in the priorities of the excluded as additional findings.

Back to the relevant main text.

Respondent gender as a confounder

This appendix is not necessary for the understanding of Plackett-Luce trees. However, it reminds analysts that design factors impact model outcomes, and if they are observed and recorded at the observation (or group) level, it may be a good thing to run tree models with and without such variables, and then compare estimates.

Potential bias from survey design

By design, in the MSNA 2019 enumerators and respondents were paired by gender (while preserving the probability character of the sample). The respondent was not always the household head; this was accepted as long as an able substitute matching the gender of the visiting enumerator was found. Still, the correlation between respondent and household head genders is significant, as this table tells:

## 
## ------------------- ------------- -------------- -------------- ---------------
##                       resp_gender           Male         Female           Total
##   hh_gender_marital                                                            
##          Fem-Single                   31 (13.8%)    193 (86.2%)    224 (100.0%)
##         Fem-Married                   87 (28.6%)    217 (71.4%)    304 (100.0%)
##         Male-Single                   63 (88.7%)      8 (11.3%)     71 (100.0%)
##        Male-Married                 1420 (54.8%)   1169 (45.2%)   2589 (100.0%)
##               Total                 1601 (50.2%)   1587 (49.8%)   3188 (100.0%)
## ------------------- ------------- -------------- -------------- ---------------

Naturally, the question arises whether the gender pairing confounds the true effect that the household head status (gender plus married/single) has on preferences. In other words, whether there is a significant interviewer/respondent bias on gender lines.

An appearance of bias

In fact, when we add resp_gender as a control, it completely overwhelms the effect of the other covariates, incl. the previously dominant hh_gender_marital. This is true even when we set a very weak significance level for splits (alpha = 0.20).

plt_resp_gender <- pltree(G ~ fcs + educ_min_6_years + working  + males_18_59 + hh_gender_marital + resp_gender,
                           data     = msna2019_filtered, 
                           minsize  = 150, 
                           maxdepth = 4 ,
                           weights  = msna2019_filtered$sampling_weight,
                           control  = mob_control(alpha = 0.20) # See Help for partykit::mob, a function on which PlackettLuce::pltree makes calls.
                           )

pl_tree_plot(pltree = plt_resp_gender, width = 1350, pointsize = 24)
PL-tree plot of the model with `resp_gender`, entire sample

Figure 11: PL-tree plot of the model with resp_gender, entire sample

The tree returned has only two terminal nodes. Households with female enumerators / respondents emphasize shelter improvements and more/better latrines more strongly. Among items with strengths < 1/12 (those below the horizontal line), male preference for (cheaper? more regularly supplied?) cooking fuel is a multiple over the strengh that women give this item.

The highly vulnerable status “single female household head” is no longer distinctive in this specification.

Assessing the bias

However, the verdict of design-induced bias is not compelling. To arrive at a more valid perspective, we re-estimate the model on the subsample of households headed by married males, the cultural default, accounting for over 80 percent of the sample. Almost half of these households (45 percent) had female enumerators / respondents. For them, obviously, resp_gender is well balanced.

msna2019_confound <- msna2019_filtered %>% filter(hh_gender_marital == "Male-Married") 
nrow(msna2019_confound)
## [1] 2589

On this subsample, we test whether resp_gender still dominates. hh_gender_marital defines the subsample; thus it is no longer in the model. We also lower the minimum size for terminal nodes:

pl_tree_confound <- pltree(G ~ fcs + educ_min_6_years + working  + males_18_59 + resp_gender,
                           data     = msna2019_confound, 
                           minsize  = 50, 
                           maxdepth = 4 ,
                           weights  = msna2019_confound$sampling_weight,
                           control  = mob_control(alpha = 0.20)
                           )
# pl_tree_confound
pl_tree_plot(pltree = pl_tree_confound, width = 2200, pointsize = 30)
PL-tree plot of the model with `resp_gender`, subsample = households headed by married men.

Figure 12: PL-tree plot of the model with resp_gender, subsample = households headed by married men.

Despite these adjustments, enumerator / respondent gender is still dominant. A new distinction is added. Where the interview took place between women, the priority profiles are distinct depending on whether at least one household member was earning or not. The differences between these two profiles, however, appear minor and are difficult to interpret - why should the group with no one earning (node no.5) express a weaker priority of food aid? The major difference persists between genders (nodes 2 vs. 4 and 5), expressed chiefly in items nos. 2, 6 and 9, as before.

Node memberships compared

Between the two models, we compare terminal nodes and excluded observations by their size and by their average Food Consumption Scores.

In the first step, we create the node membership variable; we recall that this model only includes households with married male heads. We tabulate it:

msna2019_confound$node_confound <- get_pl_tree_nodes(pltree = pl_tree_confound) 
table(msna2019_confound$node_confound, useNA = "always") %>% addmargins()
## 
## Node 3 Node 4 Node 5   <NA>    Sum 
##    878    542   1169      0   2589

There are no missings.

Next, we merge this new variable with msna2019 and tabulate it there. As expected, observations with household heads other than married men appear as NAs, in addition to those with fewer than two priorities recorded or with incomplete covariates.

msna2019_confound %>%
  select(id, node_confound) %>%
  left_join(x = msna2019, y = ., by = "id") -> msna2019
table(msna2019$node_confound, useNA = "always") %>% addmargins()
## 
## Node 3 Node 4 Node 5   <NA>    Sum 
##    878    542   1169    829   3418

Third, we tabulate the node memberships for both models side by side and calculate two (unweighted, i.e. for the sample only) Food Consumption Score statistics.

msna2019 %>%
  group_by(
    node,
    node_confound
    ) %>% 
  summarize(
    n          = n(),
    fcs_mean   = mean(fcs) %>% round(1),
    fcs_median = median(fcs),
    .groups      = "drop"
    ) %>% 
  data.frame()
##     node node_confound    n fcs_mean fcs_median
## 1 Node 2          <NA>  224     41.1       40.0
## 2 Node 4          <NA>  375     46.0       44.0
## 3 Node 5        Node 3  878     45.8       44.0
## 4 Node 5        Node 4  542     43.2       42.5
## 5 Node 5        Node 5 1169     44.9       43.5
## 6   <NA>          <NA>  230     41.5       40.5

Although the partitions both consist of four elements, neither is a function of the other (they are neither injective nur surjective in any direction). Still, in combination with fcs, the mapping is informative. Not only are households with single female heads poorer (node 2 in the model without resp_gender) - the same is true, to a smaller degree, of married men-led ones without an earner (node 5 with resp_gender). This lends some additional credibility to both models.

Is the bias serious?

The bias is not serious. For, in all our tree models, just as in the unconditional PlackettLuce estimates, the strengths of the same three items are strongest: food (by far!), shelter, and water. The gender of the enumerators / respondents does not change this ranking.

Moreover, the bias is useful in the sense that it lets us see, once more, what the voices of women add to the needs assessment. Clearly, when the interviews are between women, shelter improvements are desired more strongly - independently of the gender/marital status of the household head, as the “confound” model demonstrates. Women in Rohingya culture are more strictly confined to their homes, and take greater responsibility for running them, than men (although the camp situation offers great latitude away from their shelters to neither gender). When talking with empathyic outsiders asking questions about shelter quality (as done in this MSNA), they press the importance of improvements.

Certainly, women’s voices come out with more numerous and refined insights in qualitative enquiries than in this statistical model. Still, testing the robustness of the effect of a key covariate (in our case, the status of the household head) by controlling for a design variable is helpful. Where possible, it should always be done with Plackett-Luce tree models.

Robustness to covariate transformations

Back to the previous point in section Transformed covariates: Education proxy.

As argued in the main section, covariates may be passed to pltree in their original shape or after some transformation. Transformed variants may generate different trees. In particular, the effect on the tree of (generally information-richer) numeric vs. (easier-to-interpret) categorical variants may differ dramatically.

We illustrate this with the education indicators. For the above tree, we used the categorical educ_min_6_years, telling us whether the highest education among any household members was at least six years of schooling. educ_min_6_years itself is the result of a transformation of the original count variable educ_max_years, which has a range from 0 to 12 years of schooling.

For further probing, we create PL-trees using two more versions: The binary any_school_ever, which which takes the value “No one” if no one in the household has ever completed Class 1, and “At least one” otherwise. And educ_ridit, which is the distribution-driven ridit transform of educ_max_years.

For rationale and calculation, refer to section “Education proxy” above.

The models here work with only two covariates, the education proxy and the gender of the enumerator / respondent. We allow small groups with minsize = 100 and splits up to maxdepth = 4.

Using the original educ_max_years

Tree with respondent gender and max. years education

Figure 13: Tree with respondent gender and max. years education

## 
## Node 2 Node 5 Node 6 Node 7   <NA>    Sum 
##   1601    536    270    781    230   3418

One notices immediately that for households with female respondents and a maximum of Class 1 or 2 completed among household members, the priority placed on food is much lower. Education, it seems, has a non-linear effect on the priority profiles of female respondents, and no discernible one on the male side. This is particular to this contrived model and not at all generizable. Obviously it drives the splits here.

Using educ_ridit

Tree with respondent gender and riditized max. years education

Figure 14: Tree with respondent gender and riditized max. years education

## 
## Node 2 Node 5 Node 6 Node 7   <NA>    Sum 
##   1601    536    270    781    230   3418

The terminal node plots are exactly the same as in the educ_max_years version. Since years are easier to understand than ridit split values, this model has no value beyond demonstrating the robustness when transformations are bidirectional one-to-one mappings.

Using educ_min_6_years

Tree with respondent gender and categorical minimum 6 years of education

Figure 15: Tree with respondent gender and categorical minimum 6 years of education

## 
## Node 2 Node 3   <NA>    Sum 
##   1601   1587    230   3418

Categorical, returning a tree with two terminal nodes.

Using any_school_ever

Tree with respondent gender and categorical any household member finished at least Class 1

Figure 16: Tree with respondent gender and categorical any household member finished at least Class 1

## 
## Node 2 Node 4 Node 5   <NA>    Sum 
##   1601    536   1051    230   3418

Categorical, returning a tree with three terminal nodes.

Comparing node frequencies

Table 2: Model comparison of terminal nodes.
Original variable
Transformed to
Gender of respondent Max years Ridit Min. 6 years Any school ever n
Male Node 2 Node 2 Node 2 Node 2 1601
Male NA NA NA NA 68
Female Node 5 Node 5 Node 3 Node 4 536
Female Node 6 Node 6 Node 3 Node 5 270
Female Node 7 Node 7 Node 3 Node 5 781
Female NA NA NA NA 162

The table summarizes the findings that matter here: The original and the transformed numeric variants produce exactly the same partitions of the sample; the categorical transforms return trees with two, resp. three terminal nodes. The partition under any_school_ever maps surjectively to the one under educ_min_6_years (both nodes 4 and 5 |-> node 3); the reverse is not true.

Back to the previous point in section Transformed covariates: Education proxy.

Borda count

Back to the relevant discussion point.

Generating Borda scores

We use this helper function to convert natural ranks to Borda scores:

convert_2_borda <- function(data, items){
 data_only_items <- data[ ,items]
 last_priority   <- max(data_only_items)
 get_borda_score <- function(x){
   ifelse(x != 0, yes = last_priority + 1 - x, no = x)
   }
 mutate_all(.tbl = data_only_items, .funs = get_borda_score) 
 }

Item-wise statistics

The conversion takes place in this chunk, which also produces the item-wise statistics and a dataframe to which we join the PL-strengths:

items        <- msna2019 %>% select(food:income) %>% names()
# all items stored as a vector
borda_scores <- convert_2_borda(data = msna2019, items = items)
# convert natural ranks to Borda scores with the helper function

borda_scores_weighted <- borda_scores*msna2019$sampling_weight
# multiply the Borda scores by the sampling weights

data.frame(
  item         = items,
  borda_counts = apply(borda_scores_weighted, 2, sum),
  borda_means  = apply(borda_scores_weighted, 2, mean),
  row.names    = NULL
  ) %>%
  mutate(borda_normed = borda_means/sum(borda_means)) %>%
  left_join(
    strength_normed %>% select(item, strength),
    by = "item"
    ) %>%
  rename("pl_strength_normed" = strength) -> borda_pl_info
Table 3: Borda count statistics (sampling-weighted).
Borda
item counts means normed
food 4907.774 1.436 0.254
shelter 3722.790 1.089 0.193
water 1094.105 0.320 0.057
latrines 1088.450 0.318 0.056
electricity 3015.396 0.882 0.156
cooking 1123.639 0.329 0.058
clothing 817.365 0.239 0.042
health_care 1253.208 0.367 0.065
education 376.151 0.110 0.019
safety 246.734 0.072 0.013
fuel 396.500 0.116 0.021
income 1279.132 0.374 0.066

Note that the Borda counts were computed off the unfiltered msna2019 data frame. For stricter comparability with the Plackett-Luce strengths, they were separately calculated using msna2019_filtered. The differences in the results are minissime, in the third fractional digits of the normed Borda means.

Code for the comparison plot

This is code for the plot shown in the “Discussion” section:

borda_pl_info %>%
ggplot(aes(x = pl_strength_normed, y = borda_normed)) + 
  theme_bw() + 
  ggrepel::geom_text_repel(
    aes(label = item),
    box.padding = unit(0.45, "lines")
    ) +
  geom_point(colour = "lightgreen", size = 4) + 
  geom_point(pch = 1, size = 4) + 
  scale_x_continuous(breaks = seq(0,0.4,0.1), limits = c(0,0.4)) +
  scale_y_continuous(breaks = seq(0,0.4,0.1), limits = c(0,0.4)) +
  geom_abline(intercept = 0, slope = 1, colour = "red", lty = 2) +
  coord_fixed() + # ensures aspect ratio = 1
  labs(x = "Plackett-Luce (weighted, normed)", y = "Borda count (weighted, normed)")

ggsave(filename = "figure_PL_vs_Borda.png", 
       width = 9, height = 6)

Back to the relevant discussion point.