Shortening the Edinburgh postnatal depression scale using optimal test assembly methods: Development of the EPDS‐Dep‐5

This study used a large database to develop a reliable and valid shortened form of the Edinburgh Postnatal Depression Scale (EPDS), a self‐report questionnaire used for depression screening in pregnancy and postpartum, based on objective criteria.


| INTRODUCTION
Depression is a leading cause of disability among women. 1 Although the 7-13% prevalence of major depression during pregnancy and postpartum 2-5 is similar to rates among women during non-childbearing periods, 3,6-10 perinatal depression is associated with adverse outcomes for the mother, developing child, mother-infant relationship and marital quality. [11][12][13] Most women with depression in the perinatal period, however, do not receive adequate care. [14][15][16] Rapidly identifying women with depression to improve their care is a high clinical priority. 17 The 10-item Edinburgh Postnatal Depression Scale (EPDS) is the most commonly used self-report questionnaire in pregnancy and postpartum for screening, and it is also used as a continuous scale for symptom monitoring clinically and for research. 16,18 Scores on each EPDS item reflect the frequency of symptoms in the last two weeks and range from 0 to 3, with questions 3 and 5-10 reverse coded. Total scores range from 0 to 30. Higher scores indicate greater depressive symptomatology. As completing measures can be demanding, shortened versions with scores that perform comparably well with original full-length versions may help reduce the burden placed on respondents, as well as decrease the time it takes to administer the scale. However, shortening a scale is only advisable if it does not adversely affect measurement and screening accuracy properties of the scale.
Shortened forms of the full 10-item EPDS have been developed Table 1. [19][20][21][22][23][24] These include two two-item forms, 19,24 a five-item form, 20 three-and seven-item subscales that measure symptoms of anxiety and depression separately, 21,24 a three-item form, 22 and an eight-item form. 23 None of the development processes for these shortened forms used pre-specified criteria for performance to determine how many items to remove from the full 10-item EPDS. Furthermore, only three studies shortening the EPDS validated against major depression classification status, 20,22,24 and these studies included only 63, 19, and 9 major depression cases. The extent to which the existing shortened forms retain the measurement and diagnostic properties of the full scale is unclear. Individual participant data meta-analysis (IPDMA), in which participant-level data from many studies are synthesized, allows for the development of a shortened form using data from a large number of participants.
Optimal test assembly (OTA) is a mixed-integer programming procedure that uses an estimated item response theory (IRT) model to select the subset of items that maximizes performance with respect to a given metric while satisfying pre-specified constraints. 25 While more commonly

Significant Outcomes
• A 5-item short form of the EPDS can be used to screen for depression in the perinatal period. • The 5-item short form was shown to be valid and reliable in a sample of 5157 participants. • Optimal test assembly methods provide a replicable and reproducible methodology to shorten patient-reported outcomes.

Limitations
• This study was not able to obtain data from 25 of 81 eligible datasets. • There exists substantial heterogeneity across studies in terms of country and language of administration of the semi-structured interview. • The optimal test assembly procedure is datadriven and should be replicated. used in the development of high-stakes educational tests, 26 OTA is being increasingly used to develop shortened versions of patient-reported outcome measures. [27][28][29] This procedure was also shown to be replicable, reproducible, and to produce shortened forms of minimal length compared to alternative methods. 30

| Aims of the study
The objective of the present study was to apply optimal test assembly methods to a large database in order to develop a shortened version of the Edinburgh Postnatal Depression Scale. We (1) used confirmatory factor analysis to verify the unidimensionality of the underlying construct measured by the Edinburgh Postnatal Depression Scale; (2) applied optimal test assembly methods to obtain candidate forms of each possible length; and (3) selected the shortest possible form that showed similar performance to the full form in terms of pre-specified validity, reliability, and screening accuracy criteria, compared to the Edinburgh Postnatal Depression Scale.

| MATERIALS AND METHODS
This study used a subset of data accrued for an IPDMA on the diagnostic accuracy of the EPDS for screening to detect major depression among pregnant and postpartum women. This IPDMA was registered in PROSPERO (CRD42015024785) and a protocol was published. 31 The protocol for the main IPDMA did not include methods for the present study. A protocol for the present study was uploaded to the Open Science Framework repository prior to initiating the study (https:// osf.io/3cepr/).

| Study eligibility for the main IPDMA
Datasets from articles in any language were eligible if they included women ≥18 years who were pregnant or had given birth in the previous year and both: (a) EPDS scores and (b) diagnostic classification for a current Major Depressive Episode (MDE) using Diagnostic and Statistical Manual of Mental Disorders (DSM) or International Classification of Diseases (ICD) criteria based on a validated semi-structured or fully structured interview, administered within two weeks of each other. Participants recruited from psychiatric settings or setting where scales or interviews were administrated because of reported symptoms of depression were excluded, since screening is done to identify previously unrecognized cases. 32 Not all participants in a dataset needed to be eligible, if primary data allowed the selection of eligible participants.

| Database searches and study selection
A medical librarian searched Medline, Medline In-Process & Other Non-Indexed Citations and PsycINFO via OvidSP, and Web of Science Core Collections via ISI Web of Knowledge from inception to October 3, 2018, using a peer-reviewed 33 search strategy (Methods S1). We reviewed reference lists of relevant reviews and queried contributing authors about non-published studies. Search results were uploaded into RefWorks (RefWorks-COS). After de-duplication, remaining citations were uploaded into DistillerSR (Evidence Partners) for processing review results. Two investigators independently reviewed titles and abstracts. If either deemed a study potentially eligible, full-text review was done by two investigators, independently, with disagreements resolved by consensus, consulting a third investigator when necessary.

| Data contribution, extraction, and synthesis
Authors of eligible datasets were invited to contribute deidentified primary data, including EPDS item scores and major depression status. We emailed corresponding authors of eligible primary studies at least three times, as necessary. If there was no response, we emailed co-authors and attempted phone contact.
Individual participant data were converted to a standard format and synthesized into a single dataset. We compared published participant characteristics and accuracy results with results from raw datasets and resolved any discrepancies in consultation with primary investigators.
For defining major depression, we considered MDD or MDE based on the DSM or ICD. If more than one was reported, we prioritized MDE over MDD. This is because screening would attempt to detect depressive episodes; further interview would determine if the episode is related to MDD, bipolar disorder, or persistent depressive disorder. We also prioritized DSM over ICD.
When datasets included statistical weights to reflect sampling procedures, we used the provided weights. For studies where sampling procedures merited weighting (e.g., all participants with positive screens and a random subset of participants with negative screens received a diagnostic interview), but the original study did not weight, we used inverse selection probabilities.

| Data eligibility for present study
For the present study, from the main IPDMA dataset, we only included primary studies that classified major depression based on the Structured Clinical Interview for DSM (SCID). 34 The SCID is a semi-structured diagnostic interview that was designed to be conducted by experienced diagnosticians. It requires clinical judgment and allows rephrasing questions and probes to follow up responses. Fully structured interviews, on the other hand, are fully scripted, with no allowance for deviation from the script. These interviews remove clinical judgement from the process, allowing lay interviewers, rather than clinicians, to perform the assessment. Because of this, they may sacrifice validity. In recent analyses using three large IPDMA databases, [35][36][37] it was found that compared to semi-structured interviews, fully structured interviews, which are designed for administration by lay interviewers, may identify more patients with low-level symptoms as depressed but fewer patients with high-level symptoms. Furthermore, a very brief version, the Mini International Neuropsychiatric Interview, identified far more participants as being depressed across the symptom spectrum. [35][36][37] These results were consistent with the idea that semi-structured interviews most closely replicate clinical interviews done by trained professionals, whereas fully structured interviews are less rigorous reference standards. They are less resource-intensive options that can be administered by research staff without diagnostic skills but may misclassify major depression in substantial numbers of patients. Semi-structured interviews replicate diagnostic standards more closely than other types of interviews, and the SCID is by far the most commonly used semi-structured diagnostic interview for depression research [34][35][36]. In our main EPDS IPDMA database, 34 of 36 studies that used semi-structured interviews to classify major depression status used the SCID. Therefore, we only included SCID studies.
In addition, as EPDS item-level data was necessary for the proposed analyses, we only included studies in which EPDS item-level data (not just total scores) were available. For studies that collected data at multiple time points, we selected the time point with the most participants. If there was a tie, we selected the time point with the most major depression cases.

| Statistical analyses
All analyses were conducted using R version 3.6.0.

| Verification of unidimensionality of the EPDS
Robust weighted least squares estimation in R was used to fit a single-factor confirmatory factor analysis model of EPDS items. 38 The model was first fit without allowing for any residual correlations among the items. If there was poor model fit, and if warranted by theoretical justification, modification indices were to be used to identify item pairs that would improve model fit by allowing their residuals to correlate. 39 Model fit was evaluated concurrently, using the χ 2 statistic, Comparative Fit Index (CFI), Tucker-Lewis Index (TLI), and Root Mean Square Error of Approximation (RMSEA). 40 Priority was given to CFI, TLI, and RMSEA, because the χ 2 test may reject well-fitting models when sample size is large. 41 Model fit was considered to be adequate if CFI and TLI were ≥0.95 and RMSEA ≤0.08. 42 The confirmatory factor analysis was fit using the lavaan package. 43 2.5.2 | Item response theory model and optimal test assembly A generalized partial credit model (GPCM) was fit to EPDS pooling data from all included studies. 44 The GPCM is an IRT model that relates a latent trait, representing severity of depressive symptomatology, to the distribution of observed item-level responses. The GPCM estimates two types of item-specific parameters: a discrimination parameter and threshold parameters. From these item-level parameter estimates, item information functions for each item were calculated from the GPCM, as well as a test information function (TIF), obtained by summing item information functions. Because the TIF is inversely related to the standard error of measurement of the latent trait, high amounts of information represent greater precision for measuring depressive symptomatology. The GPCM was fit using the ltm package. 45 Next, we used OTA-a mixed-integer programming technique-to systematically search for the short form that maximized the TIF, subject to the constraint of fixing the number of items included in each short form. By using the TIF as the objective function, the procedure optimizes the precision of the short form in estimating participants' level of depressive symptomatology. 25,46 The shape of the TIF was anchored at five points. 25 Thus, for each short form of lengths 1-9 items, OTA selected items from the full set of EPDS items that maximized the test information. The OTA analysis was conducted using the lpSolveAPI package.
For each of the 9 candidate short forms and the full-length form, two scoring procedures were used to obtain estimates of each participant's level of depressive symptomatology. First, the summed scores across all items included in the short form were calculated. Second, factor scores were estimated for each participant. Although summed scores are typically relied upon for clinical use, the factor scores are considered to provide a better estimate of the latent trait because of well-known limitations of the summed score under the GPCM. 47,48

| Selection of final short form
The elimination of items necessarily reduces information compared to a full-length form. Thus, to guarantee adequate performance, the selection of the final short form was based on the following five criteria: reliability, concurrent validity of summed scores, concurrent validity of factor scores, and non-inferior sensitivity and specificity.
Reliability of each candidate short form was assessed with Cronbach's alpha, 49  short form scores and the full-length EPDS. It was required a priori to be ≥0.90. 30 Diagnostic accuracy of each candidate short form was assessed through a three-step process. First, pooled sensitivity and specificity of each candidate short form (compared to the SCID) for each of its possible cutoff summed score values were estimated with a bivariate random-effects model. Second, for each candidate short form, an optimal cutoff score was selected using Youden's J statistic (sensitivity + specificity −1). 50,51 The bivariate random-effects model was fit using the lme4 package. 52 Third, two non-inferiority tests were conducted for each of the 9 candidate forms to compare sensitivity and specificity, separately, to the full-length form. Non-inferiority tests assess whether the sensitivity or specificity of the short form is not lower than that of the full-length form, up to a pre-specified clinically significant tolerance of δ = 0.05. 53 To conduct the non-inferiority test, the sampling distribution of the test statistic was generated through the bootstrap method. 54 Bootstrapping resamples the original dataset with replacement to generate new, artificial, datasets. 55 For each non-inferiority test, 2000 bootstrap iterations were conducted, controlling in each for the number of respondents with and without major depression. For each bootstrap iteration, the bivariate random-effects model was fit to each of the 9 candidate short forms and the full-length form, and the sensitivities and specificities were computed based on their cutoff scores. To account for the multiple testing in the 18 total non-inferiority tests, Benjamini-Hochberg adjusted p-values were used to determine the significance of the tests at the 0.05 significance level. 56

| Funding and ethics
The study sponsors had no role in study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the paper for publication. DH had full access to all data in the study and had final responsibility for the decision to submit for publication. As this study involved secondary analysis of de-identified previously collected data, the Research Ethics Committee of the Jewish General Hospital declared that this project did not require research ethics approval. However, for each included dataset, we confirmed that the original study received ethics approval and that all patients provided informed consent.

| Search results and inclusion of primary data
Of 4434 unique titles and abstracts identified from the database search, 4056 were excluded after title and abstract review and 257 after full-text review, leaving 121 eligible articles with data from 81 unique participant samples, of which 56 (69%) contributed datasets ( Figure S1. Authors of included studies contributed data from two additional studies that were not retrieved by the search, for a total of 58 datasets. Of these, we excluded 24 studies that used a diagnostic interview other than the SCID and 12 more studies that did not have EPDS item scores available. In total, 5157 participants (765 major depression cases) from 22 primary studies were included. These studies were conducted in 18 different countries, with 17 different languages. The mean age of the sample was 29.1 years. See Table 2 for descriptive sample statistics and Table S1 for characteristics of each included study.

| Unidimensionality of the EPDS
A single-factor model was fit to the EPDS-10 with residuals modeled as uncorrelated (χ 2 [df =65] = 663.1, p < 0.0001, TLI =0.992, CFI =0.988, RMSEA =0.042). As this model was deemed to be well fitting, no modification indices were used. Factor loadings for items were all high, with a median of 0.97 and a range of 0.88 to 1.15.

| Item response theory model and optimal test assembly
The discrimination parameters for each item based on the GPCM are presented in Table 3. The information functions of each of the 10 items, as well as the total TIF are shown in Figure 1. The item with the greatest discrimination parameter was item 8, and thus has the most peaked information function in Figure Table 4 shows the items that were included in each of the 9 candidate short forms from the OTA analysis. Item 8 was included in all candidate short forms, with items 3, 5, and 6 quickly dropped.

| Selection of final short form
Cronbach's alpha values and concurrent validity correlations for the 9 candidate short forms are presented in results of the non-inferiority tests for both sensitivity and specificity are presented in Table 6. The 5-item short form (EPDS-Dep-5) was the shortest form that fulfilled all criteria. The form included item 1 ("I have been able to laugh and see the funny side of things"), item 2 ("I have looked forward with enjoyment to things"), item 8 ("I have felt sad or miserable"), item 9 ("I have been so unhappy that I have been crying"), and item 10 ("The thought of harming myself has occurred to me"). The EPDS-Dep-5 maintained high reliability with a Cronbach's alpha of 0.82 (95% CI, 0.81, 0.83) compared to 0.88 (95% CI, 0.87, 0.88) for the full-length form. Correlations of the summed and factor scores between the EPDS-Dep-5 and EPDS-10 were 0.91 (95% CI, 0.91, 0.92) and 0.95 (95% CI, 0.91, 0.97), respectively. Youden's J for the full EPDS and EPDS-Dep-5, at their optimal cutoffs of 11 or greater and 4 or greater, respectively, were both 0.68. Receiver operating curves for the full EPDS and EPDS-Dep-5 are presented in Figure S2. The sensitivity and specificity of the EPDS-Dep-5 at its optimal cutoff of 4 or greater were 0.83 (95% CI, 0.73, 0.89) and 0.86 (95% CI, 0.80, 0.90), respectively. Both sensitivity and specificity were non-inferior to the sensitivity (0.80; 95% CI, 0.71, 0.86) and specificity (0.88; 95% CI, 0.83, 0.92) of the full-length form.

| DISCUSSION
This study used OTA to shorten the EPDS to a 5-item shortened version (EPDS-Dep-5) while maintaining comparable measurement properties and screening accuracy to detect major depression among women in pregnancy and postpartum. The implication of this research is that shortening this scale allows for shorter administration times and places lower burden on respondents without significantly reducing the ability of the scale to measure depressive symptomology.
The EPDS-Dep-5 maintained similar sensitivity and specificity to that of the full-length form and resulted in a minimal loss of information. Furthermore, the shortened form maintained reliability and validity that were comparable to the full-length form based on pre-specified criteria. Cronbach's alpha of the EPDS-Dep-5 was within 0.06 of that for the full-length form, and correlations of the summed score and factor scores of the EPDS-5 and EPDS-10 were 0.91 and 0.95. Per pre-specified criteria, the sensitivity and specificity of the EPDS-Dep-5 (0.825 and 0.859, respectively) were non-inferior to those of the EPDS-10 (0.797 and 0.880, respectively).
The 5 items included in the EPDS-Dep-5 included items 1, 2, 8, 9, and 10 from the original EPDS. These items cover the two core symptoms of depression-low mood (items 8 and 9) and anhedonia (items 1 and 2), as well as self-harm (item 10). Of note, although they were included as potential items for the final shortened form, none of the 3 anxiety items (items 3 [blame], 4 [anxious], and 5 [scared]) were retained in the EPDS-Dep-5. Our short form selection procedure assessed screening accuracy for detecting depression, not anxiety, and short form development for that purpose would need to be done separately.
Most existing studies developing shortened EPDS forms compared the shortened forms to the full EPDS rather than comparing to diagnostic classification for depression. Only three studies validated their shortened forms against major depression classification based on DSM or ICD diagnostic criteria, but these studies included only 63, 19, and 9 major depression cases, 20,22,24 limiting their ability to draw conclusions about the shortened scales' measurement properties. Table 1 presents the items included in each study's shortened form as well as the methods used to create that version. The development of the EPDS-Dep-5 in the present study used data that originated from an IPDMA thus (1) providing the largest total sample size (5157 participants), as well as data from multiple settings and countries, (2) used by far the largest number of major depression cases (765 cases), (3) used a validated semi-structured diagnostic interview as the reference standard for major depression classification (the SCID), and (4) used screening accuracy as part of the development process, not solely as a tool for validation. It was also the only study that used objective, pre-specified criteria for empirical selection of items to include in the short form.
This study showed that an EPDS-Dep-5 cutoff ≥4 maximized combined sensitivity and specificity using Youden's J. 51 However, clinicians and researchers may consider use of a higher cutoff if their goal is to only capture patients with high depressive symptom levels or a lower cutoff if their goal is to avoid false negatives.
There are several limitations for this study that must be considered. First, for the collection of data for the full IPDMA, it was not possible to obtain primary data from 25 of the 81 eligible datasets. In addition, of the 34 studies using the SCID that provided data for the full IPDMA, 12 did not provide EPDS item scores and thus could not be included in the present study. Second, although we included data from 22 studies that fulfilled strict inclusion criteria, including the use of the rigorous semi-structured SCID interview, there was still substantial heterogeneity across studies in terms of country and language which both allows for the generalization of the results to larger and more diverse populations but also may not select the optimal shortened form for each individual context. Third, the present study did not conduct a risk of bias assessment; however, the full IPDMA from which a subset of data was selected for this study did conduct a risk of bias assessment using QUADAS-2. No QUADAS-2 domain items were consistently associated