Nuffield Early Language Intervention

A school-based program designed to improve children's vocabulary, narrative skills, active listening, and confidence in independent speaking.

Fact Sheet

Program Outcomes

Preschool Communication/ Language Development
School Readiness

Program Type

Academic Services
School - Individual Strategies
Skills Training

Program Setting

School

Continuum of Intervention

Selective Prevention

Age

Early Childhood (3-4) - Preschool
Late Childhood (5-11) - K/Elementary

Gender

Both

Race/Ethnicity

Endorsements

Blueprints: Promising

Program Information Contact

Denise Cripps
Executive Officer to the President
St. John's College, Oxford
OX1 3JP
denise.cripps@sjc.ox.ac.uk
Tel: 01 865 277456
Website:

Program Developer/Owner

Maggie Snowling
St. John's College, University of Oxford

Brief Description of the Program

Nuffield Early Language Intervention is a 30-week language intervention program delivered in the final term in Nursery school (ages 3-4) and the first two terms in Reception class (age 5). The program comprises activities targeting spoken language skills for the first 20 weeks, supplemented for the final 10 weeks with training in two critical components of the alphabetic principle, letter-sound knowledge and phoneme awareness. A second 20-week version begins upon entry into primary school, rather than beginning in preschool.

Nuffield Early Language Intervention is a 30-week language intervention program delivered in the final term in Nursery school (ages 3-4) and the first two terms in Reception class (age 5). The first 10 weeks involves three 15-min group sessions (2-4 children per group) per week delivered in preschool. This increases to three 30-min sessions plus two 15-min individual sessions in Reception class. A separate 20-week version skips the initial 10-week preschool portion and begins with the 30-minute sessions in primary school.

Children are taught using multi-sensory techniques within a standard framework. The oral language program aims to improve children's vocabulary, develop narrative skills, encourage active listening and build confidence in independent speaking. New vocabulary is selected with reference to themes common in Early Years' settings and includes nouns, verbs, adjectives, prepositions, pronouns and question words. Narrative work encourages expressive language and grammatical competence. Activities revolve around creating and acting out stories, sequencing and story elements. Listening skills are specifically targeted in the first 20 weeks during the Sound/Listening Game incorporating ideas from Letters and Sounds: Phase 1 (DfES, 2007). This section is extended in the last 10 weeks by activities to promote phoneme awareness (blending and segmenting) and letter-sound knowledge.

Outcomes

Primary Evidence Base for Certification

Study 1

Fricke et al. (2013) found, compared to the control group, the program significantly improved scores for the treatment children on:

Language skills (vocabulary, grammar and listening comprehension)
Narrative awareness
Phonological awareness
Reading comprehension

Study 2

Sibieta et al. (2016) found at the posttest, compared to the control group, the 30-week treatment group showed significantly improved:

Language skills (vocabulary, grammar, and listening comprehension)

Study 3

Fricke et al. (2017) found at posttest and delayed 6-month follow-up, compared to children in the control group, children in treatment (both in the 30-week and 20-week intervention groups) showed greater improvements in:

Language skills (vocabulary, grammar, and listening comprehension)

Study 4

Dimova et al. (2020) found at the posttest, children in the intervention condition, compared to the control condition, demonstrated significantly better:

Language skills
Reading ability

Brief Evaluation Methodology

Primary Evidence Base for Certification

Blueprints has reviewed four studies and all meet Blueprints evidentiary standards (specificity, evaluation quality, impact, dissemination readiness). In addition, Studies 1 and 3 were done by the developer, and Studies 2 and 4 were conducted by independent evaluators.

Study 1

Fricke et al. (2013) conducted a randomized controlled trial with 180 children from 15 nursery schools in Yorkshire, England. From each school, 12 children with the lowest mean verbal composite scores were selected as participants in the trial. The waitlist control group received no additional service during the duration of the study. Children were assessed on language and literacy skills at baseline, posttest, and 6 months after the completion of the intervention.

Study 2

Sibieta et al. (2016) conducted a randomized controlled trial with 394 children from 34 nursery schools in Yorkshire, England. From each school, approximately 12 children with the lowest mean verbal composite scores were selected as participants in the trial. Children were assigned to a 30-week treatment group, 20-week treatment group, or waitlisted control group. The study conducted assessments to measure language and literacy skills at baseline, posttest, and 6 months after completion of the intervention.

Study 3

Fricke et al. (2017) conducted a replication and extension of the study by Fricke et al. (2013) that involved a randomized controlled trial with 394 children from 34 nursery schools in England (17 in Greater London and 17 in Yorkshire/Nottinghamshire). Within each school, children were assigned to the 30-week intervention, the 20-week intervention, or the waitlist control group. Literacy and language skills were measured at baseline, posttest, and 6 months after completion of the intervention.

Study 4

Dimova et al. (2020) conducted a cluster randomized trial by randomly assigning 193 schools (1,156 children) across the United Kingdom to either receive the 20-week intervention program or continue instruction as usual. Language and reading skills were assessed at baseline and posttest.

Blueprints Certified Studies

Study 1

Fricke, S., Bowyer-Crane, C., Haley, A. J., Hulme, C., & Snowling, M. J. (2013). Efficacy of language intervention in the early years. Journal of Child Psychology and Psychiatry, 54(3), 280-290.

Study 2

Sibieta, L., Kotecha, M., & Skipp, A. (2016). Nuffield early language intervention: Evaluation report and executive summary. Education Endowment Foundation.

Study 3

Fricke, S., Burgoyne, K., Bowyer-Crane, C., Kyriacou, M., Zosimidou, A., Maxwell, L., . . . Hulme, C. (2017). The efficacy of early language intervention in mainstream school settings: A randomized controlled trial. Journal of Child Psychology and Psychiatry, 58(1), 1141-1151. doi:10.1111/jcpp.12737

Study 4

Dimova, S., Ilie, S., Brown, E. R., Broeks, M., Culora, A., & Sutherland, A. (2020). The Nuffield Early Language Intervention: Evaluation Report. United Kingdom: Education Endowment Foundation.

Risk and Protective Factors

Protective Factors

School: Instructional Practice

* Risk/Protective Factor was significantly impacted by the program

Subgroup Analysis Details

Gender Specific Findings

Female

Subgroup Analysis Details

Subgroup differences in program effects by race, ethnicity, or gender (coded in binary terms as male/female) or program effects for a sample of a specific race, ethnic, or gender group.

Study 1 Fricke et al. (2013) did not test for subgroup effects defined by race, ethnicity, gender, sexual identity, economic disadvantage, geographic location, or birth origin.
Study 2 (Sibieta et al., 2016) tested for within-subgroup program effects and found significant benefits for females but not in comparison to other groups.
Study 3 (Fricke et al., 2017) did not test for subgroup effects defined by race, ethnicity, gender, sexual identity, economic disadvantage, geographic location, or birth origin.
Study 4 (Dimova et al., 2020) did not test for subgroup effects defined by race, ethnicity, gender, sexual identity, economic disadvantage, geographic location, or birth origin.

Sample demographics including race, ethnicity, and gender for Blueprints-certified studies:

None of the studies reported the race or ethnicity of participants. The samples for two studies, Study 2 and Study 4 had approximately equal proportions of males and females (slightly more males).

Training and Technical Assistance

Nuffield Early Language Intervention is a European program and has not been assessed by Blueprints for dissemination readiness in the United States.

The training for teachers/teaching assistants to deliver the program is a two-day training course and covers the following:

Background to the Nuffield Early Language Intervention program
Oral language development (including listening and comprehension)
Oral language development (vocabulary, grammar and narrative)
Overview of the programme
Discovering the Manual (including record-keeping and progress assessment tool)
Individual sessions
Letter sounds and phonological awareness
Record-keeping and assessment tools

The Reception program costs £400 and this includes the training.

Benefits and Costs

Source: Washington State Institute for Public Policy
All benefit-cost ratios are the most recent estimates published by The Washington State Institute for Public Policy for Blueprint programs implemented in Washington State. These ratios are based on a) meta-analysis estimates of effect size and b) monetized benefits and calculated costs for programs as delivered in the State of Washington. Caution is recommended in applying these estimates of the benefit-cost ratio to any other state or local area. They are provided as an illustration of the benefit-cost ratio found in one specific state. When feasible, local costs and monetized benefits should be used to calculate expected local benefit-cost ratios. The formula for this calculation can be found on the WSIPP website.

Program Costs

No information is available

Funding Strategies

No information is available

Evaluation Abstract

Program Developer/Owner

Maggie SnowlingProfessor and PresidentSt. John's College, University of OxfordDepartment of Experimental PsychologyUnited Kingdom

Program Outcomes

Preschool Communication/ Language Development
School Readiness

Program Specifics

Program Type

Academic Services
School - Individual Strategies
Skills Training

Program Setting

School

Continuum of Intervention

Selective Prevention

Program Goals

A school-based program designed to improve children's vocabulary, narrative skills, active listening, and confidence in independent speaking.

Population Demographics

Preschool children with poor language and literacy skills.

Target Population

Age

Early Childhood (3-4) - Preschool
Late Childhood (5-11) - K/Elementary

Gender

Both

Gender Specific Findings

Female

Race/Ethnicity

Subgroup Analysis Details

Subgroup differences in program effects by race, ethnicity, or gender (coded in binary terms as male/female) or program effects for a sample of a specific race, ethnic, or gender group.

Study 1 Fricke et al. (2013) did not test for subgroup effects defined by race, ethnicity, gender, sexual identity, economic disadvantage, geographic location, or birth origin.
Study 2 (Sibieta et al., 2016) tested for within-subgroup program effects and found significant benefits for females but not in comparison to other groups.
Study 3 (Fricke et al., 2017) did not test for subgroup effects defined by race, ethnicity, gender, sexual identity, economic disadvantage, geographic location, or birth origin.
Study 4 (Dimova et al., 2020) did not test for subgroup effects defined by race, ethnicity, gender, sexual identity, economic disadvantage, geographic location, or birth origin.

Sample demographics including race, ethnicity, and gender for Blueprints-certified studies:

None of the studies reported the race or ethnicity of participants. The samples for two studies, Study 2 and Study 4 had approximately equal proportions of males and females (slightly more males).

Other Risk and Protective Factors

Poor language and literacy skills

Risk/Protective Factor Domain

Individual

Risk/Protective Factors

Risk Factors

Protective Factors

School: Instructional Practice

*Risk/Protective Factor was significantly impacted by the program

Brief Description of the Program

Description of the Program

Nuffield Early Language Intervention is a 30-week language intervention program delivered in the final term in Nursery school (ages 3-4) and the first two terms in Reception class (age 5). The first 10 weeks involves three 15-min group sessions (2-4 children per group) per week delivered in preschool. This increases to three 30-min sessions plus two 15-min individual sessions in Reception class. A separate 20-week version skips the initial 10-week preschool portion and begins with the 30-minute sessions in primary school.

Theoretical Rationale

The program is based on research that learning to read builds on oral language skills, and that children must learn to decode print fluently and develop skills to understand what they read to become literate. In addition to phoneme awareness and letter knowledge, reading comprehension requires broader language skills.

Theoretical Orientation

Skill Oriented

Brief Evaluation Methodology

Primary Evidence Base for Certification

Study 1

Study 2

Study 3

Study 4

Outcomes (Brief, over all studies)

Primary Evidence Base for Certification

Study 1

Fricke et al. (2013) found a significant impact at posttest and 6 months after the intervention on language skills, narrative skills, and phonological awareness, and a significant impact on reading comprehension at 6 months. There was no impact on literacy (early word reading and spelling) at either assessment.

Study 2

Sibieta et al. (2016) found a significant treatment effect for the 30-week intervention at posttest for the composite language score, driven mainly by the grammar measure and the expressive vocabulary measure. It found a marginal effect for the 20-week intervention.

Study 3

Fricke et al. (2017) found at posttest and the delayed 6-month follow-up, results showed that children in treatment (both in the 30-week and 20-week interventions) demonstrated greater improvements in a composite measure of language skills (vocabulary, grammar, and listening comprehension) as compared to children in the control group. There was no impact on early literacy and reading comprehension.

Study 4

Dimova et al. (2020) that children in the treatment condition displayed significantly better language and reading skills compared to children in the control condition at posttest.

Outcomes

Primary Evidence Base for Certification

Study 1

Fricke et al. (2013) found, compared to the control group, the program significantly improved scores for the treatment children on:

Language skills (vocabulary, grammar and listening comprehension)
Narrative awareness
Phonological awareness
Reading comprehension

Study 2

Sibieta et al. (2016) found at the posttest, compared to the control group, the 30-week treatment group showed significantly improved:

Language skills (vocabulary, grammar, and listening comprehension)

Study 3

Language skills (vocabulary, grammar, and listening comprehension)

Study 4

Dimova et al. (2020) found at the posttest, children in the intervention condition, compared to the control condition, demonstrated significantly better:

Language skills
Reading ability

Mediating Effects

In Study 1 (Fricke et al., 2013), language scores immediately after the intervention fully mediated reading six months after the intervention.

Effect Size

Study 1 (Fricke et al., 2013) reported small to large effects (Cohen's d = 0.30 - .83). Study 2 (Sibieta et al., 2016) reported small effect sizes (.16 - .27). Study 3 (Fricke et al., 2017) reported small effect sizes (d = .21 to .30). Study 4 (Dimova et al., 2020) reported small to medium effect sizes (g = .15 to .36).

Generalizability

Four studies meet Blueprints standards for high quality methods with strong evidence of program impact (i.e., "certified" by Blueprints): Study 1 (Fricke et al., 2013), Study 2 (Sibieta et a., 2016), Study 3 (Fricke et al., 2017) and Study 4 (Dimova et al., 2020). All four studies took place in England and compared the treatment group to an instruction-as-usual control group.

Notes

Study 4 was certified at posttest for Preschool Communication/Language Development and School Readiness (Dimova et al., 2020). However, due to very high attrition, the long-term evaluation at the two-year follow-up (Groom et al., 2023; Hulme et al., 2025) was not certified for these outcomes.

Endorsements

Blueprints: Promising

Program Information Contact

Denise Cripps
Executive Officer to the President
St. John's College, Oxford
OX1 3JP
denise.cripps@sjc.ox.ac.uk
Tel: 01 865 277456
Website:

References

Study 1

Certified Fricke, S., Bowyer-Crane, C., Haley, A. J., Hulme, C., & Snowling, M. J. (2013). Efficacy of language intervention in the early years. Journal of Child Psychology and Psychiatry, 54(3), 280-290.

Study 2

Certified Sibieta, L., Kotecha, M., & Skipp, A. (2016). Nuffield early language intervention: Evaluation report and executive summary. Education Endowment Foundation.

Study 3

Certified

Study 4

Certified

Dimova, S., Ilie, S., Brown, E. R., Broeks, M., Culora, A., & Sutherland, A. (2020). The Nuffield Early Language Intervention: Evaluation Report. United Kingdom: Education Endowment Foundation.

Groom, M., Brown, R.B., & Lymperis, L. (2023). The Nuffield Early Language Intervention addendum report. London: Education Endowment Foundation.

Hulme, C., West, G., Rios Diaz, M., Hearne, S., Korell, C., Duta, M., & Snowling, M. J. (2025). The Nuffield Early Language Intervention (NELI) programme is associated with lasting improvements in children's language and reading skills. Journal of Child Psychology and Psychiatry. https://doi.org/10.1111/jcpp.14157

Study 1

Summary

Fricke et al. (2013) found, compared to the control group, the program significantly improved scores for the treatment children on:

Language skills (vocabulary, grammar and listening comprehension)
Narrative awareness
Phonological awareness
Reading comprehension

Evaluation Methodology

Design:

Recruitment: Nineteen nursery schools in Yorkshire (England) were involved at the outset of the study. In these Nursery schools, all children who were due to enter school (Reception) in the following academic year were screened. Following screening, one school withdrew and three schools were deemed unsuitable. In each of the remaining 15 nursery schools, 12 children with the lowest mean verbal composite scores were selected as participants in the trial.

Assignment: The 180 children from the 15 nursery schools were randomly allocated within each school to receive the 30-week language intervention (n = 90) or to a waiting control group (n = 90).

In addition, six children in each school matched on gender and date of birth to a random sample of three children from the intervention and the waiting control groups acted as a representative peer comparison group against which to benchmark the progress of children (n = 82).

Assessments and Attrition: Children were assessed before the intervention, immediately following the intervention, and 6 months after the intervention. A total of 7 children from the intervention group (7.8%) and 8 children from the control group (8.9%) moved schools and were lost to follow-up.

Sample Characteristics:
The mean age of children at baseline was 4 years. No other demographic information was provided.

Measures:
All measures had high reliability (0.75 to 0.99). The structural equation models used multiple indicators for each of the following four outcomes.

Language Skills were measured with grammar and vocabulary information from the Renfrew Action Picture Test, vocabulary knowledge from the CELF Preschool IIUK Expressive Vocabulary test, and listening comprehension from answers to questions about two short stories read to the child.

Narrative skills were measured using a story retelling task.

Phonological awareness was measured by indicators of alliteration matching and sound isolation.

Literacy skills were measured by an early word reading scale and spelling responses.

Additional measures focused on taught vocabulary using Expressive Picture Naming and Receptive Picture Selection and the Picture Naming and Definitions task. Reading comprehension, available only at the 6-month follow-up, used the YARC beginner passage.

Analysis: The authors used hierarchical linear models or structural equation models, with Maximum Likelihood Missing Value estimators to allow for missing data and robust standard errors to allow for the clustering of children within schools. The structural equation models included baseline outcomes as predictors.

Intent-to-Treat: The study did not follow children who moved schools, but structural equation models with missing values estimators used all 180 subjects.

Outcomes

Implementation Fidelity: There was no information, though the article says that fidelity was monitored: "teaching assistants attended regular tutorials and the research team observed each teaching assistant delivering intervention and provided feedback on five occasions. In addition, teaching assistants completed records of session plans, children's progress and attendance for each group and individual session."

Baseline Equivalence: The intervention and control groups were said to be approximately equated on all measures, but the study presented no d values or significance tests.

Differential Attrition: No tests were performed, perhaps because attrition was only about 8% and missing data were imputed.

Posttest: The intervention had significant and beneficial effects on language skills (d = .80), narrative skills (d = .39), and phonological awareness (d = .49). The effects on literacy (early word reading and spelling) were not significant.

6-month Follow-up: The above posttest effects were maintained at follow-up: language skills (d = .83), narrative skills (d = .30), and phonological awareness (d = .49).

Also, at 6 months, there was a significant effect on reading comprehension (marginal mean group difference = 0.91, 95% CI 0.42-1.41, p < .001). This was found to be fully mediated by language comprehension abilities at posttest.

Additional tests on the vocabulary taught by the program showed higher scores for the intervention group in reception classes but not in nursery classes.

Long-term effects: Not evaluated.

Study 2

Summary

Sibieta et al. (2016) conducted a randomized controlled trial with 394 children from 34 nursery schools in Yorkshire. From each school, approximately 12 children with the lowest mean verbal composite scores were selected as participants in the trial. Children were assigned to a 30-week treatment group, 20-week treatment group, or waitlisted control group. The study conducted assessments to measure language and literacy skills at baseline, posttest, and 6 months after completion of the intervention.

Sibieta et al. (2016) found at the posttest, compared to the control group, the 30-week treatment group showed significantly improved:

Language skills (vocabulary, grammar, and listening comprehension)

Evaluation Methodology

Design:

Recruitment: The study recruited primary schools with attached nurseries that were located in disadvantaged areas of Yorkshire, England. A total of 34 schools were included in the study out of 302 approached. Within selected schools, children were screened using a composite measure of language skills including vocabulary and sentence structure. Children with the 12 lowest scores were invited to participate with parent consent.

Assignment: The randomization process allocated pupils within each nursery to two treatments and one control group and minimized differences across groups in terms of age, gender, and pretest scores using an iterative optimization process. Assignment was conducted at the individual level. Of the 394 assigned students, 132 were in the 30-week treatment, 133 in the 20-week treatment, and 129 in the control group. Participating schools were offered alternate early language development programs after completion of the intervention for waitlisted control participants.

Assessments and Attrition: Assessments were conducted at pretest, posttest, and 6-month follow-up. Of 394 students enrolled, 350 remained at the 6-month follow-up. Of the 34 schools enrolled, 3 left the program before completion. Reasons for attrition included changing schools (n = 34) and not completing one of the assessments (n = 10). In addition, some of the moderator variables gathered from a national database were available for a sample of only 239.

Sample Characteristics: The sample was approximately half female (49%) and an average of 46.1 months old. Given the targeting of disadvantaged schools, about 29% of the students qualified for free school meals and 16% were learning English as an additional language.

Measures: The study distinguished primary and secondary measures. The measures were gathered by research assistants blind to condition. Although the study reported no information on validity or reliability, it appears that the measures are well standardized and commonly used.

For the primary outcome, the study used a composite language measure that consisted of four components: information scores and grammar scores from the Renfrew Action Picture Test (APT), which asks students to describe a set of pictures; expressive vocabulary from the CELF-Preschool 2 UK test; and a listening comprehension test using short stories.

For the secondary outcome, the study used a word-level literacy composite measure consisting of three components: letter-sound knowledge, early word reading, and spelling.

Analysis: The study used fully-interacted linear matching, which linearly interacts the treatment effect with all pre-treatment characteristics and outcomes. The models controlled for gender, age, English as a second language, known speech or language difficulties, and pre-treatment scores for the language composite assessment. They also adjusted for school-level clustering in the estimation of standard errors (as well as checking the results with several other estimation techniques in the Appendix C).

Intent-to-Treat: The analysis included all cases with all data. Three schools dropped out of the intervention condition, but the students were followed and included in the analysis. Only students leaving the schools or not completing the survey were excluded.

Outcomes

Implementation Fidelity: Three schools dropped the program and 5 of the other 31 schools showed significant deviation related to both the structure and delivery of the program. However, the study stated that overall the delivery of the program structure and session components were generally in line with the prescribed model. On average, students attended 80% of classroom sessions and 56% of individual sessions.

Baseline Equivalence: Tests for baseline differences across conditions (Table 9) used the analysis sample of 350 rather than the randomized sample of 394. There were no significant differences between conditions for demographic and outcome measures. Measures based on the subsample of 239 used in the moderation tests did show some differences, however.

Differential Attrition: Tests for baseline equivalence of the analysis sample, which excluded dropouts, indicated that attrition did not compromise the balance between conditions.

Posttest: At posttest, the 30-week intervention group showed significantly more improvement in the primary measure of the composite language score as compared to the control group. This improvement was driven by significant improvement in grammar scores and expressive vocabulary. The 20-week intervention group showed marginal improvement in the language composite score.

For the secondary measure of composite word-level literacy, neither the 30-week nor the 20-week program had significant effects.

The study conducted a 6-month follow-up, finding that the composite language score but not the composite literacy score was significantly improved among both the 30-week and 20-week intervention groups. However, in the intervening 6 months, some schools implemented other reading programs at varying times, which complicates results.

Finally, tests for moderation suggest that the program was most effective for students without known speech and language difficulties or students learning English as an additional language.

Long-Term: The study did not conduct long-term follow-up.

Study 3

Summary

Language skills (vocabulary, grammar, and listening comprehension)

Evaluation Methodology

Design:

Recruitment: A total of 302 primary schools with attached nurseries in generally disadvantaged areas were approached, and 34 schools agreed to participate. All children in these nurseries who were expected to enter school (Reception in England) the following academic year were screened. Children in need of special education and children learning English as an additional language were not included in the screening. Up to 12 children were selected within each school using the following criteria: (a) having the lowest mean verbal composite scores in their school, and (b) entering the same primary school they attended for nursery. A total of 394 students met these criteria.

Assignment: Within each school, children were assigned to either the 30-week intervention (n=132), 20-week intervention (n=133), or the waitlist control group (n=129). While the control group received business-as-usual, after the posttest, schools were given permission to deliver additional language and literacy support provided by the research team (which was different than the Nuffield Early Language Intervention) to control group students. Fifteen schools opted for training in this additional support, but by the 6-month follow-up only 8 schools had begun to implement it (the specific nature, quality and intensity varied widely). The other 19 schools continued with business-as-usual.

Attrition: Overall attrition rates were 7% at the posttest and 16% at the 6-month follow-up.

Sample: Information about sample characteristics was not provided.

Measures:

The measures were gathered by research assistants blind to condition. The primary outcome, language skills, was assessed using a latent measure that consisted of the following six assessments:

Expressive vocabulary knowledge was measured using the CELF Expressive Vocabulary subtest (alpha = .82) and the Information Score from the Renfrew Action Picture Test (APT; interrater reliability = .83).
Receptive vocabulary skills were assessed using the BPVS (alpha = .91).
Grammatical skills were measured using the CELF Sentence Structure subtest (alpha = .78) and the APT Grammar Score (interrater reliability = .89).
Listening comprehension skills were tested by asking children to listen to two short stories adapted from the York Assessment of Reading for Comprehension (YARC) and answer questions about them (interrater reliability = .99).

Secondary outcomes were early literacy skills and reading comprehension. Early literacy skills were assessed using a latent measure that consisted of two assessments: the YARC letter-sound knowledge subtest (alpha = .95) and the YARC early word reading subtest (alpha = .98). Reading comprehension was measured using the two beginner passages from the YARC passage reading test (alpha = .77).

Analysis: Structural equation models (SEM) were constructed using Mplus 7.4 with Full Information Maximum Likelihood estimators to allow for missing data and robust (Huber-White) standard errors to allow for the clustering of children within schools. Although baseline outcomes were adjusted in these models, it appears that the models did not control for demographic variables. In this model, the unstandardized regression weights from the language pretest to the two language post-test factors were fixed to be equal.

Intent-to-Treat: The authors argued that all analyses were performed on an intention-to treat basis.

Outcomes

Implementation Fidelity: Teaching assistants delivered on average 28/30 group sessions in nursery and 49/57 group sessions in reception for the 30-week intervention group. For the 20-week intervention group, teaching assistants delivered on average 49/57 group sessions in Reception. The number of sessions each child attended varied considerably (30-weeks group: nursery group sessions M = 24.69, reception group sessions M = 38.51; individual sessions M = 21.91; 20-weeks: reception group sessions M = 41.11, individual sessions M = 23.01). Some teaching sessions were also observed to assess treatment fidelity. The quality of teaching of different session components were graded on a 5-point scale with the manual instructions as a reference point (1 = several aspects missing/not satisfactory, 2 = some aspects missing/not satisfactory, 3 = according to manual, 4 = according to manual with good use of resources/questions/techniques to support language, 5 = according to manual with very good use of resources/questions/techniques). On average, teaching assistants achieved a mean quality rating of 2.83 for group sessions observations in nursery, 2.95 in the first 10 weeks in reception, and 3.20 in the second 10 weeks in reception. Fidelity and quality ratings for individual sessions tended to be lower than for group sessions (first 10 weeks in reception: M = 2.74, second 10 weeks: M = 2.83).

Baseline Equivalence: Baseline equivalence was tested only on outcome measures without significance tests.

Differential Attrition: In response to a Blueprints request, the lead author sent results for complete tests of differential attrition for the 20- and 30-week intervention groups at posttest and 6-month follow-up. In 36 tests using baseline measures, condition, and the interaction of condition by each baseline measure to predict attrition, no interaction coefficients were significant.

Posttest: At the posttest and the 6-month follow-up, Fricke et al. (2017) reported that children in treatment (both in 20-week and 30-week intervention groups) showed greater improvements in language skills than children in control. Effect sizes were slightly larger for the 30-week intervention group. There were no significant differences between the two treatment groups. Also, there were no significant differences between treatment and control in secondary outcome measures assessing early literacy and reading comprehension.

The effect of interaction between condition and pretest scores were not significant, which means that children with the most severe language difficulties at pretest responded to the intervention to the same degree as children with less severe difficulties.

Long-Term: Not conducted.

Study 4

The program was delivered in the reception or kindergarten year. Groom et al. (2023) conducted an independent evaluation without the participation of the intervention team and, given high attrition, labeled their follow-up results as exploratory. Hulme et al. (2025) presented the results from the intervention team and, given high attrition, treated the follow-up data as quasi-experimental.

Summary

Dimova et al. (2020), Groom et al. (2023), and Hulme et al. (2025) conducted a cluster randomized trial by randomly assigning 193 schools (1,156 children) across the United Kingdom to either receive the 20-week intervention program or continue instruction as usual. Language and reading skills were assessed at baseline, posttest, and two-year follow-up.

Dimova et al. (2020), Groom et al. (2023), and Hulme et al. (2025) found that children in the intervention condition, compared to the control condition, demonstrated significantly better:

Language skills at posttest and follow-up
Reading ability at posttest and follow-up

Evaluation Methodology

Design:

Recruitment: In the summer of 2018, 1,100 schools were approached to participate in the study, and 207 schools expressed interest. A total of 193 schools that contained 240 reception or kindergarten classrooms agreed to participate. The schools were recruited from 13 geographic regions across the UK and included a balance of rural and urban schools. To be eligible, schools needed to have not implemented the program before, have above average free school meal eligibility, and agree to implement their assigned condition. From each classroom, the five children with the lowest scores on a composite language skills measure (LanguageScreen) were recruited for the study (total children n = 1,156 in Dimova et al., 2020, and Groom et al., 2023, or 1,173 in Hulme et al, 2025).

Assignment: Using a stratified cluster randomized controlled trial design, schools were randomly assigned to either the treatment condition (n = 97 schools, n = 585 or 581 children) or a business-as-usual control condition (n = 96 schools, n = 571 or 592 children). The geographic area and number of classes within a school defined the assignment strata. As an incentive, control schools received £1,000 for their participation.

Assessments/Attrition: Dimova et al. (2020) assessed children at baseline (prior to randomization) and at posttest (immediately after the 20-week program). Overall, 0.5% of schools (n = 1) did not complete the posttest; 7% of students (n = 85) did not complete the posttest. Groom et al. (2023) and Hulme et al. (2025) added a long-term follow-up assessment that occurred during the period from May 2021 to October 2021, approximately two years on average after the posttest (a planned assessment in July 2020 was cancelled due to the COVID-19 pandemic). Attrition at the follow-up was 42.5% of schools and 55.5% of students in Groom et al. (2023) or 54.3% in. Hulme et al. (2025).

Sample: The only individual-level demographic variables described in the report were gender, with both conditions having slightly more boys than girls (57.4% and 53.2%), and age, with both conditions having average ages around 52 months. At the school level, about 34% of students were eligible for free school meals

Measures:

Dimova et al. (2020) examined measures gathered by research assistants blind to condition. The primary outcome, language skills, was assessed using a latent measure that consisted of the following four language tests:

the Clinical Evaluation of Language Fundamentals (CELF) recalling sentences subtest and expressive vocabulary subtest
the Renfrew Action Picture Test (RAPT) information description subtest and grammar subtest
the York Assessment of Reading for Comprehension (YARC) early word reading test
the LanguageScreen, a composite scale of expressive vocabulary, receptive vocabulary, sentence repetition, and listening comprehension.

Groom et al. (2023) did not distinguish primary and secondary outcomes, and the assessment method had changed to have school assessors rather than research assistants collect the data. The study examined three measures (early word reading, reading fluency, and reading comprehension) from the York Assessment of Reading for Comprehension test. School assessors recorded children's verbatim responses and uploaded the responses to the University of Oxford team via Qualtrics for scoring by the research team. The study also examined a latent oral language scale created from four LanguageScreen measures and two Renfrew Action Picture Test measures. The LanguageScreen was scored automatically in the App used by the school assessors, while the Renfrew Action Picture Tests were uploaded for scoring by the research team.

Note that most students (93%) were tested while in the last term of Year 2 (i.e., first grade), but 7% were tested in the first term of Year 3 (i.e., second grade) after the summer break. However, sensitivity tests in Appendix D in Groom et al. (2023) that controlled for age at the assessment did not significantly change the results.

Hulme et al. (2025) used measures similar to those of Groom et al. (2023) but without the latent factor. The five outcomes included the total score from the LanguageScreen (the sum of expressive vocabulary, receptive vocabulary, sentence repetition, listening comprehension), two items from the RenfrewAction Picture Test (information and grammar), and two items from the York Assessment of Reading for Comprehension (early word reading, passage comprehension). The authors noted that the number of items in the LanguageScreen subtests differed across time points.

Analysis: Dimova et al. (2020) used multilevel models to account for the clustering of students nested within schools. Models controlled for school-level stratification variables (geographic location and number of classes per school), as well as student baseline scores.

Groom et al. (2023) also used a two-level multilevel model to account for the clustering of students in schools and included controls for the baseline outcome and other variables. However, the authors stated, "Due to the high attrition rate (55.5%), analysis is both underpowered and at greater risk of bias. As the analysis is exploratory, this report does not place much weight on measures of statistical significance." The authors did not report p-values, only confidence intervals for effect sizes.

Given the high attrition rate and assuming that the randomization was "no longer intact," Hulme et al. (2025) presented QED results. They estimated a propensity score matching model that aimed to equalize condition means on baseline measures for the analysis sample of completers and approximate randomization. The baseline measures used in the model included the baseline outcome, EAL status, age, gender, and baseline Early Word Reading score. The measures do not directly include influences related to the school response to the COVID-19 pandemic, a likely source of non-response according to Groom et al. (2023).

Missing Data Methods. Dimova et al. (2020, p. 25) stated that no missing data imputation was used given complete data on the primary outcome measure. Groom et al. (2023) also used complete case analysis rather than multiple imputation because their data was likely missing not at random. Hulme et al. (2025) used complete cases in selecting the matched sample.

Intent-to-Treat: Most analyses used all schools and students with complete data in the conditions to which they were originally assigned, regardless of the extent or quality of the treatment received. Hulme et al. (2025) used matched respondents who had complete data in the propensity score analysis, irrespective of the treatment dose.

Outcomes

Implementation Fidelity: The researchers calculated a compliance measure based on the share of eligible school staff attending NELI training, the proportion of group NELI sessions delivered, and the number of individual NELI sessions delivered. Based on these measures, only 11 schools were rated as "high compliers," indicating differences in the quality of program implementation. Data on teacher attendance and externally recorded number of intervention program sessions completed were not available.

Baseline Equivalence: Dimova et al. (2020) did not perform formal tests to assess baseline equivalence, stating that they followed CONSORT guidelines and provided baseline descriptive characteristics in Table 12. Other than school-level "OFSTED" ratings, intervention and control condition baseline characteristics appeared similar for both school- and individual child-level variables. Following WWC guidelines, they did provide effect sizes for the baseline differences for outcome measures, and all differences were small (g < .128).

Differential Attrition: Because overall attrition was relatively low, Dimova et al. (2020) performed no formal tests for differential attrition. At the student level, there was 5.3% (n = 30) attrition in the control condition and 9.4% (n = 55) attrition in the intervention condition. The overall attrition rate combined with the difference in condition rates meets the WWC cautious and optimistic standards.

Groom et al. (2023) reported a high attrition rate of 55.5% at the long-term follow-up. They used a multilevel model to test for missingness at follow-up as a function of baseline covariates, including treatment. The results in Appendix B showed that only one of 30 predictors in the model reached statistical significance (the dropout rate was higher in the Durham region). There was a non-significant tendency for dropout rates to be higher in more economically disadvantaged schools and lower-performing schools. The authors admitted that attrition likely related to other unobservable factors, such as the differential impact of COVID-19 on schools and pupils, but they argued that the factors likely affected the two conditions equally. Despite the high attrition rate, the condition difference in attrition rates was so small (0.1%) that the study meets the WWC cautious and optimistic standards.

Hulme et al. (2025) reported similar tests by comparing baseline measures for children who completed and did not complete the follow-up. Baseline age and gender did not differ in attrition, but completers had significantly better language skills than dropouts. Schools taking part in the follow-up were situated in significantly more advantaged postcodes than those dropping out. The authors also reported on baseline equivalence tests for the analysis sample. They found no significant condition differences on age, gender, or the LanguageScreen total. The condition attrition rates differed from Groom et al. (2023), and the difference between the rates meets the WWC optimistic standard but not the cautious standard.

Posttest: At posttest, Dimova et al. (2020) found that, for the primary outcome, the intervention school students demonstrated significantly better language skills (as assessed by the latent language skills variable) than control school students (g = .26). For secondary outcomes, intervention school students also demonstrated significantly better single-word reading ability (as assessed by the YARC early word reading test, g = .15) and significantly better language skills (as measured by the LanguageScreen, g = .36).

Long-Term: At the two-year follow-up, Groom et al. (2023) found that one of three reading measures was significant. The intervention group scored higher on the early word reading measure (g = .18, with a confidence interval of .02 to .39). The authors noted that these measures are designed for younger students, and ceiling effects for older students may limit the intervention impact. The other outcome, the latent measure of early language, showed a higher outcome for the intervention group than the control group (g = .18, with a confidence interval of .02 to .35).

Comparing the long-term outcomes to the posttest outcomes showed that the long-term effects resulted from the persistence of posttest treatment effects rather than any additional gains made during the follow-up period. The intervention effects thus appear to have been sustained.

Moderation tests suggested a larger intervention effect on language for pupils eligible for free school meals than for other pupils.

In the propensity score matching analysis, Hulme et al. (2025) reported on five outcomes (see the text on pages 5-6), two of which showed significant effects. The matched intervention group had higher LanguageScreen scores (d = .22) and reading comprehension (d = .16).