Volume 7, Issue 1: 2014
Linguistic microfeatures to predict L2 writing proficiency: A case study in Automated Writing Evaluation
byScott A. Crossley, Kristopher Kyle, Laura K. Allen, Liang Guo, & Danielle S. McNamara
This study investigates the potential for linguistic microfeatures related to length, complexity, cohesion, relevance, topic, and rhetorical style to predict L2 writing proficiency. Computational indices were calculated by two automated text analysis tools (Coh-Metrix and the Writing Assessment Tool) and used to predict human essay ratings in a corpus of 480 independent essays written for the TOEFL. A stepwise regression analysis indicated that six linguistic microfeatures explained 60% of the variance in human scores for essays in a test set, providing an exact accuracy of 55% and an adjacent accuracy of 96%. To examine the limitations of the model, a post-hoc analysis was conducted to investigate differences in the scoring outcomes produced by the model and the human raters for essays with score differences of two or greater (N = 20). Essays scored as high by the regression model and low by human raters contained more word types and perfect tense forms compared to essays scored high by humans and low by the regression model. Essays scored high by humans but low by the regression model had greater coherence, syntactic variety, syntactic accuracy, word choices, idiomaticity, vocabulary range, and spelling accuracy as compared to essays scored high by the model but low by humans. Overall, findings from this study provide important information about how linguistic microfeatures can predict L2 essay quality for TOEFL-type exams and about the strengths and weaknesses of automatic essay scoring models.
An important area of development for second language (L2) students is learning how to share ideas with an audience through writing. Some researchers suggest that writing is a primary language skill that holds greater challenges for L2 learners than speaking, listening, or reading (Bell & Burnaby, 1984; Bialystok, 1978; Brown & Yule, 1983; Nunan, 1989; White, 1981). Writing skills are especially relevant for L2 learners involved with English for specific purposes (i.e., students primarily interested in using language in business, science, or the law) and for L2 learners who engage in standardized writing assessments used for admittance into, advancement within, and eventual graduation from academic programs.
Given the importance of writing to L2 learners, it is no surprise that investigating how the linguistic features in a text can explain L2 written proficiency has been an important area of research for the past 30 years. Traditionally, this research has focused on propositional information (i.e., the lexical, syntactic, and discoursal units found within a text Crossley, 2013; Crossley & McNamara, 2012), such as lexical diversity, word repetition, text length, and word frequency (e.g., Connor, 1990; Engber, 1995; Ferris, 1994; Frase, Faletti, Grant, & Ginther, 1999; Jarvis, 2002; Jarvis, Grant, Bikowski, & Ferris, 2003; Reid, 1986; 1990; Reppen, 1994). More recent studies have begun to assess lexical proficiency using linguistic features found in situational models (i.e. a text’s temporality, spatiality, and causality, Crossley & McNamara, 2012) and rhetorical features more closely related to argument structure (Attali & Burstein, 2005; Attali, 2007). While research in this area continues to advance, a coherent understanding of the links between linguistic features in the text and how these features influence human judgments of writing proficiency is still lacking (Jarvis et al., 2003). There are many reasons for this dearth of understanding, chief among them being the sheer variety of topics, prompts, genres, and tasks that are used to portray writing proficiency. Another is the incongruence among the types, numbers, and sophistication of the methods used to investigate writing proficiency, which renders it challenging to make comparisons between measures and studies (Crossley & McNamara, 2012).
This study focused specifically on how linguistic microfeatures produced in an independent writing task can explain L2 writing proficiency. Our focus is on one specific domain common to L2 writing research studies: a standardized independent writing assessment as found in the Test of English as a Foreign Language Internet-Based Test (TOEFL iBT). These independent writing tasks require students to respond to a writing prompt and assignment without the use of secondary sources. Standardized independent writing assessments like those found in the TOEFL are an important element of academic writing for L2 learners and provide a reliable representation of many underlying writing abilities because they assess writing performance beyond morphological and syntactic manipulation. These assessments ask writers to provide extended written arguments that tap into textual features, discourse elements, style issues, topic development, and word choice and that are built exclusively from their experience and prior knowledge (Camp, 1993; Deane, 2013; Elliot et al., 2013). Our study builds on previous research in the area, but overcomes previous shortcomings by relying on advanced computational and machine learning techniques, which provide detailed information about the linguistic features investigated and the models of writing proficiency developed in this study. Although general explanations of the variables and models employed in AES systems have been described in previous publications (e.g., Enright & Quinlan, 2010), detailed explanations of the variables and scoring algorithms are uncommon in published studies on TOEFL writing proficiency because of the need to protect proprietary information (Attali, 2004; 2007; Attali & Burstein, 2005; 2006). In one of the most detailed of these published studies, for example, Enright and Quinlan (2010) provided an outline of the aggregated features included in e-rater (e.g., organization, development, mechanics, usage, etc.) and the relative weights for each of these aggregated features. While some of the aggregated features are fairly straightforward (e.g., lexical complexity, which is comprised of the microfeatures average word length and word frequency), others are much more opaque. For instance, the aggregated features “organization” and “development,” which when combined are responsible for 61% of the e-rater score, are reported to be comprised of the “number of discourse elements” and the “length of discourse elements,” respectively. However, little information is provided with regard to what qualifies as a “discourse element” or how these elements are computationally identified (though see Burstein, Marcu, and Knight, 2003). Thus, our goal in the current study is to contribute to the growing body of knowledge regarding the identification of microfeatures by providing a comprehensive, rigorous, and elaborative model of L2 writing quality using microfeatures, which can be used as a guide in writing instruction, essay scoring, and teacher training.
In addition, we examined the limitations and weaknesses of statistical models of writing proficiency (cf. Deane, 2013; Haswell & Ericsson, 2006; Herrington & Moran, 2001; Huot, 1996; Perelman, 2012; Weigle, 2013a). We did this by qualitatively and quantitatively assessing mismatches between our automated scoring model and human raters. This analysis provides a critical examination of potential weaknesses of automated models of writing proficiency that questions elements of model reliability while, at the same time, provides suggestions to improve such models. By noting both the strengths and weaknesses of an automatic essay scoring model, we hope to address concerns among scholars and practitioners about issues of model reliability (Attali & Burstein, 2006; Deane, Williams, Weng, & Trapani, 2013; Perelman, 2014; Shermis, in press) and construct validity (Condon, 2013; Crusan, 2010; Deane et al., 2013; Elliot et al., 2013, Haswell, 2006; Perelman, 2012).
Writing in a second language (L2) is an important component of international education and business. Research into L2 writing has investigated a wide spectrum of variables that explain writing development and proficiency, including first language background (Connor, 1996), writing purpose, writing medium (Biesenbach-Lucas, Sigrun, & Wesenforth, 2000), cultural expectations (Matsuda, 1997), writing topic, writing audience (Jarvis et al., 2003), and the production of linguistic microfeatures (Crossley & McNamara, 2012; Ferris, 1994; Frase et al., 2000; Grant & Ginther, 2000; Jarvis, 2002; Reid, 1986, 1990, 1992; Reppen, 1994). The need to teach and assess L2 writing quickly and efficiently at a global scale increases the importance of AES systems (Weigle, 2013b).
The current study operated under the simple premise that the linguistic microfeatures in a text are strongly related to human judgments of perceived writing proficiency. The production of linguistic microfeatures in written text, especially in timed written texts where the writer does not have access to outside sources, reflects writers’ exposure to a second language and the amount of experience and practice they have in understanding and communicating in that second language (Crossley, 2013; Dunkelblau, 1990; Kamel, 1989; Kubota, 1998). Also, unlike L1 writing, L2 writing can strongly vary in terms of linguistic production (e.g., syntax, morphology, and vocabulary) and is dependent on both writing ability and language ability (Weigle, 2013b). Thus, linguistic microfeatures in a text are reliable cues from which to judge L2 writing proficiency (although not the only cues). Common cues found in writing studies relate to propositional features, such as the lexical, syntactic, and discourse units found in the text. Researchers have used such cues to investigate L2 writing development and L2 writing constraints using longitudinal approaches (Arnaud, 1992; Laufer, 1994), approaches that predict essay quality (Crossley & McNamara, 2012; Ferris, 1994; Engber, 1995), approaches that examine differences between L1 and L2 writers (Connor, 1984; Crossley & McNamara, 2009; Reid, 1992; Grant & Ginther, 2000), approaches that examine differences in writing topics (Carlman, 1986; Hinkel, 2002; Bonzo, 2008; Hinkel, 2009), and approaches that examine different writing tasks (Cumming et al., 2005, 2006; Guo, Crossley, & McNamara, 2013; Reid, 1990). More recently, researchers have started to investigate situational cues (Zwaan, Magliano, & Graesser, 1995) related to a text’s temporality, spatiality, or causality (Crossley & McNamara, 2009; 2012). Such studies provide foundational understandings about writing proficiency, the linguistic development of L2 writers, how L2 writers differ linguistically from L1 writers, and how prompt and task influence written production.
L2 Writing Proficiency
Our main interest in this paper is the examination of L2 writing proficiency. The most common approach to assessing writing proficiency is to assess relationships between linguistic microfeatures in an essay and the scores attributed to that essay by an expert human rater. In general, researchers have focused on lexical, syntactic, and cohesion microfeatures and how such features can predict essay scores.
Studies that have examined lexical features have found that higher rated L2 essays contain more words (Carlson, Bridgeman, Camp, & Wanderers, 1985; Ferris, 1994; Frase et al., 1999; Reid, 1986; 1990), use words with more letters or syllables (Frase et al., 1999; Grant & Ginther, 2000; Reid, 1986, 1990; Reppen, 1994), and demonstrate greater lexical diversity (Engber, 1995, Grant & Ginther, 2000; Jarvis, 2002; Reppen, 1994). Syntactically, L2 essays that are rated as higher quality include more subordination (Grant & Ginther, 2000) and instances of passive voice (Ferris, 1994; Grant & Ginther, 2000). From a cohesion perspective, researchers have investigated explicit connections and referential links within text. The findings from these studies do not demonstrate the same level of agreement as those that investigate lexical and syntactic features of text. For instance, some past studies have shown that more advanced L2 writers produce a greater number of connectives (Jin, 2001) and pronouns (Reid, 1992), while more recent studies demonstrate that higher rated essays contain fewer conditional connectives (e.g., if-then), fewer positive logical connectives, (e.g., and, also, then), less content word overlap, less given information, and less temporal cohesion (e.g., aspect repetition; Crossley & McNamara, 2012; Guo et al., 2013). In general, the findings from these studies indicate that linguistic variables related to lexical sophistication, syntactic complexity, and, to some degree, cohesion can be used to distinguish high proficiency L2 essays from low proficiency L2 essays.
Automatic Essay Scoring (AES)
Once it is established that linguistic microfeatures in a text can be used to separate high and low quality essays, it becomes possible to consider using such features to automatically score essays. Any computerized approach to analyzing texts falls under the field of natural language processing (NLP). NLP investigations of writing focus on how computers can be used to understand and analyze L2 written texts for the purpose of studying L2 writing development and proficiency. Prior to the development of NLP tools, such research required manually coding texts for linguistic microfeatures of interest, which is prone to errors, time consuming, and cost prohibitive (Higgins, Xi, Zechner, & Williamson, 2011). However, advances in computational linguistics have led to new techniques that allow researchers to automatically extract linguistic information from texts (Brill & Mooney, 1997; Dikli, 2006). These extraction techniques have led to the development of computer systems that can automatically provide assessments of the content, structure, and quality of written prose (Shermis & Barrera, 2002; Shermis & Burstein, 2003; Shermis, Burstein, & Leacock, 2006). Such systems are known as Automatic Essays Scoring (AES) systems.
AES systems can assist teachers in scoring essays in low-stakes classroom assessments and can also offer students greater opportunities for writing practice and feedback (Dikli, 2006; Page, 2003). Additionally, AES systems can benefit large-scale testing services by providing automated and reliable ratings for high-stakes writing assessments, such as the Graduate Record Exam (GRE) or the TOEFL (Dikli, 2006). In both situations, AES systems reduce the demands and complications often associated with human writing assessment, such as time, cost, and reliability (Bereiter, 2003; Burstein, 2003; Myers, 2003; Page, 2003). Of course, AES systems are not without their detractors. A position statement written by the Conference on College Composition and Communications, the largest writing conference in North America, categorically opposes the use of AES systems in writing assessment (2004). Other researchers voice concerns that AES systems cannot assess the entire construct of writing because they fail to address issues of argumentation, purpose, audience, and rhetorical effectiveness, which are hallmarks of quality writing attended to by human raters (Condon, 2013; Deane, 2013; Haswell, 2006; Haswell & Ericsson, 2006; Herrington & Moran, 2001; Huot, 1996; Perelman, 2012). More importantly, AES systems are generally only successful at scoring limited writing genres such as the independent writing genre found in the TOEFL and less successful at assessing other genres such as authentic performance tasks and portfolio based-writing, which are considered more credible and valid forms of writing (as compared to large-scale commercial assessments; Condon, 2013; Elbow & Belanoff, 1986; Wardle & Roozen, 2013).
A few examples of AES systems that rely on NLP to assess writing are e-rater(R) (Burstein, 2003; Burstein, Chodorow, & Leacock, 2004), IntelliMetric (Rudner, Garcia, & Welch, 2005; 2006), Intelligent Essay Assessor (IEA; Landauer, Laham, & Foltz, 2003), and the Writing Pal (W-Pal) system (Crossley, Roscoe, & McNamara, 2013; McNamara, Crossley, & Roscoe, 2013). All of these systems provide scores for original essays through a comparison with a training set of annotated essays. Thus, the systems are based on the notion that essay quality is associated with specific and measurable groups of linguistic measures found in the text. AES methods first require human raters to code a set of essays for holistic quality as well as the presence of certain text properties, such as topic sentences, thesis statements, and evidence statements. The essays are then analyzed by the AES system along numerous linguistic dimensions related to lexical sophistication, syntactic complexity, grammatical accuracy, rhetorical features, and cohesion. This step allows the engine to extract linguistic features from the essays that can serve to discriminate higher- and lower-quality essays. In the last step, the extracted linguistic features are given weights and combined to create statistical models. These weighted statistical models can then be used to score essays along the previously selected dimensions.
Reliability and accuracy of Automated Essay Scoring systems. High agreement between AES engines and human raters has been reported in a number of studies (Attali, 2004; Landauer, Laham, & Foltz, 2003; Landauer, Laham, Rehder, & Schreiner, 1997; McNamara et al., 2013; Vantage Learning, 2003; Warschauer & Ware, 2006). The reported correlations for most AES systems typically range from .70 to .85 with one human rater, which is consistent with the range found between two human raters (Warschauer & Ware, 2006). For instance, in an unpublished study (Attali, 2008) reported in Enright and Quinlan (2010) involving e-rater, two human raters reached an agreement of r = .70 while one human rater and e-rater reached an agreement of r = .76. Weigle (2010) reported that correlations between the scores assigned by two human raters for 772 essays written by 386 ESL students in response to two prompts ranged from .64 (topic 1) to .67 (topic 2), while the correlations between the averaged human scores and the e-rater scores ranged from .76 (topic 1) to .81 (topic 2). IntelliMetric reported mean correlations between automated scores and a human score between .83 and .84, respectively (Rudner et al., 2006). Using linguistic microfeatures taken from the computational tool Coh-Metrix, which helps power the W-Pal AWE system, McNamara et al. (2013) reported a correlation of r = .81 between their regression model and human scores for 240 TOEFL independent essays. Unlike models reported by e-rater and IntelliMetric, McNamara et al. provided details on the linguistic microfeatures that informed their models: the number of words in the text, the average syllables per word, noun hypernymy scores, past participle verbs, and conditional connectives.
The true agreement between human raters and AES engines is typically reported in two ways: perfect agreement and perfect-adjacent agreement. Perfect-agreement reports the number of identical scores between humans and an AES system while perfect-adjacent agreement reports the number of scores that are within one point of each other. In an investigation of the e-rater system, Attali and Burstein (2006) reported perfect agreement ranging from 46% to 58%, based on the test and grade level of examinees. Attali (2008) reported that two human raters reached 56% exact and 97% adjacent agreement, while one human rater and e-rater achieved 57% exact and 98% adjacent agreement. In a large study of TOEFL essays scores (152,000 independent essays), Ramineni, Trapani, Williamson, Davey, and Bridgeman (2012) reported a 60% exact agreement and 98% adjacent agreement between two raters, while e-rater reported 59% exact and 99% adjacent agreement with one human rater. Similarly, Rudner et al. (2006) investigated the accuracy of the IntelliMetric system across two studies and reported perfect agreements from 42% to 65% and adjacent agreement from 92% to 100%.
Our purpose in this study was to examine the potential for automatic indices reported by the computational tools Coh-Metrix and WAT to predict human scores of essay quality in a corpus of independent essays written for the TOEFL.
Our selected corpus of independent essays samples was collected from two administrations of the TOEFL-iBT. The essays were composed by two groups of 240 test-takers who were stratified by quartiles for each task (N =480). The essays were written on two different prompts (one prompt per form). The essays, the final scores, and the demographic information of the test-takers were directly provided by the Educational Testing Service (ETS). The 480 test-takers included both English as a Second Language (ESL) and English as a foreign language (EFL) learners. They were from a variety of home countries and linguistic backgrounds.
The TOEFL independent writing rubric, which describes five levels of writing performance (scored 1 through 5), was used to score the independent essays ( 2008 rubric here). In the rubric, linguistic sophistication at the lexical and syntactic levels is emphasized in addition to the development and the coherence of the arguments along with syntactic accuracy. An independent essay with a score of 5 is defined as being a well-organized and developed response to the given topic, displaying linguistic sophistication and containing only minor language mistakes. In contrast, an essay with a score of 1 has serious problems in organization, idea development, or language use.
Two expert raters trained by ETS scored each essay using the standardized holistic rubrics described above. The final holistic score of each essay was the average of the human rater scores if the two scores differed by fewer than two points. Otherwise, a third rater scored the essay, and the final score was the average of the two closest scores. While inter-rater reliability scores are not provided for the TOEFL-iBT scores in the public use dataset, Attali (2008) reported that weighted Kappas for similarly double scored TOEFL writing samples were .70.
The primary research instruments we used in this study were Coh-Metrix (e.g., Graesser, McNamara, Louwerse, & Cai, 2004; McNamara & Graesser, 2012; McNamara, Graesser, McCarthy, & Cai, 2014) and the Writing Assessment Tool (WAT; Crossley et al., 2013). Both Coh-Metrix and WAT represent the state of the art in computational tools and together report on hundreds of linguistic indices related to text structure, text difficulty, rhetorical patterns, and cohesion through the integration of pattern classifiers, lexicons, shallow semantic interpreters, part-of-speech taggers, syntactic parsers, and other components that have been developed in the field of computational linguistics (Jurafsky & Martin, 2008). WAT, unlike Coh-Metrix, was developed specifically to assess writing quality. As such, it includes a number of writing-specific indices related to global cohesion, contextual cohesion, n-gram accuracy, lexical sophistication, key word use, and rhetorical features. The majority of Coh-Metrix and WAT indices are normed for text length (except raw counts such as token counts, type counts, and type-token ratio). Unlike other scoring engines such as e-rater, Coh-Metrix and WAT do not calculate errors in grammar, usage, mechanics and style, which have been important predictors of essay quality in previous studies (Enright & Quinlan, 2010). In total we selected 189 indices from Coh-Metrix and WAT that all had theoretical links to writing quality. The various linguistic constructs measured along with their associated indices and theoretical links are discussed briefly in Appendix A. For more detailed descriptions, we refer readers to Graesser et al., 2004, McNamara & Graesser, 2012, and McNamara et al., 2014.
We first divided the TOEFL essay corpus into a training and a test set following a 67/33 split (Witten, Frank, & Hall, 2011). Thus, we had a training set of 320 essays and a test set of 160 essays. To control for prompt-based effects, we conducted a MANOVA that examined if the selected linguistic variables demonstrated significant differences between the two prompts. All variables that showed prompt-based effects were removed. We then conducted Pearson correlations to assess relationships between the selected variables and the human scores using the training set only. Those variables that demonstrated significant correlations (p < .050) with the human scores were retained as predictors in a subsequent regression analysis. Prior to inclusion, all significant variables were checked for multicollinearity to ensure that the variables were not measuring similar constructs. Our cut-off for multicollinearity was r => .70. If two or more indices were highly correlated with each other, we selected the index with the highest correlation to the human raters for inclusion in the regression and removed the other, redundant variable(s).
The selected indices were next regressed against the holistic scores for the 320 essays in the training set with the essay scores as the dependent variable and the Coh-Metrix indices as the predictor variables using a stepwise method. The derived regression model was then applied to the essays in the test sets to predict the scores. The R2 provided us with an estimate for the amount of variance in human scores that the model explains. The model from this regression analysis was then applied to essays held back in the test set to assess how well it worked on an independent set of essays (i.e., how generalizable the model was to essays it was not trained on).
Exact and adjacent matches between the model and the human raters provided us with another means of assessing the reliability of essay scoring rubrics and automated scoring algorithms. The premise behind such an analysis is that a score that is only off by one point (i.e., adjacent accuracy) is more acceptable than a score that is off by 2 or more points (Attali & Burstein, 2006; Dikli, 2006; Rudner, Garcia, & Welch, 2006; Shermis, Burstein, Higgins, & Zechner, 2010).
To control for prompt-based writing effects, which can affect linguistic production during writing (Crossley, Weston, Sullivan, & McNamara, 2011; Hinkel, 2002; 2003), a MANOVA was conducted using the selected Coh-Metrix and WAT indices as the dependent variables and the two TOEFL prompts as the independent variables. Of the 189 selected indices, only 59 of the indices did not demonstrate prompt-based effects (defined as p > .05 in the MANOVA). These 59 indices were thus candidates for inclusion into our models of essay writing quality.
Correlations with Human Ratings
Correlations were conducted between the 59 variables from Coh-Metrix and WAT that did not demonstrate prompt-based effects. Of these 59 variables, 43 demonstrated significant correlations with human scores for essay quality as found in the TOEFL dataset.
We next checked for multicollinearity (defined as r > .700) between the 43 variables to ensure they were not measuring similar or overlapping microfeatures (i.e., we selected one independent feature for each linguistic construct). Eighteen of the 43 variables yielded strong correlations with one another. For these 18 variables, we removed the variable that demonstrated the lowest correlation with the human scores of writing quality. For instance, Number of words correlated strongly with Number of types (r = .836), but, because Number of types exhibited a stronger correlation with ratings of essay quality than Number of words, Number of types was kept for the analysis and Number of words was removed. After controlling for multicollinearity, we were left with 34 variables for our regression analysis. These 34 variables are presented in Table 1 based on the strength of correlation with the human judgments of essay quality.
Table 1: Correlations between selected indices and human ratings of essay quality
|Number of types||0.680||<.001||Incidence of determiners||0.182||<.001|
|Frequency of spoken bi-grams||-0.546||.001||Incidence of possibility modals||<-0.164||<.001|
|Incidence of 'and'||0.392||.001||Incidence of split infinitives||0.158||<.001|
|Word familiarity content words||-.0367||.001||Word imageability every word||-0.157||<.001|
|CELEX written frequency for content words||-0.366||<.001||Mean of location and motion ratio scores||-0.157||<.001|
|Incidence of agentless passives||0.358||< .001||Total number of paragraphs in essay||0.154||< .001|
|Incidence of perfect verb forms||0.331||< .001||Minimal edit distance (all stems mean)||0.150||< .001|
|Average of word hypernymy||0.329||< .001||Incidence of hedges||0.150||< .001|
|Word meaningfulness all words||-0.304||< .001||Incidence of emphatics||0.148||< .001|
|Incidence of downtoners||0.267||< .001||Stem overlap||-0.147||< .001|
|LSA body to conclusion||0.254||< .001||Incidence of amplifiers||0.134||< .010|
|Incidence of conjuncts||0.254||< .001||Incidence of positive causal connectives||-0.122||< .010|
|Incidence of noun phrases||-0.243||< .001||Incidence of split auxiliaries||0.106||< .050|
|Number of motion verbs per verb phrases||0.224||< .001||Incidence of adjectival phrases||-0.105||< .050|
|Word concreteness component score||-0.216||< .001||Incidence of the verb 'seem'||0.099||< .050||Subordinating conjunctions||0.198||< .001||Incidence of body paragraph n-grams||0.098||< .050|
|Relative clause pronoun deletion in present participles||0.195||< .001||Proportion of key words||0.091||< .050|
Training set. A stepwise regression analysis using the 34 indices as the independent variables to predict the human scores yielded a significant model, F (6, 352) = 55.176, p < .001, r = .716, R2 = .512, for the training set. Six Coh-Metrix and WAT indices were included as significant predictors of the essay scores. The six indices were: Number of types, Word imageability every word, Proportion of key words, Incidence of 'and', LSA body to conclusion, and Incidence of perfect verb forms.
The model demonstrated that the six indices together explained 51% of the variance in the evaluation of the 320 independent essays in the training set (see Table 2 for additional information). t-test information for the six indices together with the amount of variance explained are presented in Table 3.
Table 2: Stepwise regression analysis for indices predicting the independent essay scores: Training set
|Entry 1||Number of types||0.654||0.428||0.017||0.608||0.001|
|Entry 2||Word imageability every word||0.676||0.456||-0.017||-0.162||0.004|
|Entry 3||Proportion of key words||0.696||0.485||4.339||0.146||1.303|
|Entry 4||Incidence of 'and'||0.703||0.495||0.036||0.116||0.013|
|Entry 5||LSA body to conclusion||0.710||0.504||0.173||0.114||0.064|
|Entry 6||Incidence of perfect verb forms||0.716||0.512||0.010||0.099||0.004|
Note: B = unstandardized β; B = standardized; S.E. = standard error. Estimated constant term is 5.414.
Table 3: t-value, p-values, and variance explained for the six indices in the regression analysis: Training set
|Number of types||12.325||p<.001||0.428|
|Word imageability every word||-4.026||p<.001||0.028|
|Proportion of key words||3.330||p<.001||0.028|
|Incidence of 'and'||2.693||p<.010||0.010|
|LSA body to conclusion||2.712||p<.010||0.009|
|Incidence of perfect verb forms||2.342||p<.050||0.008|
Test set. We used the model reported for the training set to predict the human scores in the test set. To determine the predictive power of the six variables retained in the regression model, we computed an estimated score for each integrated essay in the independent test set using the B weights and the constant from the training set regression analysis. This computation gave us a score estimate for the essays in the test set. A Pearson’s correlation was then conducted between the estimated score and the actual score assigned on each of the integrated essays in the test set. This correlation with its R2 was then calculated to determine the predictive accuracy of the training set regression model on the independent data set.
The regression model, when applied to the test set, reported r = .773, R2 = .598. The results from the test set model demonstrated that the combination of the six predictors accounted for 60% of the variance in assigned scores of the 160 essays in the test set, providing increased confidence for the generalizability of our model.
Exact and Adjacent Matches
We used the scores derived from the regression model to assess the exact and adjacent accuracy of the regression scores when compared to the human-assigned scores. For this analysis, we rounded up the essay scores to the closest integer (i.e., a score of 4.5 was rounded up to a 5). Our baseline comparison for this model was against a default score of 3 for each essay. A default score of 3 would provide an exact accuracy of 37% and an adjacent accuracy of 78%. The regression model produced exact matches between the predicted essay scores and the human scores for 263 of the 480 essays (55% exact accuracy). The model produced exact or adjacent matches for 460 of the 480 essays (96% exact/adjacent accuracy). The measure of agreement between the actual score and the predicted score produced a weighted Cohen’s Kappa for the adjacent matches of .463, demonstrating a moderate agreement. A confusion matrix for the results is presented in Table 4.
Table 4: Confusion matrix for the total set of essays showing actual and predicted essay scores
|Actual Essay Score||Predicted Essay Scores|
Our model miscalculated human scores by a factor greater than one in four percent of the data (i.e., the predicted scores were beyond adjacent scores; n = 20). These 20 essays provide an opportunity to analyze model components that may influence reliability and text components that may influence human ratings that are not captured by our automated indices. The purpose of the following analysis is to explore the extent to which the misscored essays can provide valuable information concerning model reliability and human scoring. Our corpus consisted of those essays in which the human raters assigned a score of 5 and the model assigned a score of 3 (n = 14) and in which the human raters assigned a score of 4 and the model assigned a score of 2 (n = 1). We label these essays high/low. We also examined essays in which the human raters assigned a score of 3 and the model assigned a score of 5 (n = 2) and in which the human raters assigned a score of 2 and the model assigned a score of 4 (n = 3). We label these essays low/high.
Our post-hoc analysis was both qualitative and quantitative in nature. The 20 essays were scored by expert raters on linguistic features found in the TOEFL scoring rubric. These analytic scores were used to statistically assess the elements of the text that human raters found important and to examine differences between the high/low and low/high essays. The 20 essays were also assessed using the computational indices taken from the regression analysis. These linguistic features were used to examine differences between the high/low and low/high essays.
Analytic TOEFL scoring rubric. A coding scheme was developed to assess the major linguistic features found in the TOEFL scoring rubric (see Appendix B). These features included topic, task, development, coherence, and language use (i.e., syntactic variety, syntactic structure/accuracy, word choice, idiomaticity, vocabulary range, and spelling).
Human judgments. Two expert raters with over 5 years of teaching English to non-native speakers both abroad and in the United States were trained on the rubric using a training set of 25 TOEFL independent essays. The raters reached agreement on each analytical item after the first training session (r > .70)The raters then independently scored each of the essays in the post-hoc analysis corpus. If the raters disagreed by two or more points, the raters adjudicated the scores. After adjudication, all items except syntactic variety reported acceptable inter-rater reliability (see Table 5 for results).
Table 5: Inter-rater reliability statistics for the analytic features in the essay scoring rubric
|Feature||Cronbach's alpha||Pearson's r||Weighted Kappa|
|Syntactic structure||0.921||0.864||0.845||Word choice||0.853||0.772||0.680|
>Correlations between analytic features and human scores. To assess links between the analytic scores and the holistic scores, we conducted correlations between the mean scores for the two raters on each analytic feature and the holistic scores assigned to the essays (see Table 6). The correlations demonstrated that 7 out of the 10 analytic features demonstrated significant correlations with the holistic scores. The strongest correlations were reported for spelling, appropriate word choice, syntactic structure/accuracy, and idiomaticity. Although Coh-Metrix and WAT measure lexical and syntactic structures, they do not measure the accuracy of the structures produced.
Table 6: Correlations between analytic features and holistic scores
|Syntactic variety||0.620||< .010|
|Syntactic structure/accuracy||0.787||< .001|
|Word choice||0.819||< .001|
|Vocabulary range||0.595||< .010|
MANOVA high/low and low/high essays (analytic features). A MANOVA was conducted using the analytic indices as the dependent variables and the high/low and low/high categorizations as the independent variables. Seven out of the ten analytic features demonstrated significant differences between the categorizations and 8 of 10 of the analytic features showed a medium or larger effect size (Cohen, 1988; see Table 7). The seven features that demonstrated significant differences were the same that reported significant correlations between analytic features and holistic scores. The results indicate that essays scored high/low were rated as having greater coherence, greater syntactic variety, greater syntactic structure/accuracy, better word choices, better idiomatic language, greater vocabulary range, and better spelling than low/high essays. As in the correlation analysis, the strongest effect sizes were reported for spelling, word choice, and syntactic structure.
Table 7: MANOVA results for predicting high/low and low/high classifications using analytic essay features
|Analytic features||High/Low||Low/High||F||p||hp2||Cohen’s d|
|Topic||5.433 (0.904)||5.100 (1.084)||0.465||> .050||0.025||0.334|
|Task||4.667 (1.319)||4.800 (1.441)||0.037||> .050||0.002||-0.096|
|Development||3.900 (0.870)||3.100 (0.548)||3.661||> .050||0.169||1.100|
|Coherence||4.133 (0.694)||3.000 (1.323)||6.313||< .050||0.260||1.073|
|Syntactic variety||4.400 (0.507)||3.600 (0.652)||8.151||< .050||0.312||1.370|
|Syntactic structure/accuracy||4.500 (0.627)||2.600 (0.652)||33.844||< .001||0.653||2.971|
|Word choice||4.467 (0.550)||2.800 (0.274)||41.36||< .001||0.697||3.837|
|Idiomaticity||4.200 (0.592)||2.800 (0.671)||19.746||< .001||0.523||2.213|
|Vocabulary range||4.467 (0.667)||3.300 (0.837)||10.171||< .010||0.361||1.542|
|Spelling||4.933 (0.594)||2.200 (1.323)||96.363||< .001||0.843||2.724|
MANOVA high/low and low/high essays (computational indices). A MANOVA was conducted using the computational indices from the regression as the dependent variables and the high/low and low/high categorizations as the independent variables. Two out of the six indices demonstrated significant differences between the categorizations and four of the six indices demonstrated medium or larger effect sizes (Cohen, 1988; see Table 8). The two features that demonstrated significant differences were the number of types and incidence of perfect verbs. The results indicate that essays scored high/low contained fewer tokens and fewer perfect verb forms than essays scored low/high.
Table 8: MANOVA results for predicting high/low and low/high classifications using computational indices
|Computational indices||High/Low||Low/High||F||p||hp2||Cohen’s d|
|Number of types||131.800 (14.447)||167.400 (32.067)||11.477||< .010||0.389||-1.431|
|Word imageability every word||317.721 (6.239)||318.801 (10.964)||0.077||> .050||0.004||-0.121|
|Proportion of key words||0.104 (0.032)||0.102 (0.017)||0.028||> .050||0.002||0.078|
|Incidence of 'and'||3.600 (2.586)||5.400 (2.191)||1.939||> .050||0.097||-0.751|
|LSA body to conclusion||0.952 (0.770)||1.484 (1.242)||1.320||> .050||0.068||-0.514|
|Incidence of perfect verb forms||8.841 (6.847)||18.552 (7.730)||8.141||< .050||0.311||-1.329|
Automated models of human essay scoring can provide strong evidence for how microfeatures of language found in the text can predict essay quality. The current study demonstrates that a regression model using six linguistic microfeatures related to breadth of lexical production, lexical sophistication, key words use, local and global cohesion, and tense can explain 60% of the variance in the human scores for the TOEFL essays in our test set. The same microfeatures can be used to predict human essay scores 55% of the time and provide adjacent matches 96% of the time. These six microfeatures, their calculations, and their weights in a regression analysis provide a straightforward, comprehensive, rigorous, and elaborative model of TOEFL independent writing quality that has applications in classroom teaching and assessment, teacher training, standardized testing situations, and industrial development.
A post-hoc analysis of our findings demonstrated that the essays scored high by the regression model and low by human raters contained a greater number of word types and perfect tense forms compared to essays scored high by human raters and low by the regression analysis. On the other hand, the analytic feature analysis demonstrated that essays scored high by humans but low by the regression model had greater coherence, syntactic variety, syntactic structure/accuracy, words choice, idiomaticity, vocabulary range, and spelling accuracy as compared to essay scored high by the model but low by human raters. These findings highlight potential problems with automated models of writing quality, especially those based on Coh-Metrix and WAT, and provide examples to better understand issues of model reliability (Attali & Burstein, 2006; Deane et al., 2013; Perelman, 2014; Shermis, in press) and construct validity (Condon, 2013; Crusan, 2010; Deane et al., 2013; Elliot et al., 2013, Haswell, 2006; Perelman, 2012). Conversely, the findings also afford us the opportunity to consider where scoring mismatches in human and automated approaches occur and, thus, provide convenient examples to guide the development of natural language processing tools.
The regression model demonstrates that the strongest predictor of essay quality is the number of word types used by an L2 writer, explaining about 43% of the variance in the model (see Table 2). This index is informative for a number of reasons. First, the index relates to a writer’s breadth of vocabulary knowledge, with more word types indicating a greater vocabulary. Second, the number of word types also strongly correlates with the number of words in a text (r = .836) indicating that test-takers who produce more types (and thus more words) will receive a higher essay score. The next strongest predictor is word imageability scores, which explained about 3% of the variance in the regression model. The regression model indicates that test-takers who produce less imageable words will receive a higher score than those who produce more imageable words. Thus, lexical sophistication is an important predictor of L2 writing quality. The third strongest predictor is the proportion of key words in the essay, which explains about 3% of the variance in the human scores. The regression model indicates that test-takers who use more key words specific to the prompt (i.e., words commonly used by other test-takers for the same prompt) will receive a higher score. Those who use fewer key words are presumably less on topic and will receive a lower score. The next two indices in the regression analysis are related to cohesion. The first index, the incidence of ‘and,’ is related to local cohesion and explains 1% of the variance in the human scores. The index demonstrates that test-takers who use a greater number of ‘and’s are rated as high proficiency writers, presumably because they make greater connections between words. The second index of cohesion, LSA body to conclusion score, is related to global cohesion. This index explains 1% of the variance and indicates that writers who have greater semantic overlap between their body paragraphs and their conclusion paragraph will receive higher scores. The final index is the incidence of perfect forms. This index explained 1% of the variance and demonstrates that test-takers who produce more complex verb forms will be rated as more proficient writers.
Of secondary interest are the indices that demonstrated significant correlations with human ratings (see Table 1), but were not included in the regression analysis. These correlations generally support the findings from the regression model in that better rated essays correlated with indices of lexical sophistication (e.g., contained less frequent n-grams, less familiar words, less frequent words, less meaningful words, and less concrete words). The correlations are less clear in terms of text cohesion. Some cohesion indices show positive correlations with essay quality (e.g., conjuncts and subordinating conjunctions), while others show negative correlations (e.g., stem overlap and positive causal connectives) or indicate lower cohesion through a positive correlation (minimal edit distance). The correlations seem to indicate that conjuncts and connectives are important indicators of essay quality, but overlap, causality, and minimal edit distance are not. The correlations also seem to demonstrate that essays that are more verbal (i.e., contain more perfect forms and motion verbs) are rated higher than essays that are more nominal (i.e., contain a greater incidence of noun phrases). Two other trends are evident in the correlation analysis. The first is that more syntactically complex essays are rated higher, as evidenced by the positive correlations between essay score and indices such as incidence of agentless passives, incidence of relative clause deletion, incidence of split infinitives, and split auxiliaries. The second trend is that essays with more rhetorical features, such as downtoners, hedges, emphatics, amplifiers, and body paragraph n-grams are scored higher by human raters. These correlations, along with the indices included in the regression model, provide strong indications of the types of linguistic microfeatures that predict human ratings of essay quality.
Importantly, many of these indices relate directly to textual elements that are of concern for automated models of essay quality. These include elements such as text cohesion and coherence, text relevance, and rhetorical purposes. While indices related to text length, lexical sophistication, and syntactic complexity may not overlap with writing concerns voiced by both the writing studies and educational measurement communities (e.g., domain knowledge, cultural and background knowledge, and variation in rhetorical purposes; Condon, 2013; Deane, 2013; Haswell, 2006; Haswell & Ericsson, 2006; Herrington & Moran, 2001; Huot, 1996; Perelman, 2012), they do indicate an ability to quickly and easily produce complex text, which should free up cognitive resources that can be used to address rhetorical and conceptual concerns in the text, both of which are needed for writing mastery (Deane, 2013). They thus can be used to provide empirical evidence to help represent a more robust construct representation of writing quality (Elliot & Klobucar, 2013; Kane, 2013). However, this suggestion warrants empirical investigation.
Our post-hoc analysis provides an overview of the weaknesses of the tested regression model. While the adjacent accuracy reported for our regression model is on par with previous analyses of TOEFL independent essays (Attali, 2008; Rudner et al., 2006), 4% of the essays were scored outside the adjacent range with the human ratings. The majority of these essays (n = 15) were scored lower by the model than by human raters. The main reason for this scoring discrepancy appears to be the model’s reliance on assigning a stronger weight to the number of word types in the essay. While essay length can be a strong indicator of an essays organization and development (Attali & Powers, 2008), this may not be the case for all writers (Crossley, Roscoe, & McNamara, 2014). The largest effect size between the high/low and low/high essays in the post-hoc analysis was for number of word types (see Table 8), with essays scored high by humans and low by the model averaging 132 words and essays scored high by the model and low by the human averaging 167 words. This finding indicates that vocabulary breadth and essay length are not always synonymous with essay quality for human raters and that the described model assigns too much importance to these microfeatures. Similar findings, based on effect sizes, are reported for the incidence of perfect verb forms, incidence of ‘and,’ and LSA body to conclusion scores. However, since these indices only explained about 1% of the variance each in the regression model, their weight is less predictive than that reported for word type count.
The analyses of the human ratings for analytic features provide some indications about which linguistic elements are associated with essay quality when fewer word types are used and the text is of shorter length (see Table 7). Foremost, it appears that human raters rely on spelling accuracy to assign scores in such cases. The problem is compounded by the notion that, according to the Coh-Metrix calculations, the number of tokens may increase for each misspelled word. Thus, if a test-taker spells ‘the’ as both ‘the’ and ‘hte,’ that essay will be judged to have a greater number of types by the model and thus be given a higher score, because the calculation for the incidence of tokens does not consider misspelled words. Conversely, the essay will be scored lower by human raters because of the increased number of spelling errors.
The next two most important analytic features for humans raters in essays scored high/low and low/high are word choice and syntactic structure/accuracy. The essays scored high by humans and low by the regression model had better word choices as compared to those essays scored low by humans and high by the model. Thus, it is not just that words are spelled correctly, but that words are used appropriately, something that neither the Coh-Metrix nor WAT indices assess (or for that matter, any AES system of which we are aware). This is a major limitation of current AES systems. In addition, those essays scored high by humans and low by the model had fewer syntactic errors as compared to those essays scored low by humans and high by the model. Again, while Coh-Metrix calculates indices that can assess syntactic complexity, it does not report indices for syntactic accuracy (although some AES systems do assess grammatical errors).
Implications for AES systems
Considering the correlations reported in Table 6 between the analytic features and the holistic scores for the essays, the findings from this paper indicate that, at a minimum, a more successful AES system should include computational indices of spelling and syntactic accuracy (at least when L2 writing is being assessed). Automated assessments of mechanical and spelling errors are linguistic microfeatures that computers are relatively accurate at capturing, so such an implementation should not be difficult (see e-rater as an example). Other linguistic elements such as accurate word choice and idiomaticity are more difficult to implement and point to areas in which AES systems need improvement, while features such as coherence and syntactic variety are already measured by Coh-Metrix and WAT. It should be noted that the limitations discussed here are linguistic in nature and do not address larger conceptual concerns expressed by many writing researchers regarding the inability for AES systems to assess the effectiveness of written arguments, stated purposes, rhetorical moves, and addressing the appropriate audience.
The findings also provide evidence that researchers should use caution when examining writing quality in a corpus of essays written on a number of different prompts. Prompt-based effects occur when linguistic features found in a writing prompt influence the writing patterns found in essays written on that prompt (Brown, Hilgers, & Marsella, 1991; Huot, 1990). Past studies have demonstrated prompt-based differences in text cohesion (Crossley, Varner, & McNamara, 2013), syntactic complexity (Crowhurst & Piche, 1979; Hinkel, 2002; Tedick, 1990), and in lexical sophistication (Crossley, Weston, et al., 2011; Hinkle, 2002). The present analysis controlled for prompt-based differences, but reported that of 189 potential indices, only 59 did not show prompt-based differences (i.e., 69% of the indices had to be removed from the analysis because they were influenced by the prompt). Thus, prompt should be a major concern for researchers interested in AES.
Microfeatures or Component Scores?
A final issue relates to the specificity of the linguistic indices that were used to predict essay scores and their subsequent representation of the constructs within the essays. The microfeatures calculated by Coh-Metrix and WAT are, by their nature, extremely fine-grained indices, which are intended to represent certain characteristics of learners’ essays. In many ways, this is a strength of such microfeatures because researchers can use the linguistic indices to investigate specific questions about language use in various forms of texts. However, the use of these fine-grained microfeatures as predictors of essay quality could potentially lead to less stable models, which do not generalize to different prompts and tasks. It is yet to be seen whether microfeatures, aggregated features (like those used by e-rater), or a combination of both are most informative and predictive. In future studies, we plan to address this issue by developing component scores based on these linguistic microfeatures. The development of such component scores may improve the stability of AES algorithms and provide more representative features, as well as provide more informative means for formative writing feedback to students.
Overall, this study provides important information about how linguistic microfeatures can predict L2 essay quality and about the strengths and weaknesses of automatic essay scoring models. Unlike previous research, this study provides specific information on the linguistic microfeatures that correlate with L2 essay quality and a regression model that can be used to automatically assign scores to the TOEFL essays using these linguistic microfeatures. While the results are strong, future studies should consider similar approaches using a larger corpus of data (this study was limited to the 480 essays in the TOEFL iBT public use dataset). A larger corpus is especially needed in analyses that examine model miscalculations, because, in strong models, miscalculations are infrequent, leading to potentially small sample sizes (e.g., in our study only 20 essays with mismatched scores existed).
Overall, the results of this study advance our knowledge of how linguistic features in an L2 essay predict human judgments of quality. Follow-up analyses discuss some of the weaknesses of AES systems and provide suggestions for AES system development including the incorporation of lexical and syntactic accuracy indices. These improvements to AES systems should provide greater overlap between human and automated ratings of essay quality. Automating essay scoring should free teachers from many elements of essay grading that are time consuming and cost prohibitive, allowing them to focus more on other aspects of essay quality that AES systems are poor at assessing, such as argumentation, style, and idea development.
This research was supported in part by the Institute for Education Sciences (IES R305A080589 and IES R305G20018-02). Ideas expressed in this material are those of the authors and do not necessarily reflect the views of the IES. TOEFL (R) test material are reprinted by permission of Educational Testing Service, the copyright owner.
Scott Crossley is an Associate Professor at Georgia State University. His interests include computational linguistics, corpus linguistics, and second language acquisition. He has published articles in second lexical acquisition, second language writing, second language reading, discourse processing, language assessment, intelligent tutoring systems, and text linguistics.
Arnaud, P. J. (1992). Objective lexical and grammatical characteristics of L2 written compositions and the validity of separate-component tests. In P. J. Arnaud & H. Bejoint (Eds.), Vocabulary and applied linguistics (pp. 133-145). London, England: Macmillan.
Attali, Y. (2004, April). Exploring the feedback and revision features of Criterion. Paper presented at the National Council on Measurement in Education (NCME), San Diego, CA.
Attali, Y. (2007). Construct validity of e-rater in scoring TOEFL essays. Princeton, NJ: ETS.
Attali, Y. (2008). E-rater performance for TOEFL iBT independent essays. Unpublished manuscript.
Attali, Y., & Burstein, J. (2005). Automated essay scoring with e-rater(R) v.2.0. Princeton, NJ: ETS.
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater(R) v.2.0. The Journal of Technology, Learning and Assessment, 4(3), (np).
Attali, Y., & Powers, D. (2008). A developmental writing scale (ETS Research Report RR-08-19). Princeton, NJ: Educational Testing Service.
Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1995). The CELEX lexical database (CD-ROM). Philadelphia: Linguistic Data Consortium, University of Pennsylvania.
Bereiter, C. (2003). Foreword. In Mark D. Shermis, & Jill C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary approach (pp. vii-ix). Mahwah, NJ: Lawrence Erlbaum Associates.
Biesenbach-Lucas, S., & Weasenforth, D. (2001). E-mail and word-processing in the ESL classroom: How the medium affects the message. Language Learning and Technology, 5, 35-165.
Bell, J., & Burnaby, B. (1984). A handbook for ESL literacy. Toronto, Canada: Ontario Institute for Studies in Education/Hodder and Stoughton.
Biber, D. (1988). Variation across speech and writing. Cambridge, England: Cambridge University Press.
Bialystok, E. (1978). A theoretical model of second language learning. Language Learning, 28, 69-83.
Bonzo, J. D. (2008). To assign a topic or not: Observing fluency and complexity in intermediate foreign language writing. Foreign Language Annals, 41(4), 722-735.
Brill, E., & Mooney, R. J. (1997). An overview of empirical natural language processing. AI Magazine, 18, 4-13.
Brown, J. D., Hilgers, T., & Marsella, J. (1991). Essay prompts and topics: Minimizing the effect of mean differences. Written Communications, 8, 533-556.
Brown, G., & Yule, B. (1983). Discourse analysis. Cambridge, England: Cambridge University Press.
Burstein, J. (2003). The e-rater scoring engine: Automated Essay Scoring with natural language processing. In M. D. Shermis and J. C. Burstein (Eds.), Automated Essay Scoring: A cross-disciplinary approach (pp. 113-121). Mahwah, NJ: Lawrence Erlbaum Associates.
Burstein, J., Chodorow, M., & Leacock, C. (2004). Automated essay evaluation: The Criterion online service. AI Magazine, 25, 27-36.
Burstein, J., Marcu, D., & Knight, K. (2003). Finding the WRITE stuff: Automatic identification of discourse structure in student essays. IEEE Intelligent Systems: Special Issue on Natural Language Processing, 18(1): 32-39.
Camp, R. (1993). Changing the model for the direct writing assessment. In M. M. Williamson & B. A. Huot (Eds.), Validating holistic scoring for writing assessment: Theoretical and empirical foundations (pp. 45-78). Cresskill, NJ: Hampton Press, Inc.
Carlman, N. (1986). Topic differences on writing tests: How much do they matter? English Quarterly, 19, 39-49.
Carlson, S., Bridgeman, B., Camp, R., & Waanders, J. (1985). Relationship of admission test scores to writing performance of native and non-native speakers of English. (TOEFL Research Rep. No. 19). Princeton, NJ: ETS.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.
Conference on College Composition and Communication. (2004, February 25). CCCC position statement on teaching, learning, and assessing writing in digital environments. Retrieved from http://www.ncte.org/cccc/resources/ positions/digitalenvironments
Condon, W. (2013). Large-scale assessment, locally-developed measures, and automated scoring of essays: Fishing for red herrings? Assessing Writing, 18, 100-108.
Connor, U. (1984). A study of cohesion and coherence in English as a second language students' writing. Papers in Linguistics, 17(3), 301-316.
Connor, U. (1996). Contrastive rhetoric: Cross-cultural aspects of second-language writing. Cambridge, England: Cambridge University Press.
Costerman, J., & Fayol, M. (1997). Processing interclausal relationships: Studies in production and comprehension of text. Hillsdale, NJ: Lawrence Erlbaum Associates. Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34, 213-238.
Crismore, A., Markkanen, R., & Steffensen, M. (1993). Metadiscourse in persuasive writing: A study of texts written by American and Finnish university students. Written Communication, 10, 39-71.
Crossley, S. A. (2013). Advancing research in second language writing through computational tools and machine learning techniques: A research agenda. Language Teaching, 46(2), 256-271.
Crossley, S. A., Cai, Z., & McNamara, D. S. (2012). Syntagmatic, paradigmatic, and automatic n-gram approaches to assessing essay quality. In P. M. McCarthy & G. M. Youngblood (Eds.), Proceedings of the 25th International Florida Artificial Intelligence Research Society (FLAIRS) Conference (pp. 214-219). Menlo Park, CA: The AAAI Press.
Crossley, S. A., & McNamara, D. S. (2009). Computationally assessing lexical differences in L1 and L2 writing. Journal of Second Language Writing, 17(2), 119-135.
Crossley, S. A., & McNamara, D. S. (2011). Understanding expert ratings of essay quality: Coh-Metrix analyses of first and second language writing. IJCEELL, 21, 170-191.
Crossley, S. A., & McNamara, D. S. (2012). Predicting second language writing proficiency: The role of cohesion, readability, and lexical difficulty. Journal of Research in Reading, 35, 115-135.
Crossley, S. A., Roscoe, R. D., & McNamara, D. S. (2011). Predicting human scores of essay quality using computational indices of linguistic and textual features. In G. Biswas, S. Bull, J. Kay, & A. Mitrovic (Eds.), Proceedings of the 15th International Conference on Artificial Intelligence in Education (pp. 438-440). Auckland, New Zealand: AIED.
Crossley, S. A., Roscoe, R., & McNamara, D. S. (2013). Using automatic scoring models to detect changes in student writing in an intelligent tutoring system. In P. M. McCarthy & G. M. Youngblood (Eds.). Proceedings of the 26th International Florida Artificial Intelligence Research Society (FLAIRS) Conference (pp. 208-213). Menlo Park, CA: The AAAI Press.
Crossley, S. A., Roscoe, R., & McNamara, D. S. (2014). What is successful writing? An investigation into the multiple ways writers can write successful essays. Written Communication, 31(2), 184-215.
Crossley, S. A, Salsbury, T., & McNamara, D. S. (2009). Measuring second language lexical growth using hypernymic relationships. Language Learning. 59
AllenC. L. (1977). A study of the effect of selected mechanical errors on teachers' evaluation of the non-mechanical aspects of students' writing. DAI37: 5554–A. ED 147 812.
AllenJ. P. B. & CorderS. P. (eds.) (1974). The Edinburgh Course in Applied Linguistics, Vol. 3. London: Oxford University Press.
AllwrightR. L., WoodleyM-P., & AllwrightJ. M. (1988). Investigating reformulation as a practical strategy for the teaching of academic writing. Applied Linguistics9, 236–56.
AngelisP. J. (1975). Sentence combining, error analysis and the teaching of writing. In M. K. Burt & H. Dulay (eds.).
AzevedoM. M. (1980). The interlanguage of advanced learners: an error analysis of graduate students' Spanish. International Review of Applied Linguistics, 20, 217–27.
BambergB. (1984). Assessing coherence: a reanalysis of essays written for the National Assessment of Educational Progress, 1969–1979. Research in the Teaching of English, 18, 305–19.
Bardovi-HarligK. & BofmanT. (1989). Attainment of syntactic and morphological accuracy by advanced language learners. Studies in Second Language Acquisition, 111, 17–34.
BeckerA. (1965). A tagmemic approach to paragraph analysis. College Composition and Communication, 16, 237–42.
BellR. T. (1974). Error analysis: a recent pseudoprocedure in applied linguistics. ITL Review of Applied Linguistics, 25/6, 35–49.
BensonJ. D. & GreavesW. S. (eds.) (1985). Systemic perspectives on discourse 2: selected papers from the 9th International Systemic Workshop. Advances in Discourse Processes, 16. Norwood, N.J.: Ablex.
BhatiaA. T. (1974). An error analysis of students' composition. International Review of Applied Linguistics, 12, 337–50.
BloorT. & NorrishJ. (eds.) (1987). Written language. British Studies in Applied Linguistics, 2. London: CILT Publications.
BlountN. S., JohnsonS. L. & FredrickW. C. (1968). A comparison in the writing of eighth- and twelfth-grade students, Technical Report No. 78. Madison, Wisconsin: Wisconsin: Research and Development Center for Cognitive Learning, University of Wisconsin.
BrittonB. K. & BlackJ. B. (eds.) (1985). Understanding expository text. Hillsdale, N.J.: Erlbaum.
BrittonJ. N., BurgessT., MartinN., McLeodA. & RosenH. (1975). The development of writing abilities (11–18). London: Macmillan Education Ltd.
BrodkeyD. & YoungR. (1981). Composition correctness scores. TESOL Quarterly, 15, 159–67.
BrownG. (1989). Making sense: the interaction of linguistic expression and contextual information. Applied Linguistics, 10, 1, 97–108.
BrownH. (ed.) (1976). Papers in Second Language Acquisition, Language Learning Special Issue No. 4.
BruceB., RubinA. & StarrK. (1981). Why readability formulas fail. Reading Education Report No. 28. Urbana, ILL: University of Illinois Center for the Study of Reading, ED 205 915.
BurtM. K. (1975). Error analysis in the adult EFL classroom. TESOL Quarterly, 9, 53–63.
BurtM. K. & DulayH. (eds.) New directions in second language learning, teaching, and bilingual education. On TESOL '75. Washington, D.C.: TESOL.
ButeauM. F. (1970). Students' errors and the learning of French as a second language: a pilot study. International Review of Applied Linguistics, 8, 133–45.
CharollesM. (1978). Introduction aux problèmes de la cohérence des textes. Langue Française, 38, 7–41.
ChristensenF. (1965). A generative rhetoric of the paragraph. College Composition and Communication, 16, 144–56.
CombettesB. (1977). Ordre des éléments de la phrase et linguistique du texte. Pratiques, 13, 91–101.
CombettesB. (1978). Thématisation et progression thématique dans les récits d'enfants. Langue Française, 38, 74–89.
CombsW. E. (1976). Further effects of sentence-combining practice on writing ability. Research in the Teaching of English, 10, 137–49.
CombsW. E. (1977). Sentence-combining practice: Do gains in judgment of writing ‘quality’ persist?Journal of Educational Research, 70, 318–21.
ConnorU. (1987a). Research frontiers in writing analysis. TESOL Quarterly, 21, 4, 677–96.
ConnorU. (1987b). Argumentative patterns in student essays: cross-cultural differences. In U. Connor and R. B. Kaplan (eds.), 57–72.
ConnorU. & KaplanR. B. (eds.) (1987). Writing across languages: analysis of L2 text. Reading, MA: Addison Wesley.
ConnorU. & LauerJ. (1985). Understanding persuasive essay writing: linguistic/rhetorical approach. Text, 5, 4, 309–26.
CoombsV. (1986). Syntax and communicative strategies in intermediate German composition. Modern Language Journal, 70, 2, 114–24.
CooperC. R. (1977). Holistic evaluation of writing. In C. R. Cooper & L. Odell (eds.).
CooperC. R. (1983). Procedures for describing written texts. In P. Mosenthal, L. Tamor & S. Walmsley (eds.), 287–313.
CooperC. R. & OdellL. (1977). Evaluating writing: describing, measuring, judging. Urbana, ILL: National Council of Teachers of English.
CooperC. R. & GreenbaumS. (1986). Studying writing: linguistic approaches. Beverly Hills, CA: Sage Publications.
CooperT. C. (1976). Measuring written syntactic patterns of second-language learners of German. Journal of Educational Research, 69, 5, 176–83.
CooperT. C. (1977). A strategy for teaching writing. Modern Language Journal, 61, 251–56.
CooperT. C. (1981). Sentence combining: an experiment in teaching writing. Modern Language Journal, 65, 158–65.
CooperT. C. & MorainG. (1980). A study of sentence-combining techniques for developing written and oral fluency in French. French Review, 53, 411–23.
CooperT. C., MorainG. & KalivodaT. (1980). Sentence combining in second language instruction. Language in Education Series, No. 31. Washington, D.C.: Center for Applied Linguistics/ERIC Clearinghouse on Languages and Linguistics ED 195 167.
CorderS. P. (1967). The significance of learners' errors. International Review of Applied Linguistics, 5, 161–70.
CorderS. P. (1971). Idiosyncratic dialects and error analysis. International Review of Applied Linguistics, 9, 147–59.
CorderS. P. (1974). Error analysis. In J. P. B. Allen & S. P. Corder (eds.), 122–54.
CoutureB. (1985). A systematic analysis of writing quality. In J. D. Benson & W. S. Greaves (eds.), 67–87.
CoutureB. (ed.) (1986). Functional approaches to writing: research perspectives. London: Frances Pinter.
CrowhurstM. (1980). Syntactic complexity and teachers' quality ratings of narrations and arguments. Research in the Teaching of English, 14, 223–31.
CrowhurstM. (1987). Cohesion in argument and narration at 3 grade levels. Research in the Teaching of English, 21, 185–201.
CrowhurstM. & PichéG. L. (1979). Audience and mode of discourse effects on syntactic complexity in writing at two grade levels. Research in the Teaching of English, 13, 101–9.
CrowleyS. (1989). Linguistics and composition instruction: 1950–1980. Written Communication, 6, 4, 480–505.
CummingA. (1989). Writing expertise and second-language proficiency. Language Learning, 39, 1, 81–141.
DaikerD., KerekA. & MorenbergM. (1978). Sentence combining and syntactic maturity in freshman English. College Composition and Communication, 29, 36–41.
DaikerD., KerekA. & MorenbergM. (eds.) (1979). Sentence combining and the teaching of writing. Conway, Arkansas: University of Akron and University of Central Arkansas.
DaleE. & ChallJ. (1948). A formula for predicting readability. Educational Research Bulletin, 27, 37–54.
D'AngeloF. (1974). A generative rhetoric of the essay. College Composition and Communication, 25, 388–96.
DavisonA., KantorR. N., HannahJ., HermonG., LutzR. & SalzilloR. (1980). Limitations of readability formulas in guiding adaptations of texts. Technical Report No. 162. Urbana, ILL: University of Illinois Center for the Study of Reading, ED 184 090.
DehghanpishehE. (1978). Language development in Farsi and English: implications for the second-language learner. International Review of Applied Linguistics, 16, 1, 45–61.
Van DijkT. A. (1980). Macrostructures. Hillsdale, N.J.: Erlbaum.
Van DijkT. A. (1983). Discourse analysis: its development and application to the structure of news. Journal of Communication, 33, 20–43.
DixonE. (1970). Indexes of syntactic maturity (Dixon-Hunt-Christensen). ERIC Clearinghouse on Reading and Communication Skills ED 091 748.
DommerguesJ.-Y. & LaneH. (1978). On two independent sources of error in learning the syntax of a second language. Language Learning, 26, 111–24.
DoushaqM. H. (1986). An investigation into stylistic errors of Arab students learning English for academic purposes. English for Specific Purposes, 5, 1, 27–39.
DurstR. K. (1987). Cognitive and linguistic demands of analytic writing. Research in the Teaching of English, 21, 4, 347–76.
EnkvistN. E. (1985a). Coherence, composition, and text linguistics. In N. E. Enkvist (ed.), 11–26.
EnkvistN. E. (1985b). Stylistics, text linguistics, and composition. Text, 5, 251–67.
EnkvistN. E. (ed.) (1985). Coherence and composition: a symposium. Åbo, Finland: Publications of the Research Institute of the Åbo Akademi Foundation.
EnkvistN. E. (1987). Text linguistics for the applier: an orientation. In U. Connor & R. B. Kaplan (eds.), 23–43.
EvensenL. S. (1985). Discourse-level interlanguage studies. In N. E. Enkvist (ed.).
FaganW. T. & HaydenH. M. (1988). Writing processes in French and English of fifth grade French immersion students. Canadian Modern Language Review, 44, 4, 653–58.
FaigleyL. (1979). Another look at sentences. Freshman English News, 7, 3, 18–21.
FaigleyL. (1980). Names in search of a concept: maturity, fluency, complexity and growth in written syntax. College Composition and Communication, 31, 3, 291–300.
FitzgeraldJ. & SpiegelD. L. (1986). Textual cohesion and coherence in children's writing. Research in the Teaching of English, 20, 3, 263–80.
FlahiveD. & SnowB. G. (1980). Measures of syntactic complexity in evaluating ESL compositions. In J. W. Oiler & K. Perkins (eds.).
FleschR. (1949). The art of readable writing. New York: Harper & Row.
FreedmanA. & PringleI. (1980). Writing in the college years: some indices of growth. College Composition and Communication, 31, 3, 311–24.
FreedmanA. & PringleI. (1984). Why students can't write arguments. English in Education, 18, 2, 72–84.
FreedmanA. & PringleI. (eds.) (1980). Reinventing the rhetorical tradition. Urbana, ILL: National Council of Teachers of English.
FreedmanA., PringleI. & YaldenJ. (eds.) (1983). Learning to write: first language/second language. New York: Longman.
GivónT. (1983). Topic continuity in discourse: an introduction. In T. Givón (ed.).
GivónT. (ed.) (1983). Topic continuity in discourse. Amsterdam, Philadelphia: John Benjamins.
GoldenJ., HaslettB. & GaunttH. (1988). Structure and content in 8th graders' summary essays. Discourse Processes, 11, 139–62.
GötzJ. & HülmannH. (1988). Von der Textaufgabe zum Zieltext (II). Die Neueren Sprachen, 87, 5, 506–38.
GrabeW. (1984). Written discourse analysis. Annual Review of Applied Linguistics, 5, 101–23.
GradyM. (1971). A conceptual rhetoric of the composition. College Composition and Communication, 22, 348–54.
GraubergW. (1971). Error analysis in German of first year university students. In G. E. Perren & J. L. M. Trim (eds.).
GreenP. S. & HechtK. (1985). Native and non-native evaluation of learners' errors in written discourse. System, 13, 2, 77–97.
GrobeC. (1981). Syntactic maturity, mechanics, and vocabulary as predictors of quality ratings. Research in the Teaching of English, 15, 75–85.
HadenR. (1987). Discourse error analysis. In J. Monaghan (ed.), 134–46.
HakeR. L. & WilliamsJ. M. (1979). Sentence expanding: not can, or how, but when. In D. Daiker, A. Kerek & M. Morenberg (eds.).
HallidayM. A. K. & HasanR. (1976). Cohesion in English. London: Longman.
HallowayD. W. (1981). Semantic grammars: how they help us teach writing. College Composition and Communication, 32, 189–204.
HammarbergB. (1974). The insufficiency of error analysis. International Review of Applied Linguistics, 12, 185–92.
HartnettC. G. (1986). Static and dynamic cohesion: signals of thinking in writing. In B. Couture (ed.), 142–53.
HaswellR. H. (1988). Length of text and the measurement of cohesion. Research in the Teaching of English, 22, 4, 428–33.
HatimB. (1987). A text linguistic model for the analysis of discourse errors: contributions from Arabic linguistics. In J. Monaghan (ed.), 102–13.
HennigK. R.Jr., (1980). Composition writing and the functions of language. DAI41: 2424–A.
HillocksG.Jr., (1986). Research on written composition. New directions for teaching. ERIC Clearinghouse on Reading and Communication Skills, Urbana, ILL: NCTE.
HoeyM. (1983). On the surface of discourse. London: Allen & Unwin.
HomburgT. J. (1984). Holistic evaluation of ESL compositions: can it be validated objectively?TESOL Quarterly, 18, 87–107.
HultC. A. (1986). Global marking of rhetorical frame in text and reader evaluation. In B. Couture (ed.), 154–68.
HuntK. W. (1965). Grammatical structures written at three grade levels. NCTE Research Report No. 3. Champaign, ILL: NCTE.
HuntK. W. (1970a). Syntactic maturity in schoolchildren and adults. Monographs of the Society for Research in Child Development, 35, 1. Chicago, ILL: University of Chicago Press.
HuntK. W. (1970b). Do sentences in the second language grow like those in the first?TESOL Quarterly, 4, 3, 195–202.
HuntK. W. (1971). Teaching syntactic maturity. In G. E. Perren & J. L. M. Trim (eds.).
InghilleriM. (1989). Learning to mean as a symbolic and social process: the story of ESL writers. Discourse Processes, 12, 3, 391–411.
JakobovitsL. A. (1969). Second language learning and transfer theory: a theoretical assessment. Language Learning, 19, 55–86.
JacobsS. E. (1981). Rhetorical information as prediction. TESOL Quarterly, 15, 237–49.
JakobsonR. (1960). Closing statements: linguistics and poetics. In T. A. Sebeok (ed.).
KameenP. T. (1978). A mechanical, meaningful, and communicative framework for ESL sentence combining exercises. TESOL Quarterly, 12, 395–401.
KameenP. T. (1983). Syntactic skill and ESL writing quality. In A. Frcedman, I. Pringle & J. Yalden (eds.), 170.
KierasD. E. (1978). Good and bad structure in simple paragraphs: effects on apparent theme, reading time, and recall. Journal of Verbal Learning and Verbal Behavior, 17, 13–27.
KierasD. E. & JustM. (eds.) (1984). New methods in reading comprehension. Hillsdale, N.J.: Lawrence Erlbaum.
KingM. L. & RentelV. M. (1979). Towards a theory of early writing development. Research in the Teaching of English, 13, 243–53.
La BrantL. L. (1933). A study of certain language developments of children in grades 4–12 inclusive. Genetic Psychology Monographs, 14, 4, 387–491.
LantolfJ. P. (1988). The syntactic complexity of written texts in Spanish as a foreign language: a markedness perspective. Hispania, 71, 4, 933–40.
Larsen-FreemanD. & StromV. S. (1977). The construction of a second-language acquisition index of development. Language Learning, 27, 123–34.
LautamattiL. (1978). Some observations on cohesion and coherence in simplified texts. In J. O. O˚stman (ed.), 165–81.
LautamattiL. (1987). Observations on the development of the topic of simplified discourse. In U. Connor & R. B. Kaplan (eds.), 87–114.
LevenstonE. A. (1978). Error analysis of free composition: the theory and the practice. Indian Journal of Applied Linguistics, 4, 1, 1–11.
LindebergA. C. (1985). Abstraction levels in student essays. Text, 5, 4, 327–46.
Lintermann-RyghL. (1985). Connector density – an indicator of essay quality?Text, 5, 4, 347–58.
McCannT. M. (1989). Student argumentative writing knowledge and ability at three grade levels. Research in the Teaching of English, 23, 1, 62–76.
McLureE. & GevaE. (1983). The development of cohesive use of adversative conjunctions in discourse. Discourse Processes, 6, 4, 411–32.
MannW. C. & ThompsonS. A. (1988). Rhetorical structure theory: a theory of text organisation. In L. Polanyi (ed.).
MartlewM. (ed.) (1983). The psychology of written language. Chichester: J. Wiley & Sons.
MellonJ. C. (1969). Transformational sentence-combining: a method for enhancing the development of syntactic fluency in English composition, NCTE Research Report No. 10. Champaign, ILL: NCTE.
MellonJ. C. & KinneavyJ. (1979). Issues in the theory and practice of sentence combining: 20-year perspective. In D. Daiker, A. Kerek & M. Morenberg (eds.).
MeyerB. J. F. (1975). The organisation of prose and its effects on memory. Amsterdam: North Holland.
MeyerB. J. F. (1985). Prose analysis: purposes, procedures, and problems (part II). In B. K. Britton & J. B. Black (eds.), 269–97.
MichaelsS. (1987). Text and context: a new approach to the study of classroom writing. Discourse Processes, 10, 4, 321–46.
MillerS. (1979). Rhetorical maturity: definition and development. Paper given at the Carleton Conference on Learning to Write (CCTE), Canada, May 1979.
MonaghanJ. (ed.) (1987). Grammar in the construction of texts. London: Frances Pinter.
MonroeJ. H. (1975). Measuring and enhancing syntactic maturity in French. French Review, 6, 1023–31.
MorenbergM., DaikerD. & KerekA. (1978). Sentence combining at the college level: an experimental study. Research in the Teaching of English, 12, 245–56.
MorrisseyM. D. (1979 and 1980). A rule-based description of noun-phrase errors. Moderne Sprachen, 23, 7–18 (part 1) and 24, 1–16 (part 2).
MorrisseyM. D. (1981). Learners' errors and linguistic description. Lingua, 54, 277–94.
MosenthalP., TamorL. & WalmsleyS. A. (1983). Research on writing. Principles and methods. New York and London: Longman.
MulderJ. E. M., BraunC. & HollidayW. G. (1978). Effects of sentence-combining practice on linguistic maturity level of adult students. Adult Education, 28, 111–20.
MullinA. E. (1990). Errors as discourse of an other. Paper presented at the Annual Meeting of the Conference on College Composition and Communication (41st, Chicago, ILL, March 22–24, 1990).
NeunerJ. L. (1987). Cohesive ties and chains in good and poor freshman essays. Research in the Teaching of English, 21, 1, 92–103.
NeyJ. W. (1966). Review of Grammatical structures written at three grade levels (Hunt, K. W.). Language Learning, 16, 130–5.
NietzkeD. A. (1972). The influence of composition assignment upon grammatical structure. DAI32: 5746–A.
NoldE. W. & FreedmanS. W. (1977). An analysis of readers' responses to essays. Research in the Teaching of English, 11, 164–74.
NoldE. W. & DavisB. E. (1980). The discourse matrix. College Composition and Communication, 31, 141–52.
NystrandM. (1979). Using readability research to investigate writing. Research in the Teaching of English, 13, 231–42.
NystrandM. (1982). An analysis of errors in written communication. In M. Nystrand (ed.).
NystrandM. (ed.) (1982) What writers know. New York: Academic Press.
ObenchainA. (1979). Developing paragraph power through sentence combining. In D. Daiker, A. Kerek & M. Morenberg (eds.).
O'BrienT. (1987). Predictive items in student writing. In T. Bloor & J. Norrish (eds.), 70–84.
OdellL. (1977). Measuring changes in intellectual processes as one dimension of growth in writing. In C. R. Cooper & L. Odell (eds.).
O'DonnellR. C., GriffinW. J. & NorrisR. C. (1967). Syntax of kindergarten and elementary school children: a transformational analysis. NCTE Research Report No. 8. Champaign, ILL: NCTE.
O'HareF. (1973). Sentence combining: improving student writing without formal grammar instruction. NCTE Committee on Research Report Series, No. 15. Urbana, ILL: NCTE.
OllerJ. W.Jr., & PerkinsK. (1980). Research in language testing. Rowley, MA: Newbury House.
OnyeberechiS. E. (1986). Syntactic fluency and cohesive ties in college freshmen writing. Dissertation Abstracts International, 47, 08A.
OstlerS. E. (1987). English in parallels: a comparison of English and Arabic prose. In U. Connor and R. B. Kaplan (eds.), 169–85.
ÖstmanJ. O. (ed.) (1978). Cohesion and semantics. Åbo, Finland: Research Institute of the Åbo Akademi Foundation.