K-12 School Profile System

Testing Terminology

What are the main types of educational tests carried out by the Department of Education in Newfoundland & Labrador?

Why use a sample for the norming group and how do we know that the results for the norming sample are representative of the norm group population?

How do we compare the Newfoundland population to the norm group sample on norm-referenced tests such as the CTBS?

grade equivalents
percentile ranks
stanine

Criterion-Referenced Tests

How are CRT scores summarized and interpreted?

What is the standard of excellence and how is it different from the acceptable standard?

The Role of Educational Testing

Educational testing is useful (some would argue necessary) because students inevitably vary in terms of their performance within the school environment. Because performance tends to vary so much, it is important to have a means of evaluating differences across students and (at the higher level) across schools, school districts, and so on. Such testing is integral to educational policy since it gives educators and policy developers a better idea as to what aspects of education are working effectively in the system and which are not. If there are aspects of curriculum or school operation which are causing difficulties for students in specific areas of their school program, it is important to have a means of isolating what these areas are and what might be done to help improve them over time.

Without some form of testing, teachers and education professionals would largely be "in the dark" as to how effectively their individual schools and districts were in meeting the objectives for current educational programs. This is why most educational programs have routine testing of students at various grades each year. Routine testing is necessary since the results of any single administration only provide a single view or "snapshot" of the educational performance for the group present at the time of the testing. There is no reason to believe that the results on these tests will hold as true for the next group of students to come into the school system. This is why routine testing and longitudinal comparisons of student performance over time are useful; to ensure that the level of performance is acceptable during any given year and that it is being maintained, if not improved, over subsequent years. Looking at performance changes in the same grades over several years is one of the best means of accomplishing this.

Return to top

What are the main types of educational tests carried out by the Department of Education in Newfoundland & Labrador?

There are two primary types of tests which are employed by the Department of Education to evaluate student performance in schools, these being norm-referenced tests and criterion-referenced tests.

The norm-referenced tests consistently used by the Department of Education are the CTBS (the Canadian Test of Basic Skills) which are administered to students in Grades 4, 7, 10, and 12. Such administrations take place in a cyclical manner. Thus, in the year following testing of Grade 4 students, Grade 7 students are tested, followed by Grade 10 students in the next year, and so on. Following testing of the Grade 12 students, the cycle begins anew with Grade 4 students in the subsequent year. Based on this cycle, a given grade will have 4 year intervals between the times that it is tested on the CTBS. Note also that the CTBS generally attempt to assess student performance in a wide variety of basic skill areas relative to the performance of an appropriate comparison group (see below for details regarding the use of such a comparison group in evaluation).

The criterion-referenced tests, or CRT’s, employed by the Department of Education are normally administered for specific grades and specific subject areas each year. The CRT’s also change format on a yearly basis, often times varying the grade and subject area in which testing takes place each year. CRT’s employed in the recent past have included Grade 6 Core French, Grade 3 and 6 Math, and Grade 3 and 6 Writing. A more in depth discussion of elements related to criterion-referenced testing is presented below.

Return to top

Nature of the Tests

Below you will find a fairly comprehensive description of the two different types of tests mentioned above as well as any concepts which are important to their interpretation. These concepts are not always easy to grasp but are necessary if one wishes to make any useful inferences regarding how performance on these tests maps onto student achievement.

Return to top

Norm-referenced tests, first and foremost, are meant to provide comparisons of different groups on similar measures. They are seldom in depth enough to provide information regarding how well the students are achieving the specific objectives of their educational program (this is much better covered by the CRT’s). Rather, norm-referenced tests focus more heavily on comparing different groups on a number of broad subject or skill areas. Above all, it is important to realize that the emphasis for norm-referenced tests is placed upon how well students perform in these areas relative to similar comparison groups. The scores on such tests are thus only truly meaningful when interpreted relative to some other comparative group of students.

Given this primary focus on group comparisons, who exactly are the students in specific schools and school districts compared against? The most common choice is to compare the performance of various schools against some relevant norm group. The norm group is usually a sample of students taken from some greater population to which the students in a given school belong. For example, a Newfoundland school might have a national norm group consisting of students from schools in all parts of Canada. In other words, the performance of a Newfoundland school’s students is being compared against that for a population of students from schools all across Canada. Note as well that the students in this Newfoundland school would also be considered potential candidates for this norming population (i.e., Canadian school students) even though they will not all be incorporated into the final norming sample.

An implicit assumption here, of course, is that the norming group sample is indeed representative of the overall population of Canadian school students. If it is not, any comparisons between a Newfoundland school and the norming group may be inappropriate. To achieve such representativeness, testers usually construct the norm group of students using carefully implemented sampling procedures. Because the issue of the norming sample’s representativeness is so integral to the interpretation of norm-referenced tests, the sampling concept is further explained in detail below.

The biggest problem individuals have when interpreting scores on norm-referenced tests is remembering that the test scores only measure comparative differences and not absolute differences. For example, a high PR score obtained on a Science section of the CTBS does not necessarily mean that the students/schools/school districts are highly proficient in Science. What it means is that in the area of Science as measured by the CTBS norm-referenced test, these groups do much better than the majority of the norm group on the same questions. Always remember that the emphasis for norm-referenced tests is placed on a school’s performance relative to some relevant comparison group (i.e., the norm group). Once this point is grasped, interpretation of the various types of scores on a norm-referenced test (percentile ranks, grade equivalents, stanines, etc.) ultimately becomes much easier.

Return to top

Why use a sample for the norming group and how do we know that the results for the norming sample are representative of the norm group population?

Sampling, as opposed to a census in which all possible students are tested, is usually employed simply because the amount of resources (in terms of time, money, and effort, to name a few) would be monumentally large to test the entire population. Particularly in today’s times of limited resources, it is often necessary to make do with less than a perfect census. The good news, however, is that when sampling is carried out effectively, it can give us results which are still highly representative of the entire population even though the entire population was never actually sampled to obtain these results. In this case, you can indeed get "more for less." But this gain is necessarily dependent on the assumption that the sampling procedure is carried out correctly in the first place.

The overall goal of a sampling procedure is to ensure that any results obtained for the sample group will be reasonably representative of the results that would be obtained if the entire population had actually been sampled. To achieve this goal, various sampling conditions must be met. First of all, it is necessary to specify in as much detail as possible the sampling frame which will be used in the testing. The sampling frame is, more or less, a focused definition of the population which will be tested during the evaluation. Thus, for example, in a norming group for Grade 4 students, the sampling frame might be defined as all students attending regular Grade 4 classes in schools within the country of Canada. Note that this frame implicitly filters out or removes any students from schools outside the country. As well, it also filters out any special education students and any Canadian students who attend Grade 4 under a specialized learning program. This is partly the reason why these types of students are never actually given the CTBS when these tests are administered within Newfoundland schools. Because they are never actually represented in the original norming sample, any comparisons between the special education population and the norming sample would be pointless (this would be largely the equivalent of comparing "apples to oranges").

After the sampling frame is established, the next step is to specify in some systematic way every possible member in the sampling frame population. This could, for example, be a list of all Canadian schools in different provinces coupled with a list of all Grade 4 students in these schools. Once this is done, the actual selection of sample units begins. Such selection can be accomplished using various units of sampling. For instance, sampling might be done at the level of individual students such that students are sampled one at a time from the available lists and put into the final norm group for testing. Alternatively, another choice would be to sample individual schools from the overall sampling frame of Canadian schools and include every regular class Grade 4 student from that school in the norm group.

An important, if not crucial, condition of sampling is that some form of random selection be used in the selection of cases for the sample. Random selection is imperative in achieving a representative sample which accurately reflects the results for the larger population. Random selection requires that every element in the studied population have a known probability of selection. In most cases, we further clarify this by stating that every element in the studied population have an equal probability of being selected (also known as simple random sampling). Why is this important? Because for the sample results to be representative of the overall population, each individual in the population must have an equal probability of being chosen. If some individuals have a lower probability of being selected, then they are less likely to be included in the sample and, consequently, the sample is less likely to be representative of this subgroup in the population. When various individuals in the population have a lesser chance of being selected, we run the risk of biasing our sample and getting results which are not very representative of the population from which the sample is drawn. For this reason, sampling procedures tend to emphasize the use of random sampling procedures as much as possible.

In selecting cases for the sample, it is important to ensure that no individuals are selected who are not members of the population being studied. For instance, in obtaining the sample for our Grade 4 norming group, we should not end up with a small subgroup of Grade 5 students within the final norming sample. It should be rather obvious that these students would not be representative of the overall Grade 4 population, by any means. Such a situation is unlikely to arise, though, if the initial sampling frame (mentioned above) was in fact constructed properly.

Assuming that a random sampling procedure has been correctly used, will the norming sample then be representative of the overall population? This is at least partly dependent on another factor related to the sample; the sample size. Issues surrounding what is an appropriate sample size are complex and need not be dealt with in detail here. However, as a general rule of thumb, keep in mind that, all other things equal, the representativeness of a sample will increase the larger the sample size gets. This makes sense since, the greater the sample size chosen, the more elements from the original population are being taken. In fact, if we were to continue increasing the size of our sample indefinitely, we would eventually reach a point where the entire population was in fact selected. On a final note, keep in mind that the makers of the CTBS usually employ norm groups consisting of many thousands of students (about 40, 000) in an effort to ensure that the sample size requirement for representativeness is in fact met.

At this point, suppose that we have selected a sufficiently large sample from the studied population using random selection and an appropriate sampling frame. Can we now guarantee that our results derived from this sample will be representative of the overall population? Unfortunately, the answer is no. There is always a very small chance that we will select a large number of unusual cases in the population who are quite different from the remaining members of the population (for example, a subset of Grade 4 students who perform exceptionally well on the various CTBS tests and who score much higher than the average Grade 4 student). While there is always a small chance that such a biased sample will happen, the random selection procedure effectively minimizes the chances of such an event. This is why the random sampling procedures are thought to be so beneficial towards achieving representativeness.

Return to top

Coin Toss Probability Example

How do we compare the Newfoundland population to the norm group sample on norm-referenced tests such as the CTBS?

Let us assume that an appropriate and representative norm group of Canadian students was effectively obtained for many different grades. The next step would then be to evaluate the performance of the selected students in the norming group by having them complete the appropriate test sections of the CTBS. Following this, the number of students correctly answering each item can then be calculated. Such calculations are important since they will be the chief means through which the Newfoundland population will eventually be compared to the norming group.

After the performance of the norming group has been effectively summarized, the test can then be given to students in various Newfoundland schools and school districts. Note that every attempt is made here to administer these tests under the exact same conditions as the norming group experienced when they completed the tests. For example, Newfoundland students given the tests in the Fall will be compared to a norming group of the same grade that was also tested in the Fall. Such steps are necessary since, if one group is tested at a later point in the year than another, considerable differences in performance may arise simply due to the difference in time (and not because of aspects related to the educational program which is what we are primarily interested in). An attempt must therefore be made to keep the testing conditions consistent since any differences in testing conditions might influence the performance of students on the test. Remember, we are interested largely in differences which have arisen due to differences in educational programs and implementation. Differences in CTBS scores brought on by testing conditions only serve to muddle our interpretation of the results since we no longer know if the variation in scores arose from the varying testing conditions or from actual differences in the educational programs of these schools.

Once the Newfoundland students have completed the tests, they too are scored and summarized on an item-by-item basis. The number of students getting each question correct can then be calculated. Similarly, for each section of the test which deals with a specific skill, the total number of questions answered correctly in that section can be found. Such summaries are typically known as raw scores. While raw scores are instrumental in interpreting norm-referenced tests, they do not usually tell us very much on their own. For example, if a school, on average, answers 6 of the 10 questions in a science area correctly, what does this tell us? There is, after all, no indication as to the difficulty of these 10 items (are they all similar in difficulty or do they vary quite a lot?). These raw scores, however, take on much more meaning when they are directly compared to the performance of the norm group on the same measures. In this vein, several different types of scores can then be used to summarize the performance of Newfoundland students in comparison to that for the norm group. In all cases, however, remember that the results of a norm-referenced test always provide information regarding one group’s position relative to some similar group’s (and, as we now know, the second group would ordinarily be the norm group).

Return to top

The different types of scores derived from conversions of the student raw scores include:

a) grade equivalents - this indicates the grade level, in years and months, for which an obtained score was the average score for the norm group. Thus, a score of 30 with a grade equivalent score of 5.1 means that 30 was the average score for students in the first month of Grade 5. People sometimes generalize too much about the implications of grade equivalent scores. For example, when a Grade 3 student scores a grade equivalent of 4.8, there is a temptation to say that the Grade 3 student is operating at the Grade 4 level. This is unlikely to be the case, however. What it means is that, for the questions presented on the norm-referenced test, the student achieved a score that was in the average range for students in the 8^th month of Grade 4. It goes without saying that Grade 4 students have learned considerably more material beyond that which has not been tested on the test and that a student in Grade 3 would likely not be familiar with. Thus, to claim that a Grade 3 student is operating at a Grade 4 level based on grade equivalents, is misguided. Perhaps more than any other type of score on the norm-referenced tests, the grade equivalent must be interpreted with care.

Return to top

b) percentile ranks - these are one of the more common types of summary scores used on norm-referenced tests such as the CTBS. One of the most important things to note about PR’s is that they refer not to the percentage of answers correct but, instead, to the percentage of students in the norm group who achieved the same score or lower on the test. Consider, for example, a score of 100 for one school which is found to translate into a PR of 50. What does this signify? What this means is that 50% of the norming group received this same score or less when they were given the test.

Parents and even professionals may sometimes confuse aspects related to the usage of raw scores and percentile ranks. For example, parents might sometimes assume that a score of 100 on the CTBS is equivalent to 100%, or a perfect score. This is not actually the case, however. Note above, for example, that a score of 100 or less was received by 50% of the norm group and would therefore be considered only average performance relative to the norm group.

A similar confusion sometimes arises when educators seek to summarize the scores for their own personal schools and school districts. For example, there may be a desire to combine the various PR’s for the CTBS subtests in order to obtain a total or composite PR score. While mathematically this is possible, statistically it does not make sense to do this. For instance, let us assume that the 4 PR scores obtained on the test were 23, 42, 50, and 52. Adding these scores together, we arrive at a composite score of 167. This does not make sense as a PR since it is impossible for 167% of the norming group to achieve the same score or lower as this school/school district. The highest possible PR, in fact, is limited to 99.

Sometimes, composites can be correctly reported through other means. It is argued by some, however, that performance in one area of the test (e.g., math) has little bearing on performance in another area (e.g., language) and, therefore, collapsing the numbers into a composite is a misguided venture. In any case, while the proper calculation of composites is acceptable, the interpretation of such composites should always be made in conjunction with the scores found on the individual subtests. After all, it is at the level of the individual subtests where the areas of difficulty will be most evident and where possibilities for change in instruction will prove most useful.

Return to top

c) stanine - a stanine is merely a score on a nine-unit scale from 1 to 9, with the middle score of 5 generally assumed to represent average performance. Stanine 1 is the minimum score while 9 would be the maximum. All stanines except for those at either end (i.e., stanines 1 and 9) are of equal amounts. The majority of students will score in the middle three stanines (4, 5, and 6) while very few will be located in the stanines at either end (1 and 9). This makes sense since the majority of students will perform around some average mark while only a much smaller percentage will perform exceptionally higher or lower than this average (think of this in terms of a common bell curve).

In truth, stanines merely represent a coarser grouping for percentile ranks (see Harcourt’s - Things Parents Should Know About Testing, page 5, in the back of this document). Stanines are generally easier to understand at face value compared to PR’s but this comes at the cost of providing less detailed information regarding group differences. Similar to what was stated for percentile ranks, it is important to note that stanines still represent differences in groups; they only offer information regarding comparative performance (e.g., school performance versus that of the norm group) and are not indicators of absolute performance.

Return to top

Criterion-Referenced Tests

While the overall objective of criterion-referenced tests, or CRT’s, is similar to that for NRT’s (i.e., the evaluation of school children’s performance), it focuses on the problem from a much different perspective. Whereas the NRT’s place the emphasis solely on a school’s position relative to a comparison or norm group, the CRT’s place the emphasis on how school performance compares to what is judged to be an acceptable standard in that area of the school curriculum.

So what exactly is judged to be an acceptable standard? This inevitably varies with the specific area of the curriculum being tested (as well as the specific grade, among other factors). However, in almost all cases, the acceptable standard is determined by a specially constructed committee or panel of professionals and experts. Such a committee usually consists of teachers and curriculum experts although individuals from other relevant occupations could conceivably be included here if it were beneficial. Normally, this group considers at length what aspects of the subject should be addressed in the determination of an acceptable standard of performance. Once these aspects or criteria for acceptable standards have been agreed upon, the group then considers the best means by which to test for these various criteria among the school populations. Remember, the overall aim here is to determine whether students in various schools have acquired the skills and knowledge relevant to a given area of the school curriculum. The means by which this assessment is done is through the use of carefully outlined criteria regarding what constitutes acceptable performance. This overall objective can be contrasted with that for norm-referenced tests which tend to focus more on differences across groups in broad areas of knowledge. On CRT’s, however, the performance of other examinees (whether in a norm group or some other) is irrelevant to how the student scores on the test. Only the student’s performance relative to the criteria of acceptable performance is deemed to be important on the CRT.

Return to top

How are CRT scores summarized and interpreted?

There is usually a certain amount of flexibility here regarding how scores on a criterion-referenced test are arrived at and summarized. Often times, the method by which scores are summarized will be dependent on the nature of the test itself - specifically, what subject area is being tested and what the defining criteria for acceptable performance are judged to be. In some cases, the nature of the criteria makes it difficult to evaluate student performance by simple means. Consider the following two CRT’s administered in Newfoundland schools over the past few years. In the 1996 Grade 3 Math CRT, scores were summarized by finding the percentage of correct responses students made in various areas of the math curriculum, including: numbers & numeration, geometry, measurement, graphs, and problem solving. This represents a fairly straight forward and common method of summarizing CRT scores. A slightly modified approach, however, was used for the Grade 6 Writing CRT in 1997. While different areas were also analyzed for writing (content, organization, sentence fluency, voice, word choice, and conventions), scores on this CRT represented the percentage of students who were writing at different levels of competency. In total, there were 5 levels identified for each area of writing, with carefully defined criteria associated with each of them so that accurate and reliable assignment of all students to appropriate levels was possible. Note, then, that, while percentages were used to summarize scores on both CRT’s, these percentages were employed in a slightly different way for each. The difference in scoring formats may be viewed as having arisen from the different objectives and criteria associated with the math and writing areas.

Return to top

What is the standard of excellence and how is it different from the acceptable standard?

CRT’s sometimes make mention of an additional item known as the standard of excellence. A standard of excellence represents a score which is taken to represent a student’s excelling in the subject area of interest. It generally represents the percentage on the CRT that would be received by a student who typically receives 80% or more on teacher administered paper-and-pencil tests in the classroom. Students achieving at or above the percentage indicated for the standard of excellence are thought to be highly competent in the subject area being examined.

In contrast, the acceptable standard represents the minimum percentage on the CRT which students must obtain in order to be considered minimally competent in the subject area being tested. The "acceptable" or "minimum" standard represents the lowest percentage on the CRT that can be taken as evidence of students having achieved curriculum objectives in that subject area to the passing level.

To put these two forms of standards into context relative to one another, conceptualize it this way. There is a cut-off percentage below which a student would be considered as not having attained the objectives of the course subject at an acceptable level. Anyone at this cut-off percentage or above it is considered to have adequately attained these objectives to a passable level. Within this group of individuals who have reached the passable level, there is a subgroup who are at or above the percentage designated as the standard of excellence. This subgroup represents those who not only are passing but are considered to be excelling in the course subject.

Note that the cut-off point for a CRT need not necessarily be the same as for a regular classroom type test. For example, although 50% might be considered the cut-off mark for the acceptable standard on a regular classroom test, a similar cut-off point for the CRT need not be. Cut-off marks will usually be dependent on the difficulty of the CRT in question. In cases where the CRT is judged to be particularly easy, the cut-off point is shifted upward to compensate. Conversely, in cases where the CRT is judged to be considerably harder, the cut-off point is usually located at a point lower than the 50% mark.

Return to top

Final Note Regarding Criterion-Referenced Tests

Criterion-referenced tests, overall, tend to be much easier to comprehend than are their norm-referenced counterparts. They are, in many ways, similar to the tests that are administered by teachers within their own classrooms and are often scored in a similar manner. The biggest difference between CRT’s and the classroom variety of tests probably lies in their comprehensiveness. Criterion-referenced tests are usually much more structured and focused, compared to the average classroom test. This is because CRT’s are generally designed to achieve a much higher purpose (to assess students’ overall level of knowledge and skill in a given subject area) and accomplish this goal by comparing student performance to a specific set of criteria which are assumed to be indicative of competency in the area tested. In fact, it is the student’s performance relative to this structured set of criteria or standards which is most characteristic of criterion-referenced tests used in education.

Last Updated: