Test Scoring and Analysis Using SAS (2014)

Chapter 4. Checking Your Test Data for Errors

Introduction

All of the programs developed so far have assumed that there are no errors in the test data. This is an unrealistic assumption. In the real world, it is likely that your student data contains errors. For example, if you are using a mark-sense reader (scanner), the student may have filled in more than one answer choice. More sophisticated scanners include darkest mark discrimination software that attempts to choose the darker mark if more than one answer is selected. This is useful when there are stray marks on the sheet or when the student fails to erase an item completely. Another possible error occurs when a student marks a selection so lightly that it is interpreted as an unanswered item.

Other possible errors include incorrect student IDs (in which case a matching name will not be found in the student name file). Because scanners can jam (and they do), you may wind up with multiple entries for a single student.

Other sources of test data (entered by hand, from the web, etc.) are also subject to data errors. Most of the programs described in this book will either produce strange-looking results or fail completely if certain types of errors in your data remain uncorrected.

You will see ways to identify and fix all of these errors with the SAS programs developed in this chapter.

Detecting Invalid IDs and Answer Choices

Suppose you are using a scanner that does not support darkest mark discrimination. If a student marks more than one choice, you will usually wind up with an ASCII character other than A, B, C, D, or E. One way to detect these errors is with the following short SAS program. (Note: This program, as well as other programs in this book, are available in pre-packaged form in Chapter 11 for use by SAS programmers and other programmers alike.) A file called data_errors.txt was created using the original file sample_test.txt with three errors introduced: one invalid student ID, one answer choice not equal to A through E, and one blank answer. Here is a listing of file data_errors.txt:

Listing of File data_errors.txt

000000000 abcedaabed

123456789 a cdeaabed

111111111 abcedaabed

222222222 cddabedeed

333333333 bbdddaaccd

444444444 abce?aabbb

555555555 eecedabbca

666666666 aecedaabed

777x77777 dbcaaaabed

888888888 bbccdeebed

999999999 cdcdadabed

You can run the following program to identify the errors in this file:

Program 4.1: Detecting Invalid IDs and Answer Choices

title "Checking for Invalid IDs and Incorrect Answer Choices";

data _null_;

   file print;

   array ans[10] $ 1 Ans1-Ans10;

   infile 'c:\books\test scoring\data_errors.txt' pad;

   input @1  ID $9.

         @11 (Ans1-Ans10)($upcase1.);

   if notdigit(ID) and _n_ gt 0 then put "Invalid ID " ID "in line " _n_;

   Do Item = 1 to 10;

      if missing(Ans[Item]) then

      put "Item #" Item "Left Blank by Student " ID;

      else if Ans[Item] not in ('A','B','C','D','E') then

         put "Invalid Answer for Student " ID "for Item #"

         Item "entered as " Ans[Item];

   end;

run;

Because your multiple-choice responses may be in upper- or lowercase, this program reads the student data and converts it to uppercase. This is accomplished by using the $UPCASE1. informat. You can use the NOTDIGIT function to test if the student ID contains any non-digit characters. This useful function searches every position of a character value (including trailing blanks) and returns the position of the first non-digit. If the character value only contains digits, the functions returns a 0. In SAS, any numeric value that is not 0 or missing is considered to be true. So, if the ID contains any non-digit characters, regardless of where that value is located in the ID, the function returns a number from 1 to 9 (since the ID variable in this example is nine digits) and is therefore considered true. Because you used a FILE PRINT statement, the PUT statement following the test for non-digit values writes to the output device (the Output window if you are using SAS Display Manager). The variable _N_ is a value created when you run a DATA step. It counts iterations of the DATA step. Because you are reading one line of input data with every iteration of the DATA step, the value of _N_ tells you what line of the text file contains an ID error.

Following the test for non-digit ID values, you test each of the 10 answer values to see if a value other than A through E is found. If so, you print out the ID, the item number, and the value.

This program assumes that the answer key is included in the file to be tested—it does not test for a valid ID in the first line of the file. If this is not the case, omit the test for _N_ greater than one in the test for non-digit values.

Output from Program 4.1

Checking for Invalid IDs and Incorrect Answer Choices

Item #2 Left Blank by Student 123456789

Invalid Answer for Student 444444444 for Item #5
entered as ?

Invalid ID 777x77777 in line 9

Checking for ID Errors

You tested for non-digit values in the previous program. In this program, you want to go one step further and determine if any of the IDs in the student test file are missing in the student name file (containing ID values and names). To demonstrate this, another text file called invalid_test.txt was created from the original sample_test.txt file with one ID changed so that it cannot be found in the student name file (student.txt). When you merge the test data file and the student name file (by ID), you can determine if there are any names in the test data file that are not found in the student name file, as demonstrated next:

Program 4.2: Checking for Student IDs in the Test File that Are Not in the Student Name File

*Identifying IDs in the test data that are not present

 in the student name file;

data student_test;

   infile 'c:\books\test scoring\invalid_test.txt' firstobs=2;

   input @1  ID $9.;

run;

proc sort data=student_test;

   by ID;

run;

data student_name;

   length Name $ 15 ID $ 9;

   infile 'c:\books\test scoring\student_name.txt' dsd;

   input Name ID;

run;

proc sort data=student_name;

   by ID;

run;

title "Listing of Student IDs not in the Student Data File";

data _null_;

   file print;

   merge student_name(in=in_student_name) student_test;

   by ID;

   if not in_student_name then put "ID " ID "not found in name file";

run;

You start out by reading the student data from the file. Because the first line of this file contains the answer key, you use the INFILE option FIRSTOBS= to tell the program to begin reading the text data file starting with the second line. You are only checking for invalid IDs, so that is the only variable that you need to read. Next, you sort this data set by ID since that is required in the merge that is to follow.

You may already have a permanent student name file, but in this example you are re-creating it in a DATA step. You also sort this data set by ID to prepare for the merge.

In the final DATA step, you merge the two SAS data sets by ID. The SAS data set option IN= creates a temporary variable (that you called in_student_name) that has a value of true if that data set is making a contribution to the merge and false if the observation is not making a contribution to the merge. If there is an ID in the student_test data set that is not in the student_name data set, the variable in_student_name will be false. You want to print out the ID for any student whose name is not included in the student name file. The output looks like this:

Output from Program 4.2

Listing of Student IDs not in the Student Data File

ID 434444444 not found in name file

Using “Fuzzy” Matching to Identify an Invalid ID

Two possible reasons why a student ID is missing from the name file are that the name was never entered or that the student marked it incorrectly on the answer sheet. This section shows you how to look for close matches between the incorrect ID and IDs in the student file using a method sometimes described as fuzzy matching. SAS has a spelling distance function (SPEDIS) that is used to match words (usually names) where they may not be spelled exactly the same. This same function can be used to identify closely matched IDs. Let’s try to see if we can use this method to find a close match to ID 434444444. Here is the program:

Program 4.3: Attempting to Find a Close Match for IDs in the Test File and the Name File

*Attempting to find a close match for ID 434444444 in the student name file;

data student_test;

   infile 'c:\books\test scoring\invalid_test.txt' firstobs=2;

   input @1  ID $9.;

run;

data student_name;

   length Name $ 15 ID $ 9;

   infile 'c:\books\test scoring\student_name.txt' dsd;

   input Name ID;

run;

title "Searching for a Close Match for ID 434444444";

proc sql;

   select test.ID as ID_from_Test,

          name.ID as ID_from_Name,

          Name

   from student_test as test,

        student_name as name

   where spedis(test.ID,name.ID) between 1 and 15;

quit;

Even though the IDs consist of digits and not letters, you can still use the spelling distance function (SPEDIS) to compare two IDs. The program starts by creating the student_test and student_name data sets, as in the previous program. To look for possible matches between the ID on the test and the IDs in the name file, you create a Cartesian product using PROC SQL. A Cartesian product is all possible combinations of observations in each file matched against every observation in the other file. As a simple example, if File One contained:

File One

111111111 Ron

222222222 Jim

And File Two contained:

File Two

111111110

222222222

333333333

The Cartesian product of File One and File Two would be:

Cartesian Product of File One and File Two

111111111 Ron 111111110

111111111 Ron 222222222

111111111 Ron 333333333

222222222 Jim 111111110

222222222 Jim 222222222

222222222 Jim 333333333

You can then look at each line in the Cartesian product data set to see if the two IDs are close. For those readers who enjoy the details, the SPEDIS function uses a spelling distance algorithm to compute a spelling distance between words. If the two words being compared are exactly the same, the spelling distance is 0. For each spelling mistake, the function assigns penalty points. Some spelling mistakes, such as getting the first letter wrong, result in a large penalty; other errors, such as transposing the position of two letters, result in a smaller error. Finally, after all the penalty points are computed, the spelling distance is computed as the sum of the penalty points as a percentage of the number of letters in the word you list as the first argument in the function. This makes sense since one letter wrong in a three-letter word is a rather bad spelling mistake, whereas one letter wrong in a 10-letter word is a minor error.

Now that you see how the SPEDIS function can compare IDs, let’s see the result of running Program 4.3:

Output from Program 4.3

Searching for a Close Match for ID 434444444

ID_from_Test  ID_from_Name  Name

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

434444444     444444444     Raul Heizer

Here you see a possible match for ID 434444444. You may want to experiment with spelling distance values in this program. If you pick a value too small, you may miss possible matches; if you pick a value too large, you may match IDs that do not belong to the same person.

Checking for and Eliminating Duplicate Records

There are a number of ways that duplicate records can occur. If you are using a scanner (with or without automatic feed), an answer sheet may cause a jam. When this happens, you usually scan the sheet a second time. Depending on the situation, the rescanning process may cause a duplicate record to be entered into your file. Another possible source of duplicate records results when your data is entered by more than one person and one or more answer sheets are entered multiple times.

The solution to this problem is very simple: When you use PROC SORT to sort your answer file, you can include an option to remove duplicate observations. To demonstrate this, two duplicate entries were added to the data_errors.txt file. This file was called duplicates.txt and is displayed next:

Listing of duplicates.txt

000000000 abcedaabed

123456789 a cdeaabed

111111111 abcedaabed

111111111 abcedaabed

222222222 cddabedeed

333333333 bbdddaaccd

444444444 abce?aabbb

555555555 eecedabbca

666666666 aecedaabed

888888888 bbccdeebed

777x77777 dbcaaaabed

888888888 bbccdeebed

999999999 cdcdadabed

The duplicate entries are printed in bold type. The following program detects and eliminates these duplicate observations:

Program 4.4: Program to Detect and Eliminate Duplicate Records in the Answer File

data check_for_dups;

   infile 'c:\books\test scoring\duplicates.txt' pad;

   input @1 ID $9.

         @11 (Ans1-Ans10)($upcase1.);

run;

proc sort data=check_for_dups noduprecs;

   by ID;

run;

The DATA step reads text data from the duplicates.txt data file. Although the first line of data is really the answer key, for purposes of detecting duplicates, you can treat it the same as student answers. The PROC SORT option NODUPRECS (no duplicate records) checks for duplicate records (that is, two records that are exact duplicates), keeps the first one, and deletes successive duplicates. It is a good idea to include this test before you attempt to score a test or perform item analysis. It is especially important to remove duplicate records if you plan to merge your test data with a file containing student names and IDs.

You can see in the listing of the SAS Log (below) that two duplicate records were detected and deleted.

SAS Log from Program 4.4

6    proc sort data=check_for_dups noduprecs;

7       by ID;

8    run;

NOTE: There were 13 observations read from the data set

      WORK.CHECK_FOR_DUPS.

NOTE: 2 duplicate observations were deleted.

NOTE: The data set WORK.CHECK_FOR_DUPS has 11 observations and 11

      variables.

NOTE: PROCEDURE SORT used (Total process time):

      real time           0.04 seconds

      cpu time            0.01 seconds

The note in the SAS Log tells you that 13 observations were read, 2 duplicate observations were deleted, and data set CHECK_FOR_DUPS has 11 observations.

Conclusion

You should routinely perform data validity checks before attempting to run any of the item analysis programs described in this book. A possible exception to this recommendation might be considered if you are using a scanner with sophisticated software that performs data integrity checks as the tests are scanned. This is not the case in most scanners on the market.

A SAS macro is included in Chapter 11 to automate the task performed by Program 4.1.