Becoming an Alteryx Certified user is a goal for many customers and members of the Alteryx Community. The number of certifications offered at Alteryx has increased significantly since the first exams were launched more than five years ago.
To create quality content, the certification team develops certification exams following industry best practices. Exam questions are carefully created, verified, and reviewed by the certification team along with select groups of subject matter experts. However, this qualitative review is not the end of the test development process!
After an exam has been updated or added to the certification portfolio, candidates taking the exam create a wealth of real-world data that can be used for quantitative statistical analysis of the questions or items on the exam. This analysis provides a way to assess the exam itself and to identify items that might not be performing as well as intended.
The statistical analysis can be broken down into two parts: the overall exam statistics and the item-level statistics. The rest of this blog post will refer to a question on an exam as an “item,” and all items are from the same exam.
This part of the analysis focuses on descriptive statistics, including the total number of attempts, the passing rate, the distribution of overall exam scores, exam duration, and the number of correct responses for each item.
These metrics provide insight into the difficulty of the exam (is the passing score too low or high?), the number of items (is the exam too short or long?), and the overall difficulty of the items (proportion of candidates answering items correctly).
Another important metric to look at is the percentage of candidates that answer each item correctly. Items with either very high or very low percentages might be too easy or difficult, respectively. These items should be flagged for further review during the more detailed item-level analysis.
The overall exam statistics are helpful in highlighting items that need further review, so it’s important to look at the detailed statistics for each item. This includes the number of responses for each answer choice and how those responses correlate to the overall exam scores of the candidates.
The process involves the use of Designer from start to finish! Let’s jump right in with how the exam data is prepared.
As with many data science projects, the raw exam data requires some preparation before it’s ready to be sent through the Data Investigation tools. The exam platform generates an exam scores dataset, which includes a record for each candidate with the total points scored, exam duration, and the answer chosen on each item. (All personally identifiable information is removed from the dataset prior to analysis in Designer).
After some additional cleaning and transformation, it’s time to dive into the statistics!
For an item analysis, we want to look at how individual item responses correlate with a candidate's overall exam scores. The data has been formatted with individual candidate responses in each row, and the selected answer for each item is recorded in the corresponding column; each column is a different exam item. For this analysis, a correct answer is given one point, and an incorrect answer is zero points (further explained below).
For example, in the following table, the RecordID corresponds to a single candidate record, the Sum_Points column is the total number of correct responses on the exam for each candidate (their total exam score), and responses to each exam item are listed in individual columns by the item ID, e.g. 35093532.
The next step is to calculate the correlation between the responses for each item (item ID columns) and the total scores on the exam (Sum_Points column). In other words, there will be a correlation coefficient for each item ID-total score pair. The example data shown above has about 700 candidate records for an exam with 47 items.
In item analysis, this correlation is typically a special case of the Pearson correlation, where we measure the relationship between a normally distributed continuous variable (total exam score - Sum_Points column) and a dichotomous variable (a variable that has only two values, in this case correct/incorrect - an item ID column).
The output from the Pearson Correlation tool is the linear correlation between each of the columns, including the correlation of a column with itself (which is 1). The larger the coefficient, the more the values in each column correlate with each other. A positive coefficient means that as one variable increases, so does the other.
In item analysis, we want the responses for each item to correlate with the overall candidate exam scores. In other words, candidates with high exam scores should also have correct answers on most of the items.
For this analysis, we only need the correlation between the candidate total score column (Sum_Points) and the item ID columns, which are the first two columns shown below.
Finally, we’ll join the correlation coefficient of each item with the percentage correct (p_value), the number of attempts, and the number of null answers for that item. The data below is sorted by the percentage correct (p_value) column.
With this combined data, we can now look for patterns or general trends. While the table view of the data is helpful in providing a quick look for any outliers or other concerning values, an even better way to look for patterns is with a plot.
For the plot below, the variables plotted are the percentage correct and the correlation, where each item is represented by a point.
In this plot, most of the points are grouped together, but there are two items that are outliers. The orange oval highlights where both the correlation and percentage correct values are lower than the general group. These two points warrant further review, which will be discussed in more detail below.
Source: GIPHY
Now that we have summarized some of the overall exam item statistics, we can take a more detailed look at the individual item statistics. Detailed reports are compiled for all items in the exam, and outliers can be reviewed more closely using these summaries.
To simplify the analysis a bit here, the overall candidate exam scores were grouped in score quintiles (0-20%, 20-40%, 40-60%, 60-80%, and 80-100%). The responses for each option (Options A, Options B, etc.) were counted for each of the score quintiles. The final plot shows the score quintile on the x-axis and the response number (percentage of the total responses) for each option as a separate color. An item that is “performing” well will be answered correctly by candidates who score high on the exam; the number of responses for the correct option should increase with the score.
To look at how the performance of an item varies, we’ll look at two different items: one with a reasonable correct percentage score (~60-70%) and one with a low percent correct score (less than ~20%).
For a brief detour into exam development:
First, an item that is performing as expected on the exam, with a chart, text, and table to summarize the responses:
For this item, the plot shows higher-scoring candidates correctly answer this item more frequently than lower-scoring candidates (blue line with the plus symbol). The candidates answering incorrectly have lower overall exam scores, which is what we expect for a well-written item that is performing as intended.
And now an item with less desirable results:
This item will need additional review because we can see that the correct option A (black line with open circles) has a low correct percentage; even the high-scoring candidates are not answering correctly. Most candidates seem to view Option C (blue line with plus symbols) as the correct answer.
A few common reasons for this type of result would be an incorrect key (where Option C might be the correct response) or the question stem might need to be rephrased to clarify the correct response. Either way, this item can now be addressed and re-tested, improving the overall quality of the exam.
The whole item analysis process is relatively simple to automate. Due to limitations with the exam hosting platform, a manual download of the exam results data is required, with an initial (but quick!) editing to remove personal information and extra rows. Once this is complete, the entire workflow can be run a single time to produce reports summarizing the overall exam and item-level statistics. The reports can then be automatically uploaded for easy review and record keeping.
As a final observation, there is (expensive) commercial software available to perform the above analysis. Fortunately, Alteryx Designer makes this analysis easy and allows for flexibility in how the data is prepared and reported based on the needs of the exam development team.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.