Grade Inflation - or How I Learned to Stop Worrying and Love the AI

Recently, there have been two interesting developments that I think hint at the future of education using AI: An AI assisted proof of an open Erdos problem was recently published here and also a recent news about Harvard capping the percentage of A grades for undergrad classes. Both of these are related to the evident fact that our current understanding of education is quickly becoming outdated. In this post I write about my own experience designing a class where students could freely use AI.

AI generated art of a Tiepolo-like painting of a student using AI.

A brief history of modern education

The current education system was more or less developed in the 15th century with the adoption of the movable type printing press by Johannes Gutenberg. Having mass printed textbooks was an incredible innovation in education, as it allowed anyone to learn from great masters in any location at any time. This coincided with the renaissance and growth of university education. A lot of discoveries were made because of this, because information could be transmitted over long spans of time beyond the lifetime of any single person. Some of these discoveries include: the invention of calculus, the discovery of the orbits of the planets, and pretty much all modern discoveries including the Theory of Relativity.

The common factor, is that even within the finite lifespan of a single person, information and discoveries accumulated over hundreds of years were accessible to almost everyone. In some sense, written information is much more efficient than audio or video at information transmission. This leads to a very natural model of education: a curriculum is created such that a young person through continuous education can reach a level of mastery in multiple subjects that would be impossible on their own without extra information.

What hasn’t changed is our brain: reading information is not the same as understanding it. This is the reason why classes have tests to check the understanding of the material. This is crucial as knowledge is built over exercises, struggle and repetition of the material. This part about learning is not optional and will not change naturally unless there is a breakthrough in our understanding of the brain and we can transmit information directly into it. Until then, we have the same biological bottleneck in our learning process.

What the AI assisted proof Erdos problem tells us about research and AI in education

It is incredibly exciting that LLM tools are useful for researchers as an assistive tool to increase research productivity and tackle previously unsolved problems. Reading the proof, the strength of using AI tools lies in the fact that analogies, metaphors and other verbal forms allow the LLM to uncover connections, that in the eyes of a proficient researcher can be exhausted and explored, leading to breakthroughs that a single person couldn’t tackle with their own body of knowledge. This hints at a future with less specialization in research and back to an era of polymaths.

The challenge in designing a class in the AI era

Taking these points into account, the goal of designing classes where AI is used should satisfy these requirements:

  1. There is no textbook as the AI in a way is all textbooks.
  2. Tests should be non-trivial to AI.
  3. Tests should result in actual learning.

The first point is immediate and the easiest. In some sense, the only detail to define is the scope of a curriculum that can be tackled in a semester or an academic year. This calibration is hard, because AI allows students to solve problems that they previously couldn’t do on their own.

The second point is interesting as well, as it requires somehow forgetting everything we know about education as just trying to band-aid an old curriculum will not work. This connects to the first point, it’s not that the AI is actually thinking, but rather the tasks we give it are too easy. It’s as if in a PE class you allow students to use bicycles to complete a 400m course. Clearly all students using the bike will clear the course way faster than even the fastest athlete on foot. Here, the point is that when AI is allowed, the goal should be moved proportionally up. I will discuss more on this in designing my own class ahead.

The third point is perhaps the hardest: once a curriculum and a set of tests is established with AI in mind, how to check that students are actually learning? Here I think the clue is that evaluation metrics need to be reimagined. Instead of just having a single goalpost, a variable one where the interaction between students becomes crucial in the learning process. More on this and grade inflation at the end.

Designing a Supply Chain Analytics class

I started designing my class vaguely following the principles I mentioned. For the material I tried to follow a collection of textbooks initially, but I soon realized that I needed to tackle first the overall theme of the class (where I ended up writing the draft of a book): that is, answering the question on how to create a meaningful collection of problems that couldn’t be solved easily by AI.

The beauty of Operations Research lies in the fact that it lives in the intersection of incredibly relevant real-world problems, the limit of computational tractability and dealing with probabilistic uncertainty. Most of modern research tends to focus on mathematical complexity, but ends up being stale in not taking advantage of the true potential of the discipline.

I knew that the class could be built around solving real-world problems, with sample data, mimicking a solving a real-world consulting project. This also had the advantage that the performance could be objectively evaluated out-of-sample sequentially, by evaluating their algorithm multiple times. Kind of a Kaggle evaluation, but calling their model sequentially.

Some examples of the assignments were: a demand forecasting problem, a data-driven inventory management setting, a tanker-loading and transporting problem and a cloud-computing for AI allocation problem.

For those familiar with real-world consulting, it’s easy to see how to give nuance to an instance of a problem in multiple ways: by adding certain statistical patterns seen in real-world data, or by designing instances of the problem where a simple solution performs well, but where there is room for improvement for curious students.

Creating incentives for learning and exploring

The last ingredient was aligning the incentives for the class such that there is a tangible stake or result by putting effort in learning. I remember clearly when I took a stochastic models class where the assignments were only graded for completion and there was a solution manual. At the beginning of the semester I diligently solved the problem sets, but as the semester became busier, I just used the solution manual to complete them. Nowadays, as students face similar situations, it is naive to expect them not to use AI (whether allowed or not).

I made the decision of making grading relative to the performance of the class. This is not a silver-bullet, but at least ensures that the average student will put an average amount of effort. Due to the almost zero-sum nature of grading, it also created an incentive for academic integrity. The natural downside, is that it might perhaps dissuade collaborative effort, but this can also be mitigated by making the assignments group assignments rather than individual. In some sense, this is aligned with the philosophy behind the A grading cap taken by Harvard. In a way, the job market (and many other markets, such as the marriage market) work this way. Most markets work as intended when there is fair competition.

Did the students actually learn?

Allowing the students to use AI created a lingering question in my mind: If the AI is truly advanced, no matter the assignment the only variation I should observe in the scores should be due to the randomness in the data. Said another way: If the AI is going to replace white collar jobs, then all scores should be the same, as thinking is outsourced to the AI.

Fortunately, this was not the case! In fact, there is a clear pattern where the students have meaningful differences with respect to the AI. As a first example, for the tanker-loading and transporting problem, which is a difficult TSP with additional Dynamic Programming dynamics. See the distribution of scores:

Raw scores of class for take-home Midterm using AI.

Most students do quite better than the AI baseline (a one-shot prompt asking the AI to solve the problem). Clearly this shows that humans in fact make all the difference in the scores, and that in fact they do way better than a one-shot prompt without the proper context and understanding of the problem.

For all assessments (including midterms and the final), I allowed students to freely use AI. If the human element would make no difference, the correlation between all the normalized scores of the class should be 0, as the randomness is just caused by the data as the AI would reach the same conclusion. Here is the correlation matrix of all assessments in the chronological order they were presented:

Correlation matrix of the assignments in chronological order.

The positive correlation means that a good score in one assessment is correlated with a positive score in another. Here, we can see that most scores are positively correlated, except for the second midterm. This is expected as this midterm evaluated a different set of skills (more on the optimization and risk management abilities than the others). This supports that AI is just a tool and that the students are making all the difference. Moreover, the positive correlation hints at a sustained engagement and consistency with the assignments.

Lastly, to evaluate how learning evolved over time consider these metrics: Let $X_i$ be the normalized score of each assignment, let $w_iX_i$ be the weighted score of the assignment. Likewise, let $Y_j=\frac{1}{j}\sum_{i=1}^jX_i$ be the unweighted average score up to assignment $j$ and let $\tilde Y_j=\sum_{i=1}^jw_iX_i$ be the weighted-average score up to that assignment. The covariances $cov(X_j,Y_{j-1})$ and $cov(w_jX_j,\tilde Y_{j-1})$ should be stable and positive as a sign that learning is effective. See the following plot of these:

Sequential correlation of assignments.

Except for midterm 2, the curve is positive and non-decreasing. Which hints that students were successfully learning and that the AI was in fact helping rather than hindering learning.

Conclusion

While not a silver-bullet, the approach presented here is a positive sign about finding ways to incorporate AI in the classroom, such that students can acquire practical abilities that will serve them in a competitive job market. I think the future is very promising in this regard.