Let’s Stop Evaluating on Averages When the Mode Means More.

Original Post [Medium]


Full post:

There are times I watch my sister — a 5th grader — mull about the world and I marvel at her energy. She bounces around, she sits, runs, does twirling back-handstands off the couch onto the ground, plays games, plays the Viola, does her homework…etc. She has the capacity to do, variously, in one hour what I can in a day. Sometimes, it feels like there are five of her around me.

And while that youthful energy astounds me, there’s something else I wonder sometimes. How could her teachers possibly center in on the real her?

Looked at another way, youthful energy is just another way of describingmassive inconsistency.

And we all have that. Days where we swear we aren’t ourselves. We feel like our minds somewhere else. Or the work we produced is unrecognizable weeks later. “That’s not me,” you say some foggy mornings. And you mean it.

And yet, from basic education on up to corporate education, we’re evaluated in a way that suggests all of these inconsistencies tell the story of who we are. You take Geometry and you get a ‘B’. You bust yourself for sales one quarter and you earn a 91 on your employee report.

These are the final demarcations of your actions for a period of time. This ‘B’ represents your knowledge of that semester’s Geometry cirriculum.

But there was always that one section in Geometry you really got. And that one section you never did. You aced one test but failed another….One month of your quarter you killed it. The other was bleak. It’s been like this —this pitter patter of performance— for decades now.

And we’re grading as such. On averages. What all of these wavering scores equal out to when added up and divided by the # of evaluating presences.

This average is allegedly our story. Who we are. How we perform.

And yet what that gives you is a number that you never really are. It’s a balancing act, but it’s not your common state. It’s a number that tries to describe who you are, by not describing who you are.

And it’s doing us no favors.

We started this whole thing with an assumption:

Scoring things on a percentage scale should give us meaningful data. That the % you score at is the % you attained of perfection.

But that assumption has led us awry. It supposes that we could be described in this range somewhere, and that, in that description, it could prescribe us a marker. But we’ve seen that with averages, this marker is incomplete.

Wouldn’t you rather, on a scale that demarcates your approach of perfection, know where you stand on a consistent basis?

Isn’t the whole point to prescribe a reality? To understand a being and their place?

Who you are is not who you never are. It’s who you are most of the time. Who we expect, and, yes, might not always get, but when we aren’t getting this “you” it’s our perception of missing that consistency.

Anecdotally, have you ever tried to score something consistently on a 1-10 or or 1-100 rating. Chances are, if you have, you’ve discovered something peculiar. Most of your grades come out to fit inside much smaller range.

Take IMDB movie ratings, for example. Each movie on there is scored on a 1-10 scale by viewers and users. The top-rated movie is the The Shawshank Redemption. Great flick. My favorite movie of all time —City of God— comes in right after. They both scored a 9.2 from user voters.

And, yet, the stunningly underwhelming 1999 feature Forces of Nature with Ben Affleck and Sandra Bullock (which you may find yourself watching at 3am on Comedy Central on lowly Tuesday nights) has a rating of 5.3.

Now, I’m more than willing to bow to the “voting effect” (in which people that like the movie are more likely to vote and vote high), but still. We’re talking the difference between a masterpiece and a flop at only 3.9 points. Less than 40% of a difference on the scale.

The same is true in our education system. An A+ is a 100, right? But an F is anything less than, usually, 60%. Again, 40% differential between the best you can do and the “rest”. And then you have the drop-off where, in this system, a 12% is the same as a 51%, bell curve not involved.

This is bad. This is, quite simply, a lazy way to tell the story.

And yet this is what we permeate and perpetrate. A’s through F’s.

Here’s another one:

Margaret is a student. She takes five tests throughout the course (sounds familiar from your University days, right?). She scores a 92%, another 91%, a 87%, 80% and then, and then, on her last exam, she gets a 40%.

What’s her average? Well, if all is weighted equally, it’s a 78%. Person A scored a C+.

But Margaret’s work in the class suggests much more than a C+. She aced two tests and nearly a third.

We’ve seen this. And if we haven’t seen it, certainly we’ve been riddled by fear of it.

Average-based grading systems both take into account the extreme examples (of both poor performance and stellar), and forget about them completely in the name of finding a middle ground.

So as an evaluator, you’re seeing inconsistencies play into a score without even being able to recognize the inconsistencies.

Consistency is how you evaluate things in your own life. Take your car for example.

If your car gave you a different output each day, even if it was mostly on the positive side, you’d go a bit crazy. Not knowing is an enormous human fear. Not being able to count on something we utilize is tough. Really, really tough. Say your car is an all-star 25% of the time, decent and average another 40%, and the other 35% it broke down, leaked, etc…you’d start slamming your hand on the dash.

But, it’d still average out to being okay. It’d still, by average, be a “good car”. Well, a good used car.

What you ask for when buying something is consistency. What is this product going to give me day and day out.

Same with a person who works with you, for you, or above you. It’s not going to win anyone over if you say Tom is uber-productive on Tuesdays and the rest of the week, well who knows. You can’t rely on Tom in a standard environment. [And that is what I’m going for here: standard environments. The education system, for instance, I’d like to see to measure on this rather than just total output or a ROWE type assessment]

I’ll take the Tom that I can measure accurately. As a teacher, I can see where he is and work with him on improving. But without knowing where he’s at, well, it makes that latter part nearly impossible.

I’m in a position where part of my job is to evaluate people. My team wanted to bring some objectivity into our evaluation systems and for a while we used a 1-10 system.

Guess what we found? 90% of scores were between 6-8. Worse, a rare “4/10″ on a task could bring an average down egregiously. The stray and inconsistent “10″ made candidates appear better than they were.

One day, I happened to meet an old teacher from my high school. He told me of a new grading system he was implementing that was loosely based off another teacher’s idea on an “evidence-based” assessment system for grading.

It struck a chord in me. It made more sense than 1-10, more than A’s, B’s and F’s.

So we adapted it. We use a 1-4 system now. No “.5’s” allowed. You have to pick 1, 2, 3 or 4.

We look in three different verticals for each evaluation. Each vertical gets a 1-4.

From there, we use a “Double Mode” system.

The mode, for those that can’t quite bring that to the front of the mind, is the most common number.

The mode is, in short, the measure of consistency instead of amalgamating inconsistency as an average does.

We find Mode1 — the most common number. And then we find Mode2—the second most common. Together, these give us our score and paint our picture.

This is how this person performs most of the time. And that information is greatly more important to know than where work averages out to.

We’re working on telling a story by prescribing a consistency to our people. It’s working so far and we can read into those we’re evaluating so much more fully than before (with averages).

It begs the larger question. In our wide systems — education, employee evaluation, etc..— why strive to paint an incomplete picture. Just because it’s easier?

Let’s ditch the system and work a la mode.