The task of clustering data in the age of AI has some parallels with multi-event athletic competitions
Clark Alexander and Sofya Akhmametyeva are sitting in an airy, modern downtown Chicago office, describing decathlons.
“How do you say who is the best athlete in a decathlon?” asks Alexander, a math professor at DePaul University and a mathematical engineer at Nousot, an AI-based tech startup. “You take every event that those athletes do. Then you score all of them, and the highest total gives you the best athlete. But in our case, the highest total gives you the best algorithm.”
Decathlons are both an analogy for and the basis of a solution that Alexander and Akhmametyeva didn’t set out to build, but designed and published nonetheless: a common, comprehensive standard to quantitatively assess the performance of clustering algorithms.
The two mathematicians and computer scientists’ original goal was development, not measurement. Nousot had already built an autonomous forecasting algorithm that used deep learning to deliver high initial accuracy and then improve over time, and the company wanted to do the same with a clustering algorithm.
“Clustering is perfect for big data,” says Akhmametyeva, Nousot’s lead machine learning engineer. “There’s a ton of data out there, and the user doesn’t have to be lost in it. An algorithm figures out the groupings in the data, and the user creates stories from the groupings.”
In fact, users have created literal world change from the groupings. Meaningful data clusters — those groups of elements that reveal something conclusive or useful — have helped people and organizations to do things like develop vaccines, discover species, run election campaigns, and see a tsunami coming, even before the advent of AI.
Now that AI is here, so is the technology to build algorithms that find even more precise and powerful groups in ever growing volumes of data, with little or no human input. But soon after Alexander and Akhmametyeva began the work of creating such an algorithm, they discovered that they had to move the goalposts.
“The baseline for architecting our algorithm had to address all the existing performance metrics for clustering algorithms, and improve on them,” says Akhmametyeva, who also founded AIR, a company that builds interfaces between humans and robots. “So our approach was to first identify all the quantitative metrics for determining how well a clustering algorithm does.”
That approach grew perplexing right away. “We kept not finding quantitative measures,” Alexander says.
“Every clustering algorithm is good at identifying about four types of clusters and not so good with about three others,” he continues, referring to the seven features by which clusters are commonly defined: stability, noise, complexity, homogeneity, intercluster distance, covolume, and shape.
“Every paper we read concluded with something like ‘this algorithm is good because it checks off more boxes that we care about,’ ” says Alexander. “They were qualitative evaluations — almost pros and cons — for a task that is inherently numerical.”
So Alexander and Akhmametyeva relegated algorithm development to step two, and made algorithm assessment step one, resolving to build the broad and rigorous evaluation framework that they hadn’t found. Landing on the idea of multi-event athletic competitions as a scoring model, they created a heptathlon for clustering algorithms, complete with seven “events”: those aforementioned seven cluster features that data scientists look for in unlabeled data.
Math gets in on the gift-giving too
Within their heptathlon-inspired assessment framework, Alexander and Akhmametyeva devised new, scrupulously constructed math to quantify each cluster feature.
“Each of the seven features now has a numerical range that we designed and implemented, and we can articulate what those numbers mean,” says Akhmametyeva. “Going forward, researchers can score algorithms with real measures that are tied to real values. They can pick up nuances in performance where previous methods could not.”
The “clustering heptathlon,” if you will, works as you might imagine. Just as a decathlete receives a point score for her performance in each individual event, a clustering algorithm, when put through the assessment framework, receives a point score for its performance in identifying clusters according to each of the seven features.
And just as a decathlete’s combined score determines his final standing in a competition, an algorithm’s combined score across the framework’s seven cluster features determines its performance overall.
Still, in order to appeal to a broad audience, the framework had to elegantly handle diverse units of measurement and allow researchers to designate certain cluster features as more critical than others. With these requirements in mind, Alexander and Akhmametyeva baked three parameters into the system: scale, reference point, and weight.
Scale allows point adjustments so that an overall score isn’t skewed when adding highly scattered scores (like those for intercluster distance can be) to tightly compact ones (like those for covolume can be). Reference point accounts for the fact that high scores are best for some features (like stability), while low scores are best for others (like noise). Weight enables any cluster feature to carry more or less importance, depending upon the goals of a project. Researchers can add other parameters as well, such as a maximum range of scores.
“We wanted to give users choices,” Akhmametyeva says. “So our assessment is like an autonomous car. You want it to drive, but sometimes you want to override it. Having both options is important.”
Putting the gifts to good use
With a well-defined clustering algorithm assessment, courtesy of the heptathlon, and a rigorous quantitative measure for each cluster feature, courtesy of advanced mathematics, Alexander and Akhmametyeva turned to their original objective: building an autonomous clustering algorithm that would score well against the assessment in any clustering project, using any type of data.
Here again the pair refer to the all papers they studied. The assessment criteria used in those studies may have been soft, but the clustering algorithms themselves were not — they were actually strong, but specialized.
“What we found is that all the authors had built up their algorithms to fit their research purposes,” says Alexander. Indeed, today there are scores of clustering algorithms that perform really well at finding certain types of clusters but not others. Each is like a decathlete with world-class sprinting skills but not distance running skills, or excellent jumping technique but not throwing technique.
Alexander and Akhmametyeva recognized the opportunity and the raw material to create the super-athlete of clustering algorithms.
“What we’ve been able to do is study what all of these authors have done, and pick and choose the best functional pieces underneath their algorithms,” Alexander says. “We took those and wove them together into our own product.”
Their autonomous clustering algorithm is currently in beta, with wide release scheduled for the first quarter of 2018. “I really wanted to name it after Jessica Ennis, the Olympic heptathlon champion from the UK,” says Alexander, “but I couldn’t make ENNIS a proper acronym.”
He’s taking suggestions.
Really putting the gifts to good use
An autonomous clustering algorithm will undoubtedly outperform humans at finding meaningful groups in massive and growing quantities of data. People excel at spotting patterns quickly, but they can’t observe and learn from hundreds of thousands, even millions, of data sets like AI-powered algorithms can.
But this does not spell the end of the human contribution to cluster analysis. Quite the opposite: it offers a kind of new beginning that highlights and even demands the human contribution. The machines merely find the clusters. People are uniquely qualified to decide how and where to use the knowledge those clusters bring. We, not the machines, will innovate and deploy the systems, services, products, and treatments that high-quality cluster discovery makes possible.
Three industries in particular are poised to benefit from the improved cluster analysis that an all-purpose clustering algorithm enables, according to Alexander and Akhmametyeva.
The first is waste management. “We’re throwing out a lot of things that have value,” says Alexander. “For example, coffee grounds add richness to soil, but we don’t have an efficient way to collect them and get them where they need to go. Now we can cluster collection efforts from coffee shops and other places in the best possible way.”
Second: medicine. “Ideally, you’d synthesize a drug for one individual in order to treat them best,” Alexander continues. “With a clustering algorithm that carves out extremely precise groups, we can keep getting closer to that.”
Energy is the third industry that Alexander and Akhmametyeva are excited about transforming with cluster analysis. “We can group buildings to a smart grid by type, size, hours of energy consumption, and a lot of other variables,” explains Alexander, “and then optimize energy load shift.”
Happy holidays, cluster analysis. You’ve been upgraded.