Foreword

Our company, ISL was a provider of Artificial Intelligence tools and technology to organizations developing advanced software solutions. By 1992, what had started as a casual interest from our clients in applying some of our tools—the machine learning modules—to their historic data had evolved into a promising practice in what was to become known as data mining. This was developing into a nice line of business for us, but was frustrating in a couple of ways:

First, we'd always intended that ISL should be a software supplier. Yet here we were, because of the complexity of the technologies involved, providing data mining on a consulting services basis.

Secondly, we were finding that data mining projects involved a lot of hard work, and that most of that work was boring. Unearthing significant patterns and delivering accurate predictions…that part was fun. But most of our effort went on mundane tasks such as manipulating data into the formats required by the various modules and algorithms we applied.

So we built Clementine—to make our job easier and allow us to focus on the interesting parts of projects, and to give us a tool we could provide to our clients. When the first prototypes were ready, we tested them by using them to re-run projects we'd previously executed manually. We found work that had previously taken several weeks was now reduced to under an hour; we'd obviously got something right.

As the embryonic data mining market grew, so did our business. We saw other vendors, with deeper pockets and vastly more resources than little ISL, introduce data mining tools, some of which tried to emulate the visual style of the Clementine's user interface. We were relieved when, as the inevitable shoot-outs took place, we found time and time again evaluators reporting that our product had a clear edge, both in terms of productivity and the problem-solving power it gave to analysts.

On reflection, the main reasons for our success were that we got a number of crucial things right:

Clementine's design and implementation, from the ground up, was object-oriented. Our visual programming model was consistent and "pure"; learn the basics, and everything is done in the same way.

We stuck to a guiding principle of, wherever possible, insulating the user from technology details. This didn't mean we made it for dummies; rather, we ensured that default configurations were as sensible as possible (and in places, truly smart—we weren't AI specialists for nothing), and that expert options such as advanced parameter settings were accessible without having to drop below the visual programming level.

We made an important design decision that predictive models should have the same status within the visual workflow as other tools, and that their outputs should be treated as first-order data. This sounds like a simple point, but the repercussions are enormous. Want more than the basic analysis of your model's performance? No problem—run its output through any of the tools in the workbench. Curious to know what might be going on inside your neural network? Use rule induction to tell you how combinations of inputs map onto output values. Want to have multiple models vote? Easy. Want to combine them in more complex ways? Just feed their inputs, along with any data you like, into a supermodel that can decide how best to combine their predictions.

The first two give productivity, plus the ability to raise your eyes from the technical details, think about the process of analysis at a higher level, and stay focused on each project's business objectives. Add the third, and you can experiment with novel and creative approaches that previously just weren't feasible to attempt.

So, 20 years on, what do I feel about Clementine/Modeler? A certain pride, of course, that the product our small team built remains a market leader. But mainly, over the years, awe at what I've seen people achieve with it: not just organizations who have made millions (sometimes, even billions) in returns from their data mining projects, but those who've done things that genuinely make the world a better place; from hospitals and medical researchers discovering new ways to diagnose and treat pediatric cancer, to police forces dynamically anticipating levels of crime risk around their cities and deploying their forces accordingly, with the deterrent effect reducing rates of murder and violent crime by tens of percent. And also, a humble appreciation for what I've learned over the years from users who took what we'd created—a workbench and set of tools—and developed, refined, and applied powerful approaches and techniques we'd never thought of.

The authors of this book are among the very best of these exponents, gurus who, in their brilliant and imaginative use of the tool, have pushed back the boundaries of applied analytics. By reading this book, you are learning from practitioners who have helped define the state of the art.

When Keith McCormick approached me about writing this foreword, he suggested I might like to take a "then" and "now" perspective. This is certainly an interesting "now" in our industry. The advent of Big Data—huge volumes of data, of many varieties and varying veracity, available to support decision making at high velocity—presents unprecedented opportunities for organizations to use predictive analytics to gain value. There is a danger, though, that some of the hype around this will confuse potential adopters and confound their efforts to derive value for their business. One common misconception is that you just get all data you can together, and then poke around in the hope of finding something valuable. This approach—tell me something interesting in this data—was what we always considered "the data mining question from hell", and is very unlikely to result in real, quantifiable benefit. Data mining is first and foremost a business activity, and needs to be focused on clear business objectives and goals, hence the crucial business understanding phase in CRISP-DM that starts every data mining project.

Yet more disturbing is the positioning of Big Data analytics as something that can only be done by a new breed of specialist: the "data scientist". Having dedicated experts drive projects isn't in itself problematic—it has always been the case that the majority of predictive analytics projects are led by skilled analytical specialists—but what is worrying is the set of skills being portrayed as core to Big Data projects. There is a common misapprehension that analytics projects can only be executed by geeks who are expert in the technical details of algorithms and who do their job by writing screeds of R code (with this rare expertise, of course, justifying immense salaries).

By analogy, imagine you're looking to have a new opera house built for your city. Certainly, you have to be sure that it won't collapse, but does that mean you hand the project to whoever has the greatest knowledge of the mathematics and algorithms around material stress and load bearing? Of course not. You want an architect who will consider the project holistically, and deliver a building that is aesthetically stunning, has acoustic properties that fit its purpose, is built in an environmentally sound manner, and so on. Of course, you want it to stay up, but applying the specialist algorithms to establish its structural rigor is something you can assume will be done by the tools (or perhaps, specialist sub-contractors) the architect employs.

Back to analytics: 20 years ago, we moved on from manually, programmatically applying the technology, to using tools that boosted the analyst's productivity and kept their focus on how best to achieve the desired business results. With the technology to support Big Data now able to fit behind a workbench like Modeler, you can deliver first class results without having to revert to the analytical equivalent of chipping tools from lumps of flint. From this book, you can learn to be the right sort of data scientist!

Finally, for lovers of trivia: "Clementine" is not an acronym; it's the name of the miner's daughter with big feet immortalized in the eponymous American folk song. (It was my boss and mentor, Alan Montgomery, who started singing that one evening as we worked on the proposal for a yet-to-be-named data mining tool, and we decided it would do for the name of the prototype until we came up with something more sensible!) The first lines of code for Clementine were written on New Year's Eve 1992, at my parents' house, on a DECSstation 3100 I'd taken home for the holidays. (They were for the tabular display that originally provided the output for the Table node and Distribution node, as well as the editing dialogs for the Filter and Type nodes.) And yes, I was paged immediately before the press launch in June 1994 to be told my wife had just gone into labor, but she had already checked with the doctor that there was time for me to see the event through before hurrying to the hospital! (But the story that I then suggested the name "Clementine" for my daughter is a myth.)

Colin Shearer

Co-founder of Integral Solutions Ltd.,

Creator of Clementine/Modeler