No dataset pitfall

Machine learning is not about solving some random problem that looks commercially appealing. It is all about finding a problem for which a good training dataset can be acquired.

For example, what is harder: speech recognition or OCR? Both tasks look similar at first glance. They both attempt to recognize sequence of words. The only difference is that the first one targets audio stream, while the second one — stream of glyphs. And yet speech recognition is much harder.

The trouble is that it is difficult to build good dataset for speech recognition. The only way to do this is to hire an army of native language speakers and ask them to manually recognize text in a collection of audio files. This process is extremely money- and time-consuming. It can take half a year to build the very first dataset barely enough to be used for training, and much longer to develop it into high-quality, representative dataset that would include corner cases such as rare accents.

This problem is not so severe with OCR. Training dataset for OCR can be automatically constructed without relying on manual labor. The idea is to render sample text into image file and then distort it to represent real-world scanned text: rotate it, blur, add noise, apply brightness gradient, etc. Repeating this process with different texts and fonts allows to create dataset of decent quality in fully automatic mode. Of course, some actually scanned texts are required to make dataset representable, but this is not a bottleneck for ML engineers and can be done later.

Machine learning problems can be roughly classified into the following tiers depending on the source of the dataset: naturally existing datasets; programmatically constructed datasets; manually created datasets.


Tasks with naturally existing datasets form the lowest tier. For instance, the problem of detecting sex and age of a person by analyzing photo is exactly of this type. Dataset for this problem can be constructed by scraping social network profiles. Another problem of such type is predicting next word while user is typing text message, provided that an archive of previously sent messages exists. Similar problems are predicting geolocation of the user, stock price forecasting, and recommending movies. Even when data already exists, problems in this category can pose significant challenge because the data can be dirty or biased. People do not always specify their true sex and age in profiles, and recommender systems are prone to bias due to bubble effect. Anyway, naturally existing dirty data is better than no data at all.

Second tier is formed by problems which do not have natural datasets, but datasets can be constructed in automatic mode. One of such problems — OCR — was described above. Datasets for large number of other problems focused on image enhancement can be solved in similar manner, including image colorization, deblurring or upscaling. Dataset construction process consists of downloading a collection of originally good images and then artificially morphing them, e.g. converting to black and white. The goal of the ML model is to reconstruct the original image. Datasets for other problems, such as audio denoising, can be constructed by mixing multiple samples.

The third and highest tier consists of problems which require manual construction of the dataset. Speech recognition, adult content detection, search engine ranking — all these problems require data to be manually labeled. If a problem falls into this tier, it would take 2-10 times longer to solve it than a "similar" problem from other tiers. For example, British National Corpus, which is widely used for training language models, was constructed in three years with input from multiple organizations. It is doubtful that small company would be able to provide such resources. Not only manual labeling is time-consuming but it also requires extra effort from ML engineer. ML engineer has to assemble training dataset piece by piece knowing that any data batch scheduled for labeling costs money. Every new batch must increase overall representativeness of the dataset or otherwise money will be wasted. Selecting such batches is significantly harder than filtering dirty data (tier 1 problem) or creating additional data mutators (tier 2).

When approached with a new ML project, the first question to ask is: how hard will it be to create dataset? People without technical background often forget about data. In fact, more effort is spent on dataset construction than on fitting ML model. Projects which have to rely on manual data labeling are particularly prone to incorrect expectations. Think twice before committing to such projects.

No classification error allowed

When building a classifier, only one type of error can be minimized.

This is an example of typical conversation between a manager and a tech lead:

Manager. Hello, I have an idea that will earn us a fortune.

TechLead. What is it?

Manager. There are a lot of prospective customers who are willing to pay for a classifier that is able to mark images with explicit content. Forums, social networks, file hostings — actually all sorts of public web sites where people can upload photos. We will be the first who implement this classifier and will become rich.

TechLead. Nope, we won’t be able to create such a classifier.

Manager. What are you talking about? I do not see any offensive images in google image search. Surely they can do this, and so do we.

At this moment tech lead has to explain concepts of precision and recall to the manager. It is possible to optimize classifier only for one type of error. Either it will erroneously filter out some good images in addition to bad ones; or it will allow some bad images to pass through. But it is impossible to build a classifier that is free of both errors. In another words, there is a knob in ML model that allows to choose what error is more important. Depending on how you treat "gray" zone, error will shift to the left or to the right. The lower one error type, the higher the another one. If the knob is set to middle position, nothing good will happen because both error types will be higher than zero.


In the hypothetical classifier discussed between manager and tech lead neither error is acceptable. If classifier allows bad images through, then users will be afraid to visit web site in public or allow children to use it. And if it blocks upload of innocent images, users will be frustrated by not understanding what they did wrong. Google, on the other hand, is not so constrained. It can rotate ML knob to a position where all bad images are filtered out along with some good ones. The web is huge. Definitely it will be able to find other images which match query and also pass the classifier. If the classifier even slightly complains about some image, then this image is removed and replaced with another, absolutely innocent one.

Similarly, you may have noticed that a lot of signboards in the google street view are blurred like if they were car plate numbers. The reason is the same: ML knob is in the position where it correctly detects all true plate numbers at the cost of misclassifying some ordinary signboards as plate numbers.

When dealing with spam, the situation is the opposite. It is better to allow some spam messages through antispam filter rather than throwing away important messages. Only when classifier 99.999% sure that message is spam, it is filtered out.

Another example of this kind is an ads engine. Suppose that classifier is able to use location information to predict that user’s car has just broken on the road. Positive classification of some user instructs ads engine to promote nearby service centers and dealerships to this user. It is not a big trouble to misclassify a user and therefore display irrelevant ads if user stopped on the road for some random reason, but it would be inexcusable to miss a click from a person who is really in need of such services. Even if classifier is so bad that four out of five people are marked incorrectly, ads should be displayed.

It should be noted that if neither of error types is acceptable, it is not the end of the world yet. It just means that classifier can’t work in fully automatic mode, but only as a prefilter. Final decision must be made by a real person. In the original example about explicit content identification, classifier may be tuned to make one of the following decisions: a) publish photo immediately if it is sure that photo is OK; b) immediately reject photo and notify user if it is sure that photo is not OK; c) add photo into the queue for manual review in case of uncertainty. The trouble is that it may be prohibitively expensive to pay to reviewers. However, if you manage to make people work for free (usually such people are proudly called "moderators"), then it can work out.

Conclusion. When building classifiers, important question to ask: assuming that cumulative error will be 20%, where would you prefer to rotate the knob to?

Time limits restrict model quality

Everybody even barely accustomed to machine learning knows that training is a slow process, in particular in case of neural networks. But not everybody realizes that training time is only a small nuisance. The real problem is not the training time, it is running time of already trained model.

For example, imagine neural network that performs so-popular face morphing. If it is applied to a still photo, then there are no hard time limits. Users will patiently wait for a couple of seconds while their face is being morphed, provided that in return they will get some nice visual effect. However, if the same model is to be applied to realtime video stream (e.g. you want to morph your face while running online video chat), time constraints create a challenge. Smartphones produce video streams of 30 frames per second, meaning that model must process every single frame in under 33 milliseconds. This limit is too strict for a model performing high-quality morphing. As such, either the model must be simplified, or video resolution must be lowered, or frame rate must be reduced. Any such simplification reduces output quality.

Issues with performance are not unique to neural networks. Old-school machine learning algorithms also may be affected. For example, web search engines traditionally use ensembles of decision trees to find top N relevant documents to display. These models are computationally significantly cheaper than neural networks, but the trouble is that to build a single result page, model must be applied to thousands of candidate documents. Only top N highestly ranked candidate documents are displayed. Developers, in order to satisfy time limit, have to use various heuristics, the most drastic one is dropping large portion of documents from ranking at all. This is one of the reasons why search engines may display irrelevant results. Relevant documents are actually present in search engine database, but they were skipped by ranking engine because of the time limit.

Another commonly forgotten aspect is time required to preprocess "raw" features into "model" features which can be fed to a model. Raw features (e.g. pixels of the original 3264x2448 photo; or age, sex and salary history in credit score computation) are rarely fed directly into the model. They are typically preprocessed into higher-level features. Different ML algorithms impose different restrictions and thus require different preprocessing steps in order to make model work correctly. In case of photo analysis, processing may include converting photo to black-and-white and downscaling it to fixed dimensions 128x128. These 16384 pixels are next used directly as input features of neural network. In case of credit score computation preprocessing may consist of generating feature combinations (Age × max(Age, 30) × log(CurrentSalary)), computing statistics ("median of monthly credit payment over the course of last 15 years") and more computationally intense steps like Principal Component Analysis. This preprocessing takes time, which may even exceed time it takes to apply model. Feature preprocessing is generally more expensive in old-school ML models compared to NN. The latter imposes fewer restrictions, which is one of the reasons why NN became popular recently. In all cases preprocessing step does exist and takes time. Bad thing is that this time is not included in reports generated by model training frameworks. Training frameworks work directly with "model" features, which are typically prepared beforehand and supplied in a CSV file. When training is over, frameworks produce report that includes among other things time it takes to apply model to a single sample: "test collection: 1200 samples; score 96.4%; 4 ms/sample". This timing may be misleading since it doesn’t include preprocessing time, which is out of the scope of model training. Therefore it is strongly recommended to perform separate test that will measure timings of end-to-end workflow.

Summarizing the above, below is the overall timing formula for computing single "logical" result. Every component of this formula may dominate the others.

\mathrm{NumberOfSamples} \times \left( \mathrm{FeaturesPreprocessingTime} + \mathrm{ModelRunningTime} \right)

Following factors contribute to running time:

  • Complexity of preprocessing raw features

  • Number of features fed into model

  • Complexity of the model itself (number of layers and connections in NN; number of trees and their depth in decision tree models, etc)

  • Number of samples that are fed into model to produce single "logical" result

  • Hardware where model will be run. There is huge difference between running model on GTX 1080 video card, general purpose Intel/AMD CPU, and smartphone ARM CPU.

If you have tight time limits, then a lot of things have to be simplified, therefore reducing model quality. Be prepared for that.