The Importance of Data Quality in Machine Learning

Home Data Science The Importance of Data Quality in Machine Learning

Imagine you’re driving a high-performance car, and you fill its tank with contaminated diesel. You wouldn’t be surprised when the engine starts sputtering and coughing, leaving you stranded on the side of the road. Just as bad fuel can wreak havoc on your vehicle, poor-quality data can bring your machine learning models to a grinding halt, no matter the quality of the models themselves.

In the world of data analytics and artificial intelligence, data is the fuel that powers the engines of innovation and insight. The quality of this fuel is just as critical as its quantity – perhaps even more so. In any case, striking the right balance between quality and quantity is a tricky but vital task.

Data Quality: The Five Key Characteristics

Data quality is generally defined by the following traits:

  1. Accuracy – Is the information provided correct in every detail? Accuracy is the cornerstone of data quality. Ensuring that the information provided is not only precise but also free from errors is pivotal. Accuracy establishes trust in the insights derived from the data and is fundamental for any data-driven decision-making process. Inaccurate data can lead to misguided conclusions.
  2. Completeness – How comprehensive is the data? Complete data contains all the necessary information without significant gaps. Incomplete data, on the other hand, can introduce uncertainty and hinder accurate training of models. As well as affecting quality, incomplete data can decrease quantity when the extent of the missing data requires the removal of rows or columns.
  3. Relevance – Is the data sufficient to address your specific scenario? It’s essential to determine whether the data contains the necessary elements to provide valuable insights relevant to your objectives. Irrelevant data should also be minimised or excluded, as data that lacks relevance to the topic can lead to noise and distractions.
  4. Reliability – Is the data consistent with itself and other reliable sources? Reliability ensures that data remains stable and unchanged over time, making it suitable for analysis and decision-making. It can be assessed through methods such as test-retest and inter-rater reliabilities, split-half testing, and calculating standard deviation.
  5. Timeliness – Is the information up-to-date, or has something changed since the data was recorded? Timeliness is vital because the data’s relevance may diminish over time. Outdated information can lead to flawed conclusions or, in some cases, missed opportunities. Therefore, depending on the goal, it is crucial to establish regular data updates. This is particularly relevant when dealing with real-time analytics or time-sensitive decision making.

While these five traits are fundamental, other considerations are equally important, such as addressing the presence of bias.

Bias in Data

Bias in data can stem from various sources and manifest in different ways. It is often completely unintentional and overlooked but can majorly affect the reliability and accuracy of machine learning models.

An example of this was Amazon’s AI recruitment tool which developed bias against women. The models used in this tool were trained on resumes submitted to the company over a 10-year period. Due to male dominance in the tech industry at the time, most resumes were submitted by men. As a result, bias was incorporated into the system and the AI “learned” that males were the preference. Resumes containing the word “women’s” were penalized.

Amazon edited the tool to make it impartial to gender-related terms, but there was no guarantee that it wouldn’t learn other aspects of human bias. Due to their uncertainty, Amazon claimed that the tool was never used by recruiters, and it was later scrapped.

Addressing bias in data for machine learning is a complex and delicate task. Sensitivity to bias varies among algorithms, so it’s crucial to thoroughly assess and mitigate potential biases in each specific case.

Data Quantity

While data quality is undeniably vital, the quantity of data also holds a pivotal role in the training of robust machine learning models. Generally, a more extensive dataset facilitates more effective model learning and better generalisation. In some cases, small datasets may yield seemingly ‘high accuracy’ models, but this apparent accuracy is often misleading and due to overfitting. Overfitting occurs when a model begins to memorise the training data and struggles to apply that knowledge to new data, resulting in reduced performance on fresh data.

However, it’s crucial to note that the required dataset size for achieving high accuracy varies depending on the specific use case. Models dealing with complex tasks or many classes generally demand more extensive datasets. For example, a computer vision model designed to identify many objects in an image will need a considerably larger dataset than one classifying an image as one of two classes like ‘dog’ or ‘cat’.

Another consideration is that obtaining more data may introduce noise, which will hence reduce the data quality. While some noise is beneficial to reduce the chance of overfitting, it’s important to find a good balance between quantity and quality.

How Can You Improve Your Data Quality?

  1. Establish Boundaries – When recording and collecting data, consider setting predefined boundaries to prevent the inclusion of outliers or implausible values. This proactive approach helps maintain high-quality initial data, prior to any preprocessing.
  2. Data Quality Assessment – Before making any modifications to the dataset, it’s crucial to evaluate where data quality is lacking. Common checks in this process include:
    • Identifying missing values – looking for ‘null’, ‘NaN’ or blank values.
    • Detecting duplicates.
    • Recognising outliers or anomalies – may be done through statistical summaries or visualisations.
    • Verifying data validity, such as format and data type.
  3. Data Cleansing – Address any identified issues with the following steps:
    • Handle missing values. Whenever possible, fill missing values with the mean or median for numerical variables or the mode for categorical variables. If none of these options are suitable, consider omitting the affected data row. In extreme cases, when a substantial portion of a column is missing, it may be necessary to remove the column, except when it contains essential information. Removing rows or columns should be a last resort since it reduces the quantity.
    • Remove duplicate or irrelevant observations. Irrelevant observations are those outside of the scope of your problem, for example when the dataset includes people outside of your target audience. Additionally, removing irrelevant columns helps improve the focus of the study.
    • Filter outliers. If you believe an outlier is legitimately a typo or will skew your model in an unrealistic way, the value can be handled as another missing value, to be deleted or filled accordingly. However, sometimes outliers contain valuable information that can benefit your model by providing realistic variability.
    • Correct structural errors and data types. Rectify any structural errors and ensure that data adheres to the correct data types for consistency.

How Can You Increase Your Data Quantity?

  1. Gather more data – It may seem obvious, but if your dataset is extremely small, manually gathering more data is sometimes the only option, especially for more complex data. However, the amount of data that can realistically be gathered is often limited financially and logistically. Additionally, accumulating large datasets often raises ethical concerns in terms of privacy and copyright.
  2. Data Augmentation – This is the process of artificially increasing the amount of data by generating new data points from existing data. In computer vision this may involve altering original images in a variety of ways such as flipping, rotating, cropping, or changing brightness or contrast. In other cases, SMOTE (Synthetic Minority Oversampling Technique) can be used to add new data points to a dataset, specifically when there is a significant imbalance in the target classes.

Is Perfection Ever Achievable in Machine Learning Data?

While data can be significantly enhanced through cleaning and preprocessing, it’s essential to recognise that achieving absolute perfection in data is often an unattainable goal. Data can always have some level of imperfection due to various factors, including human error in data collection, measurement inaccuracies and inherent biases. These can’t always be detected and corrected. The objective of preprocessing is to make data fit for its intended purpose, rather than making it “perfect”.

Even issues in the data used to train widely used tools like ChatGPT have been detected. ChatGPT was trained on a massive corpus of text data that underwent extensive and thorough preprocessing. However, it has been known to generate false information on many occasions. As well as that, the timeliness of its data is poor due to how computationally expensive it is to update, and it contains noticeable bias.

If tools like ChatGPT have data imperfections, with a significant amount of time, money and effort invested into them, it’s only natural that any other smaller projects will experience the same thing. However, it’s crucial to recognise that data imperfections can have ethical implications. As demonstrated by the Amazon recruitment tool, data quality issues can negatively impact people and generally cause harm. In these cases, any imperfections, no matter how small are significant. Ethical considerations must come to the forefront before the use of these models in the real world.

The takeaway is that while we can significantly enhance data quality, we should have realistic expectations about the attainability of perfection. The focus should be on continually improving data quality to support machine learning models while simultaneously prioritising ethical considerations.

Get in touch!

If you want to delve deeper into data quality and discover how we can optimise the quality of your data through cleaning, get in touch with us today!

Erin Ward

Leave A Comment

Your email address will not be published. Required fields are marked *