AI needs better human data, not bigger models

AI needs better human data, not bigger models

AI needs better human data, not bigger models

Opinion by: Rowan Stone, CEO at Sapien

AI is a paper tiger without human expertise in data management and training practices. Despite massive growth projections, AI innovations won’t be relevant if they continue training models based on poor-quality data. 

Besides improving data standards, AI models need human intervention for contextual understanding and critical thinking to ensure ethical AI development and correct output generation.

AI has a “bad data” problem

Humans have nuanced awareness. They draw on their experiences to make inferences and logical decisions. AI models are, however, only as good as their training data.

An AI model’s accuracy doesn’t entirely depend on the underlying algorithms’ technical sophistication or the amount of data processed. Instead, accurate AI performance depends on trustworthy, high-quality data during training and analytical performance tests.

Bad data has multifold ramifications for training AI models: It generates prejudiced output and hallucinations from faulty logic, leading to lost time in retraining AI models to unlearn bad habits, thereby increasing company costs.

Biased and statistically underrepresented data disproportionately amplifies flaws and skewed outcomes in AI systems, especially in healthcare and security surveillance.

For example, an Innocence Project report lists multiple cases of misidentification, with a former Detroit police chief admitting that relying solely on AI-based facial recognition would lead to 96% misidentifications. Moreover, according to a Harvard Medical School report, an AI model used across US health systems prioritized healthier white patients over sicker black patients. 

AI models follow the “Garbage In, Garbage Out” (GIGO) concept, as flawed and biased data inputs, or “garbage,” generate poor-quality outputs. Bad input data creates operational inefficiencies as project teams face delays and higher costs in cleaning data sets before resuming model training.

Beyond their operational effect, AI models trained on low-quality data erode the trust and confidence of companies in deploying them, causing irreparable reputational damage. According to a research paper, hallucination rates for GPT-3.5 were at 39.6%, stressing the need for additional validation by researchers.

Such reputational damages have far-reaching consequences because it becomes difficult to get investments and affects the model’s market positioning. In a CIO Network Summit, 21% of America’s top IT leaders expressed a lack of reliability as the most pressing concern for not using AI.

Poor data for training AI models devalues projects and causes enormous economic losses to companies. On average, incomplete and low-quality AI training data results in misinformed decision-making that costs companies 6% of their annual revenue.

Recent: Cheaper, faster, riskier — The rise of DeepSeek and its security concerns

Poor-quality training data affects AI innovation and model training, so searching for alternative solutions is essential.

The bad data problem has forced AI companies to redirect scientists toward preparing data. Almost 67% of data scientists spend their time preparing correct data sets to prevent misinformation delivery from AI models.

AI/ML models may struggle to keep up with relevant output unless specialists — real humans with proper credentials — work to refine them. This demonstrates the need for human experts to guide AI’s development by ensuring high-quality curated data for training AI models.

Human frontier data is key

Elon Musk recently said, “The cumulative sum of human knowledge has been exhausted in AI training.” Nothing could be farther from the truth since human frontier data is the key to driving stronger, more reliable and unbiased AI models.

Musk’s dismissal of human knowledge is a call to use artificially produced synthetic data for fine-tuning AI model training. Unlike humans, however, synthetic data lacks real-world experiences and has historically failed to make ethical judgments.

Human expertise ensures meticulous data review and validation to maintain an AI model’s consistency, accuracy and reliability. Humans evaluate, assess and interpret a model’s output to identify biases or mistakes and ensure they align with societal values and ethical standards.

Moreover, human intelligence offers unique perspectives during data preparation by bringing contextual reference, common sense and logical reasoning to data interpretation. This helps to resolve ambiguous results, understand nuances, and solve problems for high-complexity AI model training.

The symbiotic relationship between artificial and human intelligence is crucial to harnessing AI’s potential as a transformative technology without causing societal harm. A collaborative approach between man and machine helps unlock human intuition and creativity to build new AI algorithms and architectures for the public good.

Decentralized networks could be the missing piece to finally solidify this relationship at a global scale.

Companies lose time and resources when they have weak AI models that require constant refinement from staff data scientists and engineers. Using decentralized human intervention, companies can reduce costs and increase efficiency by distributing the evaluation process across a global network of data trainers and contributors.

Decentralized reinforcement learning from human feedback (RLHF) makes AI model training a collaborative venture. Everyday users and domain specialists can contribute to training and receive financial incentives for accurate annotation, labeling, category segmentation and classification.

A blockchain-based decentralized mechanism automates compensation as contributors receive rewards based on quantifiable AI model improvements rather than rigid quotas or benchmarks. Further, decentralized RLHF democratizes data and model training by involving people from diverse backgrounds, reducing structural bias, and enhancing general intelligence.

According to a Gartner survey, companies will abandon over 60% of AI projects by 2026 due to the unavailability of AI-ready data. Therefore, human aptitude and competence are crucial for preparing AI training data if the industry wants to contribute $15.7 trillion to the global economy by 2030.

Data infrastructure for AI model training requires continuous improvement based on new and emerging data and use cases. Humans can ensure organizations maintain an AI-ready database through constant metadata management, observability and governance.

Without human supervision, enterprises will fumble with the massive volume of data siloed across cloud and offshore data storage. Companies must adopt a “human-in-the-loop” approach to fine-tune data sets for building high-quality, performant and relevant AI models.

Opinion by: Rowan Stone, CEO at Sapien.

This article is for general information purposes and is not intended to be and should not be taken as legal or investment advice. The views, thoughts, and opinions expressed here are the author’s alone and do not necessarily reflect or represent the views and opinions of Cointelegraph.

Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *