- The LFQA model learns to form answers from long-form content.
- It is available on platforms like Reddit, Wikipedia, and other social media platforms.
- Popular LFQA models still struggle to solve complex mathematical problems.
Introduction:
The work of long-form question answering (LFQA) involves the document retrieval relevant to a given question and using those documents to develop a paragraph-length answer for the question. According to a recent paper co-authored by the University of Massachusetts and Google, these machine learning models have been proposed for LFQA and have also posed a challenge for the researchers.
An LFQA system achieves a state of art performance within a popular data set. The researchers found that even with the best LFQA models, the models do not provide answers according to the document retrieval. Also, these models don’t demonstrate an understanding of the documents retrieved.
How does the machine learning model learn the language:
Massive language models which include Open AI GPT-3 and Google’s Gshard have learned to write human-like text by internalizing billions of examples from the open web. It uses sources like ebooks, Wikipedia, and social media platforms including Reddit to complete the sentences and form large paragraphs.
But research demonstrates the pitfall of such a training approach. The open-domain question answering model often simply memorizes the answers that they find in the data on which they are trained. Because of this language models can be prompted to display sensitive and private information if it is given certain words and phrases.
What does the research say?
In a recent study, the co-authors evaluated their LF QA models on ELI5. ELI5 is a python library that enables the developers to visualize and debug the machine learning models through a unified API. They saw a significant overlap between the data used to train and test the model and around 81% were given in the paraphrased form. This brought out the issues with the model along with ELI5.
“[Our] in-depth analysis reveals [shortcomings] not only with our model but also with the ELI5 dataset and evaluation metrics. We hope that the community works towards solving these issues so that we can climb the right hills and make meaningful progress,” they wrote in the paper.
These models face umpteen challenges and memorization is just one of them, to begin with. Another research also shows that even state-of-the-art language models struggle to work with and solve complex bulk of mathematical problems. According to a paper published by the University of California, Berkeley found that massive language models which include Open AI’s GPT-3 can compete only with 2.9% to 6.9% of problems from a dataset of more than 12500.
How does the language model work around abusive language and sensitive words:
OpenAI itself notes that its flagship language model, GPT-3, places words like “naughty” or “sucked” near female pronouns and “Islam” near words like “terrorism.”A paper by Stanford University Ph.D. candidate and Gradio founder Abubakar Abid detailed the anti-Muslim tendencies of text generated by GPT-3. And the Middlebury Institute of International Studies’ Center on Terrorism, Extremism, and Counterterrorism claims that GPT-3 could reliably generate “informational” and “influential” text that might “radicalize individuals into violent far-right extremist ideologies and behaviors.”
Among researchers, leading AI researcher Timnit Gebru questions the wisdom of developing massive language models that examines who derives the benefit and disadvantages from them. Gebru co authored a paper that highlights the impact of large language model’s carbon footprint on marginalized communities. Such models also depict the tendencies to perpetuate abusive language, hate speech, microaggressions, stereotypes, and other language types that dehumanizes a specific group of people.