Open-source massive language fashions (LLMs) are sometimes proficient in English, however they face challenges with different languages, notably these in Southeast Asia, as a result of a shortage of coaching knowledge. Addressing this concern, Viettel Options, a subsidiary of Viettel Company, has adopted NVIDIA’s NeMo Curator to boost the processing of high-quality Vietnamese language knowledge, as reported by NVIDIA.
Challenges with Language Fashions
LLMs sometimes excel in English as a result of ample coaching knowledge. Nonetheless, languages like Vietnamese typically lack enough knowledge, which impacts mannequin efficiency. NVIDIA’s NeMo Curator presents an answer by enabling the creation of high-quality datasets vital for coaching efficient language fashions.
Viettel’s Collaboration with NVIDIA
Viettel Options has leveraged NeMo Curator to coach its Llama 3 ViettelSolution 8B mannequin, now rating among the many prime within the VMLU leaderboard. The software’s GPU-accelerated options, similar to deduplication and filtering, have elevated mannequin accuracy by 10%, lowered coaching time by threefold, and decreased dataset dimension by 60%, in line with Tuan Nguyen, Head of Information Analytics at Viettel Options.
Information Curation Pipeline
The information curation course of contains downloading datasets from numerous sources, reformatting Unicode, deduplicating, and making use of high quality filtering. The datasets embrace Vietnamese subsets from C4, OSCAR, and Wikipedia, mixed right into a single dataset for coaching. NeMo Curator employs heuristic and classifier-based filtering to boost knowledge high quality, making certain the elimination of noise and preserving important content material range.
Superior Filtering Strategies
Heuristic filtering removes low-quality content material utilizing predefined guidelines, whereas classifier-based filtering employs a educated mannequin to establish excessive and low-quality knowledge. This twin method ensures that the dataset is each complete and of top quality, essential for efficient language mannequin coaching.
Influence on Dataset High quality
The curation course of considerably reduces dataset dimension by eradicating low-quality and redundant content material, with classifier-based filtering alone accounting for a forty five% discount. This environment friendly filtering ensures that the remaining knowledge is of the best high quality, appropriate for pretraining language fashions.
Conclusion
NVIDIA’s NeMo Curator supplies a sturdy software for processing high-quality Vietnamese language knowledge, enhancing the efficiency of language fashions. By bettering knowledge high quality and effectivity, it helps Viettel Options’ purpose of main in generative AI and creating AI-powered merchandise for the Vietnamese market.
Picture supply: Shutterstock