They are more than a thousand researchers, coming from the four corners of the world, to have worked on “Bloom”. A total of 46 languages are taken into account by this automatic language prediction system. Developed to work equally well in English, French and Basque, the system should be able to provide unprecedented knowledge of the language and the way we speak.
But to fully understand how Bloom works, you have to look at its history. To do this, we have to go back a few months, to the birth of the BigScience project. The latter was the work of an American startup, Hugging Face, founded by two French people. It has just completed a fundraising of 100 million dollars last May, with big names in the industry such as Airbus, the IA branch of Meta (formerly Facebook) or even the French Orange and Ubisoft.
Because the idea behind Bloom has something to seduce. As François Yvon, research director at the CNRS, explains, other language models already exist. Microsoft and Nvidia have also joined forces to build one a few months ago. But these programs, developed by private companies, are not completely transparent, unlike Bloom.
Indeed, the two founders of this idea promised that Bloom was going to be 100% open source. It should thus allow researchers from all over the world to have a common working base.
A multilingual system able to understand us
With Bloom, scientists can work on 46 languages at the same time. The software will then attempt to guess the next word. A colossal project that required months of calculation from the French supercomputer Jean Zay. But the repercussions could be enormous.
Aware of the major issues that are being played out for the world of research, the new minister in charge of higher education and research has reacted. She welcomed the presence of the French language in this program, which is integrated “widely”.
Because if for the moment the uses of Bloom are still limited, machine translation should experience a real before/after thanks to this system. But as François Yvon explains, this system is above all a tool to advance research: “Like a large telescope, it allows us to observe and understand how these language models work. »
176 billion parameters analyzed at all times
In order to perfect his training, Bloom was able to use the computing power of the Jean Zay supercomputer. The Bloom project ran at 100% for 4 months to achieve such a result. After so much time spent learning diverse and varied languages, Bloom managed to understand and analyze 176 billion parameters.
Thanks to this tree of immense possibilities, he is able to predict the word that will be written knowing only the beginning of a text. This system should be very useful for automatic translation systems, but also for writing generators which can gain enormously in reliability.
Totally open source, the project should be made public in the coming weeks by the young startup Hugging Face which is behind it.