From MIT an artificial intelligence that recognizes objects from verbal descriptions
MIT scientists study a system based on artificial intelligence that can recognize objects in a scene simply by verbal description: it can be the starting point for machine translation systems.
The speech recognition systems that are implemented in consumer devices currently on the market, although significantly improved even compared to a few years ago, still have a generally awkward behavior and above all, they need to set up significant amounts of annotations and transcriptions to understand correctly what the user is referring to.
A viable path is that of loosening through artificial intelligence and algorithms and at MIT a research project has been carried out to test a machine learning system able to identify the objects present in a scene based on the vocal description that is provided. Referring, for example, to a pair of red trousers, the system can recognize the garment without having to resort to other transcriptions.
Researchers began by following an existing approach in which two neural networks process images and audio spectrograms, so as to learn how to match an audio fragment with images that contain a given object. However, the neural network that manages the images has been modified to divide an image into a grid of boxes, while the neural network dedicated to the audio divides the spectrogram into fragments lasting 1-2 seconds.
After combining the right image with the relevant audio segment, the training process involves evaluating how the AI system is able to correctly match the audio segments to the objects in the grid. In a sense, this system can be imagined as teaching a child to recognize objects, indicating a specific one and pronouncing his name. The researchers trained the system with a total of 400 thousand audio image-fragment pairings, and processed 1000 random pairings for the test.
A system of this kind can have many potential uses, but researchers seem to be interested in beating the path of automatic translations: it becomes possible, for example: to exploit various people who speak different languages describing the same object and to have the system take that an audio fragment of an idiom is nothing more than the translation of an audio fragment of another idiom. A technology that could significantly expand the speech recognition capabilities of assistance systems, extending the use cases.