A Smarter Captioning AI

June 10, 2020

Algorithms have been used to caption images for some time now. However, the results tend to be rather generic and treat images as separate from accompanying text. Tech Xplore reports on “A System to Produce Context-Aware Captions for News Images.” Journalist Ingrid Fadelli writes:

“Alasdair Tran, Alexander Mathews and Lexing Xie at the Australian National University have been trying to develop new systems that can generate more sophisticated and descriptive image captions. In a paper recently pre-published on arXiv, they introduced an automatic captioning system for news images that takes the general context behind an image into account while generating new captions. The goal of their study was to enable the creation of captions that are more detailed and more closely resemble those written by humans. … The three researchers went on to develop and implement the first end-to-end system that can generate captions for news images. The main advantage of end-to-end models is their simplicity. This simplicity ultimately allows the researchers’ model to be linguistically rich and generate real-world knowledge such as the names of people and places.”

Instead of ignoring rare words, the model analyzes them. The team eschewed the typical LTSM architecture for Transformer, a more recent architecture now used by language modeling and machine translation researchers. The shift allows for richer vocabulary and sentence structure. The team also worked to improve their model’s accuracy in identifying individuals in photos. This is particularly useful since, they found, most newspaper images feature people. The curious can check out a demo of the system, titled Transform and Tell.

Fadelli describes the researchers’ hopes for the system’s future:

“Tran, Mathews and Xie would also like to train their model to complete a slightly different task to that tackled in their recent work, namely, that of picking an image that could go well with an article from a large database, based on the article text. Their model’s attention mechanism could also allow it to identify the best place for the image within the text, which could ultimately speed up news publishing processes.”

The team also suggest their system could be used to extrapolate longer passages or summarize related background information. We are curious to see how this technology evolves.

Cynthia Murrell, June 10, 2020

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta