Generating multi-sentence image descriptions is a challenging task, which requires a good model to produce coherent and accurate paragraphs, describing salient objects in the image. We argue that multiple sources of information are beneficial when describing visual scenes with long sequences. These include (i) perceptual information and (ii) semantic (language) information about how to describe what is in the image. We also compare the effects of using two different pooling mechanisms on either a single modality or their combination. We demonstrate that the model which utilises both visual and language inputs can be used to generate accurate and diverse paragraphs when combined with a particular pooling mechanism. The results of our automatic and human evaluation show that learning to embed semantic information along with visual stimuli into the paragraph generation model is not trivial, raising a variety of proposals for future experiments.