samedi 25 avril 2015

Scrapy creating XML feed wraps content in "value" tags


I've had a bit of help on here by my code pretty much works. The only issue is that in the process of generating an XML, it wraps the content in "value" tags when I don't want it to. According to the doc's this is due to this:

Unless overriden in the :meth:serialize_field method, multi-valued fields are exported by serializing each value inside a <value> element. This is for convenience, as multi-valued fields are very common.

This is my output:

<?xml version="1.0" encoding="UTF-8"?>
<items>
   <item>
      <body>
         <value>Don't forget me this weekend!</value>
      </body>
      <to>
         <value>Tove</value>
      </to>
      <who>
         <value>Jani</value>
      </who>
      <heading>
         <value>Reminder</value>
      </heading>
   </item>
</items>

What I send it to the XML exporter seems to be this, so I don't know why it think's it's multivalue?

{'body': [u"Don't forget me this weekend!"],
 'heading': [u'Reminder'],
 'to': [u'Tove'],
 'who': [u'Jani']}

pipeline.py

from scrapy import signals
from scrapy.contrib.exporter import XmlItemExporter

class XmlExportPipeline(object):

    def __init__(self):
        self.files = {}

    @classmethod
    def from_crawler(cls, crawler):
         pipeline = cls()
         crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
         crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
         return pipeline

    def spider_opened(self, spider):
        file = open('%s_products.xml' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = XmlItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

spider.py

from scrapy.contrib.spiders import XMLFeedSpider
from crawler.items import CrawlerItem

class SiteSpider(XMLFeedSpider):
    name = 'site'
    allowed_domains = ['www.w3schools.com']
    start_urls = ['http://ift.tt/1j1cMKy']
    itertag = 'note'

    def parse_node(self, response, selector):
        item = CrawlerItem()
        item['to'] = selector.xpath('//to/text()').extract()
        item['who'] = selector.xpath('//from/text()').extract()
        item['heading'] = selector.xpath('//heading/text()').extract()
        item['body'] = selector.xpath('//body/text()').extract()
        return item

Any help would be really appreciated. I just want the same output without the redundant tags.


Aucun commentaire:

Enregistrer un commentaire