I've had a bit of help on here by my code pretty much works. The only issue is that in the process of generating an XML, it wraps the content in "value" tags when I don't want it to. According to the doc's this is due to this:
Unless overriden in the :meth:
serialize_fieldmethod, multi-valued fields are exported by serializing each value inside a<value>element. This is for convenience, as multi-valued fields are very common.
This is my output:
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<body>
<value>Don't forget me this weekend!</value>
</body>
<to>
<value>Tove</value>
</to>
<who>
<value>Jani</value>
</who>
<heading>
<value>Reminder</value>
</heading>
</item>
</items>
What I send it to the XML exporter seems to be this, so I don't know why it think's it's multivalue?
{'body': [u"Don't forget me this weekend!"],
'heading': [u'Reminder'],
'to': [u'Tove'],
'who': [u'Jani']}
pipeline.py
from scrapy import signals
from scrapy.contrib.exporter import XmlItemExporter
class XmlExportPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_products.xml' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = XmlItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
spider.py
from scrapy.contrib.spiders import XMLFeedSpider
from crawler.items import CrawlerItem
class SiteSpider(XMLFeedSpider):
name = 'site'
allowed_domains = ['www.w3schools.com']
start_urls = ['http://ift.tt/1j1cMKy']
itertag = 'note'
def parse_node(self, response, selector):
item = CrawlerItem()
item['to'] = selector.xpath('//to/text()').extract()
item['who'] = selector.xpath('//from/text()').extract()
item['heading'] = selector.xpath('//heading/text()').extract()
item['body'] = selector.xpath('//body/text()').extract()
return item
Any help would be really appreciated. I just want the same output without the redundant tags.
Aucun commentaire:
Enregistrer un commentaire