ezoic

Thursday, February 15, 2018

How to read in a file in python scrapy and query

In python scrapy, we can query and scrapy webpage, and use callback option to parse the html got from scrape. If we want to query a list of keywords from a file, and parse some number from the resulted html, how to do it? Here is an example, it query some keywords from a file, and output how many articles mentions the keywords within one day in weixin.sogou.com.

Here is the code:

#!/usr/bin/python
# coding: utf-8


import scrapy
import time
from bs4 import BeautifulSoup
class QuotesSpider(scrapy.Spider):
   name="quotes"

   headers = {
    'Cookie': cookie,
    "User-Agent": UA,
    "Referer": "http://weixin.sogou.com/weixin?type=2"
    }

   def start_requests(self,filename=None):
             with open('your_file.txt','r') as f:
                for query in f:
                  self.log("%s" % query)
                  yield scrapy.http.FormRequest(url='http://weixin.sogou.com/weixin',
                           formdata={'type':'2',
                                     'ie':'utf8',
                                     'query':query,
                                     'tsn':'1',
                                     'ft':'',
                                     'et':'',
                                    #  'sst0': str(int(time.time()*1000)),
                                    # 'page': str(1),
                                     'interation':'',
                                     'wxid':'',
                                     'usip':''},
                           headers=self.headers,method='get', dont_filter=True,
                          meta = {'dont_redirect': True, "handle_httpstatus_list" : [301, 302, 303]},
                           callback=self.parse)

   def parse(self, response):

                   filename1="quotes-111.txt"
                   with open(filename1,"a") as k:

                      soup = BeautifulSoup(response.body, 'html.parser')

                      cc_rating_text="约".encode('utf8')
                      dd_rating_text="条".encode('utf8')
                      for row in soup.find_all('div',attrs={"class" : "mun"}):
                         line=row.text.strip()
                         tag_found = line.find(cc_rating_text)
                         tag_found2 = line.find(dd_rating_text)


                         rating = line[tag_found+1:tag_found2]
                         k.write(str(rating)+"\n")

                   self.log("Saved file %s" % filename1)

No comments:

Post a Comment

R is not a simple programming language, and it does better on reading excel files than python

R is not a simple programming language, and it does better on reading excel files than python . tried to read excel files to python and R. i...