ezoic

Friday, February 23, 2018

Html tags, attributes, elements.

Here is a video talking about html tags, attributes, elements.

https://www.youtube.com/watch?v=naowH2LzuVg

Tags tell html what to do.

start and end. <p> </p> is a paragraph tag.

An HTML element usually consists of a start tag and end tag, with the content inserted in between:

Attributes provide additional information about HTML elements.

Here is html tag list

https://www.w3schools.com/tags/default.asp

we have an element

<div class="mun"> found 31 results <!--resultbarnum:31--></div>

To describe it in BeautifulSoup:


for row in soup.find_all('div',attrs={"class" : "mun"})

If not exist:

if not soup.find_all('div',attrs={"class" : "mun"}):








Thursday, February 15, 2018

How to read in a file in python scrapy and query

In python scrapy, we can query and scrapy webpage, and use callback option to parse the html got from scrape. If we want to query a list of keywords from a file, and parse some number from the resulted html, how to do it? Here is an example, it query some keywords from a file, and output how many articles mentions the keywords within one day in weixin.sogou.com.

Here is the code:

#!/usr/bin/python
# coding: utf-8


import scrapy
import time
from bs4 import BeautifulSoup
class QuotesSpider(scrapy.Spider):
   name="quotes"

   headers = {
    'Cookie': cookie,
    "User-Agent": UA,
    "Referer": "http://weixin.sogou.com/weixin?type=2"
    }

   def start_requests(self,filename=None):
             with open('your_file.txt','r') as f:
                for query in f:
                  self.log("%s" % query)
                  yield scrapy.http.FormRequest(url='http://weixin.sogou.com/weixin',
                           formdata={'type':'2',
                                     'ie':'utf8',
                                     'query':query,
                                     'tsn':'1',
                                     'ft':'',
                                     'et':'',
                                    #  'sst0': str(int(time.time()*1000)),
                                    # 'page': str(1),
                                     'interation':'',
                                     'wxid':'',
                                     'usip':''},
                           headers=self.headers,method='get', dont_filter=True,
                          meta = {'dont_redirect': True, "handle_httpstatus_list" : [301, 302, 303]},
                           callback=self.parse)

   def parse(self, response):

                   filename1="quotes-111.txt"
                   with open(filename1,"a") as k:

                      soup = BeautifulSoup(response.body, 'html.parser')

                      cc_rating_text="约".encode('utf8')
                      dd_rating_text="条".encode('utf8')
                      for row in soup.find_all('div',attrs={"class" : "mun"}):
                         line=row.text.strip()
                         tag_found = line.find(cc_rating_text)
                         tag_found2 = line.find(dd_rating_text)


                         rating = line[tag_found+1:tag_found2]
                         k.write(str(rating)+"\n")

                   self.log("Saved file %s" % filename1)

How I resolved the Python scrapy wechat 302 error

I tried to scrape weixin.sogou.com website. And I tried to scrape some data during some certain timeframe. They have a parameter tsn to control the searching time frame. If tsn=1, means you are searching within one day.

I wrote a python scrapy code to do this. Here is the code:

import scrapy

class QuotesSpider(scrapy.Spider):
   name="quotes"
   def start_requests(self):
      yield scrapy.FormRequest(url='http://weixin.sogou.com/weixin',
                           formdata={'type':'2',
                                     'ie':'utf8',
                                     'query':"the shape of water",
                                     'tsn':'1',
                                     'ft':'',
                                     'et':'',
                                     'interation':'',
                                     'wxid':'',
                                     'usip':''},
                            method='get',callback=self.parse)
   def parse(self, response):

        filename="quotes.html"
        with open(filename,"wb") as f:
                f.write(response.body)
        self.log("Saved file %s" % filename)

When I run the code, it is always redirected to homepage, wexin.sogou.com

Then I tried many methods. At last, someone suggested use http://weixin.sogou.com/weixin?type=2 as referrer on headers. I tried it. Works. Won't redirect anymore. 


import scrapy
import time
from bs4 import BeautifulSoup
class QuotesSpider(scrapy.Spider):
   name="quotes"

   headers = {
    'Cookie': Cookie,
    "User-Agent": UA,
    "Referer": "http://weixin.sogou.com/weixin?type=2"
    }

   def start_requests(self,filename=None):
                  yield scrapy.http.FormRequest(url='http://weixin.sogou.com/weixin',
                           formdata={'type':'2',
                                     'ie':'utf8',
                                     'query':query,
                                     'tsn':'1',
                                     'ft':'',
                                     'et':'',
                                    #  'sst0': str(int(time.time()*1000)),
                                    # 'page': str(1),
                                     'interation':'',
                                     'wxid':'',
                                     'usip':''},
                           headers=self.headers,method='get', dont_filter=True,
                          meta = {'dont_redirect': True, "handle_httpstatus_list" : [301, 302, 303]},
                           callback=self.parse)






Monday, February 12, 2018

A python Scrapy tutorial

A python Scrapy tutorial

https://www.youtube.com/watch?v=OJ8isyws2yw

Python scrapy is a web crawling package.

The youtuber gave a scrapy example.

First type scrapy startproject tutorial

You will get a directory, tutorial, under tutorial/spiders, generate a python file called quotes_spider.py under spiders directory.



The code:

import scrapy

class QuotesSpider(scrapy.Spider):
   name="quotes"
   def start_requests(self):
       urls=[
          'http://quotes.toscrape.com/page/1/',
          'http://quotes.toscrape.com/page/2/'
           ]
       for url in urls:
           yield scrapy.Request(url=url,callback=self.parse)
   def parse(self, response):
        page=response.url.split("/")[-2]
        filename="quotes-%s.html"%page
        with open(filename,"wb") as f:
                f.write(response.body)
       self.log("Saved file %s" % filename)



   

   



How to run it:
under spiders directory run
scrapy crawl quotes





Results, in spiders folder:

got two quotes-.html files.






How to draw a straight line in word

How to draw a straight line in word.

Type three "-" at the beginning of a line, and hit "enter".

If you type three "_" at the beginning of a line, and hit "enter", you will get a thicker line in word.

looking for a man

 I am a mid aged woman. I live in southern california.  I was born in 1980. I do not have any kid. no compliacted dating.  I am looking for ...