Online latex resume editor :
https://www.sharelatex.com/
I wrote about the solutions to some problems I found from programming and data analytics. They may help you on your work. Thank you.
ezoic
Tuesday, February 27, 2018
Friday, February 23, 2018
Html tags, attributes, elements.
Here is a video talking about html tags, attributes, elements.
https://www.youtube.com/watch?v=naowH2LzuVg
Tags tell html what to do.
start and end. <p> </p> is a paragraph tag.
An HTML element usually consists of a start tag and end tag, with the content inserted in between:
Attributes provide additional information about HTML elements.
Here is html tag list
https://www.w3schools.com/tags/default.asp
we have an element
<div class="mun"> found 31 results <!--resultbarnum:31--></div>
To describe it in BeautifulSoup:
for row in soup.find_all('div',attrs={"class" : "mun"})
If not exist:
if not soup.find_all('div',attrs={"class" : "mun"}):
https://www.youtube.com/watch?v=naowH2LzuVg
Tags tell html what to do.
start and end. <p> </p> is a paragraph tag.
An HTML element usually consists of a start tag and end tag, with the content inserted in between:
Attributes provide additional information about HTML elements.
Here is html tag list
https://www.w3schools.com/tags/default.asp
we have an element
<div class="mun"> found 31 results <!--resultbarnum:31--></div>
To describe it in BeautifulSoup:
for row in soup.find_all('div',attrs={"class" : "mun"})
If not exist:
if not soup.find_all('div',attrs={"class" : "mun"}):
Thursday, February 15, 2018
How to read in a file in python scrapy and query
In python scrapy, we can query and scrapy webpage, and use callback option to parse the html got from scrape. If we want to query a list of keywords from a file, and parse some number from the resulted html, how to do it? Here is an example, it query some keywords from a file, and output how many articles mentions the keywords within one day in weixin.sogou.com.
Here is the code:
#!/usr/bin/python
# coding: utf-8
import scrapy
import time
from bs4 import BeautifulSoup
class QuotesSpider(scrapy.Spider):
name="quotes"
headers = {
'Cookie': cookie,
"User-Agent": UA,
"Referer": "http://weixin.sogou.com/weixin?type=2"
}
def start_requests(self,filename=None):
with open('your_file.txt','r') as f:
for query in f:
self.log("%s" % query)
yield scrapy.http.FormRequest(url='http://weixin.sogou.com/weixin',
formdata={'type':'2',
'ie':'utf8',
'query':query,
'tsn':'1',
'ft':'',
'et':'',
# 'sst0': str(int(time.time()*1000)),
# 'page': str(1),
'interation':'',
'wxid':'',
'usip':''},
headers=self.headers,method='get', dont_filter=True,
meta = {'dont_redirect': True, "handle_httpstatus_list" : [301, 302, 303]},
callback=self.parse)
def parse(self, response):
filename1="quotes-111.txt"
with open(filename1,"a") as k:
soup = BeautifulSoup(response.body, 'html.parser')
cc_rating_text="约".encode('utf8')
dd_rating_text="条".encode('utf8')
for row in soup.find_all('div',attrs={"class" : "mun"}):
line=row.text.strip()
tag_found = line.find(cc_rating_text)
tag_found2 = line.find(dd_rating_text)
rating = line[tag_found+1:tag_found2]
k.write(str(rating)+"\n")
self.log("Saved file %s" % filename1)
Here is the code:
#!/usr/bin/python
# coding: utf-8
import scrapy
import time
from bs4 import BeautifulSoup
class QuotesSpider(scrapy.Spider):
name="quotes"
headers = {
'Cookie': cookie,
"User-Agent": UA,
"Referer": "http://weixin.sogou.com/weixin?type=2"
}
def start_requests(self,filename=None):
with open('your_file.txt','r') as f:
for query in f:
self.log("%s" % query)
yield scrapy.http.FormRequest(url='http://weixin.sogou.com/weixin',
formdata={'type':'2',
'ie':'utf8',
'query':query,
'tsn':'1',
'ft':'',
'et':'',
# 'sst0': str(int(time.time()*1000)),
# 'page': str(1),
'interation':'',
'wxid':'',
'usip':''},
headers=self.headers,method='get', dont_filter=True,
meta = {'dont_redirect': True, "handle_httpstatus_list" : [301, 302, 303]},
callback=self.parse)
def parse(self, response):
filename1="quotes-111.txt"
with open(filename1,"a") as k:
soup = BeautifulSoup(response.body, 'html.parser')
cc_rating_text="约".encode('utf8')
dd_rating_text="条".encode('utf8')
for row in soup.find_all('div',attrs={"class" : "mun"}):
line=row.text.strip()
tag_found = line.find(cc_rating_text)
tag_found2 = line.find(dd_rating_text)
rating = line[tag_found+1:tag_found2]
k.write(str(rating)+"\n")
self.log("Saved file %s" % filename1)
How I resolved the Python scrapy wechat 302 error
I tried to scrape weixin.sogou.com website. And I tried to scrape some data during some certain timeframe. They have a parameter tsn to control the searching time frame. If tsn=1, means you are searching within one day.
I wrote a python scrapy code to do this. Here is the code:
import scrapy
class QuotesSpider(scrapy.Spider):
name="quotes"
def start_requests(self):
yield scrapy.FormRequest(url='http://weixin.sogou.com/weixin',
formdata={'type':'2',
'ie':'utf8',
'query':"the shape of water",
'tsn':'1',
'ft':'',
'et':'',
'interation':'',
'wxid':'',
'usip':''},
method='get',callback=self.parse)
def parse(self, response):
filename="quotes.html"
with open(filename,"wb") as f:
f.write(response.body)
self.log("Saved file %s" % filename)
I wrote a python scrapy code to do this. Here is the code:
import scrapy
class QuotesSpider(scrapy.Spider):
name="quotes"
def start_requests(self):
yield scrapy.FormRequest(url='http://weixin.sogou.com/weixin',
formdata={'type':'2',
'ie':'utf8',
'query':"the shape of water",
'tsn':'1',
'ft':'',
'et':'',
'interation':'',
'wxid':'',
'usip':''},
method='get',callback=self.parse)
def parse(self, response):
filename="quotes.html"
with open(filename,"wb") as f:
f.write(response.body)
self.log("Saved file %s" % filename)
When I run the code, it is always redirected to homepage, wexin.sogou.com
Then I tried many methods. At last, someone suggested use http://weixin.sogou.com/weixin?type=2 as referrer on headers. I tried it. Works. Won't redirect anymore.
import scrapy
import time
from bs4 import BeautifulSoup
class QuotesSpider(scrapy.Spider):
name="quotes"
headers = {
'Cookie': Cookie,
"User-Agent": UA,
"Referer": "http://weixin.sogou.com/weixin?type=2"
}
def start_requests(self,filename=None):
yield scrapy.http.FormRequest(url='http://weixin.sogou.com/weixin',
formdata={'type':'2',
'ie':'utf8',
'query':query,
'tsn':'1',
'ft':'',
'et':'',
# 'sst0': str(int(time.time()*1000)),
# 'page': str(1),
'interation':'',
'wxid':'',
'usip':''},
headers=self.headers,method='get', dont_filter=True,
meta = {'dont_redirect': True, "handle_httpstatus_list" : [301, 302, 303]},
callback=self.parse)
Monday, February 12, 2018
A python Scrapy tutorial
A python Scrapy tutorial
https://www.youtube.com/watch?v=OJ8isyws2yw
Python scrapy is a web crawling package.
The youtuber gave a scrapy example.
First type scrapy startproject tutorial
You will get a directory, tutorial, under tutorial/spiders, generate a python file called quotes_spider.py under spiders directory.
The code:
import scrapy
class QuotesSpider(scrapy.Spider):
name="quotes"
def start_requests(self):
urls=[
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/'
]
for url in urls:
yield scrapy.Request(url=url,callback=self.parse)
def parse(self, response):
page=response.url.split("/")[-2]
filename="quotes-%s.html"%page
with open(filename,"wb") as f:
f.write(response.body)
self.log("Saved file %s" % filename)
How to run it:
under spiders directory run
scrapy crawl quotes
Results, in spiders folder:
got two quotes-.html files.
https://www.youtube.com/watch?v=OJ8isyws2yw
Python scrapy is a web crawling package.
The youtuber gave a scrapy example.
First type scrapy startproject tutorial
You will get a directory, tutorial, under tutorial/spiders, generate a python file called quotes_spider.py under spiders directory.
The code:
import scrapy
class QuotesSpider(scrapy.Spider):
name="quotes"
def start_requests(self):
urls=[
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/'
]
for url in urls:
yield scrapy.Request(url=url,callback=self.parse)
def parse(self, response):
page=response.url.split("/")[-2]
filename="quotes-%s.html"%page
with open(filename,"wb") as f:
f.write(response.body)
self.log("Saved file %s" % filename)
How to run it:
under spiders directory run
scrapy crawl quotes
Results, in spiders folder:
got two quotes-.html files.
How to draw a straight line in word
How to draw a straight line in word.
Type three "-" at the beginning of a line, and hit "enter".
If you type three "_" at the beginning of a line, and hit "enter", you will get a thicker line in word.
Type three "-" at the beginning of a line, and hit "enter".
If you type three "_" at the beginning of a line, and hit "enter", you will get a thicker line in word.
Tuesday, February 6, 2018
Subscribe to:
Posts (Atom)
looking for a man
I am a mid aged woman. I was born in 1980. I do not have any kid. no complicated dating before . I am looking for a man here for marriage...
-
I tried to commit script to bitbucket using sourcetree. I first cloned from bitbucket using SSH, and I got an error, "authentication ...
-
https://github.com/boto/boto3/issues/134 import boto3 import botocore client = boto3.client('s3') result = client.list_obje...
-
Previously, I wanted to install "script" on Atom to run PHP. And there was some problem, like the firewall. So I tried atom-runner...