ezoic

Monday, September 10, 2018

write the data out to a file , python script

output the data to a file , sample python script:

filename="filename11"+'.txt'File=open("C:/path/to/file/"+filename,'w')
for item in a:
    File.write(item+"\n")
File.close()

Wednesday, August 29, 2018

sum case when pyspark



https://stackoverflow.com/questions/40762066/sum-of-case-when-in-pyspark

https://stackoverflow.com/questions/49524501/spark-sql-sum-based-on-multiple-cases




pyspark timestamp function, from_utc_timestamp function



http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

https://stackoverflow.com/questions/45469438/pyspark-creating-timestamp-column




regular expression extract pyspark



https://stackoverflow.com/questions/46410887/pyspark-string-matching-to-create-new-column


regular expression for pyspark



https://stackoverflow.com/questions/45580057/pyspark-filter-dataframe-by-regex-with-string-formatting


pyspark sql case when to pyspark when otherwise


https://www.programcreek.com/python/example/98243/pyspark.sql.functions.when


pyspark user defined function


https://stackoverflow.com/questions/34803855/pyspark-dataframe-udf-on-text-column


pyspark sql functions



https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#when


Sunday, July 29, 2018

differences between t test and z test



                distribution                                            sample size      variance known or unknown
t test       normally distributed                                can be small      unknown
z test      no requirements for normality b/c clt      is large              known

Saturday, July 28, 2018

Odds and odds ratio in statistics

https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/

The odds of success are defined as the ratio of the probability of success over the probability of failure



confidence interval

https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/PASS/Confidence_Intervals_for_the_Odds_Ratio_in_Logistic_Regression_with_One_Binary_X.pdf


Assumptions of linear models



https://www.theanalysisfactor.com/assumptions-of-linear-models/


  1. The residuals are independent
  2. The residuals are normally distributed
  3. The residuals have a mean of 0 at all values of X
  4. The residuals have constant variance

Data science interview questions




http://alexbraunstein.com/2011/08/09/hire-data-scientist-statistician/



https://www.datasciencecentral.com/profiles/blogs/66-job-interview-questions-for-data-scientists


Apply, sapply, tapply differences in R



https://www.guru99.com/r-apply-sapply-tapply.html

Apply: on matrice

apply(mat,1,var)


1: row,  margin
2:column, margin


lapply: apply on a vector, return a list, no margin

movies <- c("SPYDERMAN","BATMAN","VERTIGO","CHINATOWN")
movies_lower <-lapply(movies, tolower)
[[1]]
[1] "spyderman"
[[2]]
[1]"batman"
...

sapply does the same job as lapply, but return a vector


tapply computes a measure ( min, max, median etc) or a function for each factor variable in a vector.

data(iris)

tapply(iris$Sepal.Width, iris$Species, median)





Wednesday, July 11, 2018

Tuesday, July 10, 2018

Randomly generate user agents and ip in python

1. randomly generate user agent

installation:
pip install fake_useragent

usage:

from fake_useragent import UserAgent

ua=UserAgent()

ua.random

got a random user agent


2. randomly generate ip

'.'.join('%s'%random.randint(0, 255) for i in range(4))



Thursday, July 5, 2018

How to send emails on linux.

I use ubuntu system. How to find out which system you use, command is "uname -a".

I tried to send out email on ubuntu.

I tried on command line first.

I first installed postfix:

sudo apt-get install postfix

Then I tried the command:

echo "test message" | mailx -s "test subject" XXXX@xxx.com

And I got the following:


The program 'mailx' is currently not installed. You can install it by typing:
sudo apt-get install mailutils


So I installed mailx.

I got the message. 

And I put it in a linux shell script. Got it done. 

Tuesday, June 26, 2018

Scrapy 302 error

Scrapy 302 error


in setting.py change COOKIES_ENABLED to be true. 

Linux buffers , cache youtube video

Linux buffers , cache youtube video


https://www.youtube.com/watch?v=eqahEvCb8NM


stderr, stdout, stdin , how to output to linux file



http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-3.html

There are 3 file descriptors, stdin, stdout and stderr (std=standard).

Basically you can:
  1. redirect stdout to a file
  2. redirect stderr to a file
  3. redirect stdout to a stderr
  4. redirect stderr to a stdout
  5. redirect stderr and stdout to a file
  6. redirect stderr and stdout to stdout
  7. redirect stderr and stdout to stderr


I  stderr to a file. 

scrapy crawl XXX 2> nohup.txt

or 

scrapy crawl XXX 2>> nohup.txt

">>"  means append. 






Monday, June 11, 2018

Scrapy Spider, one url, multiple request sample code


class PabhSpider(CrawlSpider):
    name = 'pabh'
    allowed_domains = ['xxx']

    def start_requests(self):
        url = 'http://xxx'
        num1 = '01'
        formdata = {
            "depart":num,
            "years":'2014'
        }
        return [FormRequest(url=url,formdata=formdata,method='get',callback=self.parse)]


    def parse(self, response):
        item = XXXItem()
        hxs = Selector(response)
        item['bh'] = hxs.xpath('/html/body/form/p/font/select[3]/option/@value').extract()
        yield item

        num = ['02','03','04','05','06','07','08','09','10','11','12','13','14','21','31','40','51','61']

        for x in  num:
            url = 'http://xxx'
            formdata={
                "depart":x,
                "years":'2014'
            }
            yield FormRequest(url=url,formdata=formdata,method='get',callback=self.parse)

Wednesday, June 6, 2018

how to get rid of garbage characters when opening a txt file with excel and the txt file has east Asian characters

Sometimes when we open a txt file with excel  and the txt file has east Asian characters, we will see some garbage characters. How to get rid of them.

To open a txt file with excel. First open an empty excel, then click File=>Open=> Go the the file you want to open => click open.









Some ones say  if we open it with  option Windows(ANSI) we will get rid of garbage characters. But I tried my file with Windows(ANSI), did not get rid of garbage characters.



So I tried some other options, I tried Unicode (UTF-8) and got rid of garbage characters.


Wednesday, May 16, 2018

A tech forum.

A tech forum.

https://slashdot.org/


some python code to plot subplots in python




import matplotlib.pyplot as plt
import numpy as np

# Simple data to display in various forms
x = np.linspace(0, 2 * np.pi, 400)
y = np.sin(x ** 2)

plt.close('all')

# Just a figure and one subplot
f, ax = plt.subplots()
ax.plot(x, y)
ax.set_title('Simple plot')

# Two subplots, the axes array is 1-d
f, axarr = plt.subplots(2, sharex=True)
axarr[0].plot(x, y)
axarr[0].set_title('Sharing X axis')
axarr[1].scatter(x, y)

# Two subplots, unpack the axes array immediately
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
ax1.plot(x, y)
ax1.set_title('Sharing Y axis')
ax2.scatter(x, y)

# Three subplots sharing both x/y axes
f, (ax1, ax2, ax3) = plt.subplots(3, sharex=True, sharey=True)
ax1.plot(x, y)
ax1.set_title('Sharing both axes')
ax2.scatter(x, y)
ax3.scatter(x, 2 * y ** 2 - 1, color='r')
# Fine-tune figure; make subplots close to each other and hide x ticks for
# all but bottom plot.
f.subplots_adjust(hspace=0)
plt.setp([a.get_xticklabels() for a in f.axes[:-1]], visible=False)

# row and column sharing
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey='row')
ax1.plot(x, y)
ax1.set_title('Sharing x per column, y per row')
ax2.scatter(x, y)
ax3.scatter(x, 2 * y ** 2 - 1, color='r')
ax4.plot(x, 2 * y ** 2 - 1, color='r')

# Four axes, returned as a 2-d array

Thursday, May 10, 2018

Scrapy linux cron job not work, how I make it work

I tried to automate  a scrapy job using cron on linux. But it did not work. I searched and found the solution.
First use "which scrapy " to find where the scrapy is. For my machine the scrapy is:

/home/ubuntu/anaconda2/bin/scrapy


Then in the  shell script, write:

cd /path/to/spider 

nohup  /home/ubuntu/anaconda2/bin/scrapy crawl quotes   >> log.txt &

It resolved the problem. Or can write:

cd /path/to/spider 
PATH=$PATH:/home/ubuntu/anaconda2/bin
export PATH
nohup  /home/ubuntu/anaconda2/bin/scrapy crawl quotes   >> log.txt &








Tuesday, May 8, 2018

Compare Unix date and time

Compare Unix date and time.

To get Unix time, the command is:

current time:

now=`date +"%T"`

we will get a time like "20:55:01"

now=`date +"%H%M%S"`

we will get a time like "205501"

If we want to compare times, we can not compare the times in the format "%H:%M:%S", we can only compare them in the format "%H%M%S". Otherwise we will get an error Illegal number: 20:59:22

To get Unix date, the command is:

date1=`date +"%m/%d/%Y %H:%M:%S"`

We will get a date in like "5/8/2018 20:55:01"

If we want to get timestamp, we will use:

date2=`date +"%s"`

we will get a unix timestamp.

we can compare dates by its unix timestamp. it seems we can not compare two dates like "5/8/2018 20:55:01".  Otherwise we will get an error: Illegal number: 05/08/2018 20:59:22.

Monday, April 23, 2018

Code for drawing plot in python

I scraped some webpage, get a dictionary of date versus number of mentions. Here is the code:



import matplotlib.pyplot as plt

def sortdict(d):
    for key in sorted(d): yield d[key]

counter1={'2018-02-01': 22, '2018-01-31': 19, 
'2018-01-30': 10, '2018-01-29': 5, '2018-01-27': 4, 
'2018-01-28': 3, '2018-01-25': 3, '2018-01-23': 3, '2018-01-26': 3, 
'2018-01-24': 2, '2018-01-01': 2, '2018-01-15': 2, '2018-01-12': 2,
 '2018-01-09': 1, '2017-12-18': 1, '2017-12-26': 1, '2018-01-11': 1, 
'2018-01-13': 1, '2017-11-28': 1, '2017-12-21': 1, '2017-12-22': 1,
 '2017-02-09': 1, '2018-01-04': 1, '2017-01-17': 1, '2017-03-02': 1,
 '2018-01-08': 1, '2017-12-09': 1, '2017-12-24': 1, '2017-02-20': 1,
 '2018-01-14': 1, '2018-01-21': 1, '2017-12-28': 1, '2017-12-11': 1}


x_labels = [] #create an empty list to store the labels#for key in counter.keys():
#        x_labels.append(key)


fig, ax = plt.subplots()
lists = sorted(counter1.items()) # sorted by key, return a list of tuplesprint(lists)
x, y = zip(*lists)
print(x)
print(y)
# unpack a list of pairs into two tuplesplt.scatter(x, y)
plt.xticks( range(len(x)), x, rotation=90,fontsize=5 )


plt.show()


results:







looking for a man

 I am a mid aged woman.  I was born in 1980. I do not have any kid. no complicated  dating before . I am looking for a man here for marriage...