write the data out to a file , python script

output the data to a file , sample python script:

for item in a:

sum case when pyspark



pyspark timestamp function, from_utc_timestamp function



regular expression extract pyspark


regular expression for pyspark


pyspark sql case when to pyspark when otherwise


pyspark user defined function


pyspark sql functions


differences between t test and z test

                distribution                                            sample size      variance known or unknown
t test       normally distributed                                can be small      unknown
z test      no requirements for normality b/c clt      is large              known

Odds and odds ratio in statistics


The odds of success are defined as the ratio of the probability of success over the probability of failure

confidence interval


Assumptions of linear models


  1. The residuals are independent
  2. The residuals are normally distributed
  3. The residuals have a mean of 0 at all values of X
  4. The residuals have constant variance

Data science interview questions



Apply, sapply, tapply differences in R


Apply: on matrice


1: row,  margin
2:column, margin

lapply: apply on a vector, return a list, no margin

movies_lower <-lapply(movies, tolower)
[1] "spyderman"

sapply does the same job as lapply, but return a vector

tapply computes a measure ( min, max, median etc) or a function for each factor variable in a vector.


tapply(iris$Sepal.Width, iris$Species, median)

Randomly generate user agents and ip in python

1. randomly generate user agent

pip install fake_useragent


from fake_useragent import UserAgent



got a random user agent

2. randomly generate ip

'.'.join('%s'%random.randint(0, 255) for i in range(4))

How to send emails on linux.

I use ubuntu system. How to find out which system you use, command is "uname -a".

I tried to send out email on ubuntu.

I tried on command line first.

I first installed postfix:

sudo apt-get install postfix

Then I tried the command:

echo "test message" | mailx -s "test subject" XXXX@xxx.com

And I got the following:

The program 'mailx' is currently not installed. You can install it by typing:
sudo apt-get install mailutils

So I installed mailx.

I got the message. 

And I put it in a linux shell script. Got it done. 

Scrapy 302 error

in setting.py change COOKIES_ENABLED to be true. 

Linux buffers , cache youtube video

stderr, stdout, stdin , how to output to linux file


There are 3 file descriptors, stdin, stdout and stderr (std=standard).

Basically you can:
  1. redirect stdout to a file
  2. redirect stderr to a file
  3. redirect stdout to a stderr
  4. redirect stderr to a stdout
  5. redirect stderr and stdout to a file
  6. redirect stderr and stdout to stdout
  7. redirect stderr and stdout to stderr

I  stderr to a file. 

scrapy crawl XXX 2> nohup.txt


scrapy crawl XXX 2>> nohup.txt

">>"  means append. 

Scrapy Spider, one url, multiple request sample code

class PabhSpider(CrawlSpider):
    name = 'pabh'
    allowed_domains = ['xxx']

    def start_requests(self):
        url = 'http://xxx'
        num1 = '01'
        formdata = {
        return [FormRequest(url=url,formdata=formdata,method='get',callback=self.parse)]

    def parse(self, response):
        item = XXXItem()
        hxs = Selector(response)
        item['bh'] = hxs.xpath('/html/body/form/p/font/select[3]/option/@value').extract()
        yield item

        num = ['02','03','04','05','06','07','08','09','10','11','12','13','14','21','31','40','51','61']

        for x in  num:
            url = 'http://xxx'
            yield FormRequest(url=url,formdata=formdata,method='get',callback=self.parse)

how to get rid of garbage characters when opening a txt file with excel and the txt file has east Asian characters

Sometimes when we open a txt file with excel  and the txt file has east Asian characters, we will see some garbage characters. How to get rid of them.

To open a txt file with excel. First open an empty excel, then click File=>Open=> Go the the file you want to open => click open.

Some ones say  if we open it with  option Windows(ANSI) we will get rid of garbage characters. But I tried my file with Windows(ANSI), did not get rid of garbage characters.

So I tried some other options, I tried Unicode (UTF-8) and got rid of garbage characters.

A tech forum.

some python code to plot subplots in python

import matplotlib.pyplot as plt
import numpy as np

# Simple data to display in various forms
x = np.linspace(0, 2 * np.pi, 400)
y = np.sin(x ** 2)


# Just a figure and one subplot
f, ax = plt.subplots()
ax.plot(x, y)
ax.set_title('Simple plot')

# Two subplots, the axes array is 1-d
f, axarr = plt.subplots(2, sharex=True)
axarr[0].plot(x, y)
axarr[0].set_title('Sharing X axis')
axarr[1].scatter(x, y)

# Two subplots, unpack the axes array immediately
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
ax1.plot(x, y)
ax1.set_title('Sharing Y axis')
ax2.scatter(x, y)

# Three subplots sharing both x/y axes
f, (ax1, ax2, ax3) = plt.subplots(3, sharex=True, sharey=True)
ax1.plot(x, y)
ax1.set_title('Sharing both axes')
ax2.scatter(x, y)
ax3.scatter(x, 2 * y ** 2 - 1, color='r')
# Fine-tune figure; make subplots close to each other and hide x ticks for
# all but bottom plot.
plt.setp([a.get_xticklabels() for a in f.axes[:-1]], visible=False)

# row and column sharing
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey='row')
ax1.plot(x, y)
ax1.set_title('Sharing x per column, y per row')
ax2.scatter(x, y)
ax3.scatter(x, 2 * y ** 2 - 1, color='r')
ax4.plot(x, 2 * y ** 2 - 1, color='r')

# Four axes, returned as a 2-d array

Scrapy linux cron job not work, how I make it work

I tried to automate  a scrapy job using cron on linux. But it did not work. I searched and found the solution.
First use "which scrapy " to find where the scrapy is. For my machine the scrapy is:


Then in the  shell script, write:

cd /path/to/spider 

nohup  /home/ubuntu/anaconda2/bin/scrapy crawl quotes   >> log.txt &

It resolved the problem. Or can write:

cd /path/to/spider 
export PATH
nohup  /home/ubuntu/anaconda2/bin/scrapy crawl quotes   >> log.txt &

Compare Unix date and time

Compare Unix date and time.

To get Unix time, the command is:

current time:

now=`date +"%T"`

we will get a time like "20:55:01"

now=`date +"%H%M%S"`

we will get a time like "205501"

If we want to compare times, we can not compare the times in the format "%H:%M:%S", we can only compare them in the format "%H%M%S". Otherwise we will get an error Illegal number: 20:59:22

To get Unix date, the command is:

date1=`date +"%m/%d/%Y %H:%M:%S"`

We will get a date in like "5/8/2018 20:55:01"

If we want to get timestamp, we will use:

date2=`date +"%s"`

we will get a unix timestamp.

we can compare dates by its unix timestamp. it seems we can not compare two dates like "5/8/2018 20:55:01".  Otherwise we will get an error: Illegal number: 05/08/2018 20:59:22.

Code for drawing plot in python

I scraped some webpage, get a dictionary of date versus number of mentions. Here is the code:

import matplotlib.pyplot as plt

def sortdict(d):
    for key in sorted(d): yield d[key]

counter1={'2018-02-01': 22, '2018-01-31': 19, 
'2018-01-30': 10, '2018-01-29': 5, '2018-01-27': 4, 
'2018-01-28': 3, '2018-01-25': 3, '2018-01-23': 3, '2018-01-26': 3, 
'2018-01-24': 2, '2018-01-01': 2, '2018-01-15': 2, '2018-01-12': 2,
 '2018-01-09': 1, '2017-12-18': 1, '2017-12-26': 1, '2018-01-11': 1, 
'2018-01-13': 1, '2017-11-28': 1, '2017-12-21': 1, '2017-12-22': 1,
 '2017-02-09': 1, '2018-01-04': 1, '2017-01-17': 1, '2017-03-02': 1,
 '2018-01-08': 1, '2017-12-09': 1, '2017-12-24': 1, '2017-02-20': 1,
 '2018-01-14': 1, '2018-01-21': 1, '2017-12-28': 1, '2017-12-11': 1}

x_labels = [] #create an empty list to store the labels#for key in counter.keys():
#        x_labels.append(key)

fig, ax = plt.subplots()
lists = sorted(counter1.items()) # sorted by key, return a list of tuplesprint(lists)
x, y = zip(*lists)
# unpack a list of pairs into two tuplesplt.scatter(x, y)
plt.xticks( range(len(x)), x, rotation=90,fontsize=5 )



