Easy Programming: 2018

Wednesday, October 24, 2018

Naive Bayes Classifier , a good post

http://ataspinar.com/2016/02/15/sentiment-analysis-with-the-naive-bayes-classifier/

Thursday, September 27, 2018

AWS EMR training

https://www.youtube.com/watch?v=zgXtKu5QFTg&list=PLB5E99B925DBE79FF

Sunday, September 23, 2018

hadoop distributions , cloudera, apache, mapr etc, aws

https://www.youtube.com/watch?v=zjdN3IxUh6A

what is aws

https://www.youtube.com/watch?v=wWeyzYzd17o

install cloudera on aws

https://www.youtube.com/watch?v=HaU43H_HlqM

Monday, September 17, 2018

command to install a python package under python 3

python3 -m pip install glog

Run python arguments command line

Run python arguments command line

https://www.youtube.com/watch?v=Byta5WkeNiA

Monday, September 10, 2018

write the data out to a file , python script

output the data to a file , sample python script:

filename="filename11"+'.txt'File=open("C:/path/to/file/"+filename,'w')
for item in a:
    File.write(item+"\n")
File.close()

Friday, September 7, 2018

pyspark read in a file tab delimited.

https://stackoverflow.com/questions/43508054/spark-sql-how-to-read-a-tsv-or-csv-file-into-dataframe-and-apply-a-custom-sche?rq=1

Thursday, September 6, 2018

pysaprk tutorial , tutorial points

https://www.tutorialspoint.com/pyspark/index.htm

Monday, September 3, 2018

pyspark sql built-in functions

https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/index.html

Saturday, September 1, 2018

Wednesday, August 29, 2018

sum case when pyspark

https://stackoverflow.com/questions/40762066/sum-of-case-when-in-pyspark

https://stackoverflow.com/questions/49524501/spark-sql-sum-based-on-multiple-cases

pyspark timestamp function, from_utc_timestamp function

http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

https://stackoverflow.com/questions/45469438/pyspark-creating-timestamp-column

regular expression extract pyspark

https://stackoverflow.com/questions/46410887/pyspark-string-matching-to-create-new-column

regular expression for pyspark

https://stackoverflow.com/questions/45580057/pyspark-filter-dataframe-by-regex-with-string-formatting

pyspark sql case when to pyspark when otherwise

https://www.programcreek.com/python/example/98243/pyspark.sql.functions.when

pyspark user defined function

https://stackoverflow.com/questions/34803855/pyspark-dataframe-udf-on-text-column

pyspark sql functions

https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#when

Thursday, August 16, 2018

Another article about python decorator

https://realpython.com/primer-on-python-decorators/

Tuesday, August 14, 2018

python advanced exercises

https://python.g-node.org/python-summerschool-2011/_media/materials/advanced_python/exercises_and_solutions.pdf

Python tips

Python tips

http://book.pythontips.com/en/latest/

Python's *args and **kwargs

Python's *args and **kwargs

http://book.pythontips.com/en/latest/args_and_kwargs.html

An article about python cache, pickle and decorator

https://datascience.blog.wzb.eu/2016/08/12/a-tip-for-the-impatient-simple-caching-with-python-pickle-and-decorators/

An understandable article about python decorator

Here is an understandable article about python decorator:

https://www.programiz.com/python-programming/decorator

Friday, August 10, 2018

automatically log into ftp on linux and upload file

https://blog.eduonix.com/shell-scripting/how-to-automate-ftp-transfers-in-linux-shell-scripting/

https://www.linux.com/blog/ftp-file-transfer-automated-bash-script

http://www.columbia.edu/kermit/ftpscripts.html

Sample code for automatically log into sftp , and load a file to sftp

sample code

#!/bin/sh
HOST='HOST'
USER='USER'
PASSWD='PASS'

sshpass -p $PASSWD sftp $USER@$HOST << EOF

put file1.txt

EOF
echo "end ftp"
exit

Monday, July 30, 2018

Shortcut for delete in mac

Shortcut for delete in mac

command + delete.

Sunday, July 29, 2018

differences between t test and z test

distribution sample size variance known or unknown
t test normally distributed can be small unknown
z test no requirements for normality b/c clt is large known

Saturday, July 28, 2018

Odds and odds ratio in statistics

https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/

The odds of success are defined as the ratio of the probability of success over the probability of failure

confidence interval

https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/PASS/Confidence_Intervals_for_the_Odds_Ratio_in_Logistic_Regression_with_One_Binary_X.pdf

Assumptions of linear models

https://www.theanalysisfactor.com/assumptions-of-linear-models/

The residuals are independent
The residuals are normally distributed
The residuals have a mean of 0 at all values of X
The residuals have constant variance

Data science interview questions

http://alexbraunstein.com/2011/08/09/hire-data-scientist-statistician/

https://www.datasciencecentral.com/profiles/blogs/66-job-interview-questions-for-data-scientists

Apply, sapply, tapply differences in R

https://www.guru99.com/r-apply-sapply-tapply.html

Apply: on matrice

apply(mat,1,var)

1: row, margin
2:column, margin

lapply: apply on a vector, return a list, no margin

movies <- c("SPYDERMAN","BATMAN","VERTIGO","CHINATOWN")
movies_lower <-lapply(movies, tolower)
[[1]]
[1] "spyderman"
[[2]]
[1]"batman"
...

sapply does the same job as lapply, but return a vector

tapply computes a measure ( min, max, median etc) or a function for each factor variable in a vector.

data(iris)

tapply(iris$Sepal.Width, iris$Species, median)

https://www.r-bloggers.com/using-apply-sapply-lapply-in-r/

Friday, July 27, 2018

How to use ftp on the linux shell

https://www.howtoforge.com/tutorial/how-to-use-ftp-on-the-linux-shell/

https://blog.eduonix.com/shell-scripting/how-to-automate-ftp-transfers-in-linux-shell-scripting/

Monday, July 23, 2018

Good and concise naive bayes classifier sentiment analysis tutorial on youtube.

https://www.youtube.com/watch?v=0kTUXhsdhtY

Saturday, July 21, 2018

Github tutorial

Github tutorial

https://www.youtube.com/watch?v=0fKg7e37bQE

Thursday, July 12, 2018

Generate a cookie python

Generate a cookie python

http://infohost.nmt.edu/tcc/demo/pycgi/pycgi.pdf

Wednesday, July 11, 2018

A good youtube math and machine learning channel

Here is a good youtube math and machine learning channel.

https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw

And it has a video series, machine learning Neural network for recognizing numbers

https://www.youtube.com/watch?v=aircAruvnKk

Tuesday, July 10, 2018

Randomly generate user agents and ip in python

1. randomly generate user agent

installation:
pip install fake_useragent

usage:

from fake_useragent import UserAgent

ua=UserAgent()

ua.random

got a random user agent

2. randomly generate ip

'.'.join('%s'%random.randint(0, 255) for i in range(4))

Thursday, July 5, 2018

How to send emails on linux.

I use ubuntu system. How to find out which system you use, command is "uname -a".

I tried to send out email on ubuntu.

I tried on command line first.

I first installed postfix:

sudo apt-get install postfix

Then I tried the command:

echo "test message" | mailx -s "test subject" XXXX@xxx.com

And I got the following:

The program 'mailx' is currently not installed. You can install it by typing:

sudo apt-get install mailutils

So I installed mailx.

I got the message.

And I put it in a linux shell script. Got it done.

Monday, July 2, 2018

How to prevent your scrapy from being banned.

How to prevent your scrapy from being banned.

https://www.cnblogs.com/rwxwsblog/p/4575894.html

Tuesday, June 26, 2018

Scrapy 302 error

Scrapy 302 error

in setting.py change COOKIES_ENABLED to be true.

Linux buffers , cache youtube video

Linux buffers , cache youtube video

https://www.youtube.com/watch?v=eqahEvCb8NM

stderr, stdout, stdin , how to output to linux file

http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-3.html

There are 3 file descriptors, stdin, stdout and stderr (std=standard).

Basically you can:

redirect stdout to a file
redirect stderr to a file
redirect stdout to a stderr
redirect stderr to a stdout
redirect stderr and stdout to a file
redirect stderr and stdout to stdout
redirect stderr and stdout to stderr

I stderr to a file.

scrapy crawl XXX 2> nohup.txt

scrapy crawl XXX 2>> nohup.txt

">>" means append.

Tuesday, June 12, 2018

Scrapy transport item over several parse

Scrapy transport item over several parse

https://stackoverflow.com/questions/14798639/scrapy-transport-item-over-several-parse-and-collect-data

Scrapy with multiple callbacks.

Scrapy with multiple callbacks.

https://stackoverflow.com/questions/23695716/scrapy-crawlspider-rules-with-multiple-callbacks

Monday, June 11, 2018

Scrapy Spider, one url, multiple request sample code

class PabhSpider(CrawlSpider):
    name = 'pabh'
    allowed_domains = ['xxx']

    def start_requests(self):
        url = 'http://xxx'
        num1 = '01'
        formdata = {
            "depart":num,
            "years":'2014'
        }
        return [FormRequest(url=url,formdata=formdata,method='get',callback=self.parse)]


    def parse(self, response):
        item = XXXItem()
        hxs = Selector(response)
        item['bh'] = hxs.xpath('/html/body/form/p/font/select[3]/option/@value').extract()
        yield item

        num = ['02','03','04','05','06','07','08','09','10','11','12','13','14','21','31','40','51','61']

        for x in  num:
            url = 'http://xxx'
            formdata={
                "depart":x,
                "years":'2014'
            }
            yield FormRequest(url=url,formdata=formdata,method='get',callback=self.parse)

Wednesday, June 6, 2018

how to get rid of garbage characters when opening a txt file with excel and the txt file has east Asian characters

Sometimes when we open a txt file with excel and the txt file has east Asian characters, we will see some garbage characters. How to get rid of them.

To open a txt file with excel. First open an empty excel, then click File=>Open=> Go the the file you want to open => click open.

Some ones say if we open it with option Windows(ANSI) we will get rid of garbage characters. But I tried my file with Windows(ANSI), did not get rid of garbage characters.

So I tried some other options, I tried Unicode (UTF-8) and got rid of garbage characters.

Friday, May 18, 2018

Several websites

https://techcrunch.com/

https://www.codecademy.com/

https://www.freecodecamp.org/

Wednesday, May 16, 2018

A tech forum.

A tech forum.

https://slashdot.org/

some python code to plot subplots in python

import matplotlib.pyplot as plt

import numpy as np

# Simple data to display in various forms

x = np.linspace(0, 2 * np.pi, 400)

y = np.sin(x ** 2)

plt.close('all')

# Just a figure and one subplot

f, ax = plt.subplots()

ax.plot(x, y)

ax.set_title('Simple plot')

# Two subplots, the axes array is 1-d

f, axarr = plt.subplots(2, sharex=True)

axarr[0].plot(x, y)

axarr[0].set_title('Sharing X axis')

axarr[1].scatter(x, y)

# Two subplots, unpack the axes array immediately

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)

ax1.plot(x, y)

ax1.set_title('Sharing Y axis')

ax2.scatter(x, y)

# Three subplots sharing both x/y axes

f, (ax1, ax2, ax3) = plt.subplots(3, sharex=True, sharey=True)

ax1.plot(x, y)

ax1.set_title('Sharing both axes')

ax2.scatter(x, y)

ax3.scatter(x, 2 * y ** 2 - 1, color='r')

# Fine-tune figure; make subplots close to each other and hide x ticks for

# all but bottom plot.

f.subplots_adjust(hspace=0)

plt.setp([a.get_xticklabels() for a in f.axes[:-1]], visible=False)

# row and column sharing

f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey='row')

ax1.plot(x, y)

ax1.set_title('Sharing x per column, y per row')

ax2.scatter(x, y)

ax3.scatter(x, 2 * y ** 2 - 1, color='r')

ax4.plot(x, 2 * y ** 2 - 1, color='r')

# Four axes, returned as a 2-d array

Thursday, May 10, 2018

Scrapy linux cron job not work, how I make it work

I tried to automate a scrapy job using cron on linux. But it did not work. I searched and found the solution.
First use "which scrapy " to find where the scrapy is. For my machine the scrapy is:

/home/ubuntu/anaconda2/bin/scrapy

Then in the shell script, write:

cd /path/to/spider

nohup /home/ubuntu/anaconda2/bin/scrapy crawl quotes >> log.txt &

It resolved the problem. Or can write:

cd /path/to/spider

PATH=$PATH:/home/ubuntu/anaconda2/bin

export PATH

nohup /home/ubuntu/anaconda2/bin/scrapy crawl quotes >> log.txt &

Tuesday, May 8, 2018

Compare Unix date and time

Compare Unix date and time.

To get Unix time, the command is:

current time:

now=`date +"%T"`

we will get a time like "20:55:01"

now=`date +"%H%M%S"`

we will get a time like "205501"

If we want to compare times, we can not compare the times in the format "%H:%M:%S", we can only compare them in the format "%H%M%S". Otherwise we will get an error Illegal number: 20:59:22

To get Unix date, the command is:

date1=`date +"%m/%d/%Y %H:%M:%S"`

We will get a date in like "5/8/2018 20:55:01"

If we want to get timestamp, we will use:

date2=`date +"%s"`

we will get a unix timestamp.

we can compare dates by its unix timestamp. it seems we can not compare two dates like "5/8/2018 20:55:01". Otherwise we will get an error: Illegal number: 05/08/2018 20:59:22.

Thursday, May 3, 2018

Linux Cron, linux job scheduler

Linux Cron, linux job scheduler

https://www.youtube.com/watch?v=4Icg3MYZZqI

https://awc.com.my/uploadnew/5ffbd639c5e6eccea359cb1453a02bed_Setting%20Up%20Cron%20Job%20Using%20crontab.pdf

to edit a cron file "crontab -e". Then you will run a job indefinitely.

Summary of SQL questions on leetcode

Summary of SQL questions on leetcode

https://byrony.github.io/summary-of-sql-questions-on-leetcode.html

Wednesday, May 2, 2018

implement decision tree from scratch using python

implement decision tree from scratch using python:

https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/

This guy has a good blog:

https://machinelearningmastery.com/blog/

Monday, April 23, 2018

Code for drawing plot in python

I scraped some webpage, get a dictionary of date versus number of mentions. Here is the code:

import matplotlib.pyplot as plt

def sortdict(d):
    for key in sorted(d): yield d[key]

counter1={'2018-02-01': 22, '2018-01-31': 19,

'2018-01-30': 10, '2018-01-29': 5, '2018-01-27': 4,

'2018-01-28': 3, '2018-01-25': 3, '2018-01-23': 3, '2018-01-26': 3,

'2018-01-24': 2, '2018-01-01': 2, '2018-01-15': 2, '2018-01-12': 2,

 '2018-01-09': 1, '2017-12-18': 1, '2017-12-26': 1, '2018-01-11': 1,

'2018-01-13': 1, '2017-11-28': 1, '2017-12-21': 1, '2017-12-22': 1,

 '2017-02-09': 1, '2018-01-04': 1, '2017-01-17': 1, '2017-03-02': 1,

 '2018-01-08': 1, '2017-12-09': 1, '2017-12-24': 1, '2017-02-20': 1,

 '2018-01-14': 1, '2018-01-21': 1, '2017-12-28': 1, '2017-12-11': 1}


x_labels = [] #create an empty list to store the labels#for key in counter.keys():
#        x_labels.append(key)


fig, ax = plt.subplots()
lists = sorted(counter1.items()) # sorted by key, return a list of tuplesprint(lists)
x, y = zip(*lists)
print(x)
print(y)
# unpack a list of pairs into two tuplesplt.scatter(x, y)
plt.xticks( range(len(x)), x, rotation=90,fontsize=5 )


plt.show()

results:

ezoic

Wednesday, October 24, 2018

Thursday, September 27, 2018

Sunday, September 23, 2018

Monday, September 17, 2018

Monday, September 10, 2018

Friday, September 7, 2018

Thursday, September 6, 2018

Monday, September 3, 2018

Saturday, September 1, 2018

Wednesday, August 29, 2018

Wednesday, August 22, 2018

Thursday, August 16, 2018

Tuesday, August 14, 2018

Friday, August 10, 2018

Monday, July 30, 2018

Sunday, July 29, 2018

Saturday, July 28, 2018

Friday, July 27, 2018

Monday, July 23, 2018

Saturday, July 21, 2018

Tuesday, July 17, 2018

Thursday, July 12, 2018

Wednesday, July 11, 2018

Tuesday, July 10, 2018

Sunday, July 8, 2018

Thursday, July 5, 2018

Monday, July 2, 2018

Tuesday, June 26, 2018

Tuesday, June 12, 2018

Monday, June 11, 2018

Wednesday, June 6, 2018

Friday, May 18, 2018

Wednesday, May 16, 2018

Thursday, May 10, 2018

Tuesday, May 8, 2018

Thursday, May 3, 2018

Wednesday, May 2, 2018

Monday, April 23, 2018