Thursday, December 19, 2019

recommend a book

a book named on intelligence by jeff hawkins.

This is a website the author building for the book

http://www.onintelligence.org/

Saturday, November 23, 2019

An good article on building data analysis portfolio

https://blog.udacity.com/2016/02/how-to-build-a-data-analysis-portfolio-that-will-get-you-hired.html

An example of the data analysis portfolio

http://matatat.org/blog/markdown-latex-react

use jupyter and pelican

another good example of data analysis portfolio

http://davidventuri.com/portfolio#scroll

one of the portfolio on github

https://davidventuri.github.io/eda-project/

Tuesday, November 19, 2019

How to run a python script on atom on ubuntu

To run a python script on atom on ubuntu:

crl+shift+b

Friday, November 15, 2019

An interesting website where one can see timesquare, hollywood blvd live

And some other countries and areas.

https://www.earthcam.com/

time square

https://www.earthcam.com/usa/newyork/timessquare/?cam=tsnorth_hd
hollywood blvd

https://www.earthcam.com/usa/california/losangeles/hollywoodblvd/?cam=hollywoodblvd

Monday, October 28, 2019

Tried to use skype on ubuntu 14.04, and did not work, here is how I fixed it

I use ubuntu 14.04. And I tried to use skype there. not working.
There was a skype there. And I could not sign in. I used microsoft account to log in.
I uninstalled that skype.
And I downloaded a skype rpm from here
https://www.skype.com/en/get-skype/

And I installed the rpm:

sudo apt-get install alien dpkg-dev debhelper build-essential

sudo alien packagename.rpm

sudo dpkg -i packagename.deb

it finally works.

Monday, October 21, 2019

List of English forums etc

twitter, reddit, 4chan, quora, github, tumblr, snapchat, telegram, line, facebook

Tuesday, September 10, 2019

natural language processing is fun

a good article

https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e

word tokenization
stop words

Sunday, September 8, 2019

reverse a string in python

a="imdelda"
a1=list(a)
for i in range(len(a1)/2):
tmp=a1[i]
a1[i]=a1[len(a1)-i-1]
a1[len(a1)-i-1]=tmp
a2="".join(a1)
print(a2)

Monday, September 2, 2019

how to improve the coding efficiency

How to improve the efficiency of your scripts? This problem may take one some years to accomplish.

Currently, here are some videos of programming on youtube hours long, for example :

https://www.youtube.com/watch?v=PJlAnR3asGQ&t=18011s

they can help ppl to learn scripts from beginning

And there are some books on coding efficiency:

https://www.amazon.com/Effective-Python-Specific-Software-Development/dp/0134034287

https://www.amazon.com/Effective-Specific-Improve-Programs-Designs/dp/0321334876

But to improve the efficiency of your coding, one needs to study on github etc constantly. But github only shows some portion of the scripts in the world. A lot companies, they use bitbucket to store the scripts internally. The scripts there are not public.

I saw some people's scripts, very efficient. I will post some here.

deep learning , what it is

deep learning is the technique of machine learning for ai. it uses neural networks etc.

here is a tutorial for it on r-bloggers.com

https://www.r-bloggers.com/step-by-step-tutorial-deep-learning-with-tensorflow-in-r/

here is a video for it:

https://livevideo.manning.com/module/52_1_1/deep-learning-with-r-in-motion/getting-started/welcome-to-the-video-series?utm_source=rstudio&utm_medium=partner_website&utm_campaign=livevideo_deeplearningwithrinmotion&utm_content=unit1_rstudio

and a r-bloggers.com post for it:

https://www.r-bloggers.com/getting-started-with-deep-learning-in-r/

r-bloggers.com

r-bloggers.com is a comprehensive website for statistics and r programming. If you want to learn things about statistics and r programming , you can search the subject you want to study and " r-bloggers.com" on google, mostly you will find out what you want to learn.

Thursday, August 29, 2019

an example of deep learning with python code , keras

https://www.guru99.com/keras-tutorial.html

Friday, August 16, 2019

read in data to R, and check if any missing values in the data

code to read in the data into R:

data1<-read.csv("data1.csv", stringAsFactors=FALSE)
view(data1)

a line of code to check if any missing values in the data:
length(which(!complete.cases(data1))

will give the value 0, if there is no missing values in the data

Sunday, August 11, 2019

twitter tweets sentiment analysis

twitter tweets sentiment analysis using naive bayes classifier

https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed

Saturday, August 10, 2019

very good article on text mining using r and corpus

Text Mining - Exploration of a Text Corpus with R

https://rstudio-pubs-static.s3.amazonaws.com/163802_0f005a14bcfb4c4b8ee17ac8a8e6c3e9.html

Friday, August 9, 2019

Possible fundamental limitations of predictive models based on data fitting[edit]

1) History cannot always accurately predict the future. Using relations derived from historical data to predict the future implicitly assumes there are certain lasting conditions or constants in a complex system. This almost always leads to some imprecision when the system involves people.

2) The issue of unknown unknowns. In all data collection, the collector first defines the set of variables for which data is collected. However, no matter how extensive the collector considers his/her selection of the variables, there is always the possibility of new variables that have not been considered or even defined, yet are critical to the outcome.

3) Adversarial defeat of an algorithm. After an algorithm becomes an accepted standard of measurement, it can be taken advantage of by people who understand the algorithm and have the incentive to fool or manipulate the outcome. This is what happened to the CDO rating described above. The CDO dealers actively fulfilled the rating agencies' input to reach an AAA or super-AAA on the CDO they were issuing, by cleverly manipulating variables that were "unknown" to the rating agencies' "sophisticated" models.

https://indatalabs.com/blog/predictive-models-performance-evaluation-important

Wednesday, August 7, 2019

building classifier using naive bayes algorithm

building classifier using naive bayes algorithm

https://www.machinelearningplus.com/predictive-modeling/how-naive-bayes-algorithm-works-with-example-and-full-code/

Tuesday, August 6, 2019

A comprehensive python tutorial

https://www.youtube.com/watch?v=_uQrJ0TkZlc

Django

overleaf is a good website for latex

overleaf is a good website for online latex editing

overleaf.com

Tuesday, July 9, 2019

underfitting and overfitting , n and p

Overfitting refers to a model that models the training data too well.

Underfitting refers to a model that can neither model the training data nor generalize to new data.

We have p parameters and n sample.
over fitting results from trying to estimate too many parameters from too small a sample, when p>n

if we remove one feature, we will decrease the degree of overfitting .

ECS/EKS container services , docker, airflow, snowflake database

ECS/EKS container services

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.

Docker is a software platform for building applications based on containers — small and lightweight execution environments that make shared use of the operating system kernel but otherwise run in isolation from one another. While containers as a concept have been around for some time, Docker, an open source project launched in 2013, helped popularize the technology, and has helped drive the trend towards containerization and microservices in software development that has come to be known as cloud-native development.

Docker is a software platform that allows you to build, test, and deploy applications quickly. Docker packages software into standardized units called containers that have everything the software needs to run including libraries, system tools, code, and runtime. Using Docker, you can quickly deploy and scale applications into any environment and know your code will run.

containers amazon offers

https://aws.amazon.com/containers/services/

I used EMR before

https://aws.amazon.com/emr/

a tutorial for docker

https://www.youtube.com/watch?v=K6WER0oI-qs

airflow: Airflow is a platform to programmatically author, schedule and monitor workflows.

a short summary

https://blog.insightdatascience.com/airflow-101-start-automating-your-batch-workflows-with-ease-8e7d35387f94

https://airflow.apache.org/project.html

how to install

https://airflow.apache.org/installation.html

video tutorial

https://www.youtube.com/watch?v=AHMm1wfGuHE

snowflake database: cloud based data warehouse

https://docs.snowflake.net/manuals/user-guide/getting-started-tutorial.html

Monday, July 8, 2019

7 tips to learn programming faster

https://www.codingdojo.com/blog/7-tips-learn-programming-faster

#3 will land you a job

1. learn by doing
2. grasps the fundamentals for long-term benefit
3.code by hand, using a pen and write on paper
4.ask for help
5.seek out more online resources
6. don't just read the sample code, tinker with it
7. take breaks when debugging

How to run a python script on atom

how to run a python script on atom :

mac shift + command + I
mac command +I

linux/windows : SHIFT + Ctrl + B

A thesis from a Phd and what he has done since graduation

Here is a thesis from a Phd

https://lib.dr.iastate.edu/etd/13537/

The title of the thesis is

A balanced approach to the multi-class imbalance problem

And after graduation, the author did not work for companies, he opens his consulting firm instead

Omni Analytics Group

https://omnianalytics.io/

one good sql tutorial and some good machine learning channels

one good sql tutorial

https://www.youtube.com/watch?v=nWeW3sCmD2k

some good machine learning channels

https://www.youtube.com/user/joshstarmer/videos

https://www.youtube.com/user/mathtutordvd/videos

https://www.youtube.com/channel/UCq8JbYayUHvKvjimPV0TCqQ/videos

https://www.youtube.com/user/edurekaIN/videos

https://www.youtube.com/channel/UC8butISFwT-Wl7EV0hUK0BQ/videos

Monday, June 24, 2019

Understanding self and init method in python Class.

https://micropyramid.com/blog/understand-self-and-__init__-method-in-python-class/

self :

self represents the instance of the class. By using the "self" keyword we can access the attributesand methods of the class in python.

__init__ :

"__init__" is a reseved method in python classes. It is known as a constructor in object oriented concepts. This method called when an object is created from the class and it allow the class to initialize the attributes of a class.

Friday, June 14, 2019

classification

classification, how to do it

http://inseaddataanalytics.github.io/INSEADAnalytics/CourseSessions/Sessions67/ClassificationAnalysisReading.html

Monday, June 10, 2019

l1 and l2 regularization

l1 and l2 regularization
https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c

https://www.linkedin.com/pulse/l1-l2-regularization-why-neededwhat-doeshow-helps-ravi-shankar

when p>> n, when we use OLS , we will have over fitting. to reduce overfitting, we use regularization, l1 and l2. l1 forces some parameters to be zero. l2 shrinks some of the parameters to be zeros, but it tries to keep all the parameters in the models.

https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/

Key Difference

Ridge: It includes all (or none) of the features in the model. Thus, the major advantage of ridge regression is coefficient shrinkage and reducing model complexity.
Lasso: Along with shrinking coefficients, lasso performs feature selection as well. (Remember the ‘selection‘ in the lasso full-form?) As we observed earlier, some of the coefficients become exactly zero, which is equivalent to the particular feature being excluded from the model.

Traditionally, techniques like stepwise regression were used to perform feature selection and make parsimonious models. But with advancements in Machine Learning, ridge and lasso regression provide very good alternatives as they give much better output, require fewer tuning parameters and can be automated to a large extend.

2. Typical Use Cases

Ridge: It is majorly used to prevent overfitting. Since it includes all the features, it is not very useful in case of exorbitantly high #features, say in millions, as it will pose computational challenges.
Lasso: Since it provides sparse solutions, it is generally the model of choice (or some variant of this concept) for modelling cases where the #features are in millions or more. In such a case, getting a sparse solution is of great computational advantage as the features with zero coefficients can simply be ignored.

Its not hard to see why the stepwise selection techniques become practically very cumbersome to implement in high dimensionality cases. Thus, lasso provides a significant advantage.

3. Presence of Highly Correlated Features

Ridge: It generally works well even in presence of highly correlated features as it will include all of them in the model but the coefficients will be distributed among them depending on the correlation.
Lasso: It arbitrarily selects any one feature among the highly correlated ones and reduced the coefficients of the rest to zero. Also, the chosen variable changes randomly with change in model parameters. This generally doesn’t work that well as compared to ridge regression.

Wednesday, June 5, 2019

word header and footer

wood header and footer

https://www.youtube.com/watch?v=lNdjuIYuB3o

How to Create Stunning Flowcharts in Microsoft Word

How to Create Stunning Flowcharts in Microsoft Word

https://www.youtube.com/watch?v=iiS7aAFI2Cs

https://www.youtube.com/watch?v=hjhJ3-jSBM8

Thursday, May 30, 2019

Classifier for imbalanced data

Classification for imbalanced data can be resolved using re-sampling method, like smote.

Here is an example and some sample scripts

https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18

This thesis presented a method for this problem

https://lib.dr.iastate.edu/cgi/viewcontent.cgi?referer=https://www.bing.com/&httpsredir=1&article=4544&context=etd

A balanced approach to the multi-class imbalance problem

R package , climm, climer

Tuesday, May 28, 2019

Timeline of programming languages

Timeline of programming languages

https://en.m.wikipedia.org/wiki/Timeline_of_programming_languages

Monday, May 27, 2019

Data science blogs

R for Data Science
https://r4ds.had.co.nz/introduction.html
A Complete Tutorial to learn Data Science in R from Scratch
https://www.analyticsvidhya.com/blog/2016/02/complete-tutorial-learn-data-science-scratch/
Yhat Blog
http://blog.yhat.com/
A Complete Tutorial to Learn Data Science with Python from Scratch
https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/
R-bloggers
https://www.r-bloggers.com/
Python Bloggers
http://www.pybloggers.com/
Data Science Central
https://www.datasciencecentral.com/
A data scientist's blog
https://machinelearningmastery.com/blog/
Apache Spark Machine Learning Tutorial
https://mapr.com/blog/apache-spark-machine-learning-tutorial/
Data Science 101
https://101.datascience.community/page/2/
Win Vector Blog
http://www.win-vector.com/blog/
Big data and data science review
bigdatadatascience.docx
infoq
https://www.infoq.com/ai-ml-data-eng
datatau
https://www.datatau.com/
Lambda the Ultimate
http://lambda-the-ultimate.org/
Simply Statistics
https://simplystatistics.org/
Statistical Modeling, Casual Inference and Social Science
https://statmodeling.stat.columbia.edu/
Flowing data
https://flowingdata.com/
Data 36
https://data36.com/
Kaggle Blog
http://blog.kaggle.com/
Linear Digressions
https://lineardigressions.com/
Towards Data Science
https://towardsdatascience.com/
Seeing Theory
https://seeing-theory.brown.edu/
Mode Blog
https://mode.com/blog/

Consumer analytics blogs

Top 50 blogs on Consumer Analytics
some of the blogs no longer exist
https://www.ngdata.com/best-customer-analytics-blogs/
How to Use Customer Behavior Data to Drive Revenue (Like Amazon, Netflix & Google)
https://www.pointillist.com/blog/customer-behavior-data/
Using R for Customer Analytics
https://ds4ci.files.wordpress.com/2013/09/ciwr_2introandpracticals.pdf
Customer Analytics: Using Deep Learning With Keras To Predict Customer Churn
https://www.business-science.io/business/2017/11/28/customer_churn_analysis_keras.html
Marketing Analytics and Data Science
https://www.r-bloggers.com/marketing-analytics-and-data-science/
Using R to predict if a customer will buy
https://www.masterdataanalysis.com/r/using-r-predict-customer-will-buy/
Customer Segmentation using python
http://blog.yhat.com/posts/customer-segmentation-using-python.html
Using R for customer segmentation
https://ds4ci.files.wordpress.com/2013/09/user08_jimp_custseg_revnov08.pdf
Using r to analyze your customer data warehouse
https://www.bedrockdata.com/blog/using-r-to-analyze-your-customer-data-warehouse

Thursday, May 2, 2019

two data science blogs seems pretty good for data science

Win-Vector blog seems pretty good for data science

http://www.win-vector.com/blog/

Data Science Dojo blog

https://blog.datasciencedojo.com/

Tuesday, April 30, 2019

Consumer analytics articles

https://www.ngdata.com/best-customer-analytics-blogs/

https://ds4ci.files.wordpress.com/2013/09/ciwr_2introandpracticals.pdf

https://www.business-science.io/business/2017/11/28/customer_churn_analysis_keras.html

https://www.r-bloggers.com/marketing-analytics-and-data-science/

https://www.masterdataanalysis.com/r/using-r-predict-customer-will-buy/

http://blog.yhat.com/posts/customer-segmentation-using-python.html

https://ds4ci.files.wordpress.com/2013/09/user08_jimp_custseg_revnov08.pdf

https://www.bedrockdata.com/blog/using-r-to-analyze-your-customer-data-warehouse

https://legacy.gitbook.com/book/josepcurtodiaz/customer-analytics-with-r/details

Wednesday, April 24, 2019

Two articles about classifier

Metrics to evaluate machine learning algorithm

https://machinelearningmastery.com/metrics-evaluate-machine-learning-algorithms-python/

How to handle imbalanced data in classification

https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/

Thursday, April 4, 2019

One trick on big data analytics

I once worked on big data projects. I analyzed 5,000,000,000 rows of data each day. I used hadoop/hive. To analyze the data with some scripts took a long time. Sometimes when there were some errors with the scripts, the program would break, and I needed to start over. And it cost time. So sometimes it took relatively long time to get projects done.

So, when you have the problem, start with small samples of the data. Then the programs run faster. you will get the jobs done sooner. time saving.

Thursday, March 21, 2019

Feature engineering for machine learning

https://perso.limsi.fr/annlor/enseignement/ensiie/Feature_Engineering_for_Machine_Learning.pdf

feature engineering is an important topic in predictive modeling.

ezoic