Saturday, 30 March 2013

Business Application IT Lab--Plotting in R



IT Business Application Lab Assignments#10

Session 10
Date:  26th March,2013



Assignment 1:

Create 3 vectors, x, y, z and choose any random values for them, ensuring they are of equal length, bind them together.Create 3 dimensional plots of the same.

Solution:


Step 1: Creating a random data set of 50 items with mean =30 and standard deviation =10

> data <- rnorm(50,mean=30,sd=10)
> data

Step 2:

Taking sample data of length 5 from the created data set in three different vectors x,y,z
> x <- sample(data,5)
> x

> y <- sample(data,5)
> y

> z <- sample(data,5)
> z

Binding the three vectors x,y,z into a vector c using cbind
> c <- cbind(x,y,z)
> c

Output:




Plotting of 3 dimensional graphs:

Command:

plot3d(c[,1:3])

Output:



Plotting of graph with labels for axis and colors

Command:

> plot3d(c[,1:3], xlab="X Axis" , ylab="Y Axis" , zlab="Z Axis", col=rainbow(500)) 

Output:



Plotting of graph with labels for axis and colors and type "Spheres"

Command;

> plot3d(c[,1:3], xlab="X Axis" , ylab="Y Axis" , zlab="Z Axis", col=rainbow(500),type="s")

Output:



Plotting of graph with labels for axis and colors and type "Points"

Command:

> plot3d(c[,1:3], xlab="X Axis" , ylab="Y Axis" , zlab="Z Axis", col=rainbow(500),type="p")

Output:



Plotting of graph with labels for axis and colors and type "Line"

Command:

> plot3d(c[,1:3], xlab="X Axis" , ylab="Y Axis" , zlab="Z Axis", col=rainbow(500),type="l")

Output:




Assignment 2:


Choose 2 random variables 
Create 3 plots: 
1. X-Y 
2. X-Y|Z (introducing a variable z and cbind it to z and y with 5 diff categories)
3. Color code and draw the graph 
4. Smooth and best fit line for the curve

Solution:

Command:


> x <- rnorm(5000, mean= 20 , sd=10)
> y <- rnorm(5000, mean= 10, sd=10)
> z1 <- sample(letters, 5)
> z2 <- sample(z1, 5000, replace=TRUE)
> z <- as.factor(z2)
> z

Output:



Creating Quickplots

Command:

>qplot(x,y)

Output:


Command:

>qplot(x,z)

Output;


Creating Semi-Transparent plot

Command:


> qplot(x,z, alpha=I(2/10))


Output:


Creating Colored plot

Command:

> qplot(x,y, color=z)

Output:


Creating Logarithmic Color plot

Command:

> qplot(log(x),log(y), color=z)

Output:


Best fit and smooth curve using "geom"

Command:

> qplot(x,y,geom=c("path","smooth"))

Output:


Command:

> qplot(x,y,geom=c("point","smooth"))

Output:


Command:

> qplot(x,y,geom=c("boxplot","jitter"))

Output:





Saturday, 23 March 2013

QlikView-A Data Visualization Tool




Session 9
Date:  19th March,2013


An infographics/data visualization tool that I have studied and found highly sophisticated yet user-friendly is QlikView.

The QlikView Business Discovery platform delivers true self-service BI that empowers business users by driving innovative decision-making.


Features:

This is one of the most practiced data visualization tool which enables the user to
  • Consolidating relevant data from multiple sources into a single application
  • Exploring the associations in the data
  • Enabling social decision making through secure, real-time collaboration
  • Visualizing data with engaging, state-of-the-art graphics
  • Searching across all data—directly and indirectly
  • Interacting with dynamic apps, dashboards and analytics
  • Accessing, analyzing and capturing data from mobile devices

The QlikView Difference over others
  • Has an inference engine that maintains the associations in the data automatically
  • Calculates aggregations on the fly, as needed, for a super-fast user experience
  • Compresses data down to 10% of its original size to optimize the power of the processors
  • Accomplishes both within a single, comprehensive product

Go to http://ap.demo.qlikview.com/download/.

Install the application with valid credential.

The home screen looks like:



Choose any supported file.

I have chose an excel containing few NIFTY historical data as follows:

Date Open High Low Close Shares Traded Turnover (Rs. Cr)
1-Oct-12 5704.75 5722.95 5694 5718.8 123138510 4798.17
3-Oct-12 5727.7 5743.25 5715.8 5731.25 165037864 6654.02
4-Oct-12 5751.55 5807.25 5751.35 5787.6 171404290 6954.74
5-Oct-12 5815 5815.35 4888.2 5746.95 255569804 12995.8
8-Oct-12 5751.85 5751.85 5666.2 5676 142319000 5853.56
9-Oct-12 5708.15 5728.65 5677.9 5704.6 119300415 5047.01
10-Oct-12 5671.15 5686.5 5647.05 5652.15 126294361 4564.39

After loading the data there are several types of visualization options avalible like
Bar chart
Line chart
Combo chart
Scatter chart
Grid chart
Straight Table
Pivot Table

I made use of some of the above mentioned charts to came out some observations:

Fig 1:


Fig 2:



Fig 3:




Some of the features where QlikView lacks are
  •  Qlikview works perfectly when the size of the database is small but in practical cases the database is never small.
  •  Alerts- Capability to create alerts and delivers it to not only Email but blackberries, hand held devices, mobile phones etc
  •  Multiuser development environment- This feature allows multiple developers work on a single project and the utility synchronizes the pieces of project each developer is working with the main project. Qlikview completely lacks this feature.
  • Connect and extract data from multidimensional objects.
  • Support for advance features like embedded browser(available in Hyperion Interactive reporting), flickers(rolling messages) etc as an standard options.

Friday, 15 March 2013

Business Application IT Lab


IT Business Application Lab Assignments#8

Session 8
Date:  12th March,2013



The data set we have used in this assignment is "Produc".

The description for the same follows

- state : the state
- year : the year
- pcap: private capital stock
- hwy : highway and streets
- pc: public capital
- gsp: gross state products
- emp: labor input measured by the employment in non–agricultural payrolls
- unemp: state unemployment rate



Assignment :
To calculate the values for all the 3 models and decide which models best fits the data set for panel estimation ?


Solution:

Calculating value for Pooling Model




Calculating value for Fixed Model



Calculating value for Random Model




To choose the best model that fits the data set "Produc" ,we need to run pairwise hypothesis tests among the 3 models and select the best fit in the end.


Test 1:


Between pooling and fixed model

Command :
pFtest (fixed1 , pooled)




Test details :
H0: Null: the individual index and time based params are all zero
H1 : Atleast one of the index and time based params are non zero

The hypothesis test suggests that the alternative hypothesis has significant effects.
As the p-value is too low.
So we can reject the null hypothesis.

Hence Fixed model is better than the pooling model.



Test2:
Between pooling and random model

Command :
plmtest (pooled)





Test details :
H0: Null: the individual index and time based params are all zero : Pooling Model
H1: Atleast one of the index and time based params are non zero : Random Model

The hypothesis test suggests that the alternative hypothesis has significant effects.
As the p-value is too low..
So we can reject the Null hypothesis.

Hence random model is better than the pooling model.



Test3:
Between fixed and random model

Command :
We use Hausman test -:
phtest(random1 , fixed1)




Test details :
H0: Null: individual effects are not correlated with any regressor : Random Model
H1 : Individual effects are correlated : Fixed Model

The hypothesis test suggests that the one of the models is inconsistent.
As the p-value is too low.
So we can reject the null hypothesis.

Hence fixed model is better than random model.



Conclusion :-
We can conclude that fixed model best fits the "Produc" data set panel data estimations. i.e there is significant correlation observed with the regressor variables and index impact exists.
Hence, we would choose "Fixed" model to estimate the panel data presented by "Produc" data set.








Wednesday, 13 February 2013

Business Application IT Lab




IT Business Application Lab Assignments#6

Session 6
Date:  12th Feb,2013


Assignment 1:
Create a log of return data for a period of more than 1 year(1.1.2012-1.1.2013) and use this data to calculate historical volatility. 

Commands:

> stockprice<-read.csv(file.choose(),header=T)
> head(stockprice)
> closingprice<-stockprice$5
> closingprice.ts<-ts(closingprice,frequency=252)
> returns<-(closingprice.ts-lag(closingprice.ts,k=-1))/lag(closingprice.ts,k=-1)
> z<-scale(returns)+10
> returnslog<-log(z)
> returnslog




Assignment 2:
Calculate ACF plot for logreturns data and do an ADF test & interpret the result.

For acf plotting
> acf(returnslog)

Output:



From the above graph we can see that the results lie between 95% confidence interval.So we can safely conclude that the time series is stationary.



> T=252^0.5
> historicalvolatility<-sd(returnslog)*T
> historicalvolatility
> adf.test(returnslog)

Commands:



From the above output we find that p value is=0.01 which is < 0.05

So, we reject the null hypothesis and accept the alternate hypothesis that the time series is stationary.


Thursday, 7 February 2013

Business Application IT Lab

IT Business Application Lab Assignments#5

Session 5
Date:  5th Feb,2013


Assignment 1 

Find Returns of NSE data for greater than 6 months having selected the 10th data point as start and 95th data point as end.

Find plot of that return.


Data Set: S&P CNX NIFTY from 01/06/2012 to 31/01/2013

Output:


> z<-read.csv(file.choose(),header=T)
> head(z)
         Date    Open    High     Low   Close Shares.Traded Turnover..Rs..Cr.
1 01-Jun-2012 4910.85 4925.00 4831.75 4841.60     138767416           4989.22
2 04-Jun-2012 4797.30 4858.30 4770.35 4848.15     152339865           5143.54
3 05-Jun-2012 4869.45 4898.95 4847.70 4863.30     141476962           5016.41
4 06-Jun-2012 4886.65 5010.45 4886.15 4997.10     185340406           7188.42
5 07-Jun-2012 5035.35 5059.65 5007.75 5049.65     150558164           6077.37
6 08-Jun-2012 5044.25 5084.45 4994.80 5068.35     138389395           5249.81
> open<-z$Open[10:95]
> open.ts<-ts(open,deltat=1/252)
> open.ts
Time Series:
Start = c(1, 1)
End = c(1, 86)
Frequency = 252
 [1] 5105.10 5069.55 5174.00 5050.80 5114.55 5097.35 5101.75 5158.50 5107.45 5149.45 5148.95 5191.25 5283.85 5298.85 5310.40 5297.05 5324.70 5283.70 5286.60 5315.25
[21] 5240.00 5242.75 5232.35 5228.05 5199.10 5249.85 5233.55 5163.25 5128.80 5118.40 5126.30 5124.30 5129.75 5214.85 5220.70 5233.10 5195.60 5260.85 5295.40 5345.25
[41] 5348.30 5308.20 5316.35 5343.25 5385.95 5368.60 5368.70 5395.75 5426.15 5392.60 5387.85 5348.05 5343.85 5268.60 5298.20 5276.50 5249.15 5243.90 5217.65 5309.45
[61] 5343.65 5361.90 5336.10 5404.45 5435.20 5528.35 5631.75 5602.40 5536.95 5577.00 5691.95 5674.90 5653.40 5673.75 5684.80 5704.75 5727.70 5751.55 5815.00 5751.85
[81] 5708.15 5671.15 5663.50 5681.70 5674.25 5705.60
> summary(open.ts)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   5051    5218    5309    5356    5433    5815
> z.diff<-diff(open.ts)
> z.diff
Time Series:
Start = c(1, 2)
End = c(1, 86)
Frequency = 252
 [1]  -35.55  104.45 -123.20   63.75  -17.20    4.40   56.75  -51.05   42.00   -0.50   42.30   92.60   15.00   11.55  -13.35   27.65  -41.00    2.90   28.65  -75.25
[21]    2.75  -10.40   -4.30  -28.95   50.75  -16.30  -70.30  -34.45  -10.40    7.90   -2.00    5.45   85.10    5.85   12.40  -37.50   65.25   34.55   49.85    3.05
[41]  -40.10    8.15   26.90   42.70  -17.35    0.10   27.05   30.40  -33.55   -4.75  -39.80   -4.20  -75.25   29.60  -21.70  -27.35   -5.25  -26.25   91.80   34.20
[61]   18.25  -25.80   68.35   30.75   93.15  103.40  -29.35  -65.45   40.05  114.95  -17.05  -21.50   20.35   11.05   19.95   22.95   23.85   63.45  -63.15  -43.70
[81]  -37.00   -7.65   18.20   -7.45   31.35
> returns<-cbind(open.ts,z.diff,lag(open.ts,k=-1))
> returns
Time Series:
Start = c(1, 1)
End = c(1, 87)
Frequency = 252
         open.ts  z.diff lag(open.ts, k = -1)
1.000000 5105.10      NA                   NA
1.003968 5069.55  -35.55              5105.10
1.007937 5174.00  104.45              5069.55
1.011905 5050.80 -123.20              5174.00
1.015873 5114.55   63.75              5050.80
1.019841 5097.35  -17.20              5114.55
1.023810 5101.75    4.40              5097.35
1.027778 5158.50   56.75              5101.75
1.031746 5107.45  -51.05              5158.50
1.035714 5149.45   42.00              5107.45
1.039683 5148.95   -0.50              5149.45
1.043651 5191.25   42.30              5148.95
1.047619 5283.85   92.60              5191.25
1.051587 5298.85   15.00              5283.85
1.055556 5310.40   11.55              5298.85
1.059524 5297.05  -13.35              5310.40
1.063492 5324.70   27.65              5297.05
1.067460 5283.70  -41.00              5324.70
1.071429 5286.60    2.90              5283.70
1.075397 5315.25   28.65              5286.60
1.079365 5240.00  -75.25              5315.25
1.083333 5242.75    2.75              5240.00
1.087302 5232.35  -10.40              5242.75
1.091270 5228.05   -4.30              5232.35
1.095238 5199.10  -28.95              5228.05
1.099206 5249.85   50.75              5199.10
1.103175 5233.55  -16.30              5249.85
1.107143 5163.25  -70.30              5233.55
1.111111 5128.80  -34.45              5163.25
1.115079 5118.40  -10.40              5128.80
1.119048 5126.30    7.90              5118.40
1.123016 5124.30   -2.00              5126.30
1.126984 5129.75    5.45              5124.30
1.130952 5214.85   85.10              5129.75
1.134921 5220.70    5.85              5214.85
1.138889 5233.10   12.40              5220.70
1.142857 5195.60  -37.50              5233.10
1.146825 5260.85   65.25              5195.60
1.150794 5295.40   34.55              5260.85
1.154762 5345.25   49.85              5295.40
1.158730 5348.30    3.05              5345.25
1.162698 5308.20  -40.10              5348.30
1.166667 5316.35    8.15              5308.20
1.170635 5343.25   26.90              5316.35
1.174603 5385.95   42.70              5343.25
1.178571 5368.60  -17.35              5385.95
1.182540 5368.70    0.10              5368.60
1.186508 5395.75   27.05              5368.70
1.190476 5426.15   30.40              5395.75
1.194444 5392.60  -33.55              5426.15
1.198413 5387.85   -4.75              5392.60
1.202381 5348.05  -39.80              5387.85
1.206349 5343.85   -4.20              5348.05
1.210317 5268.60  -75.25              5343.85
1.214286 5298.20   29.60              5268.60
1.218254 5276.50  -21.70              5298.20
1.222222 5249.15  -27.35              5276.50
1.226190 5243.90   -5.25              5249.15
1.230159 5217.65  -26.25              5243.90
1.234127 5309.45   91.80              5217.65
1.238095 5343.65   34.20              5309.45
1.242063 5361.90   18.25              5343.65
1.246032 5336.10  -25.80              5361.90
1.250000 5404.45   68.35              5336.10
1.253968 5435.20   30.75              5404.45
1.257937 5528.35   93.15              5435.20
1.261905 5631.75  103.40              5528.35
1.265873 5602.40  -29.35              5631.75
1.269841 5536.95  -65.45              5602.40
1.273810 5577.00   40.05              5536.95
1.277778 5691.95  114.95              5577.00
1.281746 5674.90  -17.05              5691.95
1.285714 5653.40  -21.50              5674.90
1.289683 5673.75   20.35              5653.40
1.293651 5684.80   11.05              5673.75
1.297619 5704.75   19.95              5684.80
1.301587 5727.70   22.95              5704.75
1.305556 5751.55   23.85              5727.70
1.309524 5815.00   63.45              5751.55
1.313492 5751.85  -63.15              5815.00
1.317460 5708.15  -43.70              5751.85
1.321429 5671.15  -37.00              5708.15
1.325397 5663.50   -7.65              5671.15
1.329365 5681.70   18.20              5663.50
1.333333 5674.25   -7.45              5681.70
1.337302 5705.60   31.35              5674.25
1.341270      NA      NA              5705.60
> returns<-z.diff/lag(open.ts,k=-1)
> returns
Time Series:
Start = c(1, 2)
End = c(1, 86)
Frequency = 252
 [1] -6.963625e-03  2.060341e-02 -2.381136e-02  1.262176e-02 -3.362955e-03  8.631936e-04  1.112363e-02 -9.896288e-03  8.223282e-03 -9.709775e-05  8.215267e-03
[12]  1.783771e-02  2.838839e-03  2.179718e-03 -2.513935e-03  5.219887e-03 -7.699964e-03  5.488578e-04  5.419362e-03 -1.415738e-02  5.248092e-04 -1.983692e-03
[23] -8.218105e-04 -5.537437e-03  9.761305e-03 -3.104851e-03 -1.343256e-02 -6.672154e-03 -2.027765e-03  1.543451e-03 -3.901449e-04  1.063560e-03  1.658950e-02
[34]  1.121796e-03  2.375160e-03 -7.165925e-03  1.255870e-02  6.567380e-03  9.413831e-03  5.706001e-04 -7.497710e-03  1.535360e-03  5.059862e-03  7.991391e-03
[45] -3.221344e-03  1.862683e-05  5.038464e-03  5.634064e-03 -6.183021e-03 -8.808367e-04 -7.386991e-03 -7.853330e-04 -1.408161e-02  5.618191e-03 -4.095731e-03
[56] -5.183360e-03 -1.000162e-03 -5.005816e-03  1.759413e-02  6.441345e-03  3.415269e-03 -4.811727e-03  1.280898e-02  5.689756e-03  1.713828e-02  1.870359e-02
[67] -5.211524e-03 -1.168249e-02  7.233224e-03  2.061144e-02 -2.995458e-03 -3.788613e-03  3.599604e-03  1.947566e-03  3.509358e-03  4.022963e-03  4.163975e-03
[78]  1.103181e-02 -1.085985e-02 -7.597556e-03 -6.481960e-03 -1.348933e-03  3.213561e-03 -1.311227e-03  5.524959e-03
> plot(returns)

Output:




Assignment 2:

1-700 data is available.Predict the data from 701-850,use the GLM estimation using LOGIT analysis for the same.

Output;

z<-read.csv(file.choose(),header=T)

head(z)

z.data<-z[1:700,1:9]

sapply(z.data,mean)

z.data$ed<-factor(z.data$ed)

logit.est<-glm(default~age+employ+address+income+debtinc+creddebt+othdebt,data=z.data,family="binomial")

summary(logit.est)

confint.default(logit.est)

logit.eg2<-with(z[701:850,1:8],data.frame(age=age,employ=employ,address=address,income=income,debtinc=debtinc,creddebt=creddebt,othdebt=othdebt,ed=factor(1:3)))

logit.eg2$prob<-predict(logit.est,newdata=logit.eg2,type="response")

head(logit.eg2)



Output:







Wednesday, 23 January 2013

Business Application IT Lab


IT Business Application lab Assignment#3

Session 3
Date: 22nd jan,2013

Assignment 1A
Based on the groove and mileage data: mileage is affected by groove.
Fit lm and comment on the applicability of lm
Plot 
      a) Res vs independent variable
      b) Sres vs Independent variable
      c) qqplot and add a qqline.


Solution:





As the residual plot is not a random but shows somewhat a parabolic pattern so it can be stated that linear model is not applicable in the case.

Assignment 1B
Based on alpha pluto data: pluto is dependent variable and alpha is the independent variable.
Fit lm and comment on the applicability of lm
      a) Plot Res vs independent variable

Output:


b) Plot Sres vs independent variable

Output:


Since the above plot does not show any pattern and random in nature so we can safely apply the linear model.

c) qqplot and qqline

Output:



Assignment 2

Based on the chair type and comfort level data: Determine whether the comfort level given by all the types of chairs are same using the ANOVA technique.


Output:


As seen from the solution the p value comes out to be 0.687,which is greater than 5%,our confidence interval of 95%.So we can not reject the null hypothesis.

Tuesday, 15 January 2013

Business Application IT Lab

IT Business Application lab Assignment#2

Session 2:
Date:15th Jan,2013

Today we have learnt about creation,inverse,transpose and multiplication of matrices.Then we moved on to
regression and residual analysis by taking NSE historical data for NIFTY index for a certain period.Finally we had an introductory idea about how to plot normally distributed curve.


Assignment 1: 
Create two matrices of say size 3 X 3 and select the column 1 from one matrix and column 3 from second matrix. After selecting the columns in objects say x1 and x1  merge these two columns using cbind to create a new matrix .

Solution:

To create a matrix:
x <- c[1:9]
dim(x) <- c(3,3)

y <- c[10:18]
dim(y) <- c(3,3)

To select a column
z1 <- x[ ,3]
z2 <- y[ ,2]

z3<- cbind(z1,z2)

Output:




Assignment 2:

Multiply both the matrices.

Solution:

z <- x %*% y

Output:



Assignment 3:

Read historical data of NIFTY indices from NSE for the period 1st Dec 2012 to 31st Dec 2012. Find regression and residuals


Solution:

To read the csv file:

nse <- read.csv(file.choose(),header=T)

For finding the regression and residuals the following commands are used

reg <- lm(High ~ Open , data = nse)
residuals(reg)

Output:


Assignment 4:

Generate a normal distribution data and plot it.

Solution:

For creating the ND following commands are used:

x<-rnorm(40,0,1)
y<-dnorm(x)

For plotting the data

plot(x,y)

Output:

 


Tuesday, 8 January 2013

BUSINESS APPLICATION IT LAB

IT Business Application Lab Assignment#1

Session 1 :
Date: 8th Jan 2013

Briefing R:

  • R is a software package especially suitable for data analysis and graphical representation. 
  • Functions and results of analysis are all stored as objects, allowing easy function modification and model building.
  • R provides the language, tool, and environment in one convenient package.
Benefits of R:
  • It is very flexible and highly customization.
  • Excellent graphical tools make R an ideal environment for EDA (Exploratory Data Analysis)
  • Since most high level functions are written in R language itself,the language can be learnt by studying the function code.
Weakness of R:
  • R is not particularly efficient in handling large data sets.
  • R is rather slow in executing a large number of for loops, compared to compiler languages.


Assignment 1:
Draw a histogram concatenating 3 data points.

Solution:

>x<-c(1,2,3)
>plot(x,type="h")

Output :




Assignment 2:  
Drawing a line graph with points and naming the graph and the axis.

Solution:
Step 1:
Let z be the variable that contains data from the .csv file selected.
Reading from the csv file

> z<-read.csv(file.choose(), header=T)

This command asks the user to select the file from the saved location.

Step 2:
Let, zcol1 be the variable that contains contents of column 3 and all rows from the excel datasheet.

> zcol1<-z[,3]
> plot(zcol1 , type="b" , main="NSE Graph" , xlab="Time" , ylab="indices")


Output:



Assignment 3:


Create a scatter plot by using share HIGH and LOW values from the NSE Historical data as obtained from the .csv file.

Solution :

HIGH values are obtained from column 3 from the csv file
> zcol1<-z[,3]
LOW values are obtained from column 4 from the csv file
> zcol2<-z[,4]

Now,To plot the scatter plot
> plot(zcol1,zcol2)

Output:



Assignment 4:


To find the volatility between the share values obtained from NSE historical data and obtain the range for the same.

Solution :-
To obtain the volatility , we require the highest value in the HIGH values column and the lowest value among the LOW values column.

Merging both the columns into one vector variable 'y' to get the HIGH and LOW values together can be done by using the following command:

> y<-c(zcol1,zcol2)
> summary(y)
   Min.    1st Qu.  Median    Mean   3rd Qu.    Max.
   4888    5660    5723        5758    5884       6021

Now as we have got the max and min values we can find the range hence the required volatility.

> range(y)

[1] 4888.20 6020.75

Output:



Thank You!!!