forwards: DSECLZG522

Showing posts with label DSECLZG522. Show all posts

Saturday, April 8, 2023

Cassandra

Cassandra is a NoSQL database for write-heavy workload and eventual consistency

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.

Host/author is not responsible for these posts.

Thursday, March 9, 2023

INSTALLING HADOOP, HIVE IN WINDOWS & WORKING WITH HDFS -- Ramana (Along with recording)

########-----------INSTALLING HADOOP, HIVE IN WINDOWS & WORKING WITH HDFS -------------- ########

################ ------------ AKULA VENKATA RAMANA ------------ #################

Link to download Java JDK 8: https://www.oracle.com/java/technologies/downloads/#java8 (go for java version 1.8.0_261)

Link to download Hadoop(hadoop - 2.8.0): https://muhammadbilalyar.github.io/blogs/How-to-install-Hadoop-on-Window-10/

HIVE Installtion on WINDOWS:

https://www.youtube.com/watch?v=npyRXkMhrgk

Download Hive -

https://archive.apache.org/dist/hive/hive-2.1.0/

Download Derby Metastore-

https://archive.apache.org/dist/db/derby/db-derby-10.12.1.1/

Download hive-site.xml-

https://drive.google.com/file/d/1qqAo7RQfr5Q6O-GTom6Rji3TdufP81zd/view

# How to extract .tar.gz files on Windows 11

Run cmd as Admin:

tar -xvzf source file path -C destination path

# Format the HDFS filesystem: Run the following command in the admin command prompt to format the HDFS filesystem:

hdfs namenode -format

# Start Hadoop services: Run the following command in the command prompt to start the Hadoop services: start-dfs & start-yarn

start-dfs

start-yarn

# Verify installation: run the following command in the command prompt to verify that the Hadoop cluster is running: jps

C:\hadoop-2.8.0\sbin>jps

16592 ResourceManager

19296 DataNode

15096 Jps

18776 NameNode

27644 NodeManager

if we get all nodes & jps printed on console that means you have now successfully configured Hadoop in pseudo-distributed mode on Windows.

pseudo nodes means: In Hadoop, the pseudo-distributed mode is a single-node Hadoop cluster, where all the Hadoop daemons run on a single machine.

This mode is useful for testing and development purposes when we want to simulate a distributed Hadoop environment on a single machine.

In pseudo-distributed mode, all the Hadoop daemons such as NameNode, DataNode, ResourceManager, and NodeManager run in separate Java Virtual Machines (JVMs)

on the same machine. This means that we can use Hadoop commands and APIs to perform the same tasks as we would in a fully distributed mode cluster.

********************************************** Installation completed *****************************************************************************

What next? Once the Hadoop cluster running.....

# Start Derby > bin

derby-10.12.1.1

# Start the port:

startNetworkServer -h 0.0.0.0

to get the below connections on port 1527

Fri Mar 03 14:51:19 IST 2023 : Security manager installed using the Basic server security policy.

Fri Mar 03 14:51:19 IST 2023 : Apache Derby Network Server - 10.12.1.1 - (1704137) started and ready to accept connections on port 1527

# Start Hive: hive-2.1.0

# If you get this error: C:\hive-2.1.0\bin>hive

"Missing hadoop installation: C:\hadoop-2.8.0\bin must be set" then go to hadoop directory and set the below path:

set HADOOP_HOME=C:\hadoop-2.8.0

Now start Hive:

C:\hive-2.1.0\bin>hive

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/C:/hive-2.1.0/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/C:/hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.

Connecting to jdbc:hive2://

Connected to: Apache Hive (version 2.1.0)

Driver: Hive JDBC (version 2.1.0)

Transaction isolation: TRANSACTION_REPEATABLE_READ

Beeline version 2.1.0 by Apache Hive

hive>

# Show Databases / Dropping Tables / Dropping Databases etc

show databases;

use bdsassignment;

DROP TABLE ret_table_data;

DROP DATABASE bdsassignment;

# Creating a data base in Hive

create database bdsassignment;

# Use that database

use bdsassignment;

# Creating table in Hive

CREATE TABLE ret_table_data (

record_no INT,

invoice STRING,

stockcode STRING,

description STRING,

quantity INT,

invoicedate TIMESTAMP,

price DOUBLE,

customer_id INT,

country STRING

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (

"separatorChar" = ",",

"quoteChar" = "\"",

"escapeChar" = "\\"

)

STORED AS TEXTFILE

TBLPROPERTIES ("skip.header.line.count"="1");

# Now load the data intot the Hive table

load data local inpath 'C:\Assignment -2 2023 BDS DATA SET online_retail_data.csv' into table ret_table_data;

# To check if the data is propely loaded

select * from ret_table_data limit 20;

# For checking the Table description

describe formatted ret_table_data;

# Location where it is stored in hdfs: (check this in hadoop-2.8.0 command)

hdfs://localhost:9001/user/hive/warehouse/bdsassignment.db/ret_table_data

# To check the lehgth of the data set

select count(*) from ret_table_data;

# 2) Country from which the maximum revenue was collected from sales in the month of March 2010

SELECT country, SUM(quantity*price) AS revenue

FROM (

SELECT country, quantity, price,

substr(invoicedate, 1, 10) AS inv_date

FROM ret_table_data

WHERE ((invoicedate LIKE '%03-2010%') OR (invoicedate LIKE '%03/2010%'))

) t

GROUP BY country

ORDER BY revenue DESC

LIMIT 1;

# To monitor the Job status

http://localhost:8088

http://localhost:50070

# add the following line to your mapred-site.xml file, replacing the path with the correct path to your Python executable:

<name>mapreduce.executable</name>

<value>C:\Users\venka\anaconda3\pkgs\python-3.9.12-h6244533_0\python.exe</value>

</property>

# To leave the safemode

hdfs dfsadmin -safemode leave

# Run the MapReduce job using the following command: ensure to have py files of mapper & reducer on specified path:

# Now run the below command in Hadoop:

hadoop jar C:\hadoop-2.8.0\share\hadoop\tools\lib\hadoop-streaming-2.8.0.jar ^

-files "file:///C:/wordcoutmapper.py,file:///C:/wordcoutreducer.py" ^

-mapper " C:\Users\venka\anaconda3\pkgs\python-3.9.12-h6244533_0\python.exe wordcoutmapper.py" ^

-reducer "C:\Users\venka\anaconda3\pkgs\python-3.9.12-h6244533_0\python.exe wordcoutreducer.py" ^

-input /BDS_Assignment2/word_count_data.txt ^

-output /BDS_Assignment2/result

hadoop jar C:\hadoop-2.8.0\share\hadoop\tools\lib\hadoop-streaming-2.8.0.jar ^

-files "file:///C:/Q1mapper.py,file:///C:/Q1reducer.py" ^

-mapper " C:\Users\venka\anaconda3\pkgs\python-3.9.12-h6244533_0\python.exe Q1mapper.py" ^

-reducer "C:\Users\venka\anaconda3\pkgs\python-3.9.12-h6244533_0\python.exe Q1reducer.py" ^

-input hdfs://localhost:9001/user/hive/warehouse/bdsassignment.db/ret_table_data ^

-output /BDS_Assignment2/result

# Command to list the contents of the /BDS_Assignment2/result directory:

hadoop fs -ls /BDS_Assignment2/result

# Once the MapReduce job completes, you can view the output using the following command:

hadoop fs -cat /BDS_Assignment2/result/part-00000

# To delete the directory /BDS_Assignment2/result in HDFS,

hadoop fs -rm -r /BDS_Assignment2/result

stop-dfs

stop-yarn

Ramana video : https://youtu.be/HmBtvWhs2ng

Ramana linkedin : https://www.linkedin.com/in/akula-v-746ab026

*********************************************************** THANK YOU **************************************************************************

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Friday, March 3, 2023

Install Hive/Derby on top of Hadoop in Windows

To install Hive/Derby on top of Hadoop, you can follow these general steps:

Versions which are compatible are

Hadoop : 2.8.0

Hive : 2.1.0

Derby : 10.12.1.1

1.Download and install Hadoop: You can download the Hadoop distribution from the Apache Hadoop website (https://hadoop.apache.org/)

and follow the installation instructions.

For Downloading :https://archive.apache.org/dist/hadoop/common/hadoop-2.8.0/

file name : hadoop-2.8.0.tar.gz

links to refer : https://muhammadbilalyar.github.io/blogs/How-to-install-Hadoop-on-Window-10/

https://www.youtube.com/watch?v=GfixwKmS8Ro

2.Download and install Hive: You can download the Hive distribution from the Apache Hive website (https://hive.apache.org/)

and follow the installation instructions.

For Downloading :https://archive.apache.org/dist/hive/hive-2.1.0/

file name : apache-hive-2.1.0-bin.tar.gz

links : https://www.youtube.com/watch?v=BV1FgKYjc8E

3.Configure Hive to work with Hadoop: In the Hive configuration files, you need to set the location of the Hadoop

installation and configure other parameters such as the database location and the Hive metastore.

4.Start the Hadoop services: Start the Hadoop services using the start-all.sh script located in the Hadoop bin directory.

Start the Hive services: Start the Hive services using the hive command.

5.Download and install Derby: You can download the Derby distribution from the Derby website

and follow the installation instructions.

For Downloading :https://db.apache.org/derby/releases/release-10.12.1.1.html

file name : db-derby-10.12.1.1-bin.tar.gz

Quick commands:

http://localhost:50070/dfshealth.html#tab-overview

http://localhost:8088/cluster

C:\hadoop\hadoop-2.8.0\sbin>start-all.cmd

C:\hive\apache-hive-2.1.0-bin\bin>hive

C:\derby\db-derby-10.12.1.1-bin\bin>startNetworkServer -h 0.0.0.0

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Monday, January 9, 2023

BITS-WILP-BDS-Regular 2023-Mid Semester

===================================================================

Name : MTDSE CLUSTER-MID SEM-1st Sem 2022-2023 EC-2R Jan23

Subject : DSECLZG522

BIG DATA SYSTEMS EC 2R

===================================================================

Q1. Discuss briefly 3 key issues that will impact the performance of a data parallel application and need careful optimization.

Q2. The CPU of a movie streaming server has L1 cache reference of 0.5 ns and main memory reference of 100 ns. The L1 cache hit during peak hours was found to be 23% of the total memory references. [Marks: 4]

Calculate the cache hit ratio h.
Find out the average time (Tavg) to access the memory.
If the size of the cache memory is doubled, what will be the impact on h and Tavg.
If there is a total failure of the cache memory, calculate h and Tavg.

Q3. A travel review site stores (user, hotel, review) tuples in a data store. E.g. tuple is (“user1”, “hotel ABC”, “<review>”). The data analysis team wants to know which user has written the most reviews and the hotel that has been reviewed the most. Write MapReduce pseudo-code to answer this question. [Marks: 4]

Q4. An e-commerce site stores (user, product, rating) tuples for data analysis. E.g. tuple is (“user1”, “product_x”, 3), where rating is from 1-10 with 10 being the best. A user can rate many products and products can be rated by many users. Write MapReduce pseudo-code to find the range (min and max) of ratings received for each product. So each output record contains (<product>, <min rating> to <max rating>). [Marks: 4]

Q5. Name a system and explain how it utilises the concepts of data and tree parallelism. [Marks: 3]

Q6. An enterprise application consists of a 2 node active-active application server cluster connected to a 2 node active-passive database (DB) cluster. Both tiers need to be working for the system to be available. Over a long period of time it has been observed that an application server node fails every 100 days and a DB server node fails every 50 days. A passive DB node takes 12 hours to take over from the failed active node. Answer the following questions. [Marks: 4]

What is the overall MTTF of the 2-tier system ?
Assume only a single failure at any time, either in the App tier or in the DB tier, and an equal probability of an App or a DB node failure. What is your estimate of the availability of the 2-tier system ?

Q7. In the following application scenarios, point out what is most important - consistency or availability, when a system failure results in a network partition in the backend distributed DB. Explain briefly the reason behind your answer. [Marks: 4]

(a) A limited quantity discount offer on a product for 100 items at an online retail store is almost 98% claimed. (b) An online survey application records inputs from millions of users across the globe.
(c) A travel reservation website is trying to sell rooms at a destination that is seeing very few bookings.
(d) A multi-player game with virtual avatars and users from all across the world needs a set of sequential steps between team members to progress across game milestones.

Q8. Assume that you have a NoSQL database with 3 nodes and a configurable replication factor (RF). R is the number of replicas that participate to return a Read request. W is the number of replicas that need to be updated to acknowledge a Write request. In each of the cases below explain why data is consistent or in-consistent for read requests. [Marks: 4]

1. RF=1, R=1, W=1.
2. RF=2, R=1, W=Majority/Quorum.
3. RF=3, R=2, W=Majority/Quorum.
4. RF=3, R=Majority/Quorum, W=3.

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Thursday, January 5, 2023

Hadoop 2 - Architecture

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

MapReduce Programming Architecture and flow

1.Input dataset is split into multiple pieces of data (several small sets)

2.Framework creates a master and several worker processes and executes the worker processes remotely

3.Several Map tasks work simultaneously and read pieces of data that were assigned to each map. Map worker uses the

map function to extract only those data that are present on their server and generates key/value pair for the extracted

data.

4.Map worker uses partitioner function to divide the data into regions. Partitioner decides which reducer should get the

output of specified mapper.

5.When the map workers complete their work, the master instructs the reduce workers to begin their work.

6.The reduce workers in turn contact the map workers to get the key/value data for their partition (shuffle). The data thus

received from various mappers is merge sorted as per keys.

7.Then it calls reduce function on every unique key. This function writes output to the file.

8.When all the reduce workers complete their work, the master transfers the control to the user program.

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Big Data Analytics Lifecycle

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Mean Time Failure and other Formulas

Important formula for Mean Time Failures

MTTF - Mean Time To Failure

MTTF = 1 / failure rate = Total #hours of operation / Total #units

MTTF is an averaged value. In reality failure rate changes over time

because it may depend on age of component.

Failure rate = 1 / MTTF (assuming average value over time)

MTTR - Mean Time to Recovery / Repair

MTTR = Total #hours for maintenance / Total #repairs

MTTD - Mean Time to Diagnose

MTBF - Mean Time Between Failures

MTBF = MTTD + MTTR + MTTF

MTTF - Serial assembly

MTTF of system = 1 / SUM (1/MTTFi) for all components i

Failure rate of system = SUM(1/MTTFi) for all components i

MTTF - Parallel assembly

MTTF of system = SUM(MTTFi) for all components i

Availability = Time system is UP and accessible / Total time observed

Availability = MTTF / (MTTD* + MTTR + MTTF)

Availability = MTTF / MTBF

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Wednesday, January 4, 2023

Cache performance and Access time of memories

• Cache hit

✓ When CPU refers to memory and find the data or instruction within the Cache Memory

• Cache miss

✓ If the desired data or instruction is not found in the cache memory and CPU refers to the

main memory to find that data or instruction

Hit + Miss = Total CPU Reference
Hit Ratio h = Hit / ( Hit + Miss )

Average access time of any memory system consists of two levels:

✓ Cache Memory

✓ Main Memory

• If Tc is time to access cache memory and Tm is the time to access main

memory and h is the cache hit ration, then

Tavg = Average time to access memory
Tavg = h * Tc + ( 1-h ) * ( Tm + Tc )

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.

Host/author is not responsible for these posts.

Apache Technology Ecosystem

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Big Data architecture style

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.