Enhancing Data Migration Performance via Parallel Data Compression
#1

[attachment=12240]
INTRODUCTION
1.1 Data Migration: - Data migration is the process of transferring data between storage types, formats, or computer systems. Data migration is usually performed programmatically to achieve an automated migration, freeing up human resources from tedious tasks. It is required when organizations or individuals change computer systems or upgrade to new systems, or when systems merge (such as when the organizations that use them undergo a merger or takeover). To achieve an effective data migration procedure, data on the old system is mapped to the new system providing a design for data extraction and data loading. The design relates old data formats to the new system's formats and requirements. Programmatic data migration may involve many phases but it minimally includes data extraction where data is read from the old system and data loading where data is written to the new system. If a decision has been made to provide a set input file specification for loading data onto the target system, this allows a pre-load 'data validation' step to be put in place, interrupting the standard ETL process. Such a data validation process can be designed to interrogate the data to be transferred, to ensure that it meets the predefined criteria of the target environment, and the input file specification. An alternative strategy is to have on-the-fly data validation occurring at the point of loading, which can be designed to report on load rejection errors as the load progresses. However, in the event that the extracted and transformed data elements are highly 'integrated' with one another, and the presence of all extracted data in the target system is essential to system functionality, this strategy can have detrimental, and not easily quantifiable effects. After loading into the new system, results are subjected to data verification to determine whether data was accurately translated, is complete, and supports processes in the new system. During verification, there may be a need for a parallel run of both systems to identify areas of disparity and forestall erroneous data loss. Automated and manual data cleaning is commonly performed in migration to improve data quality, eliminate redundant or obsolete information, and match the
requirements of the new system. Data migration phases (design, extraction, cleansing, load, verification) for applications of moderate to high complexity are commonly repeated several times before the new system is deployed.
1.2 Data Compression: - Compression is useful because it helps to reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth. On the downside, compressed data must be decompressed to be used, and this extra processing may be detrimental to some applications. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it is being decompressed (the option of decompressing the video in full before watching it may be inconvenient, and requires storage space for the decompressed video). The design of data compression schemes therefore involves trade-offs among various factors, including the degree of compression, the amount of distortion introduced (if using a lossy compression scheme as shown in Figure1), and the computational resources required to compress and uncompress the data.
1.3 Overview: - Large-scale scientific simulation codes produce huge amounts of output, which is placed on secondary storage for fault-tolerance purposes or future time-based analysis. This analysis is usually conducted on a workstation that is geographically separated from the machine where the simulation ran, such as a scientist’s local workstation. To reduce application turnaround time including data migration, one can overlap computation and migration, but with typical network bandwidths of less than 1 mbps from supercomputers to the outside world, often migration is still the longest part of a simulation run. Compression can help by reducing data size, but is very computation-intensive for the relatively incompressible dense floating point data that forces in simulation codes. So that we need to incorporate compression into migration in a way that reliably reduces application turnaround time on today’s popular parallel platforms. Generic, data-specific, and parallel compression algorithms can improve file I/O performance and apparent Internet bandwidth .However, when integrating compression with long-distance transport of data from today’s parallel simulations. These issues include: what kind of compression ratios can we expect? Will they fluctuate over the lifetime of the application? If so, how should we make the decision whether to compress? Will the compression ratios differ with the degree of parallelism? If so, how can we handle the resulting load imbalance during migration? Are special compression algorithms needed? Can we exploit the time-series nature of simulation snapshots and checkpoints, to further improve compression performance? What kind of performance can we expect on today’s supercomputers and internet?
CHAPTER 2
CORE TECHNOLOGY

Multidimensional arrays are the most common output from scientific simulations. Simulation codes may assign separate arrays to each processor or may divide large arrays
Into disjoint subarrays, each of which is assigned to a processor. Many visualization tools can only read array data in traditional row-major or column-major order, while the simulation uses a different distribution. In this case, reorganization of array data between its memory and disk distributions is required (Figure 2).
Simulations typically perform output operations at regular intervals. The two most important output operations are snapshots and checkpoints. A snapshot stores the current “image” of simulation data, for future visualization or analysis. A checkpoint saves enough simulation data for computation to restart from the most recent checkpoint if the system crashes. These I/O operations bracket each computation phase. Similarly, an I/O phase is defined as the period between two consecutive computation phases. The processors in a parallel run can be divided into two broad types of I/O systems, I/O
Servers And I/O clients. I/O clients perform the simulation code’s computation, and I/O servers do the file I/O. Dedicated I/O servers are only used for I/O, so are usually idle when I/O clients are busy. Non-dedicated I/O servers act as I/O clients during computation phases and as I/O servers at I/O time. Often, the I/O operations of a simulation are collective, where all processors co-operate to 1carry out I/O tasks. we use I/O servers to store the output to the local file system, then migrate it. This local storing prevents compute processors from stalling while data is migrated. We can also migrate output immediately without local storing, but this does not allow overlap between data transfer and other application activities, and performs poorly in current typical hardware configurations, so we do not consider it further. Figure 3. Shows the data flow in a simulation run with I/O and migration, along with three possible spots for performing compression wise Client Side-Compression (CC), Server Side-Compression (SC),
Server Side-Compression on Already Stored Data (SC2). These are as follows:-
1. Client-side Compression during an I/O Phase (CC): Under CC, each client compresses each of its output parts and sends them to a server, along with metadata such as the compressed part size and the compression method. I/O servers receive compressed parts from clients and store them to disk. CC’s advantage is its high degree of parallelism in compression. However, CC’s compression cost is fully visible to clients. Further, if the array distributions in memory and on disk are different, servers must decompress the parts, reorganize the data, and recompress the new parts. Therefore, we assume the same array distribution in memory and on disk for CC, with client’s array parts assigned to servers in a round-robin manner. Codes with independent arrays on each processor, such as codes for irregular problems, fit this model.
2. Server-side Compression during an I/O Phase (SC): In SC, servers receive output data from clients during an I/O phase, compress them, and store them to disk. SC allows the array to be reorganized to any target distribution before compression, and thus is more flexible than CC. However, some or all of the compression cost will not be hidden with SC, if the servers start to stage the data before all of it has arrived from the clients, and force clients to wait during compression. Further, scientists usually use far fewer dedicated I/O processors than compute processors, so aggregate compression performance with SC will be worse than CC. To keep the flexibility of SC, but also exploit as many processors as possible like CC, we can choose to use all the processors as non-dedicated I/O processors, and perform SC using them.
3. Server-side Compression on Already-Stored Outputs (SC2): Before being transferred to a remote machine, a staged output needs to be read into memory from the local file system. SC2 reads and compresses the stored output, and then migrates it. This overlaps the compression with computation, so the visible I/O cost may be shorter than CC and SC. However, concurrent compression and computation will slowdown the simulation so much that SC2 is only suitable for dedicated I/O servers, thus limiting the degree of parallelism during compression. Further, more time will be spent in file I/O, due to the uncompressed data.
To see how the data migration performance gets improved we will consider some data sets. Floating point arrays are the most common data type for scientific simulations. Typical floating point arrays are dense, i.e. most of the array elements contain important numbers, but also can be sparse, and therefore highly compressible. Sparse output arrays are not unusual near the beginning and end of a simulation run. Integer arrays are also widely used; for example, floating point data often have an accompanying integer array describing the coordinate system. Simulations typically include text annotations and textual formatting information in their HDF (Hierarchical Data Format) output. To reflect this wide spectrum of data, we used the eight data sets in Table 1. Astrophysics, Cactus, ZEUS-MP, and Flash are simulation results from four different astrophysics codes. Gen1 is the output of a rocket simulation code. SCAR-B and AVHRR are direct observation data from an airborne scanning spectrometer and a satellite. Bible contains the whole Bible. The compression ratios in Table 1 were calculated as the compressed size using UNIX gzip, divided by the uncompressed size. The compression will be discussed in the next section.
Reply
#2
[attachment=13705]
[attachment=13706]
[attachment=13707]
[attachment=13708]
CHAPTER 1
INTRODUCTION
1.1 Data Migration:
- Data migration is the process of transferring data between storage types, formats, or computer systems. Data migration is usually performed programmatically to achieve an automated migration, freeing up human resources from tedious tasks. It is required when organizations or individuals change computer systems or upgrade to new systems, or when systems merge (such as when the organizations that use them undergo a merger or takeover). To achieve an effective data migration procedure, data on the old system is mapped to the new system providing a design for data extraction and data loading. The design relates old data formats to the new system's formats and requirements. Programmatic data migration may involve many phases but it minimally includes data extraction where data is read from the old system and data loading where data is written to the new system. If a decision has been made to provide a set input file specification for loading data onto the target system, this allows a pre-load 'data validation' step to be put in place, interrupting the standard ETL process. Such a data validation process can be designed to interrogate the data to be transferred, to ensure that it meets the predefined criteria of the target environment, and the input file specification. An alternative strategy is to have on-the-fly data validation occurring at the point of loading, which can be designed to report on load rejection errors as the load progresses. However, in the event that the extracted and transformed data elements are highly 'integrated' with one another, and the presence of all extracted data in the target system is essential to system functionality, this strategy can have detrimental, and not easily quantifiable effects. After loading into the new system, results are subjected to data verification to determine whether data was accurately translated, is complete, and supports processes in the new system. During verification, there may be a need for a parallel run of both systems to identify areas of disparity and forestall erroneous data loss. Automated and manual data cleaning is commonly performed in migration to improve data quality, eliminate redundant or obsolete information, and match the
requirements of the new system. Data migration phases (design, extraction, cleansing, load, verification) for applications of moderate to high complexity are commonly repeated several times before the new system is deployed.
1.2 Data Compression: - Compression is useful because it helps to reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth. On the downside, compressed data must be decompressed to be used, and this extra processing may be detrimental to some applications. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it is being decompressed (the option of decompressing the video in full before watching it may be inconvenient, and requires storage space for the decompressed video). The design of data compression schemes therefore involves trade-offs among various factors, including the degree of compression, the amount of distortion introduced (if using a lossy compression scheme as shown in Figure1), and the computational resources required to compress and uncompress the data.
Figure 1.-Demonstration of lossy and lossless compression
1.3 Overview: - Large-scale scientific simulation codes produce huge amounts of output, which is placed on secondary storage for fault-tolerance purposes or future time-based analysis. This analysis is usually conducted on a workstation that is geographically separated from the machine where the simulation ran, such as a scientist’s local workstation. To reduce application turnaround time including data migration, one can overlap computation and migration, but with typical network bandwidths of less than 1 mbps from supercomputers to the outside world, often migration is still the longest part of a simulation run. Compression can help by reducing data size, but is very computation-intensive for the relatively incompressible dense floating point data that forces in simulation codes. So that we need to incorporate compression into migration in a way that reliably reduces application turnaround time on today’s popular parallel platforms. Generic, data-specific, and parallel compression algorithms can improve file I/O performance and apparent Internet bandwidth .However, when integrating compression with long-distance transport of data from today’s parallel simulations. These issues include: what kind of compression ratios can we expect? Will they fluctuate over the lifetime of the application? If so, how should we make the decision whether to compress? Will the compression ratios differ with the degree of parallelism? If so, how can we handle the resulting load imbalance during migration? Are special compression algorithms needed? Can we exploit the time-series nature of simulation snapshots and checkpoints, to further improve compression performance? What kind of performance can we expect on today’s supercomputers and internet?
CHAPTER 2
CORE TECHNOLOGY

Multidimensional arrays are the most common output from scientific simulations. Simulation codes may assign separate arrays to each processor or may divide large arrays
Into disjoint subarrays, each of which is assigned to a processor. Many visualization tools can only read array data in traditional row-major or column-major order, while the simulation uses a different distribution. In this case, reorganization of array data between its memory and disk distributions is required (Figure 2).
Figure 2.-Different array distributions in memory and on disk
Simulations typically perform output operations at regular intervals. The two most important output operations are snapshots and checkpoints. A snapshot stores the current “image” of simulation data, for future visualization or analysis. A checkpoint saves enough simulation data for computation to restart from the most recent checkpoint if the system crashes. These I/O operations bracket each computation phase. Similarly, an I/O phase is defined as the period between two consecutive computation phases. The processors in a parallel run can be divided into two broad types of I/O systems, I/O
Servers And I/O clients. I/O clients perform the simulation code’s computation, and I/O servers do the file I/O. Dedicated I/O servers are only used for I/O, so are usually idle when I/O clients are busy. Non-dedicated I/O servers act as I/O clients during computation phases and as I/O servers at I/O time. Often, the I/O operations of a simulation are collective, where all processors co-operate to 1carry out I/O tasks. we use I/O servers to store the output to the local file system, then migrate it. This local storing prevents compute processors from stalling while data is migrated. We can also migrate output immediately without local storing, but this does not allow overlap between data transfer and other application activities, and performs poorly in current typical hardware configurations, so we do not consider it further. Figure 3. Shows the data flow in a simulation run with I/O and migration, along with three possible spots for performing compression wise Client Side-Compression (CC), Server Side-Compression (SC),
Server Side-Compression on Already Stored Data (SC2). These are as follows:-
Figure 3.-Data flow with migration and three possible compression points
1. Client-side Compression during an I/O Phase (CC): Under CC, each client compresses each of its output parts and sends them to a server, along with metadata such as the compressed part size and the compression method. I/O servers receive compressed parts from clients and store them to disk. CC’s advantage is its high degree of parallelism in compression. However, CC’s compression cost is fully visible to clients. Further, if the array distributions in memory and on disk are different, servers must decompress the parts, reorganize the data, and recompress the new parts. Therefore, we assume the same array distribution in memory and on disk for CC, with client’s array parts assigned to servers in a round-robin manner. Codes with independent arrays on each processor, such as codes for irregular problems, fit this model.
2. Server-side Compression during an I/O Phase (SC): In SC, servers receive output data from clients during an I/O phase, compress them, and store them to disk. SC allows the array to be reorganized to any target distribution before compression, and thus is more flexible than CC. However, some or all of the compression cost will not be hidden with SC, if the servers start to stage the data before all of it has arrived from the clients, and force clients to wait during compression. Further, scientists usually use far fewer dedicated I/O processors than compute processors, so aggregate compression performance with SC will be worse than CC. To keep the flexibility of SC, but also exploit as many processors as possible like CC, we can choose to use all the processors as non-dedicated I/O processors, and perform SC using them.
3. Server-side Compression on Already-Stored Outputs (SC2): Before being transferred to a remote machine, a staged output needs to be read into memory from the local file system. SC2 reads and compresses the stored output, and then migrates it. This overlaps the compression with computation, so the visible I/O cost may be shorter than CC and SC. However, concurrent compression and computation will slowdown the simulation so much that SC2 is only suitable for dedicated I/O servers, thus limiting the degree of parallelism during compression. Further, more time will be spent in file I/O, due to the uncompressed data.
Reply
#3


to get information about the topic "data migration" full report ppt and related topic refer the page link bellow

http://studentbank.in/report-cross-platf...e=threaded

http://studentbank.in/report-database-mi...ork--16409

http://studentbank.in/report-an-overview...ethodology

http://studentbank.in/report-enhancing-d...ompression
Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page
Popular Searches: data compression dna, data compression techniques ppt download, data migration services, data migration challenges, data compression question papers, data compression mark nelson ppt, salesforce data migration best practices,

[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  ROBUST DWT-SVD DOMAIN IMAGE WATERMARKING: EMBEDDING DATA IN ALL FREQUENCIES computer science crazy 2 5,225 19-06-2018, 06:10 PM
Last Post: KavyaIyengar
  Measuring the Performance of IEEE 802.11p Using ns-2 Simulator for Vehicular Networks smart paper boy 3 2,580 07-10-2014, 06:34 PM
Last Post: seminar report asees
  wireless-data-communication-infrared-led seminar class 4 3,327 31-07-2013, 10:16 AM
Last Post: computer topic
  Secured Data Transmission through Network seminar surveyer 2 2,327 26-04-2013, 02:02 PM
Last Post: computer topic
  AUTONOMOUS PARALLEL PARKING RC CAR PROJECT REPORT science projects buddy 2 3,098 08-02-2013, 10:13 AM
Last Post: seminar details
  Wirelesss Data Encryptiion and Decryption using RF Communication project topics 17 11,541 03-02-2013, 10:30 PM
Last Post: mohanece401
  Performance Analysis of MANET under Blackhole Attack smart paper boy 1 1,582 02-11-2012, 12:28 PM
Last Post: seminar details
  GPS-GSM Integration for Enhancing Public Transportation Management Services smart paper boy 2 2,121 07-06-2012, 12:06 PM
Last Post: computer girl
  Patient Monitoring System and Data Acquisition Through GSM seminar class 1 2,580 24-02-2012, 01:16 PM
Last Post: seminar paper
  Parallel Computing Technology-based Mobile Search Engine smart paper boy 1 1,694 24-02-2012, 11:40 AM
Last Post: seminar paper

Forum Jump: