The book I read to research this post was Amazon EMR Service developer Guide which is a very good book which I downloaded for free from kindle. You might be interested to know I have done blogs on various web services offered by Amazon at the following blogs.
My computing blog http://scratbag.me
My technology blog http://scratbagroberts.com
My business and finance blog http://melissaball86.com
EMR is a web service that works in conjunction with Amazon EC2 or elastic compute cloud, S3 or simple storage service, & simple DB database service. It's a service for companies that do intensive processing like in data mining or web indexing. It tends to use EC2 to do the processing and then sends the data to be stored in S3. Some companies like to use Dynamo DB database or Relational DB database to use the data as a database. Relational DB lets you use SQL which a lot of people like to use. EMR uses Hadoop and when it's processing uses a hadoop cluster and then stores the data as a HDFS or hadoop distributed file system. When the data is sent via the internet to your workstation they use the secure internet protocol https. Hadoop works with Hive QL which does your queries and often a development language called pig is used. You have to be careful the versions of these 3 languages are compatible with each other and EMR. Often a script is needed to do your processing, there are plenty of languages like ruby and python that are supported but officially you are limited to 256 steps in your processes although there are workarounds like using secure shell or SSH. A final comment if you are doing anything like this is see if you can download a similiar script that you can adapt. Often they are free or very cheap and you save yourself a load of work.
No comments:
Post a Comment