When I execute the aws cli command I get the message: 'unicode' object has no attribute to 'append'
You’ll get this message if the command contains (a) any carriage returns or (b) additional white spaces. White spaces are used to separate the command line options (those things beginning with
--). However some of theses options, e.g.
--steps take multiple arguments and in some cases (e.g. within the bracketed section that sets the args option) these are separated by commas only and not white spaces. If you follow exactly the syntax you see in the examples you will be fine.
I've logged into the AWS console but I can't see my cluster
You will need to be in the EMR part of the console. Select Services>EMR and then you will need to be in the right region: in the top right corner you’ll see a location drop down which you can change to Ireland
Do I need to restart a cluster every time I want to run a job?
No, you are able to submit further steps to the cluster using the following syntax
aws emr add-steps --cluster-id Cluster_ID --steps Type=Hive,Name=”Program Name",ActionOnFailure=CONTINUE,Args=[-f,s3n://script_bucket/script_name]
Thanks to Alberto for sharing this!
My table is loading in hive but some fields have ended up in the wrong columns
Hive is not great at reading comma separated variables and will break up fields if they contain commas even if text values are surrounded by quotes. You need to use serde to get round this. It is fairly straightforward and you can find details in lesson B4
How come the output is in many separate files
The files come directly from the reducers and there is no subsequent step that concatenates them. However to do so is simple. See lesson B5
Why does the output not contain column headers?
This relates to the previous question. If each file had column headers then it would be more difficult to concatenate the files. You might ask, why there isn’t something that just concatenates the files and sticks on a file header. Other tools like Hue will do this for you and perhaps this functionality is being developed for further releases of Hive.
Where can I find the error messages for a failed job?
Click on the triangle next to your cluster to get the cluster details, then click on
Hive Program. About 5 mins after your job has run you will see some log files. Click on stderr. This will contain a lot of java output that you don’t need to worry about. Scroll down to find the error message which is usually informative.
Why is my hiveconf variable not working?
Be careful about including quotation marks around the variable’s value when you define it. This only makes sense if the variable will be used in place of a string. If it is used in place of say, a column name, then quotation marks are not needed.
Why does my cluster have a status of waiting?
This just means that the job you have run has finished. It may or may not have run successfully. You will need to click on cluster details to find out.
Which region should we be running the cluster in?
You should run the clusters in Ireland. The code for this when configuring the cli is eu-west-1
If you are running a map reduce job (rather than say running a sql statement in hive). Then you cannot output to an existing directory. You must either delete the existing directory or output to a new one. Although this is inconvenient there is some logic to it. Each directory will often contain many parts of single data set. If you could add to an output directory the contents could become inconsistent.