Specifying the data to retrieve from a Hadoop system

To specify what data rows to retrieve from a Hadoop system running Hive, you write a query using HQL. As mentioned earlier, HQL is similar to SQL. HQL supports many of the same keywords as SQL, for example, SELECT, WHERE, GROUP BY, ORDER BY, JOIN, and UNION.

Hive transforms HQL statements into MapReduce jobs that Hadoop uses to perform and manage parallel processing across the clusters of servers. You can embed your own MapReduce scripts in the query by using the TRANSFORM clause. You make these scripts available to Hadoop through the Add File property when you configure the connection properties, as described in the previous topic.

The following is an example of a HQL query that uses the TRANSFORM clause:

SELECT

TRANSFORM (userid, movieid, rating, unixtime)

USING 'python weekday_mapper.py'

AS (userid, movieid, rating, weekday)

FROM u_data

How to specify what data to retrieve from a Hadoop system

1 In Data Explorer, right-click Data Sets, then choose New Data Set.

2 In New Data Set, specify the following information:

1 In Data Source Selection, select the Hive data source to use. Data Set Type displays HQL Select Query.

2 In Data Set Name, type a name for the data set.

3 Choose Next.

3 In HQL Query, in Query Text, type a HQL statement that indicates what data to retrieve. Figure 6‑19 shows an example of an HQL query specified in the data set editor.

Figure 6‑19 Data set editor displaying an HQL query

4 Choose Finish to save the data set. Edit Data Set displays the columns, and provides options for editing the data set, as shown in Figure 6‑20.

Figure 6‑20 Data set editor displaying the output columns

5 Choose Preview Results to view the data rows returned by the data set.