英国房地产支付价格
该数据集包含自 1995 年以来有关英格兰和威尔士房地产价格的数据。未压缩的大小约为 4 GiB,在 ClickHouse 中大约需要 278 MiB。
来源:https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads 字段说明:https://www.gov.uk/guidance/about-the-price-data
包含 HM Land Registry data © Crown copyright and database right 2021.。此数据集需在 Open Government License v3.0 的许可下使用。
创建表
CREATE TABLE uk_price_paid
(
    price UInt32,
    date Date,
    postcode1 LowCardinality(String),
    postcode2 LowCardinality(String),
    type Enum8('terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4, 'other' = 0),
    is_new UInt8,
    duration Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0),
    addr1 String,
    addr2 String,
    street LowCardinality(String),
    locality LowCardinality(String),
    town LowCardinality(String),
    district LowCardinality(String),
    county LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2);
预处理和插入数据
我们将使用 url 函数将数据流式传输到 ClickHouse。我们需要首先预处理一些传入的数据,其中包括:
- 将postcode拆分为两个不同的列 -postcode1和postcode2,因为这更适合存储和查询
- 将time字段转换为日期因为它只包含 00:00 时间
- 忽略 UUid 字段,因为我们不需要它进行分析
- 使用 transform 函数将 Enum字段type和duration转换为更易读的Enum字段
- 将 is_new字段从单字符串(Y/N) 到 [UInt8](/docs/zh/sql-reference/data-types/int-uint.md#uint8-uint16-uint32-uint64-uint256-int8-int16-int32-int64 -int128-int256) 字段为 0 或 1
- 删除最后两列,因为它们都具有相同的值(即 0)
url 函数将来自网络服务器的数据流式传输到 ClickHouse 表中。以下命令将 500 万行插入到 uk_price_paid 表中:
INSERT INTO uk_price_paid
WITH
   splitByChar(' ', postcode) AS p
SELECT
    toUInt32(price_string) AS price,
    parseDateTimeBestEffortUS(time) AS date,
    p[1] AS postcode1,
    p[2] AS postcode2,
    transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,
    b = 'Y' AS is_new,
    transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration,
    addr1,
    addr2,
    street,
    locality,
    town,
    district,
    county
FROM url(
    'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv',
    'CSV',
    'uuid_string String,
    price_string String,
    time String,
    postcode String,
    a String,
    b String,
    c String,
    addr1 String,
    addr2 String,
    street String,
    locality String,
    town String,
    district String,
    county String,
    d String,
    e String'
) SETTINGS max_http_get_redirects=10;
需要等待一两分钟以便数据插入,具体时间取决于网络速度。
验证数据
让我们通过查看插入了多少行来验证它是否有效:
SELECT count()
FROM uk_price_paid
在执行此查询时,数据集有 27,450,499 行。让我们看看 ClickHouse 中表的大小是多少:
SELECT formatReadableSize(total_bytes)
FROM system.tables
WHERE name = 'uk_price_paid'
请注意,表的大小仅为 221.43 MiB!
运行一些查询
让我们运行一些查询来分析数据:
查询 1. 每年平均价格
SELECT
   toYear(date) AS year,
   round(avg(price)) AS price,
   bar(price, 0, 1000000, 80
)
FROM uk_price_paid
GROUP BY year
ORDER BY year
结果如下所示:
┌─year─┬──price─┬─bar(round(avg(price)), 0, 1000000, 80)─┐
│ 1995 │  67934 │ █████▍                                 │