library(tidyverse)
library(sf)
library(purrr)
library(rvest)
library(stringdist)
library(httr)
library(rhdx) ## remotes::install_gitlab("dickoa/rhdx")
On HDX, you can download and use the administrative boundaries of Mauritania but with one caveat the names of the different administrative divisions are translated from Arabic to English. For some analysis, it can be useful to have also the Arabic name in the same table. In this post, we are going to scrape a table from with the Arabic name from a website before joining this table to our administrative boundaries data. We will need the rhdx
package (not yet on CRAN) and the following packages:
We can use rhdx::pull_dataset
to read the Mauritania administrative boundaries dataset in R and use rhdx::get_resources
to list available resources (aka files).
pull_dataset("cod-ab-mrt") %>%
get_resources() %>%
as_tibble() %>%
slice_head(n = 3)
# A tibble: 3 × 5
resource_id resou…¹ resou…² resou…³ resource
<chr> <chr> <chr> <chr> <list>
1 cab34844-8dd1-4a1b-af55-db5… MRT_Ad… xlsx https:… <HDXResrc>
2 dacb6ad2-13b6-4f14-b1e9-44b… mrt_ad… shp https:… <HDXResrc>
3 1e7d4873-1151-4d83-a38d-11f… mrt_ad… emf https:… <HDXResrc>
# … with abbreviated variable names ¹resource_name,
# ²resource_format, ³resource_url
We can see from the output that the 2nd resource contains the shapefile with regions layer.
<- pull_dataset("cod-ab-mrt") %>%
mrt_adm1 get_resource(2) %>%
read_resource(layer = "mrt_admbnda_adm1_gov_20200801")
glimpse(mrt_adm1)
Rows: 13
Columns: 13
$ Shape_Leng <dbl> 22.8673631, 12.5340794, 8.6261225, 13.117716…
$ Shape_Area <dbl> 19.40005835, 3.04749757, 2.82872414, 3.24448…
$ ADM1_EN <chr> "Adrar", "Assaba", "Brakna", "Dakhlet-Nouadh…
$ ADM1_PCODE <chr> "MR01", "MR02", "MR03", "MR04", "MR05", "MR0…
$ ADM1_REF <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ ADM1ALT1EN <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ ADM1ALT2EN <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ ADM0_EN <chr> "Mauritania", "Mauritania", "Mauritania", "M…
$ ADM0_PCODE <chr> "MR", "MR", "MR", "MR", "MR", "MR", "MR", "M…
$ date <date> 2020-06-12, 2020-06-12, 2020-06-12, 2020-06…
$ validOn <date> 2020-07-31, 2020-07-31, 2020-07-31, 2020-07…
$ validTo <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ geometry <POLYGON [°]> POLYGON ((-6.3422 22.8704, ..., POLY…
We can see that the Arabic names are not available in this data, we can even visualize the available name using ggplot2
and sf
.
%>%
mrt_adm1 ggplot() +
geom_sf() +
geom_sf_label(aes(label = ADM1_EN)) +
theme_minimal()
We need the Arabic name and this Wikipedia has a table with names in Arabic and English. We can use the rvest
R package to scrape the data, and map it to our geospatial layer.
<- "https://en.wikipedia.org/wiki/Regions_of_Mauritania"
url
<- url |>
arabic_adm1 read_html() |>
html_nodes("table.wikitable") |>
html_table() |>
first() |>
select(ADM1_EN = Name, ADM1_AR = `Native name`)
glimpse(arabic_adm1)
Rows: 15
Columns: 2
$ ADM1_EN <chr> "Adrar", "Assaba", "Brakna", "Dakhlet Nouadhibo…
$ ADM1_AR <chr> "أدرار", "لعصابة", "لبراكنة", "داخلة نواذيبو", …
As you can see, this table contains some Arabic names (ADM1_AR
), we now need to join it to our boundaries data. However, because of spelling differences between the two ADM1_EN
columns in each table, we need to apply some approximative matching (stringdist::amatch
).
<- amatch(arabic_adm1$ADM1_EN, mrt_adm1$ADM1_EN, maxDist = 4)
ind $ADM1_EN <- mrt_adm1$ADM1_EN[ind] arabic_adm1
We are missing Nouackchot since it was divided in 3 sections (North, South and West) but since we have most of the available regions, we can join the two data and check the final results in a map.
<- left_join(mrt_adm1,
final select(arabic_adm1, ADM1_EN, ADM1_AR))
ggplot(final) +
geom_sf() +
geom_sf_label(aes(label = ADM1_AR)) +
theme_minimal()
Session info for this analysis.
Session info
::session_info() devtools
─ Session info ────────────────────────────────────────────────
setting value
version R version 4.2.2 Patched (2022-11-12 r83340)
os Arch Linux
system x86_64, linux-gnu
ui X11
language en_US.UTF-8
collate en_US.UTF-8
ctype en_US.UTF-8
tz UTC
date 2022-12-28
pandoc 2.19.2 @ /usr/bin/ (via rmarkdown)
─ Packages ────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.2)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.2)
backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.2)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.2.2)
broom 1.0.2 2022-12-15 [1] CRAN (R 4.2.2)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.2)
callr 3.7.3 2022-11-02 [1] CRAN (R 4.2.2)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.2)
class 7.3-20 2022-01-16 [1] CRAN (R 4.2.2)
classInt 0.4-8 2022-09-29 [1] CRAN (R 4.2.2)
cli 3.5.0 2022-12-20 [1] CRAN (R 4.2.2)
colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.2)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.2)
crul 1.3 2022-09-03 [1] CRAN (R 4.2.2)
curl 4.3.3 2022-10-06 [1] CRAN (R 4.2.2)
DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.2)
dbplyr 2.2.1 2022-06-27 [1] CRAN (R 4.2.2)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.2.2)
digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.2)
dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.2)
e1071 1.7-12 2022-10-24 [1] CRAN (R 4.2.2)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.2)
evaluate 0.19 2022-12-13 [1] CRAN (R 4.2.2)
fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.2)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.2)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.2)
forcats * 0.5.2 2022-08-19 [1] CRAN (R 4.2.2)
fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.2)
gargle 1.2.1 2022-09-08 [1] CRAN (R 4.2.2)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.2)
ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.2)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.2)
googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.2)
googlesheets4 1.0.1 2022-08-13 [1] CRAN (R 4.2.2)
gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.2)
haven 2.5.1 2022-08-22 [1] CRAN (R 4.2.2)
hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.2)
hoardr 0.5.2 2018-12-02 [1] CRAN (R 4.2.2)
htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.2)
htmlwidgets 1.6.0 2022-12-15 [1] CRAN (R 4.2.2)
httpcode 0.3.0 2020-04-10 [1] CRAN (R 4.2.2)
httpuv 1.6.7 2022-12-14 [1] CRAN (R 4.2.2)
httr * 1.4.4 2022-08-17 [1] CRAN (R 4.2.2)
jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.2)
KernSmooth 2.23-20 2021-05-03 [1] CRAN (R 4.2.2)
knitr 1.41 2022-11-18 [1] CRAN (R 4.2.2)
later 1.3.0 2021-08-18 [1] CRAN (R 4.2.2)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.2)
lubridate 1.9.0 2022-11-06 [1] CRAN (R 4.2.2)
lwgeom 0.2-10 2022-11-19 [1] CRAN (R 4.2.2)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.2)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.2)
mime 0.12 2021-09-28 [1] CRAN (R 4.2.2)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.2)
modelr 0.1.10 2022-11-11 [1] CRAN (R 4.2.2)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.2)
pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.2)
pkgbuild 1.4.0 2022-11-27 [1] CRAN (R 4.2.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.2)
pkgload 1.3.2 2022-11-16 [1] CRAN (R 4.2.2)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.2.2)
processx 3.8.0 2022-10-26 [1] CRAN (R 4.2.2)
profvis 0.3.7 2020-11-02 [1] CRAN (R 4.2.2)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.2)
proxy 0.4-27 2022-06-09 [1] CRAN (R 4.2.2)
ps 1.7.2 2022-10-26 [1] CRAN (R 4.2.2)
purrr * 1.0.0 2022-12-20 [1] CRAN (R 4.2.2)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.2)
rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.2.2)
Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.2)
readr * 2.1.3 2022-10-01 [1] CRAN (R 4.2.2)
readxl 1.4.1 2022-08-17 [1] CRAN (R 4.2.2)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.2.2)
reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.2)
rhdx * 0.1.0.9000 2022-11-03 [1] gitlab (dickoa/rhdx@c443336)
rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.2)
rmarkdown 2.19 2022-12-15 [1] CRAN (R 4.2.2)
rvest * 1.0.3 2022-08-19 [1] CRAN (R 4.2.2)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.2)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.2)
sf * 1.0-9 2022-11-08 [1] CRAN (R 4.2.2)
shiny 1.7.4 2022-12-15 [1] CRAN (R 4.2.2)
stars 0.6-0 2022-11-21 [1] CRAN (R 4.2.2)
stringdist * 0.9.10 2022-11-07 [1] CRAN (R 4.2.2)
stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.2)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.2)
tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.2)
tidyr * 1.2.1 2022-09-08 [1] CRAN (R 4.2.2)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.2)
tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.2)
timechange 0.1.1 2022-11-04 [1] CRAN (R 4.2.2)
triebeard 0.3.0 2016-08-04 [1] CRAN (R 4.2.2)
tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.2)
units 0.8-1 2022-12-10 [1] CRAN (R 4.2.2)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.2.2)
urltools 1.7.3 2019-04-14 [1] CRAN (R 4.2.2)
usethis 2.1.6 2022-05-25 [1] CRAN (R 4.2.2)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.2)
vctrs 0.5.1 2022-11-16 [1] CRAN (R 4.2.2)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.2)
xfun 0.36 2022-12-21 [1] CRAN (R 4.2.2)
xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.2)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.2)
yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.2)
[1] /usr/lib/R/library
───────────────────────────────────────────────────────────────