library(tidyverse)
library(sf)
library(purrr)
library(rvest)
library(stringdist)
library(httr)
library(rhdx) ## remotes::install_gitlab("dickoa/rhdx")On HDX, you can download and use the administrative boundaries of Mauritania but with one caveat the names of the different administrative divisions are translated from Arabic to English. For some analysis, it can be useful to have also the Arabic name in the same table. In this post, we are going to scrape a table from with the Arabic name from a website before joining this table to our administrative boundaries data. We will need the rhdx package (not yet on CRAN) and the following packages:
We can use rhdx::pull_dataset to read the Mauritania administrative boundaries dataset in R and use rhdx::get_resources to list available resources (aka files).
pull_dataset("cod-ab-mrt") %>%
get_resources() %>%
as_tibble() %>%
slice_head(n = 3)# A tibble: 3 × 5
resource_id resou…¹ resou…² resou…³ resource
<chr> <chr> <chr> <chr> <list>
1 cab34844-8dd1-4a1b-af55-db5… MRT_Ad… xlsx https:… <HDXResrc>
2 dacb6ad2-13b6-4f14-b1e9-44b… mrt_ad… shp https:… <HDXResrc>
3 1e7d4873-1151-4d83-a38d-11f… mrt_ad… emf https:… <HDXResrc>
# … with abbreviated variable names ¹resource_name,
# ²resource_format, ³resource_url
We can see from the output that the 2nd resource contains the shapefile with regions layer.
mrt_adm1 <- pull_dataset("cod-ab-mrt") %>%
get_resource(2) %>%
read_resource(layer = "mrt_admbnda_adm1_gov_20200801")
glimpse(mrt_adm1)Rows: 13
Columns: 13
$ Shape_Leng <dbl> 22.8673631, 12.5340794, 8.6261225, 13.117716…
$ Shape_Area <dbl> 19.40005835, 3.04749757, 2.82872414, 3.24448…
$ ADM1_EN <chr> "Adrar", "Assaba", "Brakna", "Dakhlet-Nouadh…
$ ADM1_PCODE <chr> "MR01", "MR02", "MR03", "MR04", "MR05", "MR0…
$ ADM1_REF <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ ADM1ALT1EN <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ ADM1ALT2EN <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ ADM0_EN <chr> "Mauritania", "Mauritania", "Mauritania", "M…
$ ADM0_PCODE <chr> "MR", "MR", "MR", "MR", "MR", "MR", "MR", "M…
$ date <date> 2020-06-12, 2020-06-12, 2020-06-12, 2020-06…
$ validOn <date> 2020-07-31, 2020-07-31, 2020-07-31, 2020-07…
$ validTo <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ geometry <POLYGON [°]> POLYGON ((-6.3422 22.8704, ..., POLY…
We can see that the Arabic names are not available in this data, we can even visualize the available name using ggplot2 and sf.
mrt_adm1 %>%
ggplot() +
geom_sf() +
geom_sf_label(aes(label = ADM1_EN)) +
theme_minimal()
We need the Arabic name and this Wikipedia has a table with names in Arabic and English. We can use the rvest R package to scrape the data, and map it to our geospatial layer.
url <- "https://en.wikipedia.org/wiki/Regions_of_Mauritania"
arabic_adm1 <- url |>
read_html() |>
html_nodes("table.wikitable") |>
html_table() |>
first() |>
select(ADM1_EN = Name, ADM1_AR = `Native name`)
glimpse(arabic_adm1)Rows: 15
Columns: 2
$ ADM1_EN <chr> "Adrar", "Assaba", "Brakna", "Dakhlet Nouadhibo…
$ ADM1_AR <chr> "أدرار", "لعصابة", "لبراكنة", "داخلة نواذيبو", …
As you can see, this table contains some Arabic names (ADM1_AR), we now need to join it to our boundaries data. However, because of spelling differences between the two ADM1_EN columns in each table, we need to apply some approximative matching (stringdist::amatch).
ind <- amatch(arabic_adm1$ADM1_EN, mrt_adm1$ADM1_EN, maxDist = 4)
arabic_adm1$ADM1_EN <- mrt_adm1$ADM1_EN[ind]We are missing Nouackchot since it was divided in 3 sections (North, South and West) but since we have most of the available regions, we can join the two data and check the final results in a map.
final <- left_join(mrt_adm1,
select(arabic_adm1, ADM1_EN, ADM1_AR))
ggplot(final) +
geom_sf() +
geom_sf_label(aes(label = ADM1_AR)) +
theme_minimal()
Session info for this analysis.
Session info
devtools::session_info()─ Session info ────────────────────────────────────────────────
setting value
version R version 4.2.2 Patched (2022-11-12 r83340)
os Arch Linux
system x86_64, linux-gnu
ui X11
language en_US.UTF-8
collate en_US.UTF-8
ctype en_US.UTF-8
tz UTC
date 2022-12-28
pandoc 2.19.2 @ /usr/bin/ (via rmarkdown)
─ Packages ────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.2)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.2)
backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.2)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.2.2)
broom 1.0.2 2022-12-15 [1] CRAN (R 4.2.2)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.2)
callr 3.7.3 2022-11-02 [1] CRAN (R 4.2.2)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.2)
class 7.3-20 2022-01-16 [1] CRAN (R 4.2.2)
classInt 0.4-8 2022-09-29 [1] CRAN (R 4.2.2)
cli 3.5.0 2022-12-20 [1] CRAN (R 4.2.2)
colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.2)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.2)
crul 1.3 2022-09-03 [1] CRAN (R 4.2.2)
curl 4.3.3 2022-10-06 [1] CRAN (R 4.2.2)
DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.2)
dbplyr 2.2.1 2022-06-27 [1] CRAN (R 4.2.2)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.2.2)
digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.2)
dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.2)
e1071 1.7-12 2022-10-24 [1] CRAN (R 4.2.2)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.2)
evaluate 0.19 2022-12-13 [1] CRAN (R 4.2.2)
fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.2)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.2)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.2)
forcats * 0.5.2 2022-08-19 [1] CRAN (R 4.2.2)
fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.2)
gargle 1.2.1 2022-09-08 [1] CRAN (R 4.2.2)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.2)
ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.2)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.2)
googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.2)
googlesheets4 1.0.1 2022-08-13 [1] CRAN (R 4.2.2)
gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.2)
haven 2.5.1 2022-08-22 [1] CRAN (R 4.2.2)
hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.2)
hoardr 0.5.2 2018-12-02 [1] CRAN (R 4.2.2)
htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.2)
htmlwidgets 1.6.0 2022-12-15 [1] CRAN (R 4.2.2)
httpcode 0.3.0 2020-04-10 [1] CRAN (R 4.2.2)
httpuv 1.6.7 2022-12-14 [1] CRAN (R 4.2.2)
httr * 1.4.4 2022-08-17 [1] CRAN (R 4.2.2)
jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.2)
KernSmooth 2.23-20 2021-05-03 [1] CRAN (R 4.2.2)
knitr 1.41 2022-11-18 [1] CRAN (R 4.2.2)
later 1.3.0 2021-08-18 [1] CRAN (R 4.2.2)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.2)
lubridate 1.9.0 2022-11-06 [1] CRAN (R 4.2.2)
lwgeom 0.2-10 2022-11-19 [1] CRAN (R 4.2.2)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.2)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.2)
mime 0.12 2021-09-28 [1] CRAN (R 4.2.2)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.2)
modelr 0.1.10 2022-11-11 [1] CRAN (R 4.2.2)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.2)
pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.2)
pkgbuild 1.4.0 2022-11-27 [1] CRAN (R 4.2.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.2)
pkgload 1.3.2 2022-11-16 [1] CRAN (R 4.2.2)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.2.2)
processx 3.8.0 2022-10-26 [1] CRAN (R 4.2.2)
profvis 0.3.7 2020-11-02 [1] CRAN (R 4.2.2)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.2)
proxy 0.4-27 2022-06-09 [1] CRAN (R 4.2.2)
ps 1.7.2 2022-10-26 [1] CRAN (R 4.2.2)
purrr * 1.0.0 2022-12-20 [1] CRAN (R 4.2.2)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.2)
rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.2.2)
Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.2)
readr * 2.1.3 2022-10-01 [1] CRAN (R 4.2.2)
readxl 1.4.1 2022-08-17 [1] CRAN (R 4.2.2)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.2.2)
reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.2)
rhdx * 0.1.0.9000 2022-11-03 [1] gitlab (dickoa/rhdx@c443336)
rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.2)
rmarkdown 2.19 2022-12-15 [1] CRAN (R 4.2.2)
rvest * 1.0.3 2022-08-19 [1] CRAN (R 4.2.2)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.2)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.2)
sf * 1.0-9 2022-11-08 [1] CRAN (R 4.2.2)
shiny 1.7.4 2022-12-15 [1] CRAN (R 4.2.2)
stars 0.6-0 2022-11-21 [1] CRAN (R 4.2.2)
stringdist * 0.9.10 2022-11-07 [1] CRAN (R 4.2.2)
stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.2)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.2)
tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.2)
tidyr * 1.2.1 2022-09-08 [1] CRAN (R 4.2.2)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.2)
tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.2)
timechange 0.1.1 2022-11-04 [1] CRAN (R 4.2.2)
triebeard 0.3.0 2016-08-04 [1] CRAN (R 4.2.2)
tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.2)
units 0.8-1 2022-12-10 [1] CRAN (R 4.2.2)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.2.2)
urltools 1.7.3 2019-04-14 [1] CRAN (R 4.2.2)
usethis 2.1.6 2022-05-25 [1] CRAN (R 4.2.2)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.2)
vctrs 0.5.1 2022-11-16 [1] CRAN (R 4.2.2)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.2)
xfun 0.36 2022-12-21 [1] CRAN (R 4.2.2)
xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.2)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.2)
yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.2)
[1] /usr/lib/R/library
───────────────────────────────────────────────────────────────