2019-05-25 14:20:46 +00:00
|
|
|
|
# Internet Archive Universal Library / Decentralized Web README
|
|
|
|
|
|
|
|
|
|
The Internet Archive (http://archive.org) is famous for their WayBack Machine
|
|
|
|
|
that has saved 362+ Billion web pages, and more recently their Decentralized
|
|
|
|
|
Web project.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
This Ansible role installs the Internet Archive's dweb-mirror project on
|
2019-05-25 14:20:46 +00:00
|
|
|
|
Internet-in-a-Box (IIAB). Use this to build up a dynamic offline library
|
|
|
|
|
arising from the materials you can explore at http://dweb.archive.org
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
The project is a local server that allows users to browse resources from the
|
|
|
|
|
Internet Archive stored on local drives - including USB drives.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
It includes a crawler that can regularly synchronize local collections, against
|
|
|
|
|
a list of Internet Archive items and collections, and those collections can be
|
|
|
|
|
moved between installations.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
When connected to the internet, the server works as a Proxy, i.e. it will store
|
|
|
|
|
Internet Archive (IA) content the user views for later off-line viewing.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
There are components to integrate the IA server with decentralized tools
|
|
|
|
|
including IPFS, WebTorrent, GUN, WOLK, both for fetching content and for
|
|
|
|
|
serving it back to the net or locally.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
This is an ongoing project, continually adding support for new Internet Archive
|
|
|
|
|
content types; new platforms; and new decentralized transports.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
|
|
|
|
## Using it
|
|
|
|
|
|
|
|
|
|
### Starting server
|
2019-05-06 02:56:23 +00:00
|
|
|
|
|
2019-05-25 14:20:46 +00:00
|
|
|
|
The server is started and restarted automatically. It can be turned on or off
|
|
|
|
|
at a terminal window with `service internetarchive start` or `service
|
|
|
|
|
internetarchive stop`
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
|
|
|
|
### Browsing
|
|
|
|
|
|
2019-05-25 14:20:46 +00:00
|
|
|
|
The server can be accessed at [http://box:4244](http://box:4244) or
|
|
|
|
|
[http://box.lan:4244](http://box.lan:4244) (try
|
|
|
|
|
[http://box.local:4244](http://box.local:4244) via mDNS over a local network,
|
|
|
|
|
if you don't have name resolution set up to reach your Internet-in-a-Box).
|
|
|
|
|
|
|
|
|
|
_If future, we also hope to get [http://box/archive](http://box/archive) and
|
|
|
|
|
[http://box.lan/archive](http://box.lan/archive) working (as of 2019-05-25 the
|
|
|
|
|
error "Cannot GET /archive" appears — if you can help us fix
|
|
|
|
|
[/etc/apache2/sites-available/internetarchive.conf](https://github.com/iiab/iiab/blob/master/roles/internetarchive/templates/internetarchive.conf)
|
|
|
|
|
that would be incredible!)_
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-25 14:20:46 +00:00
|
|
|
|
If you don’t get an Archive UI then look at the server log (in browser console)
|
2019-05-06 02:56:23 +00:00
|
|
|
|
to see for any “FAILING” log lines which indicate a problem.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
Expect to see errors in the Browser log for
|
|
|
|
|
`http://localhost:5001/api/v0/version?stream-channels=true` which is checking
|
|
|
|
|
for a local IPFS server which is not started here.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
Expect, on slower machines or slower network connections, to see no images the
|
|
|
|
|
first time, refresh after a little while and most should appear.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
|
|
|
|
## Administration
|
|
|
|
|
|
2019-05-26 01:57:23 +00:00
|
|
|
|
Administration is carried out mostly through the same User Interface as browsing.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-26 01:57:23 +00:00
|
|
|
|
Access [http://box.lan:4244/local](http://box.lan:4244/local) to see a
|
2019-05-25 14:20:46 +00:00
|
|
|
|
display of local content, this interface is under development and various admin
|
2019-05-26 01:57:23 +00:00
|
|
|
|
tools will be added here. Unless your box has been configured differently this
|
|
|
|
|
should also be the page you get at [http://box.lan:4244/local](http://box.lan:4244/local).
|
2019-05-06 02:56:23 +00:00
|
|
|
|
|
2019-05-26 01:57:23 +00:00
|
|
|
|
Access [http://box.lan:4244/home](http://box.lan:4244/home) to get the Internet
|
2019-05-25 14:20:46 +00:00
|
|
|
|
Archive main interface if connected to the net.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
While viewing an item or collection, the "Crawl" button in the top bar
|
|
|
|
|
indicates whether the item is being crawled or not. Clicking it will cycle
|
|
|
|
|
through three levels:
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
|
|
|
|
* No crawling
|
2019-05-06 02:56:23 +00:00
|
|
|
|
* Details - sufficient information will be crawled to display the page, for a
|
|
|
|
|
collection this also means getting the thumbnails and metadata for the top
|
|
|
|
|
items.
|
|
|
|
|
* Full - crawls everything on the item, this can be a LOT of data, including
|
|
|
|
|
full size videos etc, so use with care if bandwidth/disk is limited.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
|
|
|
|
### Disks
|
2019-05-06 02:56:23 +00:00
|
|
|
|
|
2019-05-25 14:20:46 +00:00
|
|
|
|
The server checks for caches of content in directories called `archiveorg` in
|
|
|
|
|
all the likely places, in particular it looks in `/media/pi/*archiveorg` for
|
|
|
|
|
any inserted USB drives, and if none are found, it uses `/library/archiveorg`.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-08 02:35:25 +00:00
|
|
|
|
The list of places it checks, in an unmodified installation can be seen at
|
|
|
|
|
`https://github.com/internetarchive/dweb-mirror/blob/master/configDefaults.yaml#L7`.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
You can override this in `dweb-mirror.config.yaml` in the home directory of the
|
|
|
|
|
user that runs the server, this is currently `/root/dweb-mirror.config.yaml`
|
|
|
|
|
(see 'Advanced' below)
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-25 14:20:46 +00:00
|
|
|
|
Archive's `Items` are stored in subdirectories of the first of these
|
|
|
|
|
directories found, but are read from any of the locations.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-25 14:20:46 +00:00
|
|
|
|
If you disk space is getting full, its perfectly safe to delete any
|
|
|
|
|
subdirectories, or to move them to an attached USB. Its also safe to move
|
|
|
|
|
attached USB's from one device to another.
|
2019-05-08 02:35:25 +00:00
|
|
|
|
|
2019-05-25 14:20:46 +00:00
|
|
|
|
The one directory you should not move or delete is `archiveorg/.hashstore` in
|
|
|
|
|
any of these locations, the server will refetch anything else it needs if you
|
|
|
|
|
browse to the item again when connected to the internet.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
|
|
|
|
### Maintenance
|
2019-05-06 02:56:23 +00:00
|
|
|
|
|
|
|
|
|
If you are worried about corruption, or after for example hand-editing or
|
|
|
|
|
moving cached items around.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
```
|
|
|
|
|
# Run everything as root
|
|
|
|
|
sudo sh
|
2019-05-05 02:15:41 +00:00
|
|
|
|
# cd into location for your installation
|
|
|
|
|
cd /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-mirror
|
2019-05-05 01:56:17 +00:00
|
|
|
|
./internetarchive -m
|
|
|
|
|
```
|
2019-05-06 02:56:23 +00:00
|
|
|
|
This will usually take about 5-10 minutes depending on the amount of material
|
|
|
|
|
cached, just to rebuild a table of checksums.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
|
|
|
|
### Advanced
|
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
Most functionality of the tool is controlled by two YAML files, the second of
|
|
|
|
|
which you can edit if you have access to the shell.
|
|
|
|
|
|
2019-05-25 14:20:46 +00:00
|
|
|
|
You can view the current configuration by going to
|
|
|
|
|
[http://box.lan:4244/info](http://box.lan:4244/info) or
|
|
|
|
|
[http://localhost:4244/info](http://localhost:4244/info) depending on how you
|
|
|
|
|
are connected.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
The default, and user configurations are displayed as the `0` and `1` item in
|
|
|
|
|
the `/info` call.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
In the Repo is a
|
|
|
|
|
[default YAML file](https://github.com/internetarchive/dweb-mirror/blob/master/configDefaults.yaml)
|
|
|
|
|
which is commented. It would be a bad idea to edit this, so I'm not going to
|
|
|
|
|
tell you where it is on your installation! But anything from this file can be
|
|
|
|
|
overridden by lines in `/root/dweb-mirror.config.yaml`. Make sure you
|
|
|
|
|
understand how yaml works before editing this file, if you break it, you can
|
|
|
|
|
copy a new default from
|
2019-05-08 02:35:25 +00:00
|
|
|
|
[dweb-mirror.config.yaml on the repo](https://github.com/internetarchive/dweb-mirror/blob/master/dweb-mirror.config.yaml)
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
|
|
|
|
TODO Note this file will probably move location.
|
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
Note that this file is also edited automatically when the Crawl button
|
|
|
|
|
described above is clicked.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
|
|
|
|
As the project develops, this file will be editable via a UI.
|
|
|
|
|
|
|
|
|
|
## Update
|
2019-05-06 02:56:23 +00:00
|
|
|
|
|
|
|
|
|
Dweb-mirror is under rapid development, as is the JavaScript UI. It's
|
|
|
|
|
recommended to update frequently.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
|
|
|
|
From a Terminal window
|
|
|
|
|
```
|
|
|
|
|
sudo sh # Run all commands as root
|
2019-05-05 02:15:41 +00:00
|
|
|
|
cd /opt/iiab/internetarchive
|
2019-05-05 01:56:17 +00:00
|
|
|
|
yarn upgrade # Currently this can take up to about 20 minutes to run, we hope to reduce that time
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Crawling
|
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
The Crawler will be built into the UI fairly soon, for now it has to be run in
|
|
|
|
|
a terminal window.
|
|
|
|
|
|
|
|
|
|
Its highly configurable either through the YAML file described above, or from
|
|
|
|
|
the command line.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
|
|
|
|
In a shell
|
|
|
|
|
```
|
|
|
|
|
# Run all commands as root from dweb-mirror's directory
|
|
|
|
|
sudo sh
|
|
|
|
|
|
2019-05-05 02:15:41 +00:00
|
|
|
|
# cd into location for your installation
|
|
|
|
|
cd /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-mirror
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
|
|
|
|
# To get a full list of possible arguments
|
|
|
|
|
./internetarchive --help
|
|
|
|
|
|
|
|
|
|
# Perform a standard crawl
|
|
|
|
|
./internetarchive --crawl
|
|
|
|
|
|
|
|
|
|
# To fetch the "foobar" item from IA.
|
|
|
|
|
./internetarchive --crawl foobar
|
|
|
|
|
|
|
|
|
|
# To crawl top 10 items in the prelinger collection sufficiently to display and put
|
|
|
|
|
# them on a disk plugged into the /media/pi/xyz
|
|
|
|
|
# TODO check where pi actually put them.
|
|
|
|
|
./internetarchive --copydirectory /media/pi/xyz/archiveorg --crawl --rows 10 --level details prelinger
|
|
|
|
|
```
|
|
|
|
|
## Troubleshooting
|
2019-05-25 14:20:46 +00:00
|
|
|
|
|
2019-05-05 01:56:17 +00:00
|
|
|
|
There are two logs of relevance, the browser and the server.
|
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
**Browser**: If using Chrome then this is at View / Developer Tools /
|
|
|
|
|
JavaScript Console or something similar.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
|
|
|
|
**Server**:
|
2019-05-05 02:15:41 +00:00
|
|
|
|
From a Terminal window.
|
2019-05-05 01:56:17 +00:00
|
|
|
|
```
|
|
|
|
|
journalctl -u internetarchive
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Known Issues
|
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
See
|
|
|
|
|
[github dweb-mirror issues](https://github.com/internetarchive/dweb-mirror/issues);
|
|
|
|
|
and
|
|
|
|
|
[github dweb-archive issues](https://github.com/internetarchive/dweb-archive/issues);
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
## More info
|
2019-05-05 01:56:17 +00:00
|
|
|
|
|
2019-05-06 02:56:23 +00:00
|
|
|
|
Dweb-Mirror lives on GitHub at:
|
|
|
|
|
* [dweb-mirror](https://github.com/internetarchive/dweb-mirror)
|
|
|
|
|
* [source](https://github.com/internetarchive/dweb-mirror)
|
|
|
|
|
* [issues](https://github.com/internetarchive/dweb-mirror/issues)
|
|
|
|
|
* [API.md](./API.md) API documentation for dweb-mirror
|
|
|
|
|
|
|
|
|
|
This project is part of the Internet Archive's larger Dweb project, see also:
|
|
|
|
|
* [dweb-universal](https://github.com/internetarchive/dweb-universal) info about others distributing the web
|
|
|
|
|
* [dweb-transport](https://github.com/internetarchive/dweb-transport) miscellaneous incl GUN gateway and WebTorrent
|
|
|
|
|
* [dweb-objects](https://github.com/internetarchive/dweb-objects) library of dweb objects
|
|
|
|
|
* [dweb-archive](https://github.com/internetarchive/dweb-archive) archive UI in JavaScript
|
|
|
|
|
* [dweb-archivecontroller](https://github.com/internetarchive/dweb-archive) Knows about the structure of archive objects
|