internetarchive: Update README

2025-03-09 15:40:17 +00:00 · 2019-10-16 15:09:57 +11:00 · 2019-10-16 15:09:57 +11:00 · 7ea25a3a63
commit 7ea25a3a63
parent b524d93fdc
1 changed files with 275 additions and 130 deletions
--- a/roles/internetarchive/README.md
+++ b/roles/internetarchive/README.md
@ -1,71 +1,284 @@
-# Internet Archive Offline / Universal Library / Decentralized Web README
+# Offline Internet Archive
-The Internet Archive (http://archive.org) is famous for their WayBack Machine
+The Internet Archive offers perhaps the world’s largest online store of open content. 
-that has saved 384+ Billion web pages, and more recently their Decentralized
+The wisdom of the ages, just a few clicks away. As Wikipedia has become the world’s encyclopedia, 
-Web project.
+the Internet Archive has become its library. 
 Central to our mission is establishing “Universal Access to All Knowledge”. 
 Access to our library of millions of books, journals, audio and video recordings and beyond is free to anyone
 This Ansible role installs the Internet Archive's dweb-mirror project on
 Internet-in-a-Box (IIAB).  Use this to build up a dynamic offline library
 arising from the materials you can explore at http://dweb.archive.org
-The project is a local server that allows users to browse resources from the
+The Offline Internet Archive server:
 Internet Archive stored on local drives - including USB drives.  
-It includes a crawler that can regularly synchronize local collections, against
+* Crawls Internet Archive collections to a local server.
-a list of Internet Archive items and collections, and those collections can be
+* Serves that content locally,
-moved between installations.
+* Caches content while browsing.
 * Moves content between servers by sneakernet — on disks, USB sticks, and SD cards.
 * Delivers (mostly) the Internet Archive UI offline in javascript in the browser,
 * Is open source
 * And is being made available in other languages.
-When connected to the internet, the server works as a Proxy, i.e. it will store
+## Starting server
 Internet Archive (IA) content the user views for later off-line viewing. 
 There are components to integrate the IA server with decentralized tools
 including IPFS, WebTorrent, GUN, WOLK, both for fetching content and for
 serving it back to the net or locally. 
 This is an ongoing project, continually adding support for new Internet Archive
 content types; new platforms; and new decentralized transports.
 ## Using it
 ### Starting server
 The server is started and restarted automatically.  It can be turned on or off
 at a terminal window with `service internetarchive start` or `service
 internetarchive stop` 
-### Browsing
+## Browsing
-The server can be accessed at [http://box:4244](http://box:4244) or
+Open the web page at [http://box:4244](http://box:4244) or
 [http://box.lan:4244](http://box.lan:4244) (try
 [http://box.local:4244](http://box.local:4244) via mDNS over a local network,
 if you don't have name resolution set up to reach your Internet-in-a-Box).
-_If future, we also hope to get [http://box/archive](http://box/archive) and
+There are several aspects to managing content on the Internet Archive’s Universal Library which are covered below, 
-[http://box.lan/archive](http://box.lan/archive) working (as of 2019-05-25 the
+these include crawling content to your own system , or to an external drive suitable for moving to another system, 
-error "Cannot GET /archive" appears — if you can help us fix
+and managing a collection of material on the archive that others can download automatically. 
 [/etc/apache2/sites-available/internetarchive.conf](https://github.com/iiab/iiab/blob/master/roles/internetarchive/templates/internetarchive.conf)
 that would be incredible!)_
-If you don’t get an Archive UI then look at the server log (in browser console)
+Try walking through the following steps to get a tour of the system and understand more about:
 to see for any “FAILING” log lines which indicate a problem. 
-Expect to see errors in the Browser log for
+* Using the interface
-`http://localhost:5001/api/v0/version?stream-channels=true` which is checking
+* Details page - viewing a single item
-for a local IPFS server which is not started here.
+* Collection and Search pages - multiple items
 * Accessing Internet Archive resources
 * Managing Crawling
 * Downloading content for a different box
 * Managing collections on Internet Archive
-Expect, on slower machines or slower network connections, to see no images the
+or you can click `Home` or the Internet Archive logo, 
-first time, refresh after a little while and most should appear. 
+if you just want to explore the Internet Archive's resources.
 ## Using the page
 Whichever of the addresses above works it should bring you to your `local` start page.
 You can get back here at any time, via the `Local` button.
 If you have used the Internet Archive then the interface will be familiar, 
 but there are a few differences to support offline use. 
 At the top you'll see the Internet Archive's usual interface, a few of these buttons will (for now) only work 
 while online, and don't appear when offline.
 Below that is a row of information specific to the offline application.
 First are health indicators. 
 * If it shows "Mirror" in Red, it means we can't communicate with the mirror gateway, 
 this will only happen if the gateway goes offline part way through a process.
 * Normally you'll see an indicator for GATEWAY, which is Green when the gateway can talk to the Archive, 
  and Red when you are offline.
 * Then comes an indicator for this page, whether it is being crawled, and if so approximately how much has been stored. 
 * If the mirror is online to the Internet Archive (GATEWAY shows Green) then next comes a "Reload" button, 
 you can click this to force it to check with the Archive for an up to date list. 
 It is most useful on collections when someone else might have added something, 
 but your gateway might be remembering an old version.
 * Then there is a Settings button which brings up a page that includes status of any crawls.
 * Finally there is a Home button which will bring you back to this page. 
 Each tile on this page represents an item that your server will check for when it “crawls”.  
 The first time you access the server this will depend on what was installed on the server, and it might be empty. 
 Notice that most of the tiles should have a White, Green or Blue dot in the top right to indicate that you are crawling them. 
 * A White dot means the item has been downloaded and enough of it has been downloaded to be viewed offline. 
 * The Green dot indicates that we are checking this item each time we crawl and getting enough to display offline. 
 * A Blue dot indicates we are crawling all the content of the item, this could be a lot of data, 
 for example a full resolution version of the video. Its rare that you’ll use this. 
 This button also shows how much has been downloaded, for an item its the total size of downloaded files/pages,
 for a collection its the total amount in all collection members. 
 Tiles come in two types, most shows items that can be displayed - books, videos, audio etc, 
 clicking on these will display the item. 
 Some of the tiles will show a collection which is a group of items that someone has collected together, 
 most likely there will be at least one collection relevant to your project put on the page during installation.  
 It shows you how many items are in the collection and how many have been downloaded 
 e.g. 400Mb in 10 of 123 items, means 10 of the 123 items in the collection are downloaded sufficient to view offline,
 and a total of 400Mb is downloaded in this collection. (Which includes some files, like thumbnails, in other items).
 ## Details page - viewing a single item
 If you click on an item that is already downloaded (Blue, Green or White dot) then it will be displayed offline, 
 the behavior depends on the kind of item.
 * Images are displayed and saved for offline use
 * Books display in a flip book format, pages you look at will be saved for offline use. 
 * Video and Audio will play immediately and you can skip around in them as normal
 The crawl button at the top will indicate whether the object is being crawled and if not, whether it has been downloaded, 
 in the same way tiles do, and also show you (approximately) the total downloaded for this item. 
 Click on the Crawl button till it turns Green and it will download a full copy of the book, video or audio.
 It waits about 30 seconds to do this, allowing time to cycle back to the desired level of crawling.
 These items will also appear on your Local page.  
 See the note above, usually you won’t want to leave it at yellow (all) as this will usually try
 (there are some size limits) to download all the files.
 There is a Reload button which will force the server to try archive.org, 
 this is useful if you think the item has changed, or for debugging.
 If you want to Save this item to a specific disk, for example to put it on a USB-drive then click the Save button.  
 This button brings up a dialogue with a list of the available destinations. 
 These should include any inserted drive with "archiveorg" as a directory at its top level. 
 The content will be copied to that drive, which can then be removed and inserted into a different server.
 The server checks whether these disks are present every 15 seconds, so to use a new USB disk:
 * Insert the USB 
 * Create a folder at its top level called `archiveorg`
 * Wait about 15 seconds
 * Reload the page you are on
 * Hitting `Save` should now allow this USB disk to be selected. 
 ## Collection and Search pages - multiple items
 If you click on a Collection, then we’ll display a grid of tiles for all the items that have been placed in the collection. 
 White, Green and Blue indicators mean the same as on the Local page. 
 If you click on the crawl button till its Green then it will check this collection each time it crawls, 
 download the tiles for the first page or so, and can be configured to get some of the items as well 
 ## Accessing Internet Archive resources
 The Internet Archive logo tile on the local page will take you to the Archive front page collection, 
 content here is probably not already downloaded or crawled, 
 but can be selected for crawling as for any other item.
 ## Managing crawling
 If you click on the "Settings" button, it should bring up a page of settings to control Crawling.
 This page is still under development (as of June 2019). 
 On here you will see a list of crawls.
 You should get useful information about status, any errors etc. 
 Hitting `<<` will restart the crawl and `||` or `>' pause and resume,
 but note that any file already being downloaded will continue to do so when you hit pause. 
 Hitting `||` `<<` `<` will stop the current crawl, reset and retry, which is a good way to try again if,
 for example, you lost connection to the server part way through.   
 ## Crawling
 The Crawler runs automatically at startup and when you add something to the crawl, 
 but it can also be configurable through the YAML file described above
 or run at a command line for access to more functionality.
 In a shell
 ```
 sudo sh
 ```
 cd into the location for your installation, on most platforms it is:
 ```
 cd /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-mirror
 ```
 Perform a standard crawl
 ```
 ./internetarchive --crawl 
 ```
 To fetch the "foobar" item from IA. 
 ```
 ./internetarchive --crawl foobar 
 ```
 To crawl top 10 items in the prelinger collection sufficiently to display and put 
 them on a disk plugged into the /media/pi/xyz.
 ```
 ./internetarchive --copydirectory /media/pi/xyz/archiveorg --crawl --rows 10 --level details prelinger
 ```
 To get a full list of possible arguments and some more examples
 ```
 ./internetarchive --help
 ```
 ### Advanced crawling
 If you have access to the command line on the server, then there is a lot more you can do with the crawler.
 The items selected for crawling (Green or Blue dots) are stored in a file `dweb-mirror.config.yaml` 
 in the one directory of the server, e.g. on IIAB its in /root/dweb-mirror.config.yaml 
 and on your laptop its probably in ~/dweb-mirror.config.yaml.
 You can edit this file with care ! 
 From the command line, cd into the directory holding the service to run the crawler e.g. on iIAB
 ```
 cd /opt/iiab/internetarchive/node_modules/dweb-mirror
 ./internetarchive --crawl
 ```
 There are lots of options possible, try `./internetarchive —help` to get guidance.
 This functionality will be gradually added to the UI in future releases.
 In the meantime if you have something specific you want to do feel free to post it as a new issue on 
 [github](https://github.com/dweb-mirror/issues/new).
 ## Downloading content for a different box
 You can copy one or more items that are downloaded to a new storage device (e.g. a USB drive), 
 take that device to another Universal Library server, and plug it in.  
 All the content will appear as if it was downloaded there. 
 To put content onto a device, you can either:
 * put the `copydirectory` field in the yaml file described above, 
 * hit `Save` while on an item or search
 * or run a crawl at the command line 
 ``` 
 # CD into your device e.g. on an IIAB it would be 
 cd /media/pi/foo
 # Create a directory to use for the content, it must be called "archiveorg"
 mkdir archiveorg 
 # CD to the installation
 cd /opt/iiab/internetarchive/node_modules/dweb-mirror
 # Copy the current crawl to the directory
 ./internetarchive --crawl --copydirectory /media/foo/archiveorg
 ```
 When its finished, you can unplug the USB drive and plug into any other device 
 Alternatively if you want to crawl a specific collection e.g. `frenchhistory` to the drive, you would use:
 ```
 ./internetarchive --crawl --copydirectory /media/foo/archiveorg frenchhistory
 ```
 If you already have this content on your own device, then the crawl is quick, 
 and just checks the content is up to date. 
 ## Managing collections on Internet Archive
 You can create and manage your own collections on the [Internet Archive site](http://www.archive.org).  
 Other people can then crawl those collections. 
 First get in touch with Mitra Ardron at mitra@archive.org , as processes may have changed since this is written.
 You'll need to create an account for yourself at [archive.org](https://archive.org)
 We'll setup a collection for you of type "texts" - dont worry, you can put any kind of media in it. 
 Once you have a collection, lets say `kenyanhistory`
 you can upload materials to the Archive by hitting the Upload button and following the instructions.
 You can also add any existing material on the Internet Archive to this collection.  
 * Find the material you are looking for
 * You should see a URL like `https://archive.org/details/foobar`
 * Copy the identifier which in this case would be 'foobar'
 * Go to `https://archive.org/services/simple-lists-admin/?identifier=kenyanhistory&list_name=items` 
 replacing `kenyanhistory` with the name of your collection.
 * Enter the name of the item `foobar` into the box and click "Add". 
 * It might take a few minutes to show up, you can add other items while you wait. 
 * The details page for the collection should then show your new item `https://archive.org/details/kenyanhistory`
 On the device, you can go to `kenyanhistory` and should see `foobar`.
 Hit Refresh and `foobar` should show up. 
 If `kenyanhistory` is marked for crawling it should update automatically
 ## Administration
 Administration is carried out mostly through the same User Interface as browsing. 
-Access [http://box.lan:4244/local](http://box.lan:4244/local) to see a
+Select `local` from any of the pages to access a display of local content. 
-display of local content, this interface is under development and various admin
+Administration tools are under `Settings`.
 tools will be added here.  Unless your box has been configured differently this 
 should also be the page you get at [http://box.lan:4244/local](http://box.lan:4244/local).
-Access [http://box.lan:4244/home](http://box.lan:4244/home) to get the Internet
+Click on the Archive logo, in the center-top, to get the Internet
 Archive main interface if connected to the net. 
 While viewing an item or collection, the "Crawl" button in the top bar
@ -79,29 +292,31 @@ through three levels:
 * Full - crawls everything on the item, this can be a LOT of data, including
  full size videos etc, so use with care if bandwidth/disk is limited.
-### Disks
+### Disk storage
 The server checks for caches of content in directories called `archiveorg` in
-all the likely places, in particular it looks in `/media/pi/*archiveorg` for
+all the likely places, in particular it looks for any inserted USB drives
-any inserted USB drives, and if none are found, it uses `/library/archiveorg`.
+on most systems, and if none are found, it uses `~/archiveorg`.
 The list of places it checks, in an unmodified installation can be seen at 
 `https://github.com/internetarchive/dweb-mirror/blob/master/configDefaults.yaml#L7`.
 You can override this in `dweb-mirror.config.yaml` in the home directory of the
-user that runs the server, this is currently `/root/dweb-mirror.config.yaml`
+user that runs the server. (Note on IIAB this is currently in `/root/dweb-mirror.config.yaml`)
 (see 'Advanced' below)
 Archive's `Items` are stored in subdirectories of the first of these
 directories found, but are read from any of the locations. 
 If you disk space is getting full, its perfectly safe to delete any
-subdirectories, or to move them to an attached USB.  Its also safe to move
+subdirectories (except `archiveorg/.hashstore`), and the server will refetch anything else it needs 
-attached USB's from one device to another.
+next time youbrowse to the item while connected to the internet. 
 Its also safe to move directories to an attached USB 
 (underneath a `archiveorg` directory at the top level of the disk) 
 It is also safe to move attached USB's from one device to another.
-The one directory you should not move or delete is `archiveorg/.hashstore` in
+Some of this functionality for handling disks is still under active development, 
-any of these locations, the server will refetch anything else it needs if you
+but most of it works now.
 browse to the item again when connected to the internet. 
 ### Maintenance
@ -109,7 +324,7 @@ If you are worried about corruption, or after for example hand-editing or
 moving cached items around. 
 ```
 # Run everything as root
-sudo sh
+sudo su
 # cd into location for your installation
 cd /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-mirror
 ./internetarchive -m
@ -122,11 +337,7 @@ cached,  just to rebuild a table of checksums.
 Most functionality of the tool is controlled by two YAML files, the second of
 which you can edit if you have access to the shell. 
-You can view the current configuration by going to
+You can view the current configuration by going to `/info` on your server.
 [http://box.lan:4244/info](http://box.lan:4244/info) or
 [http://localhost:4244/info](http://localhost:4244/info) depending on how you
 are connected.
 The default, and user configurations are displayed as the `0` and `1` item in
 the `/info` call. 
@ -139,86 +350,20 @@ understand how yaml works before editing this file, if you break it, you can
 copy a new default from
 [dweb-mirror.config.yaml on the repo](https://github.com/internetarchive/dweb-mirror/blob/master/dweb-mirror.config.yaml)
 TODO Note this file will probably move location. 
 Note that this file is also edited automatically when the Crawl button
 described above is clicked. 
-As the project develops, this file will be editable via a UI. 
+As the project develops, this file will be more and more editable via a UI. 
 ## Update
 Dweb-mirror is under rapid development, as is the JavaScript UI.  It's
 recommended to update frequently. 
 From a Terminal window
 ```
 sudo sh # Run all commands as root
 cd /opt/iiab/internetarchive
 yarn upgrade  # Currently this can take up to about 20 minutes to run, we hope to reduce that time
 ```
 ## Crawling
 The Crawler will be built into the UI fairly soon, for now it has to be run in
 a terminal window.
 Its highly configurable either through the YAML file described above, or from
 the command line.
 In a shell 
 ```
 # Run all commands as root from dweb-mirror's directory
 sudo sh
 # cd into location for your installation 
 cd /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-mirror
 # To get a full list of possible arguments
 ./internetarchive --help
 # Perform a standard crawl
 ./internetarchive --crawl 
 # To fetch the "foobar" item from IA. 
 ./internetarchive --crawl foobar 
 # To crawl top 10 items in the prelinger collection sufficiently to display and put 
 # them on a disk plugged into the /media/pi/xyz
 # TODO check where pi actually put them. 
 ./internetarchive --copydirectory /media/pi/xyz/archiveorg --crawl --rows 10 --level details prelinger
 ```
 ## Troubleshooting
 There are two logs of relevance, the browser and the server.
 **Browser**: If using Chrome then this is at View / Developer Tools /
 JavaScript Console or something similar.
 **Server**: 
 From a Terminal window. 
 ```
 journalctl -u internetarchive
 ```
 ## Known Issues
 See
 [github dweb-mirror issues](https://github.com/internetarchive/dweb-mirror/issues);
 and
 [github dweb-archive issues](https://github.com/internetarchive/dweb-archive/issues);
 ## More info
 Dweb-Mirror lives on GitHub at:
-* [dweb-mirror](https://github.com/internetarchive/dweb-mirror)
+* dweb-mirror (the server) [source](https://github.com/internetarchive/dweb-mirror),
-* [source](https://github.com/internetarchive/dweb-mirror)
+  and [issues tracker](https://github.com/internetarchive/dweb-mirror/issues)
-* [issues](https://github.com/internetarchive/dweb-mirror/issues)
+* dweb-archive (the UI) [source](https://github.com/internetarchive/dweb-archive),
-* [API.md](./API.md) API documentation for dweb-mirror
+  and [issues tracker](https://github.com/internetarchive/dweb-archive/issues)
 This project is part of the Internet Archive's larger Dweb project, see also:
-* [dweb-universal](https://github.com/internetarchive/dweb-universal) info about others distributing the web
+* [dweb-universal](https://github.com/mitra42/dweb-universal) info about others working to bring access offline.
-* [dweb-transport](https://github.com/internetarchive/dweb-transport) miscellaneous incl GUN gateway and WebTorrent
+* [dweb-transports](https://github.com/internetarchive/dweb-transports) for our transport library to IPFS, WEBTORRENT, WOLK, GUN etc
-* [dweb-objects](https://github.com/internetarchive/dweb-objects) library of dweb objects
+* [dweb-archivecontroller](https://github.com/internetarchive/dweb-archivecontroller) for an object oriented wrapper around our APIs
 * [dweb-archive](https://github.com/internetarchive/dweb-archive) archive UI in JavaScript
 * [dweb-archivecontroller](https://github.com/internetarchive/dweb-archive) Knows about the structure of archive objects